# **EXPRESSION OF EMOTION IN MUSIC AND VOCAL COMMUNICATION**

**Topic Editors Anjali Bhatara, Petri Laukka and Daniel J. Levitin**

#### *FRONTIERS COPYRIGHT STATEMENT*

© Copyright 2007-2014 Frontiers Media SA. All rights reserved.

All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.

**ISSN** 1664-8714 **ISBN** 978-2-88919-263-2 **DOI** 10.3389/978-2-88919-263-2

### *ABOUT FRONTIERS*

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### *FRONTIERS JOURNAL SERIES*

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing.

All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### *DEDICATION TO QUALITY*

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view.

By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### *WHAT ARE FRONTIERS RESEARCH TOPICS?*

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area!

Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## **EXPRESSION OF EMOTION IN MUSIC AND VOCAL COMMUNICATION**

Topic Editors:

**Anjali Bhatara,** Université Paris Descartes, France **Petri Laukka,** Stockholm University, Sweden **Daniel J. Levitin,** McGill University, Canada

The owner of this image is Petri Laukka

Two of the most important social skills in humans are the ability to determine the moods of those around us, and to use this to guide our behavior. To accomplish this, we make use of numerous cues. Among the most important are vocal cues from both speech and non-speech sounds. Music is also a reliable method for communicating emotion. It is often present in social situations and can serve to unify a group's mood for ceremonial purposes (funerals, weddings) or general social interactions. Scientists and philosophers have speculated on the origins of music and language,

and the possible common bases of emotional expression through music, speech and other vocalizations. They have found increasing evidence of commonalities among them. However, the domains in which researchers investigate these topics do not always overlap or share a common language, so communication between disciplines has been limited.

The aim of this Research Topic is to bring together research across multiple disciplines related to the production and perception of emotional cues in music, speech, and non-verbal vocalizations. This includes natural sounds produced by human and non-human primates as well as synthesized sounds. Research methodology includes survey, behavioral, and neuroimaging techniques investigating adults as well as developmental populations, including those with atypical development. Studies using laboratory tasks as well as studies in more naturalistic settings are included.

# Table of Contents


Sarah Weusthoff, Brian R. Baucom and Kurt Hahlweg


Rebecca Jürgens, Matthis Drolet, Ralph Pirow, Elisabeth Scheiner and Julia Fischer

*164 Perception of Emotionally Loaded Vocal Expressions and Its Connection to Responses to Music. A Cross-Cultural Investigation: Estonia, Finland, Sweden, Russia, and The USA*

Teija Waaramaa and Timo Leisiö

*177 Cross-Cultural Differences in the Processing of Non-Verbal Affective Vocalizations by Japanese and Canadian Listeners*

Michihiko Koeda, Pascal Belin, Tomoko Hama, Tadashi Masuda, Masato Matsuura and Yoshiro Okubo

*185 Cross-Cultural Decoding of Positive and Negative Non-Linguistic Emotion Vocalizations*

Petri Laukka, Hillary Anger Elfenbein, Nela Söder, Henrik Nordström, Jean Althoff, Wanda Chui, Frederick K. Iraki, Thomas Rockstuhl and Nutankumar S. Thingujam

*193 The Role of Motivation and Cultural Dialects in the In-Group Advantage for Emotional Vocalizations*

Disa Sauter


Lena Quinto, William Forde Thompson and Felicity Louise Keating

*228 On the Acoustics of Emotion in Audio: What Speech, Music, and Sound Have in Common*

Felix Weninger, Florian Eyben, Björn W. Schuller, Marcello Mortillaro and Klaus R. Scherer

*240 The "Musical Emotional Bursts": A Validated Set of Musical Affect Bursts to Investigate Auditory Affective Processing*

Sébastien Paquette, Isabelle Peretz and Pascal Belin


Sandrine Vieillard and Anne-Laure Gilet

**EDITORIAL** published: 05 May 2014 doi: 10.3389/fpsyg.2014.00399

## Expression of emotion in music and vocal communication: Introduction to the research topic

#### *Anjali Bhatara1,2\*, Petri Laukka3 and Daniel J. Levitin4*

*<sup>1</sup> Sorbonne Paris Cité, Université Paris Descartes, Paris, France*

*<sup>2</sup> Laboratoire Psychologie de la Perception, CNRS, UMR 8242, Paris, France*

*<sup>3</sup> Department of Psychology, Stockholm University, Stockholm, Sweden*

*<sup>4</sup> Department of Psychology, McGill University, Montreal, QC, Canada*

*\*Correspondence: bhatara@gmail.com*

#### *Edited and reviewed by:*

*Luiz Pessoa, University of Maryland, USA*

#### **Keywords: music, speech, emotion, voice, cross-domain cognition**

In social interactions, we must gauge the emotional state of others in order to behave appropriately. We rely heavily on auditory cues, specifically speech prosody, to do this. Music is also a complex auditory signal with the capacity to communicate emotion rapidly and effectively and often occurs in social situations or ceremonies as an emotional unifier.

Scientists and philosophers have speculated about the common cognitive origins of music and language. Perhaps their common origin lies in their efficacy for emotional expression. Unlike semantic or syntactic aspects of language (and music), many of their acoustic and emotional aspects are shared with sounds made by other species (Fitch, 2006); music and speech share a common acoustic code for expressing emotion (Juslin and Laukka, 2003). Until recently, however, scientists working in the two domains of music and speech rarely communicated, so research was restricted to one domain or the other. The purpose of this Research Topic was to bring these researchers together and encourage cross-talk.

Over 25 groups of researchers contributed their expertise, and the included papers give an overview of the diversity of current research, both in terms of research questions and methodology. Some articles focus on aspects in one of the two domains, whereas other articles directly compare, contrast, or combine music and vocal communication.

Empirical studies on music perception include work by Eerola et al. (2013), in which they systematically manipulated musical cues to determine their effects on perception of emotion, and Droit-Volet et al. (2013), who altered acoustic elements associated with emotion to examine the effect of these changes on time perception. Effects of context on music understanding were also investigated: Spreckelmeyer et al. (2013) examined preattentive processing of emotion, measuring ERPs during the processing of a sad tone within the context of happy tones and the reverse. Schellenberg et al. (2012) demonstrated a listener preference for music that expressed emotion contrasting with an established context, and Loui et al. (2013) examined the role of vocals on perceived arousal and valence in songs.

Turning to emotional responses to music, Russo et al. (2013) developed models aimed at predicting the emotion being experienced using information in the listeners' physiological signals, and Altenmüller et al. (2014) used fMRI to investigate the neural basis of episodic memory for arousing film music. Following up on Gabrielsson's (2002) distinction between emotion felt by a listener and emotion expressed by a piece of music, Schubert (2013) provided a review and suggestions for future research on the internal and external loci of musical emotion. There were also two theoretical papers on musical emotions: Flaig and Large (2014) speculated that music may induce affective response by speaking to the brain in its own language by way of neurodynamics, and Allen et al. (2013) presented a view of the general nature of musical emotions based on studies on autism.

In the speech domain, Paulmann et al. (2013) used EEG to investigate influences of arousal and valence on cortical responses to emotional prosody. Rigoulot et al. (2013) used a gating paradigm to demonstrate the importance of utterance-final syllables in emotion recognition. Two papers focused on the role of specific acoustic cues in vocal expression: Weusthoff et al. (2013) discussed the role of fundamental frequency in the success of romantic relationships, and Yanushevskaya et al. (2013) examined the role of loudness, both independently and in conjunction with voice quality.

Several researchers undertook cross-cultural studies of emotion perception in speech and non-verbal vocalizations. Jürgens et al. (2013) examined the perception of German emotional speech tokens across three cultures. Waaramaa and Leisiö (2013) examined the recognition of emotion in Finnish pseudosentences by listeners from five countries. There were also three cross-cultural investigations of non-verbal vocalizations: Koeda et al. (2013) examined perception of emotional vocalizations by Canadian and Japanese listeners, Laukka et al. (2013) examined Swedish listeners' perception of vocalizations from four countries, and Sauter (2013) examined the role of motivation in the ingroup advantage for emotion recognition by presenting listeners with vocalizations produced by in- or out-group members.

Discussing the similarity between music and speech emotion expression, Juslin (2013) forwarded the argument that this similarity lies at the "core" or basic emotion level, and that more complex emotions are more domain-specific. Several authors empirically tested the similarity and contrasts between music and vocal expression. Margulis (2013) posited that the relative preponderance of repetition in music compared to speech contributes to a fundamental difference between the two domains. Quinto et al. (2013) showed differences in the functions of pitch and rhythm between these domains. Weninger et al. (2013) synthesized information from databases including speech, music, and environmental sounds, and thereby took a step toward a holistic computational model of affect in sound. To aid future cross-domain research, Paquette et al. (2013) presented a new validated set of stimuli—a musical equivalent to vocal affective bursts. Bowling (2013) reviewed the affective character of musical modes, based in the biology of human vocal emotion expression, and Bryant (2013) further argued that research on music and emotion might benefit from research on form and function in non-human animal signals.

Three papers examined developmental and lifespan changes. Corbeil et al. (2013) contrasted the perception of speaking and singing in infancy, and found that it is not the domain (music or speech) that matters but rather the level of (positive) emotion. Wang et al. (2013) examined early auditory deprivation, asking children with cochlear implants to imitate happy and sad utterances. Vieillard and Gilet (2013) found an increase in positive responding to music with aging.

In sum, the main contribution of this Research Topic, along with highlighting the variety of research being done already, is to show the places of contact between the domains of music and vocal expression that occur at the level of emotional communication. In addition, we hope it will encourage future dialog among researchers interested in emotion in fields as diverse as computer science, linguistics, musicology, neuroscience, psychology, speech and hearing sciences, and sociology, who can each contribute knowledge necessary for studying this complex topic.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 26 March 2014; accepted: 15 April 2014; published online: 05 May 2014. Citation: Bhatara A, Laukka P and Levitin DJ (2014) Expression of emotion in music and vocal communication: Introduction to the research topic. Front. Psychol. 5:399. doi: 10.3389/fpsyg.2014.00399*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2014 Bhatara, Laukka and Levitin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Emotional expression in music: contribution, linearity, and additivity of primary musical cues

#### *Tuomas Eerola1 \*, Anders Friberg2 and Roberto Bresin2*

*<sup>1</sup> Department of Music, University of Jyväskylä, Jyväskylä, Finland*

*<sup>2</sup> Department of Speech, Music, and Hearing, KTH - Royal Institute of Technology, Stockholm, Sweden*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Frank A. Russo, Ryerson University, Canada Dan Bowling, University of Vienna, Austria*

#### *\*Correspondence:*

*Tuomas Eerola, Department of Music, University of Jyväskylä, Seminaarinkatu 35, Jyväskylä, FI-40014, Finland e-mail: tuomas.eerola@jyu.fi*

The aim of this study is to manipulate musical cues systematically to determine the aspects of music that contribute to emotional expression, and whether these cues operate in additive or interactive fashion, and whether the cue levels can be characterized as linear or non-linear. An optimized factorial design was used with six primary musical cues (mode, tempo, dynamics, articulation, timbre, and register) across four different music examples. Listeners rated 200 musical examples according to four perceived emotional characters (happy, sad, peaceful, and scary). The results exhibited robust effects for all cues and the ranked importance of these was established by multiple regression. The most important cue was mode followed by tempo, register, dynamics, articulation, and timbre, although the ranking varied across the emotions. The second main result suggested that most cue levels contributed to the emotions in a linear fashion, explaining 77–89% of variance in ratings. Quadratic encoding of cues did lead to minor but significant increases of the models (0–8%). Finally, the interactions between the cues were non-existent suggesting that the cues operate mostly in an additive fashion, corroborating recent findings on emotional expression in music (Juslin and Lindström, 2010).

**Keywords: emotion, music cues, factorial design, discrete emotion ratings**

#### **INTRODUCTION**

One of the central reasons that music engages the listener so deeply is that it expresses emotion (Juslin and Laukka, 2004). Not only do music composers and performers of music capitalize on the potent emotional effects of music but also the gaming and film industries, as do the marketing and music therapy industries. The way music arouses listeners' emotions has been studied from many different perspectives. One such method involves the use of self-report measures, where listeners note the emotions that they either recognize or actually experience while listening to the music (Zentner and Eerola, 2010). Another method involves the use of physiological and neurological indicators of the emotions aroused when listening to music (recent overview of the field is given in Eerola and Vuoskoski, 2012). Although many extra-musical factors are involved in the induction of emotions (e.g., the context, associations, and individual factors, see Juslin and Västfjäll, 2008), the focus of this paper is on those properties inherent in the music itself which cause emotions to be perceived by the listener that are generally related to mechanism of emotional contagion (Juslin and Västfjäll, 2008).

Scientific experiments since the 1930s have attempted to determine the impact of such individual musical cues in the communication of certain emotions to the listener (Hevner, 1936, 1937). A recent summary of this work can be found in Gabrielsson and Lindström's (2010) study that states that the most potent musical cues, also most frequently studied, are *mode, tempo, dynamics, articulation, timbre*, and *phrasing*. For example, the distinction between happiness and sadness has received considerable attention—these emotions are known to be quite clearly distinguished through cues of *tempo, pitch height*, and *mode*: the expression of happiness is associated with faster tempi, a highpitch range, and a major rather than minor mode, and these cues are reversed in musical expressions of sadness (Hevner, 1935, 1936; Wedin, 1972; Crowder, 1985; Gerardi and Gerken, 1995; Peretz et al., 1998; Dalla Bella et al., 2001). Other combinations of musical cues have been implicated for different discrete emotions such as anger, fear, and peacefulness (e.g., Bresin and Friberg, 2000; Vieillard et al., 2008).

In real music, it is challenging to assess the exact contribution of individual cues to emotional expression because all cues are utterly intercorrelated. Here, the solution is to independently and systematically manipulate the cues in music by synthesizing variants of a given music. Such a factorial design allows assessment of the causal role of each cue in expressing emotions in music. Previous studies on emotional expression in music using factorial design have often focused on relatively few cues as one has to manipulate each level of the factors separately, and the ensuing exhaustive combinations will quickly amount to an unfeasible total number of trials needed to evaluate the design. Because of this complexity, the existing studies have usually evaluated two or three separate factors using typically two or three discrete levels in each. For example, Dalla Bella et al. (2001) studied the contribution of *tempo* and *mode* to the happiness-sadness continuum. In a similar vein, Ilie and Thompson (2006) explored the contributions of *intensity*, *tempo*, and *pitch height* on three affect dimensions.

Interestingly, the early pioneers of music and emotion research did include a larger number of musical factors in their experiments. For example, Rigg's experiments (1937, 1940a,b, cited in Rigg, 1964) might have only used five musical phrases, but a total of seven cues were manipulated in each of these examples (*tempo, mode, articulation, pitch level, loudness, rhythm patterns*, and *interval content*). He asked listeners to choose between happy and sad emotion categories for each excerpt, as well as further describe the excerpts using precise emotional expressions. His main findings nevertheless indicated that *tempo* and *mode* were the most important cues. Hevner's classic studies (1935, 1937) manipulated six musical cues (*mode, tempo, pitch level, rhythm quality, harmonic complexity,* and *melodic direction*) and she observed that *mode, tempo* and *rhythm* were the determinant cues for emotions in her experiments. Rather contemporary, complex manipulations of musical cues have been carried out by Scherer and Oshinsky (1977), Juslin (1997c), and Juslin and Lindström (2010). Scherer and Oshinsky manipulated seven cues in synthesized sequences (*amplitude variation, pitch level, pitch contour, pitch variation, tempo, envelope*, and *filtration cut-off level*, as well as *tonality* and *rhythm* in their follow-up experiments) but again mostly with only two levels. They were able to account for 53–86% of the listeners' ratings of emotionally relevant semantic differential scales using linear regression. This suggests that a linear combination of the cues is able to account for most of the ratings, although some interactions did occur between the cues. Similar overall conclusions were drawn by Juslin (1997c), when he manipulated synthesized performances of "Nobody Knows The Trouble I've Seen" in terms of five musical cues (*tempo*—three levels, *dynamics*—three levels, *articulation*—two levels, *timbre* three levels and *tone attacks*—two levels). The listeners rated happiness, sadness, anger, fearfulness, and tenderness on Likert scales. Finally, Juslin and Lindström (2010) carried out the most exhaustive study to date by manipulating a total of eight cues (*pitch, mode, melodic progression, rhythm, tempo, sound level, articulation*, and *timbre*), although seven of the cues were limited to two levels (for instance, tempo had 70 bpm and 175 bpm version). This design yielded 384 stimuli that were rated by 10 listeners for happiness, anger, sadness, tenderness, and fear. The cue contributions were determined by regression analyses. In all, 77–92% of the listener ratings could be predicted with the linear combination of the cues. The interactions between the cues only provided a small (4–7%) increase in predictive accuracy of the models and hence Juslin and Lindström concluded that the "backbone of emotion perception in music is constituted by the main effects of the individual cues, rather than by their interactions" (p. 353).

A challenge to the causal approach (experimental manipulation rather than correlational exploration) is choosing appropriate values for the cue levels. To estimate whether the cue levels operate in a linear fashion, they should also be varied in such a manner. Another significant problem is determining a priori whether the ranges of each cue level are musically appropriate, in the context of all the other cues and musical examples used. Fortunately, a recent study on emotional cues in music (Bresin and Friberg, 2011) established plausible ranges for seven musical cues, and this could be used as a starting point for a systematic factorial study of the cues and emotions. In their study, a synthesis approach was taken, in which participants could simultaneously adjust all seven cues of emotional expression to produce compelling rendition of five emotions (neutral, happy, sad, scary, peaceful, and sad) on four music examples. The results identified the optimal values and ranges for the individual musical cues, which can be directly utilized to establish both a reasonable range of each cue and also an appropriate number of levels so that each of the emotions could be well-represented in at least one position in the cue space for these same music examples.

#### **AIMS AND RATIONALE**

The general aim of the present study is to corroborate and test the hypotheses on the contribution of musical cues to the expression of emotions in music. The specific aims were: (1) to assess predictions from studies on musical cues regarding the causal relationships between primary cues and expressed emotions; (2) to assess whether the cue levels operate in a linear or non-linear manner; and (3) to test whether cues operate in an additive or interactive fashion. For such aims, a factorial manipulation of the musical cues is required since these the cues are completely intercorrelated in a correlation design. Unfortunately, the full factorial design is especially demanding for such an extensive number of factors and their levels, as it requires a substantial number of trials (the number of factors multiplied by the number of factor levels) and an a priori knowledge of the settings for those factor levels. We already have the answers to the latter in the form of the previous study by Bresin and Friberg (2011). With regard to all the combinations required for such an extensive factorial design, we can reduce the full factorial design by using optimal design principles, in other words, by focusing on the factor main effects and low-order interactions while ignoring the high-order interactions that are confounded in the factor design matrix.

#### **MATERIALS AND METHODS**

A factorial listening experiment was designed in which six primary musical cues (*register, mode, tempo, dynamics, articulation*, and *timbre*) were varied on two to six scalar or nominal levels across four different *music structures*. First, we will go through the details of these musical cues, and then, we will outline the optimal design which was used to create the music stimuli.

#### **MANIPULATION OF THE CUES**

The six primary musical cues were, with one exception (mode), the same cues that were used in the production study by Bresin and Friberg (2011). Each of these cues has been previously implicated as having a central impact on emotions expressed by music [summary in Gabrielsson and Lindström (2010), and past factorial studies, e.g., Scherer and Oshinsky, 1977; Juslin and Lindström, 2010] and have a direct counterpart in speech expression (see Juslin and Laukka, 2003; except for mode, see Bowling et al., 2012). Five cues—*register, tempo, dynamics, timbre* and *articulation* (the scalar factors)—could be seen as having linear or scalar levels, whereas *mode* (a nominal factor) contains two categories (major and minor). Based on observations from the production study, we chose to represent *register* with six levels, *tempo* and *dynamics* with five levels, and *articulation* with four levels. This meant that certain cues were deemed to need a larger range in order to accommodate different emotional characteristics, while others required less subtle differences between the levels (*articulation* and *timbre*). Finally, we decided to manipulate these factors across different *music structures* derived from a past study to replicate the findings using four different music excerpts, which we treat as an additional seventh factor. Because we assume that the physiological states have led to the configuration of cue codes, we derive predictions for each cue direction for each emotion based on the vocal expression of affect [from Juslin and Scherer (2005), summarized for our primary cues in **Table 3**]. For mode, which is not featured in speech studies, we draw on the recent cross-cultural findings, which suggest a link between emotional expression in modal music and speech mediated by the relative size of melodic/prosodic intervals (Bowling et al., 2012). The comparisons of our results with those of past studies on musical expression on emotions rely on a summary by Gabrielsson and Lindström (2010) and individual factorial studies (e.g., Scherer and Oshinsky, 1977; Juslin and Lindström, 2010), which present a more or less comparable pattern of results to those obtained in the studies on vocal expression of emotions (Juslin and Laukka, 2003).

#### **OPTIMAL DESIGN OF THE EXPERIMENT**

A full factorial design with these particular factors would have required 14,400 unique trials to completely exhaust all factor and level couplings (6 × 5 × 5 × 4 × 2 × 3 × 4). As such an experiment is impractically large by any standards, a form of reduction was required. Reduced designs called fractional factorial designs (FFD) and response surface methodologies (RSM), collectively called *optimal designs* provide applicable solutions; however, widespread usage of these techniques within the behavioral sciences is still rare in spite of their recommendation (see McClelland, 1997; Collins et al., 2009). The main advantage of optimal designs over full factorials designs is that they allow the research resources to be concentrated on particular questions, thereby minimizing redundancy and maximizing the statistical power. This is primarily done by eliminating high-order factor interactions (see Myers and Well, 2003, p. 332)1.

We constructed the factor design matrix so that the number of cases for each factor level was approximately equal for both main effects and first-order interactions. In this way, the design was compatible with traditional statistical analysis methods and also gave the listener a balanced array of factor combinations. In effect, this meant applying a D-optimal design algorithm to the full factorial matrix, to maximize the determinant of the information matrix (Box and Draper, 1987; Meyer and Nachtsheim, 1995). The number of maximum trials was set to 200, with the intention that each trial would use stimuli with a duration of 25 s, resulting in an estimated 80 min-experiment. The factors are also orthogonal with respect to each other and, thus, are well-suited for statistical techniques such as regression. Details about the individual cues and their levels are given in the next section.

#### **DETAILS OF THE SEVEN CUES** *Mode (two nominal levels)*

The mode of each music example was altered using a modal translation so that an original piece in an Ionian major scale was altered to the Aeolian minor scale in the same key and vice versa. Thus, the translation from major to minor did not preserve a major dominant chord. For example, the V-I major progression was translated to Vm-Im. This translation was chosen because it allowed a simple automatic translation and also enhanced the minor quality of the examples according to informal listening.

#### *Tempo (five scalar levels)*

Tempo was represented by the average number of nonsimultaneous onsets per second overall voices (called notes per second, NPS). NPS was chosen to indicate tempo because the measure was nearly constant over different music examples when the subjects were asked to perform the same emotional expression in the production study (Bresin and Friberg, 2011). The five different levels were 1.2, 2, 2.8, 4.4, and 6 NPS, corresponding to approximately the median values for the different emotions in the production study.

#### *Dynamics (five scalar levels)*

The range of the dynamics was chosen corresponding to the typical range of an acoustic instrument, which is about 20 dB (Fletcher and Rossing, 1998). The step size corresponds roughly to the musical dynamics marks pp, p, mp/mf, f, ff: −10, −5, 0, +5, +10 dB, respectively. These values corresponded to the ones obtained in the production study. The dynamics values in dB were controlling the sample synthesizer (see below). The resulting sound was not just a simple scaling of the sound level since also the timber changed according to the input control. This change corresponds to how the sound level and timber change simultaneously according to played dynamics in the real counterpart of the respective acoustic instrument.

#### *Articulation (four scalar levels)*

The articulation here is defined as the duration of a note relative to its interonset interval. Thus, a value of 1 corresponds to *legato*, and a value of ∼0.5, to *staccato*. The articulation was applied using three rules from the previously developed rule system for music performance (Bresin, 2001; Friberg et al., 2006). The *Punctuation* rule finds small melodic fragments and performs the articulation on the last note of each fragment, so it is longer with a micropause after it (Friberg et al., 1998). The *Repetition* rule performs a repetition of the chosen note with a micropause between. Finally, the *Overall articulation* rule simply applies the articulation to all the notes except very short ones. In addition, a limit on the maximum articulation was imposed to ensure that the duration of each note would not be too short. Using this combination of rules, the exact amount of articulation varied

<sup>1</sup>Consider a full factorial design with 8 factors, each with 2 levels (28), requiring 256 combinations to be tested. For factor effects, the degrees of freedom (initially 255) would be 8 for factor *main effects*, 28 for *two-factor interaction effects* and the remaining 219 degrees of freedom (255 − 8 − 28 = 219) for the *higher order interaction* effects. In this design, 86% (219/255) of the research resources would be utilized to assess the higher-order (3rd, 4th, etc.) interaction effects that are of no primary interest and difficult to interpret. The extent of this waste of effort is proportional to the number of factor levels in the design and hence in our design, the higher order factor interactions cover 98.6% of the full factorial design matrix.

depending on the note. However, the four different levels roughly corresponded to the values 1, 0.75, 0.5, 0.25—thus, a range from *legato* to *staccatissimo*. The same combination of rules was used in the production study.

#### *Timbre (three scalar levels)*

Three different instrument timbers were used for the melody voice: flute, horn, and trumpet. The same timbers were also used in the production experiment and were initially chosen for their varied expressive character, namely brightness, which has been found to have a large impact on emotional ratings in a previous experiment (Eerola et al., 2012). The estimation of brightness was based on the amount of spectral energy below a cut-off of 1500 Hz, because this correlated strongly (*r* = −0.74, *p* < 0.001, *N* = 110) with the listeners' ratings when they were asked to judge the emotional valence of 110 isolated instruments sounds (Eerola et al., 2012). Flute has the lowest and the trumpet has the highest brightness value.

#### *Register (six scalar levels)*

The whole piece was transposed so that the average pitches of the melody were the following: F3, B3, F4, B4, F5, and B5 corresponding to the MIDI note numbers 53, 59, 65, 71, 77, and 83, respectively. These values were close to the actual settings for the different emotions in the production study.

#### *Music structure (four nominal levels)*

Finally, the seventh cue music structure was added in order to extend the design across four different music examples chosen from the Montreal battery of composed emotion examples (Vieillard et al., 2008). Each example represented a different emotion and was selected according to how it had been validated by Vieillard et al. (2008). Therefore, the selected examples were from among the most unambiguous examples of sadness (T01.mid in the original stimulus set), happiness (G04.mid), peacefulness (A02.mid), and fear (P02.mid) from the study by Vieillard et al. Because the study consisted of four different musical examples many compositional factors like melody, harmony, and rhythm varied simultaneously; these same four music examples were also used in the previous production study (Bresin and Friberg, 2011).

#### **CREATION OF THE STIMULI**

The stimulus examples were generated with an algorithm using the Director Musices software (Friberg et al., 2000). The resulting MIDI files were rendered into sound using the Vienna Symphonic Library with the Kontakt 2 sampler. This library contains highquality, performed sounds for different instruments using different sound levels, registers, and playing techniques2. All the accompaniment voices were played on a sampled piano (Steinway light) and the melody voices were played on samples of each solo instrument (horn, flute, and trumpet). The sound level of each instrument was measured for a range of different MIDI velocity values and an interpolation curve was defined, making it possible to specify the dynamics in decibels, which was then translated to the right velocity value in the MIDI file. The onset delays were adjusted aurally for each solo instrument in such a manner that simultaneous notes in the piano and in the solo instrument were perceptually occurring at the same time. The resulting audio was saved in non-compressed stereo files (16-bit wav) with the sampling rate at 44.1 kHz. Examples of the stimuli are available as Supplementary material (Audio files 1–4 that represent prototypical examples of each rated emotion).

#### **PROCEDURE**

The subjects were sitting either in a semi-anechoic room (Stockholm) or in a small laboratory room (Jyväskylä). Two loudspeakers (Audio-Pro 4–14 in Stockholm/Genelec 8030 in Jyväskylä) were placed slightly behind and either side of the computer screen. The sound level at the listening position was calibrated to be at 72 dB (C). Several long notes of the horn were used as the calibration signal, performed at the middle scalar value of *dynamics* (0 dB—as detailed above).

The subjects were first asked to read the written instructions (in Swedish, English, or Finnish). Their task was to rate each example (*n* = 200) on each of the emotions provided (four concurrent ratings for each example). They were asked to focus on emotional expression (i.e., perceived emotions rather than felt emotional experiences) of the example and the ratings were made on a seven-point Likert scale. The emotions were tender/peaceful, happy, sad, angry/scary in Stockholm and tender, peaceful, happy, sad, and angry in Jyväskylä. One reason behind the variation in terms between the laboratories was to compare the terms used in the original study by Vieillard et al. (2008) to terms frequently used by other studies adopting the basic emotion concepts for music (e.g., Bresin and Friberg, 2000; Juslin, 2000; Juslin and Lindström, 2010; Eerola and Vuoskoski, 2011). The second reason to vary the labels was to explore whether collapsing the ratings of similar emotions (e.g., tender and peaceful) would result in large differences when compared to the uncollapsed versions of the same emotions. A free response box was also provided for the participants to use in cases where none of the given emotion labels could be satisfactorily used to describe the stimulus. However, we will not carry out a systematic analysis of these textual responses here, as they were relatively rare (the median number of excerpts commented on was 2 out of 200, the mean 3.4, *SD* = 4.7) and the participants that did comment did not comment on the same examples, which further hinders such an analysis.

The stimuli were presented in a different random order for each participant. The scale's position had no influence on response patterns. The experiment itself was run using the program Skatta<sup>3</sup> at Stockholm and a patch in MAX/MSP at Jyväskylä. For each example, there was a play button and four different sliders for the corresponding emotion labels. The subject was free to repeat the examples as many times as he/she wished. The whole session took between 1 and 2 h to complete. The subjects were also encouraged to take frequent pauses, and refreshments were available.

<sup>2</sup>More technical information about the Vienna Symphonic Library is available from (http://vsl.co.at/) and Kontakt 2 from (http://www.native-instruments. com/).

<sup>3</sup>http://sourceforge.net/projects/skatta/

#### **PARTICIPANTS**

In all, 46 participants took part in the experiment, 20 in Stockholm and 26 in Jyväskylä. Because the ratings collected in these two laboratories were nearly identical (detailed later), we will not document all the data gathered in each of the laboratories separately. The mean age of all participants was 30.2 years (*SD* = 8.7), 20 of the participants were female and 25 were male; one participant did not indicate his/her gender. Most of the participants had an extensive musical background as, between them, they reported having music as a hobby for an average of 16.1 years (*SD* = 10.5) and studying music at a professional level for an average of 7.0 years (*SD* = 6.3). Their musical taste was a mixture of many styles, and the participants also represented various ethnicities (some of whom were not native speakers of Swedish or Finnish). All participants were compensated for their efforts (≈9 C).

#### **RESULTS**

The description of the analysis will proceed according to the following plan. First, the consistencies of the ratings across and between the emotions will be reported. Next, the main hypotheses will be investigated using a series of regression analyses. The first regression analysis will address the contribution of cues to the emotions, the second one will address the linearity of the cue levels, and the third one will-seek to quantify the degree of interactions between the cues in the data, and compare the results with results obtained using models that are additive. All of the analyses will be carried out separately for each of the four emotions.

#### **INTER-RATER CONSISTENCY**

There was no missing data, and no univariate (in terms of the z-scores) or bivariate outliers were identified (using squared Mahalanobis distances with *p* < 0.05 according to the Wilks' method, 1963). The inter-rater consistency among the participants was high at both laboratories, (the Cronbach α scores were between 0.92 and 0.96 in Stockholm, and 0.94 and 0.97 in Jyväskylä). Because of substantial inter-participant agreement for each emotion, and because individual differences were not of interest, the analyses that follow treat the stimulus (*N* = 200) as the experimental unit, with the dependent variable being the mean rating averaged across all participants. The Pearson correlations between the mean ratings from the two laboratories were also high for the identical emotion labels (*r*[198] = 0.94 and 0.89 for happy and sad, both with *p* < 0.0001 for both). For the emotion labels that were varied between the laboratories, significant correlations between the variants also existed; tender/peaceful (Stockholm) and peaceful (Jyväskylä) correlated highly (*r* = 0.81, *p* < 0.0001, *N* = 200) so did tender/peaceful (Stockholm) and tender (Jyväskylä), *r* = 0.89. In addition, angry/scary (Stockholm) and angry (Jyväskylä) exhibited a similar, highly linear trend (*r* = 0.96, *p* < 0.0001, *N* = 200). Due to these high correspondences between the data obtained from the two laboratories, pooling tender/peaceful (Stockholm) with tender and peaceful (Jyväskylä) to *peaceful*, and, angry/scary (Stockholm) with angry (Jyväskylä) to *scary* was carried out.

#### **CORRELATIONS BETWEEN PERCEIVED EMOTIONS**

Next, we explored intercorrelations between the emotion ratings by looking specifically at correlations between the four consistently rated emotions (*happy, sad, peaceful,* and *scary*). These displayed a typical pattern, wherein happy correlated negatively with *sad* (*r* = −0.79 *p* < 0.001 and *N* = 200), and *happy* correlated positively with *peaceful*, albeit weakly (*r* = 0.21, *p* < 0.01), and *happy* correlated significantly with *scary* (*r* = −0.56, *p* < 0.001). *Sad* was weakly correlated with *peaceful* (*r* = 0.16, *p* < 0.05) while *sad* showed no correlation with *scary* (*r* = 0.04, *p* = 0.55). Finally, *peaceful-scary* exhibited significant opposite trend as would perhaps be expected (*r* = −0.72, *p* < 0.001). Similar patterns have also been observed in a study by Eerola and Vuoskoski (2011).

Next, we investigated the emotion scales with examples that were judged highest for each emotion to see the overall discrimination of the scales (see **Figure 1**, these examples are also given as audio files 1–4). Each of these prototype examples is clearly separated from the other emotions, yet the overall pattern reveals how particular emotions are related to other emotions. For instance, happy and sad prototypes get modest ratings also in peaceful, and the peaceful prototype scores similar ratings in sadness. However, these overlaps do not imply explicit confusions between the emotions, as evidenced by 95% confidence intervals. This suggests that all four scales are measuring distinct aspects of emotions in this material. The exact cue levels—shown on the top panels—for each prototype, clear show four distinct cue patterns. Interestingly, there are not only extreme cue levels used in the optimal profiles (e.g., low tempo, dynamics, and articulation and high register for peaceful) but also intermediate levels being used (e.g., middle register and dynamics for sad and happy prototypes). However, a structured analysis of the cue contributions is carried out in the next sections.

**FIGURE 1 | Means and 95% confidence intervals of four emotion ratings for four prototype examples that received the highest mean on each emotions.**

#### **CUE CONTRIBUTIONS TO EMOTIONS**

For an overview of the cues and their levels for each emotion rating, a visualization of the mean ratings is given in **Figure 2**. Most cues exhibited a strikingly clear pattern across the levels for most of the four emotions. For example, *register* can be seen to have

had a clear effect on the emotions happiness and fearfulness. A higher register corresponded to a higher happiness rating while a lower register corresponded to a higher fearfulness rating. Similar trends were displayed in *tempo, mode, dynamics* and *articulation*, though the specific emotions and the directions of the cues levels were different. It is also worth noting that the nominal cues, *mode* and *music structure*, showed large differences across the cue levels. This suggests that these cues had a powerful impact on each emotion rating scale. For *music structure*, the appropriate emotion can always be seen as a peak in the mean ratings of that emotion. In other words the prototypically "happy" musical example was consistently rated by participants to be the highest in happiness, not in other emotions. This effect was most pronounced in the case of scary and least evident in peacefulness.

To assess the impact of each cue for each emotion, regression analyses were carried out for each emotion using all the cues (see **Table 1**).

As can be observed from the **Table 1**, the ratings of all emotions can be predicted to a high degree (77–89%) by a linear coding of the five scalar cues. Beta coefficients facilitate the interpretation of the model and the squared semipartial correlations (*sr*2) are useful for showing the importance of any particular cue within the regression equation as it shows the unique proportion of variance explained by that cue. The cues are ranked along the median *sr*<sup>2</sup> values across the emotions. Note that the *music structure* cue is displayed using three dummy-coded variables, allowing us to discriminate between the effects related to the four different music structures used. Scary is predominantly communicated by the structure of the music (a nominal cue), in that a combination of low register, minor mode, and high dynamics contributes to these ratings. The most effective way of expressing happiness is a major, fast tempo, high register, and staccato articulation within this particular set of examples. For sadness, the pattern of beta coefficients is almost the reverse of this, except a darker timber and a decrease in dynamics also contributes to the ratings. These patterns are intuitively clear, consistent with previous studies (Juslin, 1997c; Juslin and Lindström, 2003, 2010).


**Table 1 | Summary of regression models for each emotion with linear predictors (mode and music structure are encoded in a non-linear fashion).**

*df* <sup>=</sup> *9,190, \*p* <sup>&</sup>lt; *0.05, \*\*p* <sup>&</sup>lt; *0.01, \*\*\*p* <sup>&</sup>lt; *0.001.* <sup>β</sup>*, standardized betas; R2adj , R2 adjusted; corrected for multiple independent variables.*

The first thing we see is that the relative contributions of the cues vary markedly for each emotion, just as in previous studies (Scherer and Oshinsky, 1977; Juslin, 1997c, 2000; Juslin and Lindström, 2010). For example, *mode* is extremely important for happy and sad emotions (*sr*<sup>2</sup> <sup>=</sup> <sup>0</sup>.48 and 0.54), whereas it has a relatively low impact on scary and peaceful (*sr*<sup>2</sup> <sup>=</sup> <sup>0</sup>.08 and 0.05). Similar asymmetries are apparent in other cues as well. For instance, *dynamics* significantly contributes to scary and peaceful emotions (*sr*<sup>2</sup> <sup>=</sup> <sup>0</sup>.08 and 0.14) but has little impact on happy and sad (*sr*<sup>2</sup> = −0.01 and 0.01). This latter observation is somewhat puzzling, as previously, dynamics has often been coupled with changes in valence (Ilie and Thompson, 2006) and happy or sad emotions (Adachi and Trehub, 1998; Juslin and Laukka, 2003). However, when direct comparisons are made with other factorial studies of emotional expression (Scherer and Oshinsky, 1977; Juslin, 1997c; Juslin and Lindström, 2010), it becomes clear that dynamics have also played a relatively weak role in sad and happy emotions in these studies. If we look at the cues that contributed the most to the ratings of sadness, namely *mode* and *tempo,* we can simply infer that the ratings were primarily driven by these two factors.

The overall results of the experiment show that the musical manipulations of all cues lead to a consistent variation in emotional evaluations and that the importance of the musical cues bears a semblance to the synthetic manipulations of musical cues made in previous studies. We will summarize these connections later in more detail. Instead of drawing premature conclusions on the importance of particular musical cues and the exceptions to the theory, we should wait until the specific properties of the cue levels have been taken into account. These issues will therefore be addressed in-depth in the next section.

#### **LINEARITY VERSUS NON-LINEARITY OF CUE LEVELS**

We used hierarchical regression analysis to estimate three qualities of the cue levels (namely linear, quadratic, and cubic) as well

as the overall contribution of the cue themselves because this is the appropriate analysis technique for an optimal design with a partial factor interaction structure (e.g., Myers and Well, 2003, pp. 615–621; Rosenthal and Rosnow, 2008, p. 476).

The cue levels were represented using (a) linear, (b) quadratic and (b) cubic using the mean ratings over subjects (200 observations for each emotion). Each emotion was analyzed separately. This was applied to all five scalar cues. For completeness, the nominal cues (*mode* and *music structure*) were also included in the analysis and were coded using dummy variables.

*Mode* used one dummy variable, where 0 indicated a minor and 1 a major key; while music structure used three dummy variables in order to accommodate the non-linear nature of the cue levels. None of the cues were collinear (variance inflation factors <2 for all cues) as they were the by-product of optimal factorial design. **Table 1** displays the prediction rates, the standardized beta coefficients as well as squared semi-partial correlations for each cue and emotion.

The Step 1 of the hierarchical regression is equal to the results reported in **Table 1**. Based on **Figure 2** and previous studies, we might think that linear coding does not do full justice to certain cues, such as *register* or *timbre*. To explore this, we add quadratic encoding of the five cues (register, tempo, dynamics, articulation, and timbre) to this regression model at Step 2. As quadratic encoding alone would reflect both linear and quadratic effects, the original linear version of the variable in question was kept in the analysis to partial out linear effects (Myers and Well, 2003, pp. 598–559). Adding the quadratic variables at the Step 2 results in increased fit for scary [+3%, *F*(185, <sup>5</sup>) = 10.0, *p* < 0.001], sad [+0.05%, *F*(185, <sup>5</sup>) = 2.4, *p* < 0.05], and peaceful [+8%, *F*(185, <sup>5</sup>) = 23.5, *p* < 0.001] emotions but no increase for the ratings of happy emotion (see **Table 2**). For the ratings of scary emotion, quadratic versions of *register, dynamics*, and *timbre* are responsible for the increased fit of the model which suggests that



*df refers to number of predictors in the model, F denotes the comparison of model at Steps 1, 2, and 3 for the complete regression models, and also the individual significance (t) of the cues, \*\*\*p* < 0.001*,\*\*p* < 0.01*, \*p* < 0.05*.*

these particular cues do contribute to the emotions in non-linear fashion.

A similar observation was made in the ratings of peacefulness. A quadratic variant of the timbre, register, tempo, articulation, and dynamics provided statistically significant change to model at Step 2 (+8.0%, see **Table 2**). Ratings of Sad emotion also received a marginal, albeit statistically significant, change at Step 2 due to contribution of quadratic encoding of *tempo*. The overall improvement of these enhancements will be presented at the end of this section. At Step 3, cubic versions of the five cues (register, tempo, dynamics, articulation, and timbre) were added to the regression model but these did not led to any significant improvements beyond the Step 2 in any emotion (see **Table 2**).

For all of these cues and emotions, cubic variants of the cue levels did not yield a better fit with the data than with quadratic versions. It is also noteworthy that the quadratic versions of the cues were included as additional cues, in that they did not replace the linear versions of the cues. It suggests that some of the cue levels violated the linearity of the factor levels. Therefore, small but significant quadratic effects could be observed in the data mainly for the cues of *timbre, dynamics* and *register*, and these were specifically concerned with the emotions of scary and peacefulness. In the context of all of the cues and emotions, the overall contribution of these non-linear variants was modest at the best (0–8% of added prediction rate) but nevertheless revealed that linearity cannot always be supported. Whether this observation relates to the chosen cue levels or to the actual nature of cues, remains open at present. The overarching conclusion is that the many cue levels were successfully chosen and represented linear steps based on the production experiment (Bresin and Friberg, 2011). These selected levels predominantly communicated changes in emotional characteristics to the listeners in a linear fashion.

#### **ADDITIVITY vs. INTERACTIVITY OF THE CUES**

Previous findings on the additivity or interactivity of musical cues are inconsistent. According to Juslin (1997c); Juslin and Lindström (2010), and Scherer and Oshinsky (1977), cue interactions are of minor importance (though not inconsequential), whereas others have stressed the importance of cue interactions (Hevner, 1936; Rigg, 1964; Schellenberg et al., 2000; Juslin and Lindström, 2003; Lindström, 2003, 2006; Webster and Weir, 2005). To evaluate the degree of cue interactions in the present data, a final set of regression analyses were carried out. In these analyses, each two-way interaction is tested separately for each emotion (21 tests for each emotion) using the mean ratings (*N* = 200). This analysis failed to uncover any interactions between the cues in any emotion after correcting for multiple testing (all 84 comparisons result in non-significant interactions, *p* > 0.315, *df* = 0196). It must be noted that some of the interactions that would be significant without corrections for multiple testing (*register* and *mode*, and *mode* and *tempo* in Happiness, *mode* and *tempo* in Sadness), are classic interacting cues of musical expression (Scherer and Oshinsky, 1977; Dalla Bella et al., 2001; Webster and Weir, 2005), and could be subjected to a more thorough, multi-level modeling with individual (non-averaged) data.

In conclusion, the results of the analysis of additivity vs. interactivity were found to be consistent with the observations made by Scherer and Oshinsky (1977); Juslin (1997c), and Juslin and Lindström (2010) that the cue interactions are comparatively small or non-existent, and additivity is a parsimonious way to explain the emotional effects of these musical cues.

### **DISCUSSION**

The present study has continued and extended the tradition of manipulating important musical cues in a systematic fashion to evaluate, in detail, what aspects of music contribute to emotional expression. The main results brought out the ranked importance of the cues by regression analyses (cf. **Table 1**). The nominal cue, *mode*, was ranked as being of the highest importance, with the other cues ranked afterwards in order of importance as follows: *tempo, register, dynamics, articulation,* and *timbre*, although the ranking varied across the four emotions and music structures. Seventy nine percent of the cue directions for each emotion were in line with physiological state theory (Scherer, 1986), and simultaneously, in accordance with the previous results from studies on the cue directions in music (e.g., Hevner, 1936; Juslin, 1997c; Gabrielsson and Lindström, 2001; Juslin and Lindström, 2010). The second main result suggested that most cue levels contributed to the emotions in a linear fashion, explaining 77–89% of variance in the emotion ratings. Quadratic encoding of three cues (*timbre,register*, and *dynamics*) did lead to minor yet significant increases of the models (0–8%). Finally, no significant interactions between the cues were found suggesting that the cues operate in an additive fashion.

A plausible theoretical account of how these particular cue combinations communicate emotional expressions connects the cues to underlying physiological states. This idea, first proposed by Spencer in 1857, builds on the observation that different emotions cause physiological changes that alter vocal expression (e.g., increased adrenalin production in a frightened state tightens the vocal cords, producing a high-pitched voice). This physiological state explanation (Scherer, 1986) is typically invoked to explain emotions expressed in speech, since it accounts for the crosscultural communication of emotions (Scherer et al., 2001) and assumes that these state-cue combinations have been adapted to common communicational use, even without the necessary underlying physiological states (e.g., Bachorowski et al., 2001). This theoretical framework has an impact on musically communicated emotions as well, because many of the cues (speech rate, mean *F*0, voice quality) that contribute to vocally expressed emotions have been observed to operate in an analogous fashion in music (e.g., Juslin and Laukka, 2003; Bowling et al., 2012). This theory enables direct predictions of the cue properties (importance and cue directions) that convey particular emotions. We have compiled the predictions from expressive vocal cues (Juslin and Scherer, 2005) and expressed emotions in music to the **Table 3**. When we look at the summary of the cue directions from the present study, also inserted to the **Table 3**, out of 24 predictions of cue directions based on vocal expression, 19 operated in the manner predicted by the physiological state theory (Scherer, 1986), three against the predictions, and two were inconclusive (see **Tables 1**, **3**). Two aberrations in the theory were related to


*L, Linear; C, Categorical; Pred, predictions; Res, results;* -*, high;* =*, moderate; , low;* -*, minor; , major—refers to not statistically significant, and letters in music structure refer to predictions based on Montreal Battery (F, Fearful; H, Happy; S, Sad; P, Peaceful). Predictions are based on Juslin and Scherer (2005), except Mode is based on Gabrielsson and Lindström (2010).*

register, which is known to have varying predictions in vocal expression with respect to the type of anger (hot vs. cold anger, see Scherer, 2003). The third conflict with the theory concerns tempo. Previous studies of the musical expression of emotions have suggested *tempo* as the most important cue Gundlach, 1935; Hevner, 1937; Rigg, 1964; Scherer and Oshinsky, 1977; Juslin and Lindström, 2010 and here *mode* takes the lead. We speculate that the nominal nature of *mode* led to higher effect sizes than linearly spaced levels of *tempo*, but this obviously warrants further research.

We interpret these results to strengthen that the musical cues may have been adopted from the vocal expression (Bowling et al., 2012 for a similar argument). We also acknowledge the past empirical findings of the expressive properties of music [e.g., as summarized in Gabrielsson and Lindström (2010)] but since these largely overlap with the cues in vocal expression (Juslin and Laukka, 2003), we rely on vocal expression for the theoretical framework and use past empirical studies of music as supporting evidence. It must be noted that expressive speech has also been used as a source of cues that are normally deemed solely musical, such as mode (Curtis and Bharucha, 2010; Bowling et al., 2012).

A further challenge related to the reliable communication of emotions via cue combinations is that the same cue levels may have different contributions to different emotions (e.g., the physiological state of heightened arousal causes a high speech rate or musical tempo, which is the same cue for both fearfulness and happiness, or, as in the reverse situation, a low *F*<sup>0</sup> conveys boredom, sadness, and peacefulness). An elegant theoretical solution is provided by the Brunswik's lens model (adapted to vocal emotions by Scherer in 1978), which details the process of communication from (a) the affective state expressed, (b) acoustic cues, (c) the perceptual judgments of the cues and (d) the integration of the cues. The lens model postulates that cues operate in a *probabilistic* fashion to stabilize the noise inherent in the communication (individual differences, contextual effects, environmental noise—the same cues may contribute to more than one emotion). Specifically, Brunswik coined the term *vicarious functioning* (1956, pp. 17–20) to describe how individual cues may be substituted by other cues in order to tolerate the noise in the communication. This *probabilistic functionalism* helps to form stable relationships between the emotion and the interpretation. In emotions expressed by music, Juslin has employed the lens model as a framework to clarify the way expressed emotions are communicated from performer to listener (Juslin, 1997a,b,c, 2000).

The cue substitution property of the lens model presumes that there are no significantly large interactions between the cues, because the substitution principle typically assumes an additive function for the cues (Stewart, 2001). Therefore, our third research question asked whether the cues in music contribute to emotions in an additive or interactive fashion. Significant interactions would hamper the substitution possibilities of the lens model. Empirical evidence on this question of expressed emotions in music is divided; some studies have found significant interactions (Hevner, 1935; Rigg, 1964; Schellenberg et al., 2000; Gabrielsson and Lindström, 2001, p. 243; Lindström, 2003, 2006; Webster and Weir, 2005) between the cues when the contribution of 3–5 cues of music have been studied, while other studifes have failed to find substantial interactions in similar designs with a large amount of cues (Scherer and Oshinsky, 1977; Juslin and Lindström, 2010). In the vocal expression of emotions, the importance of the interactions between the cues has typically been downplayed (Ladd et al., 1985). Our second research question probed whether the cues contribute to emotions in a linear fashion. Previous studies have predominantly explored cues with two levels e.g., high-low (Scherer and Oshinsky, 1977; Juslin and Lindström, 2010), which do not permit to draw inferences about the exact manner (linear or non-linear) in which cue values contribute to given emotions (Stewart, 2001). Based on the physiological state explanation, we predicted a high degree of linearity within the levels of the cues, because the indicators of the underlying physiological states (corrugator muscle, skin-conductance level, startle response magnitude, heart rate) are characterized by linear changes with respect to emotions and their intensities (e.g., Mauss and Robinson, 2009). The results confirmed both linearity and additivity of the cue contributions although non-linear effects were significant for some cues.

The most cue levels represented in scalar steps did indeed contribute to emotion ratings in a linear fashion. The exceptions concerned mainly *timbre*, for which we had only three levels. These levels were determined using the single timbral characteristic of *brightness*, but the three instrument sounds used also possessed differences in other timbral characteristics. Nevertheless, the observed relationship between emotions and *timbre* was consistent with previous studies. However, the results of one particular observation proved the hypotheses drawn from the past research wrong. *Dynamics* turned out to be of low importance both for the sad and happy emotions although it has previously been implicated as important for emotions in a number of studies using both emotion categories (Scherer and Oshinsky, 1977; Juslin, 1997c; Juslin and Madison, 1999; Juslin and Lindström, 2010) and emotion dimensions (Ilie and Thompson, 2006). It is unlikely that our results are due to insufficient differences in dynamics (±5 and ±10 dB) because ratings for the emotions peaceful and scary were nevertheless both heavily influenced by these changes. However, they might be related to the specific emotions, as this musical cue has been previously noted to be a source of discrepancy between speech and music (Juslin and Laukka, 2003). Our results are further vindicated by the fact that the emotions happy and sad have not exhibited large differences in dynamics in previous production studies (Juslin, 1997b, 2000).

Finally, the assumption inherent in the lens model that cues operate in additive fashion was validated. The interactions failed to reach statistical significance consistent with comments made by previous surveys of emotional cues (Gabrielsson and Lindström, 2001, p. 243; Juslin and Laukka, 2004) and a number of studies (e.g., Juslin, 1997c; Juslin and Lindström, 2003). This means it should therefore be realistic to construct expressive models of emotions in music with linear, additive musical cues, and this construction greatly decreases the complexity of any such model. Whether this holds true for other musical cues, than those studied here, remains to be verified. This also provides support for the mainly additive model that is used for combining different performance cues in the Director Musices rule system, for example, for the rendering of different emotional expressions (Bresin and Friberg, 2000).

The strength of the current approach lies in the fact that the cues and their levels can be consistently compared since the study design capitalized on a previous production study of emotional expression in music (Bresin and Friberg, 2011) and the analyses were kept comparable to past studies of expressive cues of music (Scherer and Oshinsky, 1977; Juslin and Lindström, 2010). The present study allowed us to establish plausible ranges for the cue levels in each of the manipulations. The drawback of our scheme was that the optimal sampling did not contain all the possible cue combinations. This means that the prototype examples (**Figure 1**) could be still be improved in terms of their emotional expression, but at least the factorial design was exhaustive enough to assess the main hypotheses about the cue level and their interactions in general. Also, our decision of using alternate sets of emotions (tender vs. peaceful) in the two laboratories was a design weakness that failed to achieve the extension of the emotions covered.

In the context of musical expression, the ranking of the importance of the musical cues for emotions seems to coalesce across the studies (e.g., Hevner, 1936; Juslin, 1997c; Gabrielsson and Lindström, 2001; Juslin and Lindström, 2010), although the small number of studies and cues studied within these studies prevents one from drawing extensive conclusions yet. We acknowledge that the choice of musical cues used for this study has, a priori, certainly excluded others from this ranking. Certain important musical cues such as *harmony, melodic contour*, or *dissonance* could be of equal relevance for attributing emotions to music and were included within the *music structure* of our design without any systematic manipulation. We also recognize that the variable contribution of the cues is a built-in feature of the brunswikian lens model, according to which communication may be accurate using multiple cues although the relative contribution of the cues will depend on the context.

As per Hevner's cautionary remarks about the results of any music and emotion study (1936), any emotional evaluations are dependent on the context established by the musical materials in question. The present work differs in three material ways from the two previous studies (Scherer and Oshinsky, 1977; Juslin, 1997c; Juslin and Lindström, 2010) that also used extensive cue manipulations. Both Scherer and Oshinsky (1977) and Juslin (1997c) used just one synthetic, artificial melody for the basis of manipulations and 2–3 large differences between the cue levels. Juslin and Lindström (2010) also had four simple melodic progressions, all based on same triadic and scalar and rhythmic elements. The present experiment was built around four polyphonic, composed and validated musical examples that were initially chosen to represent four emotion categories in a maximally clear way. Additionally, the selection of cue range was grounded in past empirical work and combined both performance-related and compositional aspects of music.

The results of the present study offer links to the findings in expressive speech research because the hypotheses about the cue direction taken from expressive speech were largely supported (Scherer, 1986; Murray and Arnott, 1993; Juslin and Laukka, 2003; Scherer, 2003). In future, it would be important to combine the factorial manipulation approach with special populations, such as children, people from different cultures, or patients with particular neural pathologies and to use other measurement techniques than self-report to further isolate the musical cues in terms of the underlying mechanisms. These combinations would allow us to determine specifically what aspects of affect perception are mostly the products of learning, as well as gain a better idea of the underlying processes involved.

#### **ACKNOWLEDGMENTS**

The work was funded by the European Union (BrainTuning FP6- 2004-NEST-PATH-028570) and the Academy of Finland (Finnish Center of Excellence in Interdisciplinary Music Research). We thank Alex Reed for proof-reading and Tuukka Tervo for collecting the data at the University of Jyväskylä.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/Emotion\_Science/10.3389/ fpsyg.2013.00487/abstract

#### **REFERENCES**


18–49. doi: 10.1177/0305735 610362821


for future research. *Psychol. Bull.* 99, 143–165. doi: 10.1037/0033- 2909.99.2.143


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 31 March 2013; accepted: 11 July 2013; published online: 30 July 2013. Citation: Eerola T, Friberg A and Bresin R (2013) Emotional expression in music: contribution, linearity, and additivity of primary musical cues. Front. Psychol. 4:487. doi: 10.3389/fpsyg.2013.00487*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Eerola, Friberg and Bresin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Music, emotion, and time perception: the influence of subjective emotional valence and arousal?

#### *Sylvie Droit-Volet <sup>1</sup> \*, Danilo Ramos 2, José L. O. Bueno3 and Emmanuel Bigand4 \**

*<sup>1</sup> Laboratoire de Psychologie Sociale et Cognitive, University Blaise Pascal, CNRS, Clermont-Ferrand, France*

*<sup>4</sup> Laboratoire d'étude de l'apprentissage et du développement, University of Burgundy, CNRS, Dijon, France*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Marion Noulhiane, UMR663 Paris Descartes University, France Steven R. Livingstone, Ryerson University, Canada*

#### *\*Correspondence:*

*Sylvie Droit-Volet, Laboratoire de Psychologie Sociale et cognitive (CNRS, UMR 6024), Université Blaise Pascal, 34 avenue Carnot, 73000 Clermont-Ferrand, France e-mail: sylvie.droit-volet@ univ-bpclermont.fr; Emmanuel Bigand, Pôle AAFE-Esplanade Erasme, Université de Bourgogne, 34 avenue Carnot, BP 26513 21065, Dijon Cedex, France e-mail: emmanuel.bigand@ u-bourgogne.fr*

The present study used a temporal bisection task with short (<2 s) and long (>2 s) stimulus durations to investigate the effect on time estimation of several musical parameters associated with emotional changes in affective valence and arousal. In order to manipulate the positive and negative valence of music, Experiments 1 and 2 contrasted the effect of musical structure with pieces played normally and backwards, which were judged to be pleasant and unpleasant, respectively. This effect of valence was combined with a subjective arousal effect by changing the tempo of the musical pieces (fast vs. slow) (Experiment 1) or their instrumentation (orchestral vs. piano pieces). The musical pieces were indeed judged more arousing with a fast than with a slow tempo and with an orchestral than with a piano timbre. In Experiment 3, affective valence was also tested by contrasting the effect of tonal (pleasant) vs. atonal (unpleasant) versions of the same musical pieces. The results showed that the effect of tempo in music, associated with a subjective arousal effect, was the major factor that produced time distortions with time being judged longer for fast than for slow tempi. When the tempo was held constant, no significant effect of timbre on the time judgment was found although the orchestral music was judged to be more arousing than the piano music. Nevertheless, emotional valence did modulate the tempo effect on time perception, the pleasant music being judged shorter than the unpleasant music.

#### **Keywords: time perception, music, emotion, valence, arousal**

Music is a powerful emotional stimulus that changes our relationship with time. Time does indeed seem to fly when listening to pleasant music. Music is therefore used in waiting rooms to reduce the subjective duration of time spent waiting or in supermarkets to encourage people to stay for longer and buy more. A number of studies have indeed shown that a period of waiting is judged shorter when there is accompanying music than when there is none (e.g., Stratton, 1992; North and Hargreaves, 1999; Roper and Manela, 2000; Guegen and Jacob, 2002) and that this subjective shortening of time appears to be greater when the subjects enjoy this accompanying music (Yalch and Spangenberg, 1990; Lopez and Malhotra, 1991; Kellaris and Kent, 1994; Cameron et al., 2003). These findings raise the question: What are the musical parameters that produce emotions and change our time judgments?

Music is a complex structure of sounds whose different parameters can affect the perception of time. Much of the published literature considers that the major cause of subjective time distortions in response to music is due to the temporal regularities of musical events. According to Jones and Boltz (1989), the effect of music on time estimation is due to the perceptual expectancies that listeners develop when they hear a piece of music. The way musical accents are patterned through time leads listeners to anticipate the timing and nature of incoming events. They thus judge time to be shorter when these events occur earlier in the piece than expected, and longer when they occur later. This finding highlights the influence exerted by musical structures (pitch and rhythmic structure) on attention during the estimation of musical time (see also Tillmann et al., 2007; Firmino and Bueno, 2008; Firmino et al., 2009).

However, without rejecting the important role of musical structure, other researchers mention the critical role of the emotional qualities of music *per se*. Indeed, music is remarkable in its ability to induce emotions in listeners (Juslin and Sloboda, 2001). Many studies conducted over the last decade have indeed demonstrated the consistency of emotional responses to music (e.g., Peretz et al., 1998; Bigand et al., 2005). However, the musical structure of a piece of music may also induce emotions in listeners, with the result that musical structure and emotional qualities cannot be easily dissociated. Quite surprisingly, only a small number of studies in the fields of music cognition and time perception have investigated the influence of musical structure and emotional qualities. The present study therefore focuses on the potential influence of the emotional qualities of musical pieces on time judgment.

As far as the emotional qualities of musical pieces are concerned, the musical mode has been found to have robust effects on perceived emotion, with pieces perceived as sounding happy

*<sup>2</sup> Departamento de Mùsica, Federal University of Paraná, Paraná, Brazil*

*<sup>3</sup> Faculdade de Filosofia, Ciências e Letras, University of São Paulo, São Paulo, Brazil*

when played in a major key and sad when played in a minor key (e.g., Crowder, 1984; Peretz et al., 1998; Fritz et al., 2009). Influences of mode on time estimation have been reported in studies using stimulus durations of several minutes (Kellaris and Kent, 1992; Bisson et al., 2009). For instance, Bisson et al. (2009) showed that the duration of a joyful musical piece (taken from Bach's Brandenburg Concertos) was overestimated compared to that of a sad piece (Barber's Adagio for Strings). However, given that the two emotions were instantiated only by two entirely different pieces, it is difficult to be sure that this difference in time estimation was not caused by other structural parameters (rhythm, meter, tempo) that are not necessarily directly related to emotion. Indeed, a piece of music in a major key that is judged happy is often associated with a fast tempo, whereas pieces written in a minor key tend to be played in a slow tempo. In such cases, the critical factor may thus be the musical rhythm rather than the mode *per se*. Moreover, two recent studies conducted using shorter stimulus durations and various temporal paradigms failed to find any significant effect of major *vs*. minor mode on time estimation. Using a retrospective time estimation paradigm, in which the participants were informed that they had to estimate time only after the presentation of the event, Bueno and Ramos (2007) did not observe any differences in time estimation between a musical piece (64.3 s) played in major and minor mode. Similarly, using a prospective time estimation paradigm (i.e., a temporal bisection task) in which the subjects were instructed that they would have to estimate time, Droit-Volet et al. (2010a) did not report a significant effect of mode on time judgments when the musical excerpts were matched on all parameters except for mode. Consequently, these authors concluded that the emotional valence of music may have little influence on time perception, at least when all other parameters, such as pitch structure, are held constant.

Finally, we can assume that it is the structure of musical pieces, which is indirectly responsible for inducing emotions, that affects the perception of time rather than the emotional valence *per se*. Using simple sequences of clicks, numerous studies on timing have shown that faster rhythms lead to longer time estimates than slower rhythms (e.g., Treisman et al., 1990, 1992; Penton-Voak et al., 1996; Droit-Volet and Wearden, 2002; Ortega and López, 2008). To explain these results, the various authors argue that the sequence of clicks increases the level of arousal that makes the internal clock run faster. According to the internal clock models (Treisman, 1963; Gibbon, 1977; Gibbon et al., 1984), the raw material for the representation of time consists of pulses that are emitted by a pacemaker-like system and accumulated in a counter during the presentation of the stimulus duration. Consequently, when the internal clock speeds up under the influence of clicks, more pulses are accumulated for a given duration, and time is judged longer. It therefore seems reasonable to consider that the critical factor in time distortions with music is the musical tempo that also seems to affect the emotional arousal. As explained in Droit-Volet and Meck (2007), an increase in the arousal level with emotional stimuli is associated with a speeding up of the internal clock, with the result that time is judged longer. According to psychophysiological studies that have used standardized emotional material (e.g., Greenwald et al., 1989; Lang et al., 1999), the arousal dimension of emotional stimuli corresponds to a subjective state ranging from calm-relaxed to excited-stimulated. An increase in arousal level is indeed associated with physiological activation of the autonomic nervous system (Juslin and Västfjäll, 2008). In addition, it has been demonstrated that physiological measures of arousal (heart rate or skin conductance) are correlated with self-assessment of arousal on the Self-Assessment Manikin Scale (SAM, Lang, 1980; Lang et al., 1999). Therefore, one aim of the present study was to examine the effect of different musical pieces on time estimation by comparing the effects of different tempi. Tempo, however, is thought to play a role in the subjective emotional arousal assessed by the SAM scale (Lang, 1980) and not in affective valence.

In music, the concept of emotional valence may be understood in two different ways (Bigand et al., 2005). First, valence may be thought in terms of an opposition between "sad" and "happy" music, that is to say, between negative and positive emotions (see also Juslin and Västfjäll, 2008). One effective way of implementing this opposition is to contrast music in major and minor keys. However, neither Bueno and Ramos (2007) nor Droit-Volet et al. (2010a) found any effect of mode on the perception of time. Second, valence may be viewed in terms of "pleasant" and "unpleasant" music. In this perspective, music qualified as "sad" could easily be experienced as very pleasant (Droit-Volet et al., 2010a). In a study run by Blood et al. (1999), extremely pleasant music was found to stimulate the reward circuit of the brain. Consequently, sad music can also bring about this rewarding effect. It is therefore possible that the valence of musical stimuli contributes differently to time estimation depending on whether the implemented contrast is between negative/positive emotions or pleasant/unpleasant emotions. In the present study, we manipulated this aspect of musical valence (pleasant vs. unpleasant) by inverting the amplitude envelope of the musical pieces. More precisely, the structure of the musical stimuli was changed by playing the sound wave either normally or backward. We expected this backward version to render the music unpleasant for two reasons: it destroys the musical relationships between tones and it modifies the amplitude envelope of each musical tone.

In sum, in a first experiment, the participants performed a temporal bisection task composed of a training and a testing phase (Allan and Gibbon, 1991; Wearden, 1991; Droit-Volet and Wearden, 2001). In the training phase, the participants were initially trained to respond "short" or "long" for a short and a long standard duration presented in the form of a white noise. In the testing phase, they were then presented with different comparison stimulus durations, equal to the short or the long standard duration, or of intermediate value. Their task was to judge whether each comparison duration was more similar to the short or to the long standard duration. However, in the testing phase, the comparison stimulus durations were not a white noise, but musical pieces whose tempo (fast vs. slow) and valence (normal vs. backward) were both manipulated. Our main hypothesis was that the psychometric function in bisection (proportion of long responses plotted against comparison durations) would be shifted toward the left for the musical pieces with a fast tempo compared to that for the musical pieces with a slow tempo, the participants responding more often long for the former. Using emotional scales similar to those employed in the SAM scale developed by Lang et al. (1999), we also verified whether tempo was associated with the subjective emotional arousal and the normal vs. backward opposition with the subjective emotional valence.

### **EXPERIMENT 1**

#### **METHOD** *Participants*

Forty undergraduate students (27 women and 13 men, *mean age* = 19.2, *SD* = 1.02) at Burgundy University, France, participated in this experiment.

#### *Material*

The participants sat in a quiet laboratory room in front a PC computer that controlled the experimental events and recorded the responses via E-prime. The participant's responses consisted in pressing the "D" or the "K" keys of the computer keyboard. The participants also listened to the stimuli through headphones which were connected to the computer. The stimuli to be timed consisted of musical sequences. Each excerpt was recorded using Cubase 4 musical software (Steinberg). A set of 5 different musical piano pieces were used as the stimuli to be timed. The same 5 musical pieces, with identical musical parameters, were subjected to two types of manipulation: one for the tempo and the other for the valence. As far as tempo is concerned, we changed the tempo from slow (72 beats per min) to fast (184 beats per min). To manipulate the valence, we changed the structure of the stimuli by playing the sound wave either normally or backward. Manipulating both the tempo (slow vs. fast) and the valence (original vs. backward) for the 5 musical pieces resulted in the generation of 20 musical sequences for use in this experiment.

#### *Procedure*

The participants performed a temporal bisection task composed of two phases: training and test phase. In the training phase, the participants were presented with a short (*S*) and a long (*L*) standard duration presented in the form of a white noise. There were 16 trials, 8 for each standard duration, presented in a random order. In this phase, the participants were trained to respond "short" for *S* and "long" for *L*, by pressing the corresponding key. The button press order was counterbalanced across subjects. Only participants who obtained at least 70% correct responses were included in the testing phase. In this testing phase, the participants were presented with 7 comparison durations presented in the form of the musical pieces described above: one for each comparison duration similar to *S* or *L*, and one for the 5 intermediate comparison durations. For each musical piece, the participants must respond whether its comparison duration was more similar to *S* or to *L*. The test phase consisted of 280 trials presented in 2 blocks of 140 trials each: 10 trials for the musical stimuli (2 × 5 different musical pieces) with two types of tempo (slow vs. fast) and two types of valence (normal vs. backward) for each of the 7 comparison durations. The trials were presented randomly within each block. In addition, the participants were divided into two groups as a function of the duration range used: 0.5/1.7 or 2.0/6.8 s. For the shorter duration range, *S* was 0.5 s and *L* 1.7 s. The comparison durations were 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, and 1.7 s. For the longer duration range, *S* and *L* were 2.0 and 6.8 s, and the comparison durations 2.0, 2.8, 3.6, 4.4, 5.6, 6, and 6.8 s. In each condition, the participants were instructed not to count the time (for the methods used to prevent counting, see Rattat and Droit-Volet, 2012).

After the bisection task, the participants were asked to evaluate the emotional qualities of the musical stimuli. More precisely, they heard each musical stimulus and rated its affective valence from "unpleasant" to "pleasant" and its arousal dimension from "calm" to "exciting" on a 9-point scale (range 1–9) similar to that used in the SAM by Lang et al. (1999). The two emotional scales were randomly presented. The presentation duration of each musical stimulus was at the mid-point between the two standard durations employed in the bisection task. In the 0.5/1.7 and the 2.0/6.8 s duration conditions, the participants thus gave their emotional judgments for stimuli of 1.1 and 4.4 s, respectively.

#### **RESULTS AND DISCUSSION**

#### **EMOTIONAL EVALUATION OF MUSICAL STIMULI**

**Table 1** displays the emotional ratings for the music, presented for 1.1 and 4.4 s, as a function of the affective and arousal dimensions of each version of the pieces tested, when these were presented forward (original version) or backward and at a slow or fast tempo.

An ANOVA was run on each of the pleasantness and arousal ratings, with duration, backward version and tempo as withinsubject factors. There was a significant main effect of both version, *<sup>F</sup>*(1, <sup>40</sup>) <sup>=</sup> <sup>168</sup>.16, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.81, and tempo, *<sup>F</sup>*(1, <sup>40</sup>) <sup>=</sup> <sup>60</sup>.99, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.60, on pleasantness. The main effect of duration, *F*(1, <sup>40</sup>) = 0.10, *p* > 0.05, was not significant, thus indicating that the presentation duration of the music (short or long) did not affect pleasantness. There was no significant interaction involving these different factors (all *p* > 0.05). In line with our hypothesis, our results thus showed that the normal version of the music was clearly judged to be more pleasant (7.20) than the backward version (3.01). The fast tempo was also judged more pleasant than the slow tempo (5.63 vs. 4.57), although the ratings tended more toward a median value on the 9-point scale.

As far as the arousal ratings are concerned, the ANOVA showed a significant main effect of tempo, *F*(1, <sup>40</sup>) = 234.50, *p* < 0.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.85, thus demonstrating that the music played at a fast tempo was judged more arousing than the music played at a slow tempo (7.11 vs. 3.5). There was, however, a significant interaction between the tempo and the backward version, *F*(1, <sup>40</sup>) = 41.88, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.51. Tempo did not significantly interact with any other factor (all *p* > 0.05). This significant interaction indicated that, at the fast tempo, the participants judged the music to be more arousing in its normal than in its backward version (7.77 vs. 6.44, *F*<sup>1</sup> (1,41) = 18.22, *p* < 0.05, η<sup>2</sup> = 0.31). In contrast, at the slow tempo, there was no difference between the normal and the backward version (3.27 vs. 3.73, *F*(1, <sup>41</sup>) = 1.83, *p* > 0.05). In addition, the ANOVA found a significant interaction between the backward version and the duration, *F*(1, <sup>40</sup>) = 4.31, *p* < 0.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.10. The original music was judged more arousing than

<sup>1</sup>Bonferroni corrections were applied for all comparisons.


**Table 1 | Mean and standard deviation of ratings of arousal and pleasantness (9-point scale) for musical excerpts presented in their original and backward version with a fast and a slow tempo for a 1.1 and a 4.4-s duration.**

**the 0.5–1.7 and the 2.0–6.8 s duration conditions.**

the backward music when the presentation duration was long (5.75 vs. 4.74, *<sup>F</sup>*(1, <sup>20</sup>) <sup>=</sup> <sup>12</sup>.93, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.39), while both forms were judged to be similarly arousing when the duration was shorter (3.27 vs. 3.73, *F*(1, <sup>20</sup>) = 0.11, *p* > 0.05). However, the arousal rating did not exceed 5.75 on the 9-point scale. No other significant effect was found. In summary, in line with our hypotheses, the results suggested that the type of presentation (original vs. backward) was the main factor affecting the assessment of the valence of the musical pieces, and the tempo the main factor affecting the level of arousal induced by music, although with the fast tempo, the subjective arousal increased more with the normal than with the backward version of musical pieces.

#### **TEMPORAL BISECTION**

**Figure 1** presents the proportion of long responses [*p*(long)] plotted against the comparison durations for the different types of musical pieces, which were judged to be high or low-arousing as a function of their tempo (fast vs. slow, respectively) and pleasant or unpleasant as a function of their version (original vs. backward). An examination of **Figure 1** reveals that the major factor that produced time distortions was the tempo. Indeed, the musical stimuli were systematically judged longer with a fast than a slow tempo. To examine the bisection performance in more detail, we calculated two indexes: The point of subjective equality, also called the bisection point (BP), and the Weber Ratio (WR) (**Table 2**). The former is the stimulus duration (t) that gives rise to *p*(long) = 0.50. The WR is an index of time sensitivity. It is the Difference Limen (t[*p*(long) = 0.75] − t[*p*(long) = 0.25] /2) divided by the BP. The lower the WR value, the higher the sensitivity to time. The regression method originally used by Church and Deluty (1977) and subsequently employed by other authors (e.g., Wearden and Ferrara, 1996; Droit-Volet and Wearden, 2002) was used to calculate these 2 temporal indexes.

**Table 2 | Means and standard deviation of the Bisection Points and Weber Ratios for musical excerpts presented in their original and backward version with a fast and a slow tempo in the 0.5/1.7 and the 2.0/6.8-s duration condition.**


An ANCOVA was conducted on the BP with 2 within-subject factors (tempo, backward version) and 1 between-subjects factor (duration), with the arousal and the valence scores for each type of musical pieces as-covariates. This ANCOVA showed a main effect of duration, *<sup>F</sup>*(1, <sup>25</sup>) <sup>=</sup> <sup>362</sup>.72, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.94, indicating that the BP was higher for the long than for the short anchor durations. No other factor significantly interacted with duration. More interestingly, there was a significant main effect of tempo, *<sup>F</sup>*(1, <sup>25</sup>) <sup>=</sup> <sup>8</sup>.37, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.25. This main effect of tempo demonstrates that the BP was lower for the fast than for the slow tempo and therefore indicates that the music was judged longer when played at a faster tempo.

The main effect of backward version was not significant, *F*(1, <sup>25</sup>) = 0.72, *p* > 0.05, and the backward version did not interact with any co-variables (all *ps* > 0.05). There was nevertheless a significant tempo × backward interaction, *F*(1, <sup>33</sup>) = 5.63, *p* < <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.18. This revealed that the music with a fast tempo was judged longer than that with a slow tempo for both the original version, *F*<sup>1</sup> (1, <sup>35</sup>) <sup>=</sup> <sup>60</sup>.01, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.63, and the backward version, *<sup>F</sup>*(1, <sup>39</sup>) <sup>=</sup> <sup>10</sup>.34, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.21. However, the difference in the lengthening effect between the fast and the slow tempo appeared to be larger for the original than for the backward version, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>13</sup>.59, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.29. In line with results that have been obtained for the assessment of the arousal and valence level of musical pieces, there was a significant interaction between the tempo and the arousal measures for the fast backward music, *<sup>F</sup>*(2, <sup>25</sup>) <sup>=</sup> <sup>4</sup>.39, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.15, demonstrating that the tempo effect on the BP increased with the arousal scores: The higher the arousal scores, the longer the musical pieces were judged to be. There were also a significant interaction between the tempo and the valence measures, both for the fast and the slow backward version of the musical pieces, revealing that the difference in the lengthening effect between the slow and the fast tempo tended to decrease for the backward version as the pleasantness of the music increased. No other main effect or interaction involving the co-variables was found.

The overall ANOVA run on the WR with tempo, backward version and duration as factors did not reveal any significant effect (all *p* > 0.05). Therefore, the perception of the music distorted time without altering the fundamental ability to discriminate different durations.

Experiment 1 showed a main effect of tempo on time judgment revealing that the musical pieces with a fast tempo were judged longer than those with a slower tempo. There was nevertheless an interactive effect of the version (normal vs. backward version) and tempo of musical stimuli on time judgment. This interaction indicated that the backward version of the music, that was rated as affecting the valence (pleasantness) of the musical pieces, modulated rather than reversed the effect of tempo on the timing of music. Indeed, whatever the stimulus duration ranges (< 2 s >), the musical pieces were always judged longer at the fast than at the slow tempo. However, the magnitude of this lengthening effect due to tempo was larger for the original than for the backward version of musical pieces. In other words, the original or backward version affecting the valence of musical pieces increased or decreased the difference in time judgment between the fast and the slow tempo, without eliminating or reversing the tempo effect.

Our Experiment 1 therefore demonstrates that musical tempo was the major factor affecting time judgments. A musical piece with a fast tempo was systematically judged longer than a musical piece with a slower tempo. Our study with musical pieces thus replicated those of studies using simple click trains, which have showed that a faster click rate produces longer time estimates (e.g., Treisman et al., 1990, 1992). In addition, our results on the emotional evaluation of musical stimuli revealed that the fast pieces of music were systematically judged to be more arousing that the slower pieces. There was also a significant interaction between the tempo and the subjective arousal measures which indicated that the lengthening effect obtained with the fast tempo was, when compared to the slow tempo, related to the increase in the subjective arousal level of the musical pieces. Consequently, the increase in subjective arousal level associated with the fast tempo would be the source of the temporal lengthening effect observed in our study. Such a conclusion would be consistent with the results of numerous studies showing that high-arousing emotional stimuli (facial expressions, images, movies) produce a temporal lengthening effect whereas low-arousing emotional stimuli do not (e.g., Droit-Volet and Gil, 2009; Droit-Volet et al., 2010b, 2011; Gil and Droit-Volet, 2011; Tipples, 2008, 2011). However, the issue of whether the effect of tempo associated with arousal is due to tempo *per se* or to the arousing qualities of the music. We therefore decided to run a second experiment similar to Experiment 1 but with a parameter other than tempo that is also thought to increase the subjective arousal level assessed by the SAM scale (Lang et al., 1999). More precisely, we manipulated the timbre of the musical pieces by playing them in a piano and an orchestral form. Previous studies have manipulated the timbre of musical sounds and demonstrated that the more complex the timbre, the greater the arousal (e.g., Behrens and Green, 1993; Balkwill and Thompson, 1999). Accordingly, piano versions were expected to induce lower arousal than orchestral versions of the same musical pieces. Our hypothesis was that, irrespective of whether arousal level *per se* is the cause of the temporal lengthening, we should observe a temporal lengthening effect for the orchestrated variants similar to those produced by variations in tempo.

### **EXPERIMENT 2 METHOD**

#### *Participants*

The sample consisted of forty new undergraduate students (24 women and 16 men, *mean age* = 21.3; *SD* = 1.54).

#### *Material and procedure*

The material was similar to that used in Experiment 1 with the exception of the musical stimuli to be timed. To manipulate the arousal induced by the musical stimuli, we changed their instrumentation. In the piano version, only the piano timbre was used. In the orchestral version, additional tracks performed by double bass, woodwind, brass and percussion were included. Increasing the number of virtual performers rendered the music livelier and thus more dynamic. The valence was manipulated in the same way as in Experiment 1 by playing the sound file either normally or backwards. The 5 musical pieces were consequently played either by piano only or with orchestral instrumentation and were run either normally or backwards.

The procedure was also identical to that used in Experiment 1, with a white noise being used for the standard durations presented in the training phase and the musical pieces for the comparison durations presented in the test phase. The test phase consisted of 280 trials presented in 2 blocks of 140 trials each: 10 musical stimulus trials (2 × 5 different musical pieces) for two types of instrumentation (piano vs. orchestral instrumental) and two types of valence (normal vs. backward) for each of the 7 comparison durations. As in Experiment 1, after the bisection task, the participants were again asked to evaluate the emotional qualities of the musical stimuli presented for 1.1 and 4.4 s (midpoint between *S* and *L*) on an affective valence scale ranging from "unpleasant" to "pleasant" and an arousal scale from "calm" to "exciting" (Lang et al., 1999).

#### **RESULTS AND DISCUSSION**

#### **EMOTIONAL EVALUATION OF MUSICAL STIMULI**

**Table 3** shows the results of emotional ratings of the orchestral pieces and corresponding piano versions, presented either forward (normal) or backward. The results of the ANOVA on the pleasantness ratings showed a significant main effect of backward version, *<sup>F</sup>*(1, <sup>36</sup>) <sup>=</sup> <sup>315</sup>.07, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.90, thus confirming that the normal music was judged pleasant (7.43) and its backward version unpleasant (2.91). In addition, there was a significant backward × duration interaction, *F*(1, <sup>36</sup>) = 15.25, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.30. This interaction revealed that the difference in affective assessment between a normal piece and its backward version was greater when the presentation duration of the music was long (4.4 s) than when it was short (1.1 s) (5.50 vs. 3.52, *<sup>F</sup>*(1, <sup>36</sup>) <sup>=</sup> <sup>15</sup>.25, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> 30). The ANOVA also showed that the main effect of orchestration did not reach significance on the pleasantness ratings, *F*(1, <sup>33</sup>) = 3.28, *p* > 0.05. This suggests that instrumentation *per se* was not sufficient to modify the pleasant nature of the music. However, there was a significant backward × orchestration × duration interaction, *<sup>F</sup>*(1, <sup>36</sup>) <sup>=</sup> <sup>5</sup>.62, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.14. For both the short and the long presentation durations, the backward version of the music was systematically judged to be less pleasant whatever its instrumentation (piano or orchestra) (all *p* < 0.05). The only difference in the pleasantness ratings between the piano and the orchestral music was found for the long presentation duration, with the backward version being judged more unpleasant with the orchestral than with the piano sound (2.14 vs. 2.79, *F*(1, <sup>18</sup>) = 4.86, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.21).

In accordance with our hypothesis, the ANOVA on the arousal ratings showed that the orchestral music was judged more arousing that the piano music, 6.95 vs. 4.66, *F*(1, <sup>37</sup>) = <sup>139</sup>.49, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.79. In addition, the backward version of the music had no significant effect on subjective arousal, *F*(1, <sup>37</sup>) = 0.02, *p* > 0.05. There was no other significant effect. To summarize, by varying the version and instrumentation, we achieved an all but perfect orthogonal manipulation of the valence and the arousing qualities of the musical stimuli. Manipulating the orchestration did indeed selectively affect the arousing values of emotion, while not producing any change in valence.

#### **TEMPORAL BISECTION**

**Figure 2** presents the psychophysical function when the orchestral and piano pieces were played forward and backward in the short and the longer duration range. In contrast to Experiment 1 in which tempo was the major factor modifying time judgment, **Figure 2** suggests that the orchestration, although it was

**Table 3 | Mean ratings and standard deviation of arousal and pleasantness (9-point scale) of musical excerpts in original × backward and orchestral × piano conditions for a 1.1 and a 4.4-s duration.**


**FIGURE 2 | Proportion of long responses plotted against stimulus duration for the original and the backward version of orchestral and piano music in the 0.5–1.7 and the 2.0–6.8 s duration conditions.**

also associated with a higher subjective level of arousal, did not affect time judgment. This is confirmed by the results of the ANCOVA performed on the BP (**Table 4**) with the same factor design as that used in Experiment 1.

As in Experiment 1, the ANCOVA run on the BP revealed a significant main effect of duration, *F*(1, <sup>27</sup>) = 595.32, *p* < 0.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.96, with no significant interaction involving this factor. Consequently, the BP was higher in the long than in the short duration range. However, and more interestingly, there are neither main effect of orchestration, *F*(1, <sup>27</sup>) = 1.72, *p* > 0.05, nor main effect of backward version, *F*(1, <sup>27</sup>) = 0.18, *p* < 0.05. Furthermore, the arousal measures entered into the ANCOVA as covariates were not significant (all *p* < 0.05). The only significant effect was the interaction between the backward version and the valence measures for the original version of the orchestral music, *<sup>F</sup>*(1, <sup>27</sup>) <sup>=</sup> <sup>6</sup>.42, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> 19. This revealed that the BP increased with the positive valence of the music. In other words, more pleasant the music was judged to be, the shorter the estimate of its duration.

The ANCOVA on the WR failed to reveal any significant effect, except for a significant interaction between the backward version, the orchestration and the arousal measures for the original version of the orchestral music, *<sup>F</sup>*(1, <sup>27</sup>) <sup>=</sup> <sup>4</sup>.67, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.15. This interaction was due solely to the WR value for the original piano music which increased significantly with the subjective valence level [*r*(39) = 0.36, *p* < 0.05]. In other words, sensitivity to time decreased as the pleasure expressed by the participants when they heard the piano music increased.

To summarize, although the orchestral music was rated as being more arousing than the piano music, our results did not **Table 4 | Means and standard deviation of the Bisection Points and Weber Ratios for original × backward and orchestral × piano music in the 0.5/1.7 and the 2.0/6.8 s duration condition.**


reveal any difference in time perception induced by the musical timbre. Therefore, as we discuss below, both the variations in orchestration and in tempo modified the subjective level of arousal, but only the tempo significantly modified the judgment of time. Finally, when different orchestral pieces were used, only the backward version of the music that modified the affective valence of the music affected time judgments, with the duration of the musical pieces been judged shorter when their positive valence (pleasantness) increased.

In sum, the backward version of musical pieces (original vs. backward) used in our studies to change the emotional valence of the music appeared to produce a shortening effect which, in the case of Experiment 1, modulated the tempo effect on time judgment. However, playing music backwards significantly alters the structure of the music, such as its emotional effect on the perception of time (i.e., temporal shortening) is perhaps specific to this manipulation of the musical pieces. Therefore, to further examine the effect of valence in the temporal judgment of music, we decided to run a third experiment involving the manipulation of other musical parameters that it was considered to modify the emotional valence of music. In a recent study conducted using a similar temporal bisection task as that used in Experiments 1 and 2, Droit-Volet et al. (2010a) tested the emotional valence of musical pieces by presenting the same pieces in two variants: a major key for positive valence and a minor key for negative valence. However, as we explained in our Introduction, they did not report any significant effect of mode on the perception of time with different duration ranges. In Experiments 1 and 2, we manipulated the valence of the music by inverting the amplitude envelope of the musical pieces (forward vs. backward version). Another approach consists in contrasting tonal and atonal music. Using a retrospective temporal judgment paradigm, Kellaris and Kent (1992) made a pop song played in the major or minor mode and lasting 2.5 min atonal by changing the pitch of appropriates tones. The participants judged the piece played in the major mode (associated with happiness) as lasting longer (3.45 min) than that played in the minor mode (3.07 min) or in an atonal variant (2.95 music). The authors therefore concluded that the strongest valence effects were found when major and atonal versions of the same music were contrasted. Consequently, in Experiment 3, we used a temporal bisection task to examine the differences in time perception caused by tonal and atonal pieces of music.

#### **EXPERIMENT 3**

#### **METHOD**

#### *Participants*

Forty new undergraduate students (22 women and 18 men, *mean age* = 24.2, *SD* = 2.03) participated in this experiment.

#### *Material and Procedure*

The same 5 musical pieces as in Experiment 1 were used, but now in their tonal and atonal versions. The tonal and atonal versions of each piece had identical musical parameters such as rhythm, meter, and melodic contour. All the stimuli (tonal and atonal) were played at a fast tempo of 108 beats per min. They differed only in the fact that the atonal version contained pitches that did not belong to a unique key, thus creating dissonant intervals.

The procedure was again identical to that employed in the previous experiments, with a white noise being used to indicate the standard durations presented in the training phase and the pieces of music being used for the comparison durations presented in the test phase. However, in the test phase, only two types of music were used (atonal vs. tonal). The test phase thus consisted of 140 trials subdivided into 2 blocks of 70 trials each: 10 trials (5 musical pieces × 2) in their tonal and atonal versions for each of the 7 stimulus durations. After the bisection task, the participants were again asked to evaluate the emotional qualities of the stimuli on both an affective valence and an arousal scale.

#### **RESULTS AND DISCUSSION**

#### **EMOTIONAL EVALUATION OF MUSICAL STIMULI**

**Table 5** displays the average emotional ratings provided by the participants. Not surprisingly, tonal music was considered more pleasant than atonal music irrespective of stimulus duration The analysis of variance (ANOVA) run on the pleasantness ratings showed a significant main effect of tonality, *F*(1, <sup>28</sup>) = 156.57, *p* < <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.85, and no significant effect of duration, *<sup>F</sup>*(1, <sup>28</sup>) <sup>=</sup> 1.55, *p* > 0.05, or significant duration × tonality interaction, *F*(1, <sup>28</sup>) = 1.44, *p* > 0.05. By contrast, the ANOVA on the arousal ratings did not reveal any significant effect: Tonality, *F*(1, <sup>29</sup>) = 0.01, Duration, *F*(1, <sup>28</sup>) = 3.34, Tonality × Duration, *F*(1, <sup>29</sup>) = 3.24, all *p* > 0.05. This finding suggests that the change in pitch structure primarily affected only the valence of the pieces, with atonal music being judged more unpleasant than tonal music.

#### **TEMPORAL BISECTION**

**Figure 3** indicates the psychophysical functions for the two types of music. This Figure suggests that, in line with the results found in Experiment 1, there was a tonality effect for the long duration range (2.0/6.8-s), with the tonal pleasant music being perceived as lasting for less time than the atonal pleasant music. However, no clear-cut effect of this type seems to be observed for the very short duration range (0.5/1.7-s).

**Table 6** presents the BP and WR calculated using the regression method as in Experiment 1. The ANCOVA was performed on the BP and the WR with duration as between-subjects factor, music as within-subjects factor, and arousal and valence scores as co-variables. The ANCOVA on the BP showed a significant main effect of duration, *<sup>F</sup>*(1, <sup>24</sup>) <sup>=</sup> <sup>585</sup>.96, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.96, as in the previous experiments. There was also a significant main effect of tonality, *<sup>F</sup>*(1, <sup>24</sup>) <sup>=</sup> <sup>4</sup>.84, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.17, as well as a significant tonality × valence interaction, *F*(1, <sup>24</sup>) = 5.38, *p* < 0.05,

**Table 5 | Mean and standard deviation of ratings of arousal and pleasantness (on a 9-point scale) for musical excerpts in tonal and atonal conditions for a 1.1 and a 4.4-s duration.**


**Table 6 | Mean and standard deviation of the Bisection Points and Weber Ratios for tonal and atonal music in the 0.5/1.7 and the 2.0/6.8 s duration condition.**


<sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.18. The BP was thus significantly higher for the tonal music than for the atonal music, indicating that the duration of the tonal music was judged shorter than that of the atonal music. In addition, this shortening effect increased with emotional valence, i.e., as the assessment of the music as pleasant increased.

The ANCOVA on the WR also found a main effect of emotion valence for the tonal music, *<sup>F</sup>*(1, <sup>21</sup>) <sup>=</sup> <sup>4</sup>.85, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.19, indicating that sensitivity to time decreased with the increase in the positive valence of the music. The ANCOVA did not show any other significant effect (tonality, *F*(1, <sup>24</sup>) = 0.03, tonality × duration, *F*(1, <sup>24</sup>) = 0.10, duration, *F*(1, <sup>39</sup>) = 0.004, all *p* > 0.05). This lack of significant effect for the WR involving duration in Experiment 3 as well in Experiments 1 and 2 confirmed that Weber's law holds for the temporal judgment of music as well as for that of other stimuli (Wearden and Lejeune, 2008). In conclusion, the manipulation of physical properties of musical pieces produced time distortions without impairing the fundamental ability to discriminate different durations.

In sum, the results of Experiment 3 revealed that the stimulus durations were judged shorter with the tonal than with the atonal music. As the tonality affected the emotional valence with the tonal music being judged more pleasant than the atonal music, our results demonstrated that hearing a pleasant music produced a temporal shortening effect compared to an unpleasant music. Consequently, modulating the emotional valence of music by changing its tonality or by inversing its amplitude envelope (backward version) produced a similar temporal shortening effect for different duration ranges.

#### **GENERAL DISCUSSION**

Numerous studies have addressed the influence of emotion on the perception of time (for reviews, see Droit-Volet and Meck, 2007; Droit-Volet, 2013; Droit-Volet et al., 2013). However, most of these have used emotional visual stimuli (i.e., emotional facial expressions, pictures from IAPS). Only two experiments, conducted by Noulhiane et al. (2007) and Mella et al. (2011), has been undertaken with sounds from the International Affective Digital Sounds (IADS, Bradley and Lang, 1999). The results of these 2 experiments showed that the emotional sounds were judged longer than the neutral sounds, and more so in the case of the negative compared to the positive sounds. These results were explained within the theoretical framework of the internal clock models (Treisman, 1963; Gibbon, 1977; Gibbon et al., 1984) in terms of arousal effects which speed up the internal clock rate. According to the internal clock models, when the speed of the internal clock increases, more temporal units (pulses) are accumulated and time is judged longer. As in most studies of time and emotion, Noulhiane et al. (2007) therefore concluded that "physiological activation is the predominant aspect of the influence of emotions on time perception, as all emotional stimuli regardless of their selfassessed valence are perceived as being longer than neutral ones" (p. 702).

However, emotional sounds differ from other emotional stimuli (visual) because they are dynamic stimuli involving different parameters that evolve through time. Without specific experimental manipulations of these different parameters, it is thus difficult to identify the real sources of temporal distortions in response to these sounds. For instance, musical pieces played in a major key at a fast tempo are judged happier than those played in a minor key at a slow tempo (e.g., Peretz et al., 1998; Fritz et al., 2009). More specifically, in the case of the perception of time, the tempo in itself must affect the speed of the internal clock independently of emotional effects. Many different studies have shown that a simple sequence of periodic stimuli (clicks, flickers) increases temporal estimates (for a review, see Wearden et al., 2009). Wearden et al. (2009) concluded that the click train effect on the perception of time due to a speeding up of the internal clock is one of the most robust effects to be observed in time psychology. However, the use of music provides an elegant way of manipulating two dimensions while keeping a number of other parameters constant. The present study addressed this issue by manipulating, in Experiments 1 and 2, two different dimensions of arousal (tempo and timbre) as well as a parameter associated with emotional valence (backward vs. forward music). Our results revealed that variations in tempo are indeed associated with different subjective levels of arousal, with music played at a faster tempo being judged as more arousing that played at a slow tempo. In the same way, orchestration was found to affect arousal level, with orchestral music being judged to be more arousing than piano music when the tempo of these two types of music was held constant. Nevertheless, in our temporal bisection studies we found that, although these two musical parameters affected the subjective level of arousal, only the tempo significantly modified the perception of time. Indeed, in Experiment 1, the psychophysical functions were systematically shifted toward the left, with the BP being lower for the fast than for the slow music, thus indicating that the fast music was judged as lasting longer than the slow music. By contrast, in Experiment 2, no significant effect of timbre on the perception of time was observed although the orchestral music was judged to be more arousing than the piano music. In conclusion, as far as music is concerned, tempo is one of the major factors associated with the emotional arousal that leads to distortions in temporal judgments. In other words, the physical properties of music plays a fundamental role in the time distortions associated with emotion.

In addition, Noulhiane et al. (2007) have suggested that, compared to physiological activation, the valence of emotional sounds has only a small influence on the perception of time. This idea finds support in the fact that a temporal lengthening effect, related to the physiological activation resulting from accelerated tempo, was systematically observed in our study whatever the emotional valence of the musical pieces and irrespective of their duration (shorter or longer than 2 s). However, the results of our study also revealed an effect of emotional valence on judgments of the duration of musical pieces, even when stimulus durations were particularly short. Indeed, regardless of the type of musical property that changed the emotional valence (the backward version, the tonality), our studies demonstrated that listening to music with a positive valence led to shorter time estimates. This finding is entirely consistent with the results of previous studies in which participants were asked to evaluate the duration of a long period of music (e.g., Yalch and Spangenberg, 1990; Kellaris and Kent, 1994). Finally, emotional valence rated in terms of pleasure (unpleasant vs. pleasant) seems to be a more sensitive index of emotional effects on time judgments than emotional valence rated in terms of mode (sad vs. happy music) (Bueno and Ramos, 2007; Droit-Volet et al., 2010a,b). As argued by Droit-Volet et al., 2010a, sad music can be also judged as pleasant.

The question that must now be asked is: Why did the emotional valence of the music produce a shortening effect on time judgments, whereas arousal produced a contrasting lengthening effect? As explained above, the lengthening effect obtained with arousal/tempo is probably due to an automatic speeding up of the internal clock. In contrast, the effect of valence (unpleasant vs. pleasant) might call on controlled attentional processes which are linked to the awareness of pleasure experienced when listening to pleasant music. According to attentional models of timing, the temporal and the non-temporal processors compete for the same pool of attentional resources (Thomas and Weaver, 1975; Zakay, 1989; Zakay and Block, 1996, 1998). Temporal units (pulses) that underpin the representation of time would be lost when attentional resources are distracted away from the processing of time, thus resulting in a shortening effect. This assumption, made by the attention-based models of timing, has been widely validated by the results of numerous studies that have used the dual-task paradigm (e.g., Fortin and Breton, 1995; Casini and Macar, 1997; Gautier and Droit-Volet, 2002; Coull et al., 2004). The results of our study, which showed that hearing musical pieces of positive valence shortened the passage of time, are thus consistent with this attentional assumption. Consequently, hearing pleasant music seems to divert attention away from time processing. In other words, time flies when subjects listen to pleasant music. In addition, our results in Experiment 1 revealed that this attention-related shortening effect was greater in the case of low-arousing music with a slow tempo. However, further experiments must be run to gain a better understanding of the effect of the interaction between the two emotional dimensions of the music (valence and arousal) on the timing of music.

In conclusion, the originality of our study lies in the fact that it reveals that the arousal and valence-related properties of a musical stimulus have an interactive effect on time perception. However, our study also showed that the critical factor responsible for producing time distortions was the tempo of the music. In consequence, the emotional effect of music on the perception of time is intrinsically linked to the temporal dynamic of music, i.e., its musical tempo. It is therefore particularly important to continue our investigation of music in order to better understand the way emotions affect time perception because emotional music has dynamic temporal properties which are not present in visual emotional stimuli.

#### **REFERENCES**


#### **ACKNOWLEDGMENTS**

These study were supported by a CAPES-COFECUB Program (Brazil-France) to José L. O. Bueno, Emmanuel Bigand, and Sylvie Droit-Volet, and by a grant from the ANR 11 EMOCO01201 (national Agency for research) from France given to Sylvie Droit-Volet.

*Psychol.* 55, 142–159. doi: 10.1006/jecp.2001.2631


*Music* 30, 210–214. doi: 10.1177/ 0305735602302007


time estimation. *J. Exp. Psychol. Anim. Behav. Process.* 7, 18–30. doi: 10.1037/0097-7403.7.1.18


*Psychol. Monogr.* 77, 1–13. doi: 10.1037/h0093864


North-Holland), 365–397. doi: 10.1016/S0166-4115(08)61047-X


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 25 April 2013; accepted: 19 June 2013; published online: 17 July 2013.*

*Citation: Droit-Volet S, Ramos D, Bueno JLO and Bigand E (2013) Music, emotion, and time perception: the influence of subjective emotional valence and arousal? Front. Psychol. 4:417. doi: 10.3389/fpsyg.2013.00417*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Droit-Volet, Ramos, Bueno and Bigand. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## Preattentive processing of emotional musical tones: a multidimensional scaling and ERP study

#### *Katja N. Spreckelmeyer 1, Eckart Altenmüller 2, Hans Colonius <sup>3</sup> and Thomas F. Münte4 \**

*<sup>1</sup> Department of Psychology, Stanford University, Stanford, CA, USA*

*<sup>2</sup> Institute of Music Physiology and Musicians' Medicine, University of Music, Drama, and Media, Hannover, Germany*

*<sup>3</sup> Department of Psychology, University of Oldenburg, Oldenburg, Germany*

*<sup>4</sup> Department of Neurology, University of Lübeck, Lübeck, Germany*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Clayton R. Critcher, University of California, Berkeley, USA Lars Kuchinke, Ruhr Universität Bochum, Germany*

#### *\*Correspondence:*

*Thomas F. Münte, Department of Neurology, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany e-mail: thomas.muente@ neuro.uni-luebeck.de*

Musical emotion can be conveyed by subtle variations in timbre. Here, we investigated whether the brain is capable to discriminate tones differing in emotional expression by recording event-related potentials (ERPs) in an oddball paradigm under preattentive listening conditions. First, using multidimensional Fechnerian scaling, pairs of violin tones played with a happy or sad intonation were rated same or different by a group of non-musicians. Three happy and three sad tones were selected for the ERP experiment. The Fechnerian distances between tones within an emotion were in the same range as the distances between tones of different emotions. In two conditions, either 3 happy and 1 sad or 3 sad and 1 happy tone were presented in pseudo-random order. A mismatch negativity for the emotional deviant was observed, indicating that in spite of considerable perceptual differences between the three equiprobable tones of the standard emotion, a template was formed based on timbral cues against which the emotional deviant was compared. Based on Juslin's assumption of redundant code usage, we propose that tones were grouped together, because they were identified as belonging to one emotional category based on different emotion-specific cues. These results indicate that the brain forms an emotional memory trace at a preattentive level and thus, extends previous investigations in which emotional deviance was confounded with physical dissimilarity. Differences between sad and happy tones were observed which might be due to the fact that the happy emotion is mostly communicated by suprasegmental features.

**Keywords: preattentive processing, musical emotion, timbre, event-related potential, mismatch negativity, multidimensional scaling**

#### **INTRODUCTION**

Music, as well as language, can be used to transport emotional information and, from an evolutionary perspective, it does not come as a surprise that the way emotion is encoded in music is similar to the encoding of emotion in human or animal vocalizations. Interestingly, the emotional and semantic processing of speech has been shown to be supported by different brain systems by the method of double dissociation (e.g., Heilman et al., 1975). While six patients with right temporoparietal lesions and left unilateral neglect were demonstrated to have a deficit in the comprehension of affective speech, six patients with left temporoparietal lesions exhibited fluent aphasia, i.e., problems with the content of speech, but no problems with affective processing. Likewise, in music processing the Montreal group around Isabelle Peretz has described a patient that is selectively impaired in the deciphering of emotions from music while being unimpaired for the processing of other aspects of music (Peretz et al., 2001).

Researchers have tried to identify segmental and suprasegmental features that are used to encode emotional information in human speech, animal vocalizations, and music. With regard to animals, similar acoustic features are used by different species to communicate emotions (Owings and Morton, 1998). In humans, perceived emotion appears to be mainly driven by the mean level and the range of the fundamental frequency (F0) (Williams and Stevens, 1972; Scherer, 1988; Sloboda, 1990; Pihan et al., 2000) with low F0 being related to sadness and, conversely, high mean F0 level being related to happiness. In music, Hevner (1935, 1936, 1937) in her classical studies found that tempo and mode had the largest effects on listeners' judgments, followed by pitch level, harmony, and rhythm. According to Juslin (2001) musical features encoding sadness include slow mean tempo, legato articulation, small articulation variability, low sound level, dull timbre, large timing variations, soft duration contrasts, slow tone attacks, flat micro-intonation, slow vibrato, and final ritardando, whereas happiness is encoded by fast mean tempo, small tempo variability, staccato articulation, large articulation variability, fairly high sound level, little sound level variability, bright timbre, fast tone attacks, small timing variations, sharp duration contrasts, and rising micro-intonation.

While suprasegmental features are thought to be, at least in part, the result of a lifelong sociocultural conventionalization and therefore, maybe less hardwired (Sloboda, 1990), a considerable part of the emotional information is transmitted by segmental features concerning individual tones. For example, a single violin tone might be recognized as sad or happy with a rather high accuracy. Indeed, string and wind instruments which afford a high degree of control over the intonation can be used to mimic the segmental features also used by singers to convey emotional information.

Segmental emotional information can be encoded into a single tone by varying its timbre, which might be defined as reflecting the different quality of sounds aside from variations in pitch, loudness, and duration. In addition to different distributions of amplitudes of the harmonic components of a complex tone in a steady state (Helmholtz, 1885/1954), dynamic variations of the sound such as attack time and spectral flux (Grey, 1977; Grey and Moorer, 1977) are also important, particularly with regard to onset characteristics. Multidimensional scaling procedures on tones differing in timbre, because they were produced by different by different musical instruments, showed that this aspect of a tone is determined by variations along three dimensions termed attack time, spectral centroid, and spectral flux (McAdams et al., 1995). Likewise, in a recent study using multidimensional scaling (MDS) procedures to investigate the emotional information transmitted by variations in timbre, Eerola et al. (2012) found that affect dimensions could be explained in terms of three kinds of acoustic features: spectral (= ratio of highfrequency to low-frequency energy), temporal (= attack slope), and spectro-temporal (= spectral flux).

From the discussion above, there is no question as to the importance of detection of emotional timbre in voice and—by extension—in music. The question that we ask here pertains to *when* in the auditory processing stream emotional timbre is differentially processed. Given the high evolutionary benefit that might be afforded by the rapid decoding of emotional information from single tones (or human calls), we hypothesize that such information might be processed "early" in the processing stream and in an *automatic* fashion. Indeed, there are a number of studies that have investigated rapid and preattentive classification of emotional sounds. In particular, our group presented normal non-musician participants with tone series comprising a frequent (standard) single violin tone played with a certain emotional connotation (happy or sad) and a rare (deviant) violin tone played with the "opposite" intonation (Goydke et al., 2004). In parallel to the tone series, the EEG was recorded with a focus on the mismatch negativity (MMN). The MMN has been shown to be an ideal tool to address the early, automatic stages of sound evaluation (Näätänen, 1992; Picton et al., 2000; Näätänen et al., 2001). It is a component of the auditory event related potential (ERP) which is elicited during *passive* listening by an infrequent change in a repetitive series of sounds. In the original incarnation of the MMN paradigm, it occurs in response to any stimulus which is physically deviant (in frequency, duration or intensity) to the standard tone. Importantly, the standard stimulus in typical MMN experiments is the same throughout the experiment. It has been shown, however, that the MMN can also be obtained to deviations within complex series of sounds (Picton et al., 2000; Näätänen et al., 2001), in which the memory trace is defined by some abstract property (e.g., ascending series of tones). Thus, it appears that the notion of a standard/memory trace can be extended such that the auditory system is capable to extract systematic properties of sound series. Moreover, and important for Goydke et al. (2004) and the present study, the MMN is sensitive to changes in the spectral component of tonal timbre (Tervaniemi et al., 1997). The onset latency of the MMN varies according to the nature of the stimulus deviance. Whereas simple, physically deviant stimuli show an onset latency of the MMN of about 150 ms, much later MMNs have been seen with more complex forms of deviance. Finally, it is important to stress the fact that the analysis of the incoming stimulus as well as its encoding appears to take place automatically since the MMN typically occurs when the subjects do not attend to the eliciting stimuli, for example during engagement in a different task such as reading a book (Näätänen, 1992). Returning to the Goydke et al. (2004) study, deviant tones were associated with an MMN. The MMN scalp topography for the emotional deviant was similar to an MMN for a control pitch deviant tone. These results were taken to indicate that the brain can categorize tones preattentively on the basis of subtle cues related to the emotional status of the tone (Goydke et al., 2004). Studies using a similar logic using both emotionally voiced words (Schröder et al., 2006) or vocalizations (Bostanov and Kotchoubey, 2004) have revealed analogous findings. Further, investigating different timbral dimensions (attack time, spectral centroid, and spectrum fine structure) and their consequences for behavioral classification latencies and ERPs in preattentive (Caclin et al., 2006) and attentive (Caclin et al., 2008) listening conditions, Caclin and colleagues showed that these different timbral features are separately represented in sensory auditory memory.

One important aspect has been neglected by these studies, however, in the Goydke et al. (2004) study, a single (e.g., happy) tone was presented repeatedly as a standard and a single (e.g., sad) tone was presented repeatedly as the emotional deviant. Thus, it is possible, that the MMN observed for the deviants in this study might have been driven by the physical differences between the standard and deviant stimuli rather than by the postulated preattentive emotional categorization of the stimulus. Indeed, different mechanisms of deviance detection (termed sensory and cognitive) have been demonstrated for other types of stimulus materials (Schröger and Wolff, 1996; Jääskeläinen et al., 2004; Opitz et al., 2005).

Therefore, to answer this question and extend our previous findings (Goydke et al., 2004), we conducted the present study. As pointed out before, segmental features encoding emotion seem to be varied. Thus, what makes the study of acoustical emotion difficult is, that the set of features encoding the same emotion does not seem to be very well defined and that there is a great variance of feature combinations found within individual emotion categories. We modified the design of our previous MMN study to see whether affective expressions are pre-attentively categorized even when their acoustical structure differs. In other words, several (*n* = 3, probability of occurrence for each tone 25%) instances of sad (or happy) tones were defined as standards to which an equally probable deviant stimulus (25%) of the other emotion had to be compared preattentively. To the extent that the MMN reflects deviance in the sense of "being rare," an MMN under these circumstances would indicate that the standards have been grouped to define a single "emotional" entity.

To test whether the brain automatically builds up categories of basic emotions across tones of different (psycho)-acoustical structure, it was necessary to create two sets of tones, where tones within one set could clearly be categorized as happy and sad, respectively but differed with respect to their acoustical structure. To this end, we first performed extensive studies to define the stimulus set for the MMN study using MDS methods. Two types of criteria were set for tones to be used as standards in the MMN study: first, each tone needed to be consistently categorized as happy or sad and, second, tones within one set as well as across sets needed to be perceived as different. The first point was addressed by performing affect-ratings on a set of violin tones which only differed in emotional expression but not in pitch or instrumental timbre. To tackle point 2, pairwise same-differentcomparisons were collected for all tones and fed into a Fechnerian scaling procedure to assess the perceived similarity among the tones. We will first describe the scaling experiment and will then turn to the MMN experiment.

For the latter, we had a straightforward expectation: If the brain categorizes tones preattentively on the basis of an automatic emotional grouping, we should observe an MMN for emotional deviant stimuli regardless of the fact that these emotional deviants were as probable as each of the three different standard stimuli.

#### **SCALING EXPERIMENT**

Multidimensional Fechnerian scaling (Dzhafarov and Colonius, 1999, 2001) is a tool for studying the perceptual relationship among stimuli. The general aim of MDS is to arrange a set of stimuli in a low-dimensional (typically Euclidean) space such that the distances among the stimuli represent their subjective (dis)similarity as perceived by a group of judges. Judges generally perform their ratings in pairwise comparisons between all stimuli in question. Based on the dissimilarity data a MDS procedure finds the best fitting spatial constellation by use of a function minimization algorithm that evaluates different configurations with the goal of maximizing the goodness-of-fit (Kruskal, 1964a,b). Though the dimensions found to span the scaling space can often be interpreted as psychologically meaningful attributes that underlie the judgment, no a priori assumptions have to be made about the nature of the dimensions. Thus, with MDS perceptual similarity can be studied without the need to introduce predefined feature concepts (as labels for the dimensions) which might bias people's judgments.

Fechnerian scaling is a development of classical MDS which is more suitable to be used with psychophysical data. Dzhafarov and Colonius (2006) have pointed out that certain requirements for data to be used with classical MDS are usually violated in empirical data, namely the property of symmetry and the property of constant self-dissimilarity. The property of symmetry assumes that discrimination probability is independent of presentation order, and, thus, that the probability to judge a stimulus x as different from a stimulus y is the same no matter whether x or y is presented first [*p*(*x*; *y*) = *p*(*y*; *x*)]. It has been known since Fechner (1860) that this is not true. The property of constant selfdissimilarity expects that any given stimulus is never perceived as different from itself, thus, that the probability to judge stimulus x as different from itself is 0 [*p*(*x*; *x*) = *p*(*y*; *y*)]. However, it has been shown repeatedly that this is not the case in psychophysical data (e.g., Rothkopf, 1957). The only requirement made by Fechnerian scaling is that of regular minimality, requesting that the probability to judge a stimulus as different from itself needs to be lower than any other discrimination probability.

In the present experiment Fechnerian scaling is used to establish subjective distances for a set of tones where tones differ with respect to their emotional expression.

#### **MATERIALS AND METHODS STIMULUS MATERIAL**

To generate the stimulus material, 9 female violinists (all students of the Hanover University for Music and Drama) were asked to play brief melodic phrases all ending on c-sharp. Melodies were to be played several times with happy, neutral, or sad expressions. Before each musician started with a new expression, she was shown a sequence of pictures from the IAPS (Lang et al., 2008) which depicted happy, neutral or sad scenes, to give her an idea of what was meant by happy, neutral, and sad. All violinists were recorded on the same day in the same room using the same recording technique: stereo (2 Neumann-microphones TLM127), 44.1 kHz sampling rate, 24 bit, distance from the instrument to the microphones was always 50 cm. Each musician filled out a form describing the changes in technique that she had applied to achieve the different expressions. From 200 melodic phrases the last tone (always c-sharp) was extracted using Adobe Audition. Only those tones were selected which were between 1450 and 1700 ms in length and had a pitch between 550 and 570 Hz. Tones from two violinists had to be discarded altogether because they were consistently below pitch level. The resulting pre-selection comprised 35 tones by 7 different violinists. To soften the tone onset a smooth fade-in envelope was created from 0 to 100 ms post-tone onset. The pre-selection was rated on a 5-point scale from very sad (1) to very happy (5) by 9 student subjects (mean age = 25.9 years, 5 males) naive to the purpose of the study and different from the participants taking part in the final experiment. Each tone was rated twice by each participant to test the raters' consistency. Tones were not amplitude-normalized, because it was found that differences in affective expression could not be differentiated properly in a normalized version. Based on the affect ratings and their consistency 10 tones were selected for the final stimulus set (**Table 1**).

**Table 1 | Features of the stimulus material.**


#### **DESIGN OF THE SAME-DIFFERENT FORCED-CHOICE EXPERIMENT**

Participants were 10 students (mean age = 25.4 years, 5 females) with no musical expertise who took part in two separate sessions. In session 1 they performed a same-different forced-choice task on the violin tones to provide data for MDS. In session 2 (approximately 1 week later) they were asked to rate the emotional expression of the tones on a five-point-scale.

For the forced-choice task, participants were tested individually while sitting in a comfortable chair 120 cm away from a 20-zoll-computer screen. All auditory stimuli were presented via closed head-phones (Beyerdynamic DT 770 M) with a level ranging from 64 to 73 dB. Presentation software (Neurobehavioral Systems) was used to present trials and to record responses. All 10 tones were combined with each other including themselves, resulting in 10 × 10 = 100 pairs; all 100 pairs were presented ten times, each time in a different randomized order (resulting in 1000 trials altogether). The stimulus onset asynchrony (SOA) between the two tones of a pair was 3500 ms. Participants had to strike one of two keys to respond same or different (forced choice). To make sure participants judged the psychoacoustical similarity of the tones unbiased, they were kept uninformed on the purpose of the experiment. Trial duration was about 6000 ms. The next trial was automatically started when one of the two buttons was pressed. Participants performed a short training to familiarize them with the procedure and were allowed to pause after each block of 25 trials. There were 40 blocks altogether. Participants could end the pause by pressing a button on the keyboard. The duration of the whole experiment was about 2 hours. Participants were verbally instructed to decide whether the two tones comprising a pair were same or different. For the data analysis responses were recorded as 0 (same) and 1 (different). Mean values (discrimination probabilities) per pair of tones were calculated over all participants and all responses. Minimum number of responses per pair was 90. The resulting discrimination probabilities were transformed into Fechnerian distances using FSDOS (Fechnerian Analysis of Discrete Object Sets by Dzhafarov and Colonius, see http://www.psych.purdue.edu/∼ehtibar/).

#### **AFFECT RATING**

In session 2 each participant from the scaling experiment performed an affect rating of each individual violin tone. All stimuli were presented twice with the order being randomized for each participant. Participants were asked to rate each tone on a 5 point-scale ranging from very sad (1) to very happy (5) by pressing one of the keys from F1 to F5 on the keyboard. Emblematic faces illustrated the sad and the happy end of the scale.

#### **VALENCE AND AROUSAL RATING**

Stimulus material was also rated according to valence and arousal by two additional groups of participants. All stimuli were presented twice but the order was randomized for each participant. To give participants an idea what was meant by the terms valence and arousal they performed a short test trial on pictures taken from the IAPS. Group A (valence) (5 women, 5 men, mean age = 27.6) was asked to rate all 10 tones on a 5-point-scale ranging from very negative (1) to very positive (5). Group B (5 women, 5 men, mean age = 24.4) was asked to rate the 10 tones from very relaxed (German = "sehr entspannt") (1) to highly aroused (German = "sehr erregt") (5).

#### **RESULTS**

#### **SAME-DIFFERENT FORCED-CHOICE EXPERIMENT**

Discrimination probabilities for each pair of tones based on participants' same-different- judgments are shown in **Table 2**. Fechnerian distances for each pair of tones calculated from discrimination probabilities are shown in **Table 3**. Given values reflect the relative distances between pairs of tones as perceived by the mean participant. For example, tone04 (abbreviated t.04 in the row), is perceived about 1.5 times more distant from tone05 than from tone07.

#### **AFFECT, AROUSAL, AND VALENCE RATING**

Results of the affect, arousal, and valence ratings are shown in **Table 4** collapsed over the first and second presentation which did not differ significantly. Please note, that the affect rating was performed by the same group of participants that also took part in the same-different forced choice experiment, whereas the arousal and valence ratings were performed by two different groups of subjects. Though stemming from different groups of participants, there was a high correlation between the affect and the arousal ratings [*r* = 0.937, *p* < 0.001]. In contrast, the correlation between valence and affect ratings was rather low [*r* = 0.651, *p* = 0.042].


*Given are probabilities with which the mean perceiver judged the row tones to be different from the column tones.*

#### **Table 3 | Fechnerian distances.**


*Distances were calculated by FSDOS (the larger the value the more distant the tones).*

#### **Table 4 | Results of the affect, arousal, and valence ratings.**


*Each scale ranged from 1 to 5; last column gives the label of the tone for the MMN study.*

This is surprising for it was expected that valence and affect are closely related. It has to be noted, though, that during the testing it became apparent that participants used different concepts for the valence dimension. While some understood positive—negative in the sense of pleasant—unpleasant, others linked positive negative to the two ends of the dimension to happy and sad. This problem is paralleled by a heterogeneous use of the valence-term in the literature (see Russell and Barrett, 1999, for a discussion) and might serve as an explanation for the incongruous pattern. In the current experiment the valence ratings will therefore, be interpreted with caution.

#### **SELECTION OF STIMULI FOR THE MMN EXPERIMENT**

Three sad tones [tone01 (sad01), tone02 (sad02), tone05 (sad03)] and 3 happy tones [tone07 (hap01), tone08 (hap02), tone09 (hap03)] were chosen from the data set based on their affect ratings. The happy tones had mean affect ratings of 3.45, 3.60, and 3.35; sad tones were rated 1.90, 1.95, and 2.20, respectively. Affect ratings of happy and sad tones were significantly different [*F*(9, <sup>90</sup>) = 12.9 *p* < 0.001] and scaling procedures demonstrated that tones were perceived as different even when belonging to the same emotion category. Fechnerian distances between happy and sad tones fell between 1.44 and 1.67. Distances were 0.17, 1.52, and 1.44 among happy tones and 0.14 and 1.29 among sad tones.

#### **EVENT-RELATED POTENTIAL EXPERIMENT**

#### **METHODS**

#### *Participants*

Of a total of 19 participants three had to be excluded because of technical error (two) or too many blink artifacts in the ERP data (one). The remaining 16 participants (8 women) were aged between 21 and 29 years (mean = 24.9). None was a professional musician.

#### *Design*

Stimuli were the 6 different single violin tones chosen on the basis of the scaling experiment. Two conditions were set up in a modified oddball-design. In condition A 3 sad tones were presented in random order (standards) with 1 happy tone (deviant) randomly interspersed. In condition B 3 happy tones were presented as standards with 1 sad tone randomly interspersed as deviant tone. As deviants, the tones with the lowest and highest affect ratings were chosen. The probability of occurrence was 25% for each of the three standard tones and the deviant tone, resulting in an overall probability of 75% for the standard stimuli and 25% for the affective deviant. In both conditions each tone was presented 340 times resulting in a total of 1360 tones per condition. A randomization algorithm guaranteed that identical tones were never presented back-to-back. Both conditions were divided in two blocks of 680 tones. The order of blocks was ABAB or BABA. All four blocks were presented in one session with one pause between block 2 and 3. The total duration of the experiment was about 90 min.

Tones were presented via insert ear phones used with Earlink ear-tips (Aearo Comp.). Stimulus onset asynchrony between two tones was 2000 ms. Mean sound pressure level of the presentation of all tones was 70 dB. To realize a non-attentive listening paradigm, participants were instructed to pay attention to cartoons (Tom and Jerry—The classical collection 1) presented silently on a computer screen in front of them. To control how well participants had attended the film a difficult post-test was performed after the experiment requiring participants to recognize selected scenes. On average, 85% of the scenes were classified correctly, indicating that the participants had indeed attended the film.

#### *ERP-recording*

The electroencephalogram (EEG) was recorded from 32 tin electrodes mounted in an elastic cap according to the 10–20 system. Electrode impedance was kept below 5 k-. The EEG was amplified (bandpass 0.1–40 Hz) and digitized continuously at 250 Hz. Electrodes were referenced on-line to the left mastoid. Subsequently, off-line re-referencing to an electrode placed on the nose-tip was performed. Electrodes placed at the outer canthus of each eye were used to monitor horizontal eye movements. Vertical eye movements and blinks were monitored by electrodes above and below the right eye. Averages were obtained for 1024 ms epochs including a 100 ms pre-stimulus baseline period. Trials contaminated by eye movements or amplifier blocking or other artifacts within the critical time window were rejected prior to averaging. For this, different artifact rejection thresholds were defined for the eye- and EEG channels.

Separate averages were calculated for each tone in both conditions. ERPs were quantified by mean amplitude measures using the mean voltage of the 100 ms period preceding the onset of the stimulus as a reference. Time windows and electrode sites are specified at the appropriate places of the result section. Effects were tested for significance in separate ANOVAs, with stimulus type (standard or deviant) and electrode site as factors. The Huynh-Feldt epsilon correction (Huynh and Feldt, 1980) was used to correct for violations of the sphericity assumption. Reported are the original degrees of freedom and the corrected *p*-values. Significance level was set to *p* < 0.05.

#### **RESULTS**

The grand average waveforms to the standard and deviant tones (**Figure 1**) are characterized by a N1-P2-complex as typically found in auditory stimulation (Näätänen et al., 1988), followed by a long-duration negative component with a frontal maximum and a peak around 400–500 ms. The current design allows two different ways to assess emotional deviants. Firstly, deviants and standards collected in the same experimental blocks can be compared (i.e., happy standard vs. sad deviant or sad standard vs. happy deviant). These stimulus classes are emotionally as well as physically different. Secondly, the ERP to the deviant can be compared with the same tone when it was presented as standard in the other condition, such that the compared stimuli are physically identical but differ in their functional significance as standard and deviant (i.e., sad standard vs. sad deviant and happy standard vs. happy deviant, see **Table 5**). Time windows for the statistical analysis were set as follows: 100–200 ms (N1), 200–300 ms (P2), and 380–600 ms. Electrode sites included in the analysis were F3, F4, FC5, FC6, C3, C4, Fz, FCz, Cz.

In condition A, emotional (happy) deviants elicited a more negative waveform in a late latency range (from 380 ms), regardless of the comparison (**Figure 1**, top; **Table 5**). Thus, the mismatch response cannot be explained by the fact that physically different tones elicited the different ERP waveforms. To illustrate the scalp distribution of this effect, the difference happy deviant minus sad standards was computed and the mean amplitude of

the difference waveform in the time-window 500–600 ms was used to create spline-interpolated isovoltage maps. The topographical distribution was typical for an MMN response. In particular, we observed a polarity inversion at temporobasal (mastoid) electrode sites (**Figure 2**). In condition B (**Figure 1**, bottom; **Table 5**), sad deviants, too, elicited a more negative waveform than the happy standards, though in an earlier latency range (P2, 200–300 ms). However, no difference was found when the ERPs to the sad tone were compared across conditions, suggesting that this effect was triggered by the structural difference of happy and sad tones rather than their functional significance as standard and deviant. To summarize the result: presenting a happy tone in a series of sad tones resulted in a late negativity that was larger in amplitude than the ERP to the same happy tone functioning as standard in the opposite condition. In contrast, no difference that could be related to its functional significance was found for the sad tone presented in a train of differing happy tones.

#### **DISCUSSION**

standard-deviant comparisons.

The affective deviant in condition A evoked a clear mismatch reaction. Though the latency was rather long, its topographic distribution, including the typical inversion of polarity over temporal regions (see **Figure 2**) in our nose-tip referenced data, suggests that it belongs to the MMN-family. Indeed, it is a known fact that MMN-latency increases with discrimination difficulty. In this regard, we would like to point to the predecessor study


#### **Table 5 | Comparison of standard vs. deviant stimuli.**

*Given are the F-values (df* = *1,15).*

*\*\*p* < *0.01; \*p*< *0.05.*

(Goydke et al., 2004), in which we obtained a rather long latency of the MMN response for emotional deviants, even though the latency was still shorter than in the present study. No doubt, discrimination was particularly difficult in the present experiment, because the difference in timbre was reduced to subtle changes in the expression of same-pitch and same-instrument tones. The mismatch reaction observed for condition A suggests that a happy tone was pre-attentively categorized as different from a group of different sad tones. An MMN reflects change detection in a previously established context (Näätänen, 1992). Thus, for it to occur, a context needs to be set up first. Consequently, the important question in the present experiment is not, what is so particular about the happy tone? The question is, what has led to grouping the standard (sad) tones into one mutual category, so that the single happy tone was perceived as standing out? For the happy tone to be categorized as deviant it was required that the sad tones though different in structure—were perceived as belonging to the same context, i.e., category. The question thus, arises: what has led to grouping of the sad tones? Three possibilities seem plausible:


#### **PERCEPTUAL SIMILARITY**

From the result of the scaling-experiment it can be derived, that tones within the sad category were perceived quite as different

from each other on a perceptual basis (e.g., sad01 and sad03: Fechnerian distance = 1.29) as was the happy deviant from the sad standards (e.g., sad03 vs. happy deviant: Fechnerian distance = 1.44). Relative distances are visualized in **Figure 3**. The arrangement of tones in a three dimensional space results from feeding Fechnerian distance values into a MDS procedure (Alscal in SPSS) which finds the optimal constellation of stimuli in an *n*dimensional space based on dissimilarity data. Three dimensions were found to explain 99% of variance. Note that the orientation of the dimensions is arbitrary. Though the positions of SAD01 and SAD02 are relatively close, both are rather distant from SAD03. Grouping, thus, cannot be explained by perceptual similarity alone.

#### **EMOTIONAL SIMILARITY**

Affect ratings (1.90, 1.95, and 2.20) indicate that the tones were perceived as equally sad in expression. There thus, is some support for the hypothesis that the tones were grouped together based on their emotional category. However, if it was the emotional expression that has led to the automatic categorization why did it not work in condition B? No index was found for a mismatch reaction in response to a sad tone randomly interspersed in a train of different happy tones. Arguing along the same line as before, this (non)-finding implies that either no mutual standard memory trace was built for the happy tones or that this memory trace was considerably weaker for these tones. Since the affect ratings of the happy tones were just as homogeneous (3.35, 3.45, and 3.60) as those of the sad tones, the question arises, if the affect ratings gave a good enough representation of the emotion as it was decoded by the listeners. Against the background that decoding accuracy of acoustical emotion expressions has repeatedly been reported to be better for sadness than for happiness (Johnstone and Scherer, 2000; Elfenbein and Ambady, 2002; Juslin and Laukka, 2003), it might be necessary to take a second look at the stimulus material. Banse and Scherer (1996) found that if participants had the option to choose among many different emotional labels to rate an example of vocal expression, happiness was often confused with other emotions. In the present experiment participants had given their rating on bipolar dimensions ranging from happy to sad. It cannot be ruled out that the response format biased the outcome. It is, for example, possible that in some cases participants chose to rate happy because the tone was found to be definitely not-sad, even if it was not perceived as being really happy either. In an attempt to examine the perceived similarity of the tones with respect to the expressed emotion without pre-selected response categories, a similarity rating on emotional expression was performed *posthoc*. For that purpose, the same students who had participated in the first scaling-experiment were asked to perform another same-different-judgment on the same stimulus material, though this time with regard to the emotion expressed in the tone. The results are depicted in **Table 6** and show that sad tones (t.01, t.02, t.05) were perceived considerably more similar to each other with respect to the emotion expressed than the happy tones (t.07, t.08, t.09). In fact, sad tones were judged half as dissimilar from each other than the happy tones (0.503 vs. 1.02). **Figure 4** shows the relation of same and different responses given for happy and sad tone pairs, respectively. Sad tones were considerably more often considered to belong to the same emotional category than happy tones (80% vs. 57% "same"-responses). It can be assumed that in the MMN-experiment, too, sad tones (in condition A) were perceived as belonging into one emotional category while happy tones (in condition B) were not. The difficulty to attribute the happy tones to the same "standard" category can serve as explanation why the sad tone did not evoke a MMN. It was not registered as deviant against a happy context, because no such context existed. Nevertheless, the hypothesis that the MMN reflects deviance detection based on emotional categorization can at least be maintained for condition A.

#### **EMOTION-SPECIFIC PERCEPTUAL SIMILARITY**

It was presupposed that emotion recognition in acoustical stimuli is based on certain acoustical cues coding the emotion intended to be expressed by the sender. To test whether the sad tones in the present experiment were similar with regard to prototypical cues for sadness an acoustical analysis was performed on the stimulus set. Tones were analyzed on the parameters found to be relevant in the expression of emotion on single tones (Juslin, 2001). Using PRAAT (Boersma, 2001) and dBSonic, tones were assessed for the following features: high frequency energy, attack, mean pitch, pitch contour, vibrato amplitude, vibrato rate, sound level. For each feature, the range of values was divided into three categories (low, medium, high) and each tone was classified accordingly (**Table 7**). The acoustical analysis revealed that some though not all parameters were manipulated the way it would have been expected based on previous findings. However,



*Given are perceived distances of row tones and column tones with respect to their emotional expression; sad tones were t.01, t.02, and t.05, happy tones were t.07, t.08, and t.09.*


*Tested were parameters expected to be relevant cues to express emotion on single tones. Categorization as low, medium, and high was based on comparison with the "happy" tones.*

**Table 7** indicates that the cues were not used homogeneously. For example, mean pitch level was not a reliable cue. Moreover, vibrato was manipulated in individual ways by the musicians. Timbre, however, was well in line with expectations. All sad tones were characterized by little energy in the high frequency spectrum. In contrast, more energy in high frequencies was found in the spectrum of the deviant happy tone. Based on the findings by Tervaniemi et al. (1994) it appears that a difference in spectral structure alone can trigger the MMN. That would mean that the sad tones were grouped together as standards based on their mutual feature of attenuated higher partials. It has to be noted though that the high-frequency energy parameter is a very coarse means to describe timbre. Especially in natural tones [compared to synthesized tones as used by Tervaniemi et al. (1994)] the spectrum comprises a large number of frequencies with different relative intensities. As a consequence, the tones still have very individual spectra (and consequently sounds), even if they all display a relatively low high-frequency energy level. This fact is also reflected in the low perceptual similarity ratings. Moreover, if the spectral structure really was the major grouping principle, it should also have applied to the happy tones in condition B. Here, all happy tones were characterized by a high amount of energy in high frequencies, while the sad deviant was not. Nevertheless, no MMN was triggered. To conclude, though the possibility cannot be completely ruled out, it is not very likely that the grouping of the sad tones was based solely on similarities of timbre structure. Instead, the heterogeneity of parameters in **Table 7** provides support for Juslin's idea of redundant code usage in emotion communication (Juslin, 1997b, 2001). Obviously, expressive cues were combined differently in different sad tones. Thus, though the sad tones did not display homogeneous patterns of emotion-specific cues, each tone was characterized by at least two prototypical cues for sadness expression. Based on the model assumption of redundant code usage, it seems likely that tones were grouped together because they were identified as belonging to one emotional category based on emotion specific-cues.

What implication does this consideration have for the question of grouping principles in the MMN-experiment? From what is known about the principles of the MMN, the results imply that the representation of the standard in memory included invariances across several different physical features. The invariances, however, needed to be in line with a certain template on how sadness is acoustically encoded. Several researchers have suggested the existence of such hard-wired templates for the rapid processing of emotional signals (Lazarus, 1991; LeDoux, 1991; Ekman, 1999; Scherer, 2001). It is assumed that to allow for quick adaptational behavior, stimulus evaluation happens fast and automatic. Incoming stimuli are expected to run through a matching process in which comparison with a number of schemes or templates takes place. Templates can be innate and/or formed by social learning (Ekman, 1999). The present study, while blind with respect to the origin of the template, provides some information as to how such a matching process might be performed on a pre-attentive level. Given the long latency of the MMN in the present experiment, it can be assumed that basic sensory processing has already taken place before the mismatch reaction occurs. Therefore, the MMN in the current experiment appears to reflect the mismatch between the pattern of acoustic cues identified as emotionally significant and the template for sad stimuli activated by the preceding standard tones. Our data is thus, in line with considerations that the MMN does not only occur in response to basic acoustical feature processing. Several authors have suggested that the MMN can also reflect "holistic" (Gomes et al., 1997; Sussman et al., 1998) or "gestalt-like" (Lattner et al., 2005) perception. They assume that the representation of the "standard" in the auditory memory system is not merely built up based on the just presented standard-stimuli, but that it can be influenced by prototypical representations stored in other areas of the brain (Phillips et al., 2000). Evidence from a speech-specific phoneme processing task suggested that the MMN-response does not only rely on matching processes in the transient memory store but that long-term representations for prototypical stimuli were accessed already at a pre-attentive level. For phonemes, (Näätänen and Winkler, 1999) assumed the existence of long-term memory traces serving as recognition patterns or templates in speech perception. He further posited that these can be activated by sounds "nearly matching with the phoneme-specific invariant codes" (p. 14). In another contribution, Näätänen et al. (2005) point out that the "mechanisms of generation of these more cognitive kinds of MMNs of course involve other, obviously higher-order, neural populations than those activated by a mere frequency change." (p. 27).

In the model of Schirmer and Kotz (2006) emotional-prosodic processing is conceptualized as a hierarchical process. Stage 1 comprises initial sensory processing of the auditory information before emotionally significant cues are integrated (stage 2) and cognitive evaluation processes (stage 3) take place. The MMN in response to emotional auditory stimuli might reflect the stage of integrating emotionally significant cues (Schirmer et al., 2005). The present data is compatible with the model albeit in the area of nonverbal auditory emotion processing. The current data contributes to disentangling the processes underlying emotion recognition in the auditory domain. It has to be pointed out though that the present results can only give a first glimpse on the mechanisms underlying processing of emotionally expressive tones. More studies with a larger set of tones characterized by different cues are needed to systematically examine the nature of the stimulus evaluation process.

Also, a critical issue for emotion recognition from musical sounds might be the time over which a listener can integrate the information. This might be the answer to the question as to why the happy tones were perceived less homogeneous than the sad tones. While all musicians had the intention to express happiness, it is possible that happiness can just not be expressed very well on single tones. Juslin (1997a), when looking for predictors of emotional ratings of musical performances, found that the best predictors for happiness were tempo and articulation. Both parameters are suprasegmental features and require a whole sequence of tones. In contrast,

#### **REFERENCES**


sounds. *Music Percept*. 30, 49–70. doi: 10.1525/mp.2012.30.1.49


sadness ratings could be predicted by a number of cues, including segmental features such as sound level, spectrum, and attack.

#### **ACKNOWLEDGMENTS**

This research was supported by the Studienstiftung des Deutschen Volkes (Katja N. Spreckelmeyer) and the Deutsche Forschungsgemeinschaft (Katja N. Spreckelmeyer, Hans Colonius, Eckart Altenmüller, Thomas F. Münte). Hans Colonius and Thomas F. Münte were members of the SFB TR31 "Active Listening" during the time of the experiment.

music. *Am. J. Psychol*. 48, 246–268. doi: 10.2307/1415746


rehearsal system. A DC-potential study. *Brain* 123, 2338–2349. doi: 10.1093/brain/123.11.2338


W., Bangert, M., et al. (2006). Perception of emotional speech in Parkinson's disease. *Mov. Disord*. 21, 1774–1778. doi: 10.1002/mds.21038


Williams, C. E., and Stevens, K. N. (1972). Emotions and speech: some acoustical correlates. *J. Acoust. Soc. Am.* 52, 1238–1250. doi: 10.1121/1.1913238

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 25 April 2013; paper pending published: 27 May 2013; accepted: 03 September 2013; published online: 23 September 2013.*

*Citation: Spreckelmeyer KN, Altenmüller E, Colonius H and Münte TF (2013) Preattentive processing of emotional musical tones: a multidimensional scaling and ERP study. Front. Psychol. 4:656. doi: 10.3389/fpsyg.2013.00656*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2013 Spreckelmeyer, Altenmüller, Colonius and Münte. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Changing the tune: listeners like music that expresses a contrasting emotion

#### **E. Glenn Schellenberg<sup>1</sup>\*, Kathleen A. Corrigall <sup>1</sup> , Olivia Ladinig<sup>2</sup> and David Huron<sup>2</sup>**

<sup>1</sup> Department of Psychology, University of Toronto Mississauga, Mississauga, ON, Canada <sup>2</sup> School of Music, Ohio State University, Columbus, OH, USA

#### **Edited by:**

Anjali Bhatara, Université Paris Descartes, France

#### **Reviewed by:**

Nathalie Gosselin, Rivière-des-Prairies Hospital and University of Montreal, Canada Scott Parker, American University, USA

#### **\*Correspondence:**

E. Glenn Schellenberg, Department of Psychology, University of Toronto Mississauga, 3359 Mississauga Road North Mississauga, ON, Canada L5L 1C6.

e-mail: g.schellenberg@utoronto.ca

Theories of esthetic appreciation propose that (1) a stimulus is liked because it is expected or familiar, (2) a stimulus is liked most when it is neither too familiar nor too novel, or (3) a novel stimulus is liked because it elicits an intensified emotional response. We tested the third hypothesis by examining liking for music as a function of whether the emotion it expressed contrasted with the emotion expressed by music heard previously. Stimuli were 30-s happy- or sad-sounding excerpts from recordings of classical piano music. On each trial, listeners heard a different excerpt and made liking and emotion-intensity ratings. The emotional character of consecutive excerpts was repeated with varying frequencies, followed by an excerpt that expressed a contrasting emotion. As the number of presentations of the background emotion increased, liking and intensity ratings became lower compared to those for the contrasting emotion. Consequently, when the emotional character of the music was relatively novel, listeners' responses intensified and their appreciation increased.

**Keywords: music, emotion, liking music, music preferences, contrast effect, hedonic ratings**

#### **INTRODUCTION**

A stimulus is perceived differently depending on whether it is presented in isolation or in context. In vision, for example, the same gray square looks lighter or darker depending on whether it is presented against a black or white background, respectively (i.e., White's, 2010 illusion). In audition, the same tone is perceived to sound louder or softer when it is presented among softer or louder background tones, respectively (e.g., Melamed and Thurlow, 1971). Similarly, the temperature of the same tactile stimulus appears to increase or decrease after exposure to a relatively cold or warm stimulus, respectively (Locke, 2008). In general, then, the perceived magnitude of a stimulus along some continuous parameter (e.g., lightness, loudness, and temperature) shifts such that it is further from stimuli presented in the same context, a phenomenon known as the *contrast effect*. This phenomenon extends to higher-level evaluative processes, or *hedonic contrasts* (Parducci, 1995). For example, evaluations of pieces of music increase or decrease depending on whether previously heard pieces sounded bad or good, respectively (Parker et al., 2008). Similar hedoniccontrast effects are observed with tastes (Zellner et al., 2003), pictures of birds (Zellner et al., 2003), paintings (Dolese et al., 2005; Zellner et al., 2010), and the degree to which people are considered physically attractive (Kenrick and Gutierres, 1980).

Hedonic contrasts are especially relevant for responses to works of art and other stimuli that are evaluated esthetically. Emotional responding to art differs from responding to other stimuli because it occurs on two levels: one related to the emotion expressed by the work of art, the other to the perceiver's evaluation (Hunter and Schellenberg, 2010). Accordingly, perceivers can have a positive evaluation of a stimulus that expresses a negative emotion, such as when they like sad-sounding music (e.g., The Beatles'*Yesterday*) or paintings that portray distress (e.g., Munch's *The Scream*). Positive hedonic evaluations are important psychologically because they can lead to perceptual sensitization (Vanderplas and Blake, 1949). For example, when presented at a low amplitude, words are identified more successfully if they are evaluated favorably rather than unfavorably. In the case of music, pieces that are positively evaluated are remembered better than pieces with neutral or negative evaluations (Stalinski and Schellenberg, 2012).

In the present study, we were interested in emotional responses to esthetic stimuli – those that pose no immediate threat or benefit to survival. Our specific focus was on liking music, and how listeners evaluate excerpts of music as a function of whether the emotion they express contrasts with the emotion expressed by music heard previously. Theories about the psychology of esthetics speculate about contrasts in different ways, making different predictions. Esthetic appreciation may increase as a consequence of the predictability that comesfrom repetition,when contrast is minimized. The *prediction effect* posits specifically that fulfilled expectations (i.e., anticipatory successes) lead to positive feelings arising from the limbic reward system (Huron, 2006). From this view, because a contrasting stimulus is unexpected,it should be evaluated unfavorably. Other theorists (Berlyne, 1960, 1971; Eysenck, 1973) propose a trade-off between predictability and novelty as formalized in the *two-factor model* (Berlyne, 1970; Stang, 1974): A stimulus is liked as a function of its arousal potential, which can be too high (e.g., novel) or too low (e.g., predictable). Because the stimulus is evaluated most favorably when it is somewhat familiar but not overly familiar, one would expect increases in liking for music expressing the same emotion after a few exposures, but decreases after many exposures. Finally, high predictability arising from repeated exposure to music expressing the same emotion may lead to habituation

or desensitization (i.e., boredom), such that a contrasting or novel stimulus is evaluated favorably (Schubert, 1996).

There are theoretical and empirical reasons for expecting that the third hypothesis could account for listeners' evaluations of a musical piece that expresses a contrasting emotion, regardless of whether the background (i.e., habituated) emotion is positive or negative. Although positive and negative emotions are typically linked to pleasure and displeasure, respectively, Schubert (1996) suggests that the link between negative emotion and displeasure is de-activated in esthetic contexts that have no consequences for survival, including but not limited to music listening. Any activation in these contexts – positive or negative – is linked to pleasure, such that the *intensity* of the emotional response predicts the degree of pleasure and, hence, the magnitude of the positive appraisal.Moreover, habituation to one type of emotional stimulus should lessen the listener's arousal level. After sustained or repeated exposure to a single emotion, the expression of another emotion will lead to heightened activation and, consequently, an increase in liking. The increase in liking for the contrasting emotion is primarily a consequence of decreases in liking (habituation) for the sustained or repeated emotion, with the contrasting emotion causing dishabituation and a restoration of activation levels and liking. In the present context, after hearing many, say, happy-sounding pieces of music, listeners should exhibit increased activation and increased liking for a piece of music that expresses a contrasting emotion such as sadness.

Empirical results are consistent with the proposal that increases in emotional activation are predictive of increases in liking for music. Many years ago, Gatewood (1927) observed that pleasure is linked to the intensity of a musical experience rather than the type of experience. In fact, she found that pleasantness ratings were correlated positively with ratings of a variety of different feelings, including sadness, love, longing, amusement, dignity, reverence, how restful the music made listeners feel, or how much the music stirred them. Pleasantness ratings were also correlated with the number of emotions the music activated, and with intensity ratings summed across the different emotions. In another study from the same era (Washburn and Dickinson, 1927), pleasantness ratings were higher for music that evoked feelings of excitement or calmness than for music that evoked a neutral response. In a review of hedonic responses to music and other artforms,Martindale (1984) concluded that esthetic pleasure is typically a positive, monotonic function of emotional activation. More recent research confirms that the intensity of listeners' emotional responding to music is correlated positively with hedonic ratings (Schubert, 2007a, 2010; Ladinig and Schellenberg, 2012; Vuoskoski et al., 2012).

In an extension of Parducci's (1995) theory of contextually determined happiness or pleasantness, Huron (2006) described *contrastive valence* as another source of musical pleasure and displeasure. Contrastive valence is based on a mismatch between a musical prediction and the actual outcome. If a positive event is expected, a negative outcome will feel overly unpleasant. By contrast, if a negative event is expected, a positive outcome will feel overly pleasant. In the present study, we sought to extend this line of reasoning to contrasting emotions expressed or evoked by music. If a listener is exposed to several happy-sounding (or sad-sounding) music excerpts in succession, the introduction of a sad-sounding (or happy-sounding) excerpt should sound especially sad (or happy) in contrast. Because listeners' emotional responses to music tend to parallel the emotions music conveys (Schubert, 2007a,b), especially for happiness and sadness (Hunter et al., 2010), an excerpt that sounds particularly happy or sad because of its contrasting status should evoke a particularly intense emotional response, and consequently greater liking.

Musical pieces differ on many dimensions, which can be continuous (e.g., slow-to-fast, quiet-to-loud) or dichotomous (e.g., major/minor, staccato/legato). Within a single piece, contrasts can occur on a small time scale, such as with alternating consonant or dissonant chords, or on a large time scale, such as with alternations of verse and chorus. Successive movements of a symphony or concerto, or the order of pieces in a concert program represent contrasts on substantially longer time scales. In the present study, we focused on one particular contrast: happiness and sadness. Happy-sounding music tends to be fast in tempo and in major mode, whereas sad-sounding music tends to be slow and minor (for a review seeHunter and Schellenberg,2010). Happiness and sadness are among the easiest emotions to convey musically (Gabrielsson and Juslin, 1996), particularly when they are contrasted with one another. In fact, young deaf children with cochlear implants – which provide poor spectral resolution and degraded perception of music – can distinguish happy- from sad-sounding music (Volkova et al., 2012).

On each trial in the present experiments, listeners heard a different excerpt of music that sounded unambiguously happy or sad. Resultsfrom multiple samples of listenersfrom the same university population – who listen primarily to dance-pop music (Stalinski and Schellenberg, 2012) – motivated the assumption that excerpts from the particular genre used here (i.e., classical piano pieces) would be unfamiliar to the present listeners. Their task was to rate how much they liked each excerpt and the intensity of their emotional response. Our focus was on responses to excerpts conveying an emotion (e.g., sadness) that contrasted with a background emotion that had been expressed repeatedly (e.g., happiness) with a varying number of presentations.

In Experiment 1, listeners made liking and emotion-intensity ratings in response to 16 different excerpts of music: 14 background excerpts that expressed either happiness or sadness and 2 excerpts that expressed the contrasting emotion. The emotional status of the excerpts had an ABAAAAAAAAAAAAAB order, with A corresponding to the background emotion and B to the contrasting emotion. Thus, the first B excerpt followed a single presentation of an A excerpt, whereas the second B excerpt followed 13 consecutive presentations of different excerpts expressing the A emotion. We predicted that liking and the intensity of listeners' emotional response would be greater for the second B excerpt than for the immediately preceding A excerpt, whereas responses to the initial A and B excerpts would be similar. This hypothesis applied equally to conditions in which A and B excerpts were happy and sad sounding, respectively, or vice versa.

In Experiment 2, we compared liking and emotion-intensity responses to background and contrasting music excerpts after listeners heard 1, 2, 4, or 8 excerpts that expressed the background emotion. We predicted that as presentation frequency of the background emotion increased, emotion-intensity and liking ratings for the contrasting excerpts would progressively exceed responses to the background excerpts. Because the association between emotional responding and frequency of stimulus presentation tends to be logarithmic (Zajonc, 1968; Harrison, 1977; Bornstein, 1989), we conducted trend analyses to examine effects of presentation frequency, which varied logarithmically.

In both experiments, we predicted that emotional responding (i.e., liking and intensity) to the background excerpts would decrease as the number of presentations increased, whereas responding to the contrasting excerpts would be stable or increase. In other words, we predicted an interaction between emotion type (background or contrasting) and presentation frequency. For both experiments, we predicted that intensity and liking ratings would be positively correlated. Because liking is considered to be a consequence of increases in emotional intensity, we also expected that increases in liking due to emotional contrast would disappear when intensity ratings were held constant.

### **MATERIALS AND METHODS**

#### **PARTICIPANTS**

Listeners were undergraduate students enrolled in an introductory psychology course who participated for partial course credit. They were recruited without regard to music training. In Experiment 1, 46 listeners were tested; 29 had taken private music lessons for an average duration of 4.0 years (SD = 3.0 years). When asked about their music-listening habits, only three participants reported listening primarily to classical music, with an additional two indicating that they sometimes listened to classical music. In Experiment 2, 48 new listeners were tested; 26 had taken private music lessons for an average duration of 5.8 years (SD = 4.5 years). Only three participants reported listening primarily to classical music,with an additional three indicating that they sometimes listened to classical music. All participants provided informed written consent, and the experiments were approved by the Office of Research Ethics at the University of Toronto.

#### **STIMULI**

In Experiment 1, stimuli were 28 music excerpts taken from commercially available compact disks, each approximately 30 s in duration (**Table 1**). In Experiment 2, an additional two excerpts were added to the set. All stimuli were normalized in amplitude to minimize variability in perceived loudness. Stimuli were selected exclusively from nineteenth- and twentieth-century piano music without any vocals or other instruments. Half of the excerpts were selected to convey happiness (major mode, fast tempo); the others were selected to convey sadness (minor mode, slow tempo).

To verify that the excerpts conveyed the intended emotion, participants judged how happy and how sad each excerpt sounded on five-point scales (1 = *not at all*, 5 = *extremely*) at the end of the test session, after making their liking and emotion-intensity ratings. The excerpts were presented in a different random order for each listener. In Experiment 1, the 14 fast/major excerpts were deemed to sound more happy (*M* = 3.45, SD = 0.79) than sad (*M* = 1.27, SD = 0.36), *t*(45) = 17.70, *p* < 0.001, whereas the 14 slow/minor excerpts were deemed to sound more sad (*M* = 2.99, SD = 0.81) than happy (*M* = 1.54, SD = 0.49),*t*(45) = 10.34, *p* < 0.001. Similarly, in Experiment 2, participants rated the 15 fast/major excerpts

as significantly more happy- (*M* = 3.25, SD = 0.66) than sadsounding (*M* = 1.32, SD = 0.39),*t*(47) = 19.09, *p* < 0.001, and the 15 slow/minor excerpts as more sad- (*M* = 3.02, SD = 0.68) than happy-sounding (*M* = 1.45, SD = 0.47), *t*(47) = 13.04, *p* < 0.001. In both experiments, each individual fast/major excerpt was rated as more happy- than sad-sounding, and each slow/minor excerpt was rated as more sad- than happy-sounding (all *p*s < 0.001). Thus, the stimulus excerpts conveyed the intended emotions.

#### **PROCEDURE**

Listeners were tested individually. They were assigned randomly to one of two conditions, constrained so that happiness was the background emotion for half of them and sadness was the background emotion for the other half. In Experiment 1, each listener heard 16 of the 28 musical excerpts: all 14 that expressed one of the two background emotions, and 2 that expressed the contrasting emotion. In Experiment 2, each listener heard 19 excerpts: all 15 that expressed one of the two background emotions, and 4 that expressed the contrasting emotion. On each trial, listeners rated how much they liked each excerpt (1 = *not at all*, 5 = *extremely*) and the intensity of their emotional response (1 = *felt nothing*, 5 = *highly emotional*). Because the excerpts were selected to sound unambiguously happy or sad, no questions were asked about perceived or felt happiness or sadness until the end of the testing session.

In Experiment 1, the trials began with one presentation of the background emotion (selected randomly) followed by one presentation of the contrasting emotion (selected randomly), followed by 13 excerpts representing the background emotion (in random order) and a second excerpt representing the contrasting emotion (selected randomly). The critical trials of interest involved the two excerpts expressing the contrasting emotion and the two immediately preceding trials that expressed the background emotion (i.e., the first and last presentations of both emotions).

In Experiment 2, excerpts expressing the contrasting emotion occurred after 1, 2, 4, and 8 excerpts that expressed the background emotion. All 24 (i.e., 4!) possible orders of the four presentation frequencies were used. Counterbalanced with condition (happiness or sadness as the background emotion), there were 48 unique presentation orders – one for each of the 48 participants. As in Experiment 1, stimulus selection and order were randomized separately for each listener. Responses to eight critical trials were analyzed: the four that conveyed the contrasting emotion and the four immediately preceding trials that conveyed the background emotion.

#### **RESULTS**

Preliminary analyses confirmed that in both experiments, listeners in the two conditions (happiness vs. sadness as the background emotion) did not differ in terms of gender, age, or years of private music lessons. The principal analyses comprised two mixed-design analyses of variance (ANOVAs): one on liking ratings and the other on emotion-intensity ratings. Both analyses had one betweensubjects factor: condition, and two repeated measures: emotion type (background or contrasting) and presentation frequency of the background emotion (Experiment 1: 1 or 13, Experiment 2: 1, 2, 4, or 8).

#### **Table 1 | Piano recordings used in Experiments 1 and 2.**


\*Used only in Experiment 2.

#### **EXPERIMENT 1**

For liking ratings, there were no main effects or interactions involving condition. Descriptive statistics are illustrated in **Figure 1** (upper panel) as a function of emotion type and presentation frequency. A significant interaction between emotion type and frequency, *F*(1, 44) = 15.93, *p* < 0.001, partial η <sup>2</sup> = 0.27, motivated separate examination of the background and contrasting emotions.Whereas liking ratings for the background emotion declined from the first to the last presentation, *F*(1, 44) = 25.04, *p* < 0.001, partial η <sup>2</sup> = 0.36, liking ratings for the contrasting emotion were identical, *F* = 0. Moreover, liking ratings did not differ between the background and contrasting emotions during the first two trials, but they did during the final two,*F*(1, 44) = 8.32, *p* = 0.006, partial η <sup>2</sup> = 0.16, with greater liking for the contrasting emotion. Because there was no three-way interaction, response patterns were similar whether happiness or sadness was the background emotion.

Descriptive statistics for intensity ratings are illustrated in **Figure 1** (lower panel) as a function of emotion type and presentation frequency. As with liking ratings, there were no significant effects involving condition. In line with predictions, there was a significant interaction between emotion type and frequency, *F*(1, 44) = 19.42,*p* < 0.001, partial η <sup>2</sup> = 0.31. For the background emotion, intensity ratings declined from the first to the last presentation, *F*(1, 44) = 13.58, *p* = 0.001, partial η <sup>2</sup> = 0.24, whereas for the contrasting emotion, intensity ratings increased, *F*(1, 44) = 4.06, *p* = 0.050, partial η <sup>2</sup> = 0.08. Moreover, intensity ratings did not differ between the background and contrasting emotions during the first two trials, but they did during the final two, *F*(1, 44) = 23.50, *p* < 0.001, partial η <sup>2</sup> = 0.35, with higher ratings for excerpts expressing the contrasting emotion. As with liking ratings, the lack of a three-way interaction meant that response patterns for intensity ratings were similar whether happiness or sadness was the background emotion.

We calculated correlations between liking and intensity ratings separately for both emotion types (background and contrasting) and both presentation frequencies (1 or 13). Liking and intensity ratings were correlated positively in all four instances (see **Table 2**). Finally, we repeated the original analysis on liking ratings using multi-level modeling (unstructured covariance matrix) so that intensity ratings could be included as a covariate. Although

the association between intensity and liking was highly significant, *F*(1, 151.45) = 122.67, *p* < 0.001, the interaction between emotion type and presentation frequency was eliminated.

#### **EXPERIMENT 2**

For liking ratings, there were no main effects or interactions involving condition. Descriptive statistics are illustrated in **Figure 2** (upper panel) as a function of emotion type and presentation frequency. As expected, the linear trend for presentation frequency interacted with emotion type, *F*(1, 46) = 9.04, *p* = 0.004, partial η <sup>2</sup> = 0.16. As the number of presentations increased, liking ratings for the background emotion decreased, *F*(1, 46) = 6.60, *p* = 0.014, partial η <sup>2</sup> = 0.13, but there was no linear trend for the contrasting emotion. There were no effects involving quadratic or cubic trends. Liking ratings did not differ between the background and the contrasting emotion after one or two presentations, but they

**Table 2 | Correlations between liking and emotion-intensity ratings in Experiments 1 and 2 (all ps** < **0.05).**


Back, background, Cont, contrasting.

**FIGURE 2 | Mean liking (upper panel) and emotion-intensity (lower panel) ratings in Experiment 2, illustrated as a function of emotion type (background or contrasting) and presentation frequency of the background emotion (1, 2, 4, or 8).** Error bars are SE.

approached significance after four presentations, *F*(1, 46) = 3.66, *p* = 0.062, partial η <sup>2</sup> = 0.07, and differed significantly after eight presentations, *F*(1, 46) = 5.17, *p* = 0.028, partial η <sup>2</sup> = 0.10.

For intensity ratings, there were again no significant effects involving condition. Descriptive statistics for intensity ratings are illustrated in **Figure 2** (lower panel) as a function of emotion type and presentation frequency. As with liking ratings, there was a significant interaction between emotion type and the linear trend for presentation frequency, *F*(1, 46) = 11.98, *p* = 0.001, partial η <sup>2</sup> = 0.21. As the number of presentations of the background emotion increased, there was a significant decrease in intensity ratings for the background emotion, *F*(1, 46) = 4.08, *p* = 0.049, partial η <sup>2</sup> = 0.08, but no linear trend for the contrasting emotion. There were no effects involving quadratic or cubic trends. Intensity ratings did not differ between the background and contrasting excerpts after one or two presentations of the background emotion, but they did after four presentations, *F*(1, 46) = 9.46, *p* = 0.004, partial η <sup>2</sup> = 0.17, and after eight presentations, *F*(1, 46) = 8.16, *p* = 0.006, partial η <sup>2</sup> = 0.15, with higher ratings for the contrasting excerpt.

As in Experiment 1, liking and intensity ratings were significantly correlated for the background and contrasting excerpts at each of the four presentation frequencies. The eight correlations are presented in **Table 2**. The final analysis used multi-level modeling on liking ratings, with the same independent variables as in the original mixed-design ANOVA, but with intensity ratings added as a covariate. Although there was a robust association between intensity and liking, *F*(1, 255.81) = 319.48, *p* < 0.001, the interaction between emotion type and the linear trend for presentation frequency disappeared.

#### **DISCUSSION**

The analyses revealed four main findings: (1) listeners reported greater appreciation and a more intense emotional response when the music contrasted in emotional status to that of music heard previously, (2) liking and intensity ratings were correlated positively, (3) the contrast effect for liking disappeared when the intensity of listeners' emotional responses was held constant, and (4) response patterns were similar whether the background emotion was happiness or sadness.

In line with predictions, both liking and emotion-intensity ratings decreased after hearing many different background excerpts that expressed the same emotion, such that liking and emotionintensity ratings were larger in comparison for excerpts that expressed a contrasting emotion. Moreover, the results of Experiment 2 provided evidence of a dose-response association: As the frequency of presentation of the background excerpts increased, so did the observed contrast effect. In both experiments, liking and emotion-intensity ratings were correlated, and the contrast effect for liking ratings disappeared when emotional intensity was held constant. Separate randomization for each listener of both excerpt selection and stimulus order ensured that any intrinsic differences in the excerpts'likeability had no effect on response patterns. Moreover, no stimulus was ever repeated for any participant. Only the emotional character was repeated along with associated features such as mode and approximate tempo. In other words, the results revealed habituation for music on a more abstract level than simple repetition.

In line with Schubert (1996), the observed contrast effect was driven primarily by reductions in emotional responding to music expressing the background emotion as the number of presentations increased. Thus, the effect was mainly a consequence of habituation or desensitization to the background emotion rather than increases in emotional responding to the contrasting emotion. In general, responding to the contrasting emotion remained at baseline levels as presentation frequency of the background emotion increased. The one exception involved emotion-intensity ratings in Experiment 1, which increased above baseline levels for the contrasting emotion after listeners heard 13 different music excerpts that expressed the background emotion. Our documentation of habituation or desensitization to an abstract stimulus property such as emotional character parallels findings from studies of infants that report habituation and/or novelty preferences based on the number of items in a display (e.g., two vs. three; Starkey and Cooper, 1980), categories such as animals (dogs vs. cats; Quinn et al., 1993) or furniture (e.g., chairs vs. couches; Behl-Chadha, 1996), and rules of order with speech sounds (ABB vs. ABA; Marcus et al., 1999).

The present findings serve to inform and evaluate theories relevant to hedonic responding. For example, the two-factor model (Berlyne, 1970; Stang, 1974) fails to account for response patterns because there was no initial increase in liking for music excerpts that expressed the same (background) emotion. In Huron's (2006) theory of emotional responding to music, the prediction effect posits that pleasure arises from the occurrence of expected events in music, which can be a consequence of simple repetition or variations on a theme (i.e., repetition with subtle changes). Because listeners exhibit greater liking for previously unfamiliar music when they hear it repeatedly in the laboratory, at least up to a point (Meyer, 1903; Getz, 1966; Heingartner and Hall, 1974; Szpunar et al., 2004; Schellenberg et al., 2008), one might expect increases rather than decreases in liking for pieces of music presented sequentially when the pieces express the same emotion. Our results, however, point to *decreases* in liking. As such, the prediction effect may be limited to the positive experience of fulfilled expectancies while listening to a single piece of music, or to repeated presentations of the same piece. Moreover, both theories might be more applicable to a different genre of music (e.g., jazz), timbres other than piano, or to pieces that convey emotions in a more subtle manner.

In the present experiments, listeners' expectancies or predictions about the emotions expressed by the music excerpts could have worked in two ways. Expectancies for another repetition of the background emotion are consistent with the "hot hand" belief in non-randomness, but different from expectancies for change that are a hallmark of the "gambler's fallacy" – the false belief that random but independent events are influenced by past occurrences (Burns and Corpus, 2004). For example, after the initial two trials in which all listeners heard one happy- and one sadsounding excerpt, they may have expected that on subsequent trials, happy- and sad-sounding excerpts would occur equally often, or that the particular emotion an excerpt expressed was determined randomly. Thus, when a contrasting excerpt was presented after a long series of background excerpts, it may have been "overdue" and highly expected or predicted, and therefore pleasurable.

Results from studies of infants show transitions from an initial preference based on stimulus familiarity to one based on novelty (Rose et al., 1982). In the present experiments with adults, relatively rapid habituation to a particular emotion may have been a consequence of the fact that emotions are processed rapidly and automatically (Zajonc, 1980), even when they are expressed musically (Bigand et al., 2005). For example, when an orienting task requires listeners to attend to the emotion expressed by a piece of music, liking for a piece of obviously happy- or sad-sounding music peaks after two exposures (Schellenberg et al., 2008). When the orienting task requires listeners to attend to the lead instrument and the piece is emotionally ambiguous, liking peaks after eight exposures (Szpunar et al., 2004). For different pieces that express the same emotion, only one dimension repeats on the level of the specific emotion (i.e., happiness or sadness). By contrast, a whole piece has many dimensions (e.g., changes in melody, rhythm, harmony, dynamics, and so on), which require more repetitions in order to remember the piece completely. Accordingly, listening to an unfamiliar piece of music initially increases liking for it (Gaudreau and Peretz, 1999), but after many repetitions, liking turns to disliking (Szpunar et al., 2004; Schellenberg et al., 2008).

Our data corroborate and extend Huron's (2006) notion of contrastive valence, which suggests that a listener's emotional response is intensified when a musical event contrasts with what is expected. Huron focuses primarily on experiences of pleasantness or unpleasantness that occur in response to unexpected positive and negative musical events, respectively, such that unexpectedness intensifies the listeners' hedonic evaluation. In the present investigation, listeners reported more intense responses to music whose emotional character contrasted with music heard previously, which, in turn, led to relatively positive evaluations whether the music was happy or sad sounding. Note that the effect size of the crucial interaction (i.e., between presentation frequency and emotion type) was larger for emotion-intensity than for liking ratings in both experiments (compare the upper and lower panels in **Figures 1** and **2**), an additional finding consistent with our hypothesis that the intensity of the emotional response would determine the evaluative response. Moreover, in Experiment 2, after four presentations of the background emotion, differences between the background and contrasting excerpts were significant for intensity ratings but only marginal for liking ratings.

Can we be certain that listeners were actually responding emotionally to the excerpts rather than simply perceiving the emotions conveyed? We know that music listeners reliably distinguish the two types of responses when asked to rate their feelings *and* perceptions (Kallinen and Ravaja, 2006; Schubert, 2007b; Evans and Schubert, 2008; Hunter et al., 2010). In the present study, listeners were told specifically to rate the intensity of their feelings, not the intensity of the emotions conveyed by the excerpts, and we have no reason to doubt that they followed instructions. In any event, because perception and feeling ratings in response to music tend to vary in tandem (Kallinen and Ravaja, 2006; Hunter et al., 2010), with feelings mediated by perceptions in some circumstances (Hunter et al., 2010), feelings are difficult to tease apart from perceptions, which almost certainly played a role in the observed response patterns. For example, if listeners had been required to rate the happiness and sadness expressed by

the excerpts during (instead of after) the actual test phase, we are confident that a perceptual contrast effect would have emerged, as it has in previous studies of perceived lightness, loudness, or temperature.

Our findings are also consistent with Schubert's (1996) proposal that the intensity of the emotional response predicts the degree of pleasure and, consequently, the magnitude of the positive appraisal. Schubert's theory further suggests that music deemed sad is enjoyed because the link between negative emotions and displeasure is de-activated in esthetic contexts. Huron (2006) expanded on this suggestion by proposing that the mechanism for increased liking of a contrasting musical stimulus is (slow) cortical inhibition of (fast and automatic) subcortical responses. In the end, the cognitive appraisal inevitably concludes that nothing bad has occurred, and that one is simply listening to sad-sounding music. The results of our experiments contribute to a longstanding paradox that has intrigued both esthetic philosophers as well as psychologists – why listeners often enjoy sad-sounding music (Robinson, 1994; Davies, 2003; Schellenberg et al., 2008; Garrido and Schubert, 2010, 2011; Hunter et al., 2011; Van den Tol and Edwards, 2011; Ladinig and Schellenberg, 2012; Vuoskoski and Eerola, 2012; Vuoskoski et al., 2012).

Although listeners tested in the laboratory generally prefer happy- over sad-sounding music (Thompson et al., 2001; Husain et al., 2002; Gosselin et al., 2005; Hunter et al., 2008, 2010), this preference can be eliminated when the listeners arefatigued (Schellenberg et al., 2008) or in a sad mood (Hunter et al., 2011). In other words, negative psychological states can motivate listening to sad-sounding music (Van den Tol and Edwards, 2011). Liking sad-sounding music is also correlated with individual differences in personality – positively with openness-to-experience, empathy, and absorption, but negatively with extraversion (Garrido and Schubert, 2011; Ladinig and Schellenberg, 2012; Vuoskoski et al., 2012). The present findings highlight another contextual factor associated with increased appreciation of sad-sounding music: repeated exposure to happy-sounding music. Our results also provide a cultural-level explanation for choosing to listen to sad-sounding music, or at least to sad-sounding classical music. Because the majority of such music sounds relatively happy (i.e., fast tempo and major mode; Post and Huron, 2009), listeners may enjoy sad-sounding music simply because of its relative rarity – and hence contrast – in a culture in which happy-sounding music is more prevalent.

Our findings raise additional questions that could be addressed in future research. For example, on each trial of the present experiments, listeners attended closely to the music because they were required to provide ratings of how much they liked each excerpt and the intensity of their emotional response. Although such focused listening is common in some contexts (e.g., while attending a concert), the majority of day-to-day listening involves music heard incidentally while listeners are performing some other task (Sloboda et al., 2001). Moreover, 32 presentations of incidental music leads to progressively higher liking ratings (Szpunar et al., 2004; Schellenberg et al., 2008), which raises the possibility that the contrast effects observed here would not extend to incidental listening. In principle, repetition of different excerpts expressing the same emotion could lead to *higher* liking ratings.

Another potential avenue for future research would be to substitute self-reports of emotion-intensity with measures of physiological changes in arousal (e.g., skin conductance or heart rate), which would provide objective indicators of the intensity of the listener's emotional response. Stimulus selection is also bound to play a role in the contrast effects we observed. The present studies made use of excerpts from classical piano music, a style of music unlikely to be favored by Canadian undergraduates. It remains unknown whether the contrast effect would be stronger or weaker with more familiar and/or well-liked styles of music. In one study, a preference for classical music was associated with more intense emotional responding to such music (Kreutz et al., 2008). The limits of the role of the intensity of the listener's emotional response could also be tested. An intense but negative emotional response (e.g., aversion evoked by misogynistic hip-hop lyrics or extremely dissonant music) is unlikely to be accompanied by increases in liking. Finally, interaction effects with mood are likely to be evident. Sad-sounding music evokes sad moods (Hunter et al., 2008, 2010; Vuoskoski and Eerola, 2012), and listeners in negative moods show increased liking for sad-sounding music (Schellenberg et al., 2008; Hunter et al., 2011;Van den Tol and Edwards, 2011). Thus, in some contexts, one might observe *increased* liking for a sad-sounding musical piece after listening to other sad-sounding pieces.

#### **REFERENCES**


emotions in music. *Music. Sci.* 12, 75–99.


Gosselin, N., Peretz, I., Noulhiane, M., Hasboun, D., Beckett, C., Baulac, M., et al. (2005). Impaired recognition of scary music following unilateral

In summary, our results reveal that when listeners attend closely to different pieces of music, they progressively habituate to music that maintains the same emotional character. Hence, they show greater appreciation for music that conveys a contrasting emotion. Such contrast effects appear to occur because repeatedly conveying the same emotion dulls the listener's emotional response, whereas conveying a contrasting emotion intensifies the response. Music composers are likely to be aware of this contrast effect, either implicitly or explicitly, by using contrasting musical characteristics (e.g., tempo, mode, and dynamics) to increase the intensity of listeners' emotional response and their liking of different sections of a particular composition, or of successive compositions on an album. Moreover, similar contrast effects are likely to be evident in other art forms, such as dance, theater, and visual art. Our results highlight the importance of emotional responding in hedonic evaluations and raise new questions about the role of contrasts in esthetic appreciation.

#### **ACKNOWLEDGMENTS**

This research was supported by a grant awarded to E. Glenn Schellenberg from the Social Sciences and Humanities Research Council of Canada. J. CharlesMillar and Patrick Redegeld assisted in testing participants and data entry.

temporal lobe excision. *Brain* 128, 628–640.


mood, and spatial abilities. *Music Percept.* 20, 149–169.


when feeling sad. *Psychol. Music.* doi:10.1177/0305735611430433. [Advance online publication].


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 26 August 2012; paper pending published: 08 November 2012; accepted: 05 December 2012; published online: 24 December 2012.*

*Citation: Schellenberg EG, Corrigall KA, Ladinig O and Huron D (2012) Changing the tune: listeners like music that expresses a contrasting emotion. Front. Psychology 3:574. doi: 10.3389/fpsyg.2012.00574*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2012 Schellenberg , Corrigall, Ladinig and Huron. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

### Effects of voice on emotional arousal

#### *Psyche Loui 1,2\*, Justin P. Bachorik1, H. Charles Li <sup>1</sup> and Gottfried Schlaug1*

*<sup>1</sup> Department of Neurology, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, USA <sup>2</sup> Department of Psychology, Wesleyan University, Middletown, CT, USA*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Mireille Besson, Centre National de la Recherch Scientifique, France E. Glenn Schellenberg, University of Toronto, Canada*

#### *\*Correspondence:*

*Psyche Loui, Department of Psychology, Wesleyan University, Judd Hall 104, 207 High Street, Middletown, 06459 CT, USA e-mail: ploui@wesleyan.edu*

Music is a powerful medium capable of eliciting a broad range of emotions. Although the relationship between language and music is well documented, relatively little is known about the effects of lyrics and the voice on the emotional processing of music and on listeners' preferences. In the present study, we investigated the effects of vocals in music on participants' perceived valence and arousal in songs. Participants (*N* = 50) made valence and arousal ratings for familiar songs that were presented with and without the voice. We observed robust effects of vocal content on perceived arousal. Furthermore, we found that the effect of the voice on enhancing arousal ratings is independent of familiarity of the song and differs across genders and age: females were more influenced by vocals than males; furthermore these gender effects were enhanced among older adults. Results highlight the effects of gender and aging in emotion perception and are discussed in terms of the social roles of music.

**Keywords: emotion, music, arousal, perception, gender, aging**

#### **INTRODUCTION**

The ability to detect emotion in speech and music is an important task in our daily lives. The power of the human voice to communicate emotion is well documented in verbal speech (Fairbanks and Pronovost, 1938; Scherer, 1995) as well as in non-verbal vocal sounds (Skinner, 1935), and the human voice is thought to convey emotional valence, arousal, and intensity (Laukka et al., 2005) via its modification of spectral and temporal signals (Fairbanks and Pronovost, 1938; Bachorowski and Owren, 1995). The use of the human voice to convey emotion is abundant and vital developmentally as in the case of infant-directed speech (Trainor et al., 2000), and can be accurately identified by people of different cultures (Bryant and Barrett, 2008), suggesting that emotion communication may be a universal function of the human voice. Furthermore, the inability to detect emotional signals in voices is associated with psychopathy (Bagley et al., 2009), thus highlighting the importance of emotional identification in the auditory modality in every human functioning.

Music is another form of sound communication that conveys emotional information. To understand the perception of emotions in music, one model that has been validated by psychological and physiological studies is as a two-dimensional space that treats affect as two separable dimensions of valence and arousal (Russell, 1980). This valence-arousal model is well validated with musical stimuli (Balkwill and Thompson, 1999; Bigand et al., 2005; Ilie and Thompson, 2006; Steinbeis et al., 2006; Grewe et al., 2007). Studies investigating why and how music is able to influence its listeners' moods and emotions (Sloboda, 1991; Terwogt and van Grinsven, 1991; Balkwill and Thompson, 1999; Panksepp and Bernatzky, 2002; Gosselin et al., 2007) have identified ratings for musical stimuli that drive changes in each of these two factors independently. Arousal is a measure of perceived energy level, ranging from low (calming) to high (exciting) (Krumhansl, 1997; Gosselin et al., 2007; Sammler et al., 2007). Orthogonally, valence is the polarity of perceived emotions, and ranges from negative (sad) to positive (happy) (Krumhansl, 1997; Schubert, 1999; Dalla Bella et al., 2001). Multidimensional scaling (MDS) studies have verified that valence and arousal are separable measures, that may be independently manipulated in experimental conditions (Bigand et al., 2005; Vines et al., 2005).

Given that music and the voice may both be strong modulators of emotions, vocal music could be a medium with emotional power. Several studies have investigated the cognition and perception of vocal lyrics in songs. Serafine et al. (1982)studied the effect of lyrics on participants' memory for songs. Results showed that melody recognition was near chance unless the melody's original words (i.e., words that were presented with the music during encoding) were present, suggesting that music and speech were combined into a single coherent object when encoded in the same stream. More recently, Weiss et al. (2012) examined the effect of timbre (including voice) on memory and preference for music. Results showed that melodies with the voice were better recognized than all other instrumental melodies. The authors suggest that the biological significance of the human voice provides a greater depth of processing and enhanced memory.

Few studies have investigated the combination of music and speech in emotion perception. In an investigation of the effects of varying stimulus parameters in music and speech on perceived emotion, Ilie and Thompson (2006) showed that emotional ratings for music and speech concurred in most emotion ratings, except that manipulations of pitch height resulted in different directions of valence change for music and speech. Interaction effects between music and speech were again observed, suggesting that the combination of speech with music may result in complex and non-additive effects on emotion.

As music and speech are both auditory stimuli that vary over time, a fundamental question regarding emotion perception of these auditory sources concerns the time-course of emotional responses. Approaches that have been used to investigate the time-course of emotion perception in music include online responses made during the presentation of music, and offline responses made after hearing musical excerpts. Using both offline techniques of categorization and MDS (Perrot and Gjerdingen, 1999; Bigand et al., 2005), subjective emotional ratings performed after hearing short musical stimuli showed that a musical segment as short as 250 ms in duration is sufficient to elicit a reliable emotional response. However, these emotional ratings were influenced by the *post-hoc* cognitive appraisal of emotional content within music after their presentation, as well as the emotional experience elicited by music during its presentation. Using continuous emotional ratings in the two-dimensional space of valence and arousal maximizes the influence of emotion perceived online during the presentation of musical stimuli (Schubert, 2004). In previous work using the two-dimensional continuous paradigm (Bachorik et al., 2009), participants took an average of 8.3 s to initiate movement signifying an emotional judgment.

The present study adopts both continuous (online) and discrete (offline) subjective ratings to investigate effects of vocals on perception of arousal in music. In addition to exploring the effects of vocals on arousal in music in a temporally sensitive manner, further questions arise concerning the factors that moderate participants' emotional response to the presence of vocals in songs. As previous studies have shown that age and gender may contribute to personality characteristics, which in turn influence musical preference (Rentfrow and Gosling, 2003), we examined the interaction of arousal ratings with age and gender, while controlling for effects of familiarity on arousal ratings. Subjects were presented with excerpts from two versions of well-known songs, one with vocals and one without (with all other variables in the songs being the same), and made continuous as well as discrete ratings of perceived arousal, as well as familiarity ratings, for each version of each song.

#### **MATERIALS AND METHODS PARTICIPANTS**

Fifty participants (25 females and 25 males) were recruited from the greater Boston metropolitan area via advertisements in daily newspapers. Participants ranged from 19 to 83 years of age (median = 37), and were representative of the Boston metropolitan area in their ethnic distribution. All participants reported having no neurological and/or psychiatric disorders and had normal IQ as assessed by Shipley abstract scale scores (Shipley, 1940). Written informed consent, approved by the Institutional Review Board of the Beth Israel Deaconess Medical Center, was obtained from all participants. Each participant was reimbursed at an hourly rate for participating.

#### **STIMULI**

The stimuli consisted of 32 unique musical excerpts, each 60 s long. Vocal and instrumental versions of 16 songs were chosen from commercially available songs (see **Table 1** for a list of all songs used). All excerpts were normalized for loudness and each excerpt was briefly faded in (0.5 s) at the beginning of the stimulus and out (0.5 s) at the end. The stimuli were divided into two blocks of 16 trials each; each block consisted of both versions

#### **Table 1 | Excerpts of song stimuli.**


(vocal/instrumental and instrumental only) of 8 songs. Excerpts ranged in tempo between 49 and 177 beats per minute.

Experiments were conducted using an Apple Powerbook G4 with a 15.4-- LCD screen using custom-made stimulus presentation software (Sourcetone, LLC). Audio was presented via Altec Lansing AHP-712 headphones, and participants used a mouse and a Flightstick Pro USB joystick to input their responses to the stimuli.

#### **PROCEDURE**

Over the course of two separate testing sessions, each participant completed two trial blocks. Order of trial block presentation was counterbalanced between subjects. Each of the 16 excerpts in each trial block was played in a randomized order, and for each stimulus presentation, the participant's task was the same: to use the joystick to respond, in real time, to the levels of emotional valence (defined as positive or negative emotion induced by the music) and arousal (defined as a stimulating or calming feeling induced by the music) of the music via an onscreen cursor in a two-dimensional grid. The joystick controlled the motion of the cursor in a 640 × 640 resolution grid, and data about the position of the joystick and the position of the cursor was sampled with a frequency of 10 Hz. Centering the joystick caused the cursor to stop moving but did not center the cursor in the grid onscreen.

After the end of each musical excerpt, subjects had additional tasks to rate the degree of valence and arousal perceived in each excerpt (on a scale of 0–4, where 4 is highest, 2 is neutral, and 0 is lowest). Participants also provided subjective ratings of familiarity (on a scale of 0–4, with 0 being "never heard" and 4 being "actively listen to; personally own song") after rating the degree of emotional arousal and valence.

#### **DATA ANALYSIS**

Continuous ratings for valence and arousal (X and Y axes on the two-dimensional rating space, respectively) were digitized and exported for each trial of each subject from the stimulus presentation program and analyzed using in-house software. Pairwise *t*-tests were conducted for each time point comparing subjects' valence and arousal ratings for vocal and instrumental versions of each song. A false-discovery rate *post-hoc* adjustment was used to minimize Type I error.

Discrete valence and arousal ratings were used as the dependent variable in a mixed design ANOVA with between-subject factors of age (two levels: old vs. young, with a median split at the age of 37) and gender (male vs. female) and the within-subject factor of song vocals (instrumental vs. vocals). Paired *t*-tests were run comparing music with and without vocals in familiarity, liking, chills, and intense emotional responses.

#### **RESULTS**

Continuous arousal ratings revealed that the vocal versions were more arousing overall. The average continuous ratings were higher in the vocal version than in the instrumental version in 15 out of 16 songs. This was confirmed using a pairwise *t*-test at every point in the time-series comparing arousal ratings in vocal and instrumental conditions indicating significant difference at the FDR-corrected alpha level of 0.05 in at least one time point between vocal and instrumental versions in 12 out of 16 songs. Among these 12 songs, 11 showed a significant arousal-enhancing effect of vocals, whereas only one song showed the opposite effect. In contrast to arousal ratings, continuous valence ratings only showed significantly higher valence ratings at the *p* < 0.05 (corrected) level for at least one point in 4 out of 16 songs, and significantly lower valence ratings for at least one point in two songs.

**Figure 1** shows the difference between average arousal rating between vocal and instrumental versions as functions of time for each of the 16 songs. Red line segments indicate a higher arousal rating in vocal versions compared to instrumental versions whereas blue line segments indicate the opposite effect. Bold lines indicate significant differences at the *p* < 0.05 (FDR-corrected) level and gray bars behind the graph indicate instrumental interludes within the vocal versions of each song.

Online ratings indicated that, as shown in **Figure 1**, the arousal-enhancing effect of vocals was more pronounced later within each piece. The trend toward higher arousal ratings in the

vocal versions began at an average of 10 s after the onset of each song, however, this was variable depending on the song (SEM = 2.6 s). The presence of instrumental interludes within each song was uncorrelated with the difference in arousal ratings. Songs that contained non-verbal vocal portions (Whitney Houston, Barbara Streisand, and Mr. Mister songs in the sample) showed a similar effect size as songs containing verbal vocals, suggesting that the presence of the human voice, rather than recognizable words, led to the increase in arousal.

The effect of vocals on arousal was confirmed in discrete as well as continuous arousal ratings. Using the discrete arousal rating as the dependent variable, the mean arousal rating for instrumental versions of the musical excerpts was 2.25 (SEM = 0.07) whereas the mean arousal rating for vocal versions was 2.60 (SEM = 0.06). A highly significant main effect of vocals on arousal was observed, *F*(1, <sup>96</sup>) = 1389.5, *p* < 0.001, indicating that songs with vocals were rated as more highly arousing than their instrument-only counterparts (**Figure 2**). Participants also reported liking the vocal versions more than the instrumental version, with a mean of 2.75 vs. 2.48, respectively [*t*(49) = −3.486, *p* < 0.001]. The same effect was not observed in discrete valence ratings [*F*(1, <sup>96</sup>) = 1.17, n.s.].

Using discrete arousal ratings as the dependent variable, we next attempted to tease apart the groups of participants who were or were not susceptible to the effects of vocals on arousal by assessing the demographics (gender and age) of each participant and comparing the mean difference between vocals and instrumental versions across demographic groups. A significant main effect of gender was observed for all arousal ratings, with ratings by females being higher [*F*(1, <sup>96</sup>) = 4.186, *p* = 0.04]. Furthermore, a significant interaction between vocals and gender was observed on arousal ratings: *F*(1, <sup>96</sup>) = 11.9, *p* = 0.001, confirming that the positive effect of vocals on arousal ratings was stronger for females than for males (**Figure 2**). Although no significant main effect of age was present [*F*(1, <sup>96</sup>) = 0.013, n.s.], a significant threeway interaction was observed on arousal ratings between gender and age [*F*(1, <sup>96</sup>) = 4.17, *p* = 0.04], with older females being more emotionally influenced by vocals than younger females, but older males being less influenced by vocals than younger males (**Figure 2**).

Familiarity ratings revealed that participants found songs with vocals to be significantly more familiar than the instrumental version [mean ratings: vocals = 2.63, instrumental = 1.872; *t*(49) = − 9.319, *p* < 0.001]. To investigate the effects of vocals on arousal while controlling for the effect of familiarity, a one-way ANCOVA was conducted on the dependent variable of discrete arousal rating with the factor of vocals (instrumentals vs. vocals), with the covariate of familiarity rating (0 through 4). Results showed a significant effect of vocals [*F*(1, <sup>97</sup>) = 4.2, *p* = 0.043] even with a significant effect of familiarity [F(1, <sup>97</sup>) = 6.3, *p* = 0.014], suggesting that the contribution of vocal stimuli to arousal was significant even after controlling for an increase in familiarity for vocal pieces.

#### **DISCUSSION**

Our results indicate that the presence of vocals generally enhances participants' arousal ratings, and were not limited to the effects of familiarity but were moderated by the gender and age of the participant. Vocal sounds and music engage multiple common resources in the brain, resulting in interactions between music and speech as assessed by tasks that tap into perception, cognition and emotion (Serafine et al., 1982; Besson et al., 1998; Ilie and Thompson, 2006). However, little research has investigated the time-course of the impact that vocals may have on arousal perception in music. Using a naturalistic and ecologically valid setting of popular songs with and without vocal content, the present study attempted to address the specific question concerning the relationship between vocals and perceived arousal in music. While the present study uses ecologically valid stimuli and identifies arousal differences attributable to the use of vocals within music, future research may be done to tease apart specific components of the vocals (e.g., words, timbre, sung melody) that most affect perceived arousal.

Based on continuous (online) and discrete (offline) subjective ratings of valence and arousal for identical musical excerpts with and without vocal content, we observed that the presence of vocals generally increases ratings of arousal but not of valence. The emotionally enhancing effect of vocals on arousal is shown in both online (continuous) and offline (discrete) ratings of subjective arousal, and is not limited to verbal lyrics but appears to generalize to non-verbal songs containing the human voice. Online ratings revealed that participants required an average of 10 s (SEM = 2.6 s) of music before differentiating vocal versions from instrumental versions; this was congruent with previous reports using a similar continuous ratings paradigm (Bachorik et al., 2009) showing that participants required an average of 8.3 s to initiate emotional ratings when listening in real time. Furthermore, the enhancing effect of vocals is not limited to familiarity, as shown by an ANCOVA revealing that effects of vocals were significant even after statistically controlling for the contribution of familiarity ratings.

It is interesting to speculate on why valence is less affected by vocals compared to arousal. One possibility is that vocals affected valence both positively and negatively depending on the listener and depending on the song, resulting in increased variability. Another possibility is that valence is already much determined from other structural features of music such as modality (major vs. minor keys) and melodic contour, leaving little changes that the added vocals could bestow upon the perceived valence of each song. The relative impact of structural features of a piece on its perceived valence vs. arousal may be an avenue for future studies.

As music with vocals has additional components of timbre, melody, and words, the present experiment design could be followed up by assessing the effect of an additional lead instrument on arousal ratings in a non-vocal control condition. However, the selection of the most appropriate additional lead instrument in such a design is non-trivial, as only a highly systematic match in timbre between the voice and the chosen test instrument would provide a true test of the possible confound of voice timbre. Future experiments should seek to identify a timbral match of the voices used in these naturalistic song stimuli in order to define a timbre-matched control condition. Nevertheless, in the current analysis we identify song sections that do not include words as a possible means to de-confound the relationship between voice and lyrics, and as the increase in arousal ratings is observed even for sections of the songs that include non-verbal vocals, the results suggest that the use of vocals, rather than of lyrics within the music, may be driving the increase in arousal.

When offline ratings were compared by the demographic variables of gender and age, results revealed the types of participants who were most sensitive to the arousal-enhancing effect of vocals. Females were more inclined to report perceiving higher arousal in vocal songs compared to males. These effects are exaggerated among older participants. One possible explanation for the gender effect is that the need to detect emotional signals rapidly may be more evolutionarily advantageous for women. Supporting evidence along this possible evolutionary basis of gender-bias in selecting for emotion in vocal content comes from electrophysiological literature showing that the dishabituation of emotional voice content is more robust in females, and is furthermore regulated by estrogen levels (Schirmer et al., 2008). Regarding the three-way interaction between the effect of vocals with gender and with age, one possibility is that the song stimuli—popular songs ranging from the 1960s to the 1990s—chosen for this experiment are more familiar to older individuals than to younger ones. However, the fact that the effect of vocals on arousal was

#### **REFERENCES**


converging findings across traditional and cluster analytic approaches to assessing the construct. *J. Abnorm. Psychol.* 118, 388–398. doi: 10.1037/a0015372


still significant after controlling for the contribution of familiarity suggests that the influence of vocals on arousal was above and beyond the influence of familiarity. Another possible explanation stems from how individuals of different ages identify with music, with possible sociological effects of changing standards of gender equality throughout the decades that may help explain the observed gender by age interaction. As young adults rely on musical preferences to communicate and understand each other's personality profiles (Rentfrow and Gosling, 2006), it would seem that younger individuals, especially females and individuals who rely on external feedback and social pressures for self-perception, may be more easily aroused by music that is representative of their own culture and the personality profile they wish to convey. Since most popular music is written with vocals, it stands to reason that younger listeners looking to identify themselves with popular taste would find music more arousing when presented with vocals. As the emotional content of songs is highly influenced by our identity as captured by demographic variables such as age and gender, future work should seek to refine our understanding of emotion perception in music and language by placing it in broader sociological and biological contexts.

The present results from continuous and discrete ratings, obtained during and after music listening, support the central notion that the combination of vocal and instrumental sounds in music could produce a more pronounced effect on emotional arousal, but not on valence, compared to instrumental music alone. The arousal-enhancing effect of vocals increases over the duration of most songs and is moderated by demographic factors such as age and gender. Results have implications for our understanding of the emotion and meaning of music, and will bear relevance for ongoing efforts to model and predict the emotional content of music (Nagel et al., 2007) for therapeutic as well as commercial applications.

#### **ACKNOWLEDGMENTS**

This work was supported by a research grant from Sourcetone, LLC, given to Beth Israel Deaconess Medical Center to support research on music and emotions. Psyche Loui also acknowledges support from the Grammy Foundation and the Templeton Foundation.

emotional responses to music. *Ann. N.Y. Acad. Sci.* 1060, 429–437. doi: 10.1196/annals.1360.036


319–330. doi: 10.1525/mp.2006.23. 4.319


236–242. doi: 10.1111/j.1467-9280. 2006.01691.x


of emotion in expressive musical performance. *Ann. N.Y. Acad. Sci.* 1060, 462–466. doi: 10.1196/annals. 1360.052

Weiss, M. W., Trehub, S. E., and Schellenberg, E. G. (2012). Something in the way she sings: enhanced memory for vocal melodies. *Psychol. Sci.* 23, 1074–1078. doi: 10.1177/ 0956797612442552

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 25 June 2013; accepted: 07 September 2013; published online: 01 October 2013.*

*Citation: Loui P, Bachorik JP, Li HC and Schlaug G (2013) Effects of voice on emotional arousal. Front. Psychol. 4:675. doi: 10.3389/fpsyg.2013.00675*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2013 Loui, Bachorik, Li and Schlaug. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Predicting musically induced emotions from physiological inputs: linear and neural network models

#### *Frank A. Russo1,2\*, Naresh N. Vempala1 and Gillian M. Sandstrom3*

*<sup>1</sup> SMART Lab, Department of Psychology, Ryerson University, Toronto, ON, Canada*

*<sup>2</sup> Communication Team,Toronto Rehabilitation Institute, Toronto, ON, Canada*

*<sup>3</sup> Department of Psychology, University of British Columbia, Vancouver, BC, Canada*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

*Reviewed by: Catherine (Kate) J. Stevens, University of Western Sydney, Australia David Terburg, Universiteit Utrecht, Netherlands*

#### *\*Correspondence:*

*Frank A. Russo, Department of Psychology, Ryerson University, 350 Victoria Street, Toronto, ON M5B 2K3, Canada e-mail: russo@ryerson.ca*

Listening to music often leads to physiological responses. Do these physiological responses contain sufficient information to infer emotion induced in the listener? The current study explores this question by attempting to predict judgments of "felt" emotion from physiological responses alone using linear and neural network models. We measured five channels of peripheral physiology from 20 participants—heart rate (HR), respiration, galvanic skin response, and activity in corrugator supercilii and zygomaticus major facial muscles. Using valence and arousal (VA) dimensions, participants rated their felt emotion after listening to each of 12 classical music excerpts. After extracting features from the five channels, we examined their correlation with VA ratings, and then performed multiple linear regression to see if a linear relationship between the physiological responses could account for the ratings. Although linear models predicted a significant amount of variance in arousal ratings, they were unable to do so with valence ratings. We then used a neural network to provide a non-linear account of the ratings. The network was trained on the mean ratings of eight of the 12 excerpts and tested on the remainder. Performance of the neural network confirms that physiological responses alone can be used to predict musically induced emotion. The non-linear model derived from the neural network was more accurate than linear models derived from multiple linear regression, particularly along the valence dimension. A secondary analysis allowed us to quantify the relative contributions of inputs to the non-linear model. The study represents a novel approach to understanding the complex relationship between physiological responses and musically induced emotion.

**Keywords: physiological responses, neural networks, music cognition, emotion, computational modeling**

#### **INTRODUCTION**

One of the principal motivations for listening to music is the emotional experience it affords. Although some have argued that this experience does not involve the induction of emotion so much as its perception (Meyer, 1956; Konecni, 2008 ˇ ), few would dispute that physiological change can be evoked while listening to music. Different mechanisms are likely responsible for these physiological changes, ranging from brainstem reflexes to the violation of top-down expectancies defined by culture and personal history (Gabrielsson, 2002; Juslin and Västfjäll, 2008). These physiological changes can be assessed non-invasively through continuous measurement of heart rate (HR), respiration, skin conductivity, facial muscle activity, and other peripheral measures. Because different types of felt emotion have been associated with different patterns of physiological change (Krumhansl, 1997; Nyklicek et al., 1997; Rainville et al., 2006; Lundqvist et al., 2009), it is reasonable to investigate the extent to which physiological responses to music can be used in and of themselves to predict felt emotion.

Both discrete and dimensional models of emotion have been used to conceptualize emotional responses to music <sup>1</sup> . Discrete models (e.g., Ekman, 1992, 1999) have the advantage of avoiding assumptions about the manner in which emotions may be related to one another, thus allowing for representation of mixed emotions (e.g., bitter-sweet as a combination of happiness and sadness). Dimensional models (e.g., Hevner, 1935, 1936; Russell, 1980) characterize emotions with respect to an n-dimensional space, thus enabling quantification of the psychological distance between any two emotions as well as characterization of the relationship between a set of emotions (e.g., bored has been conceptualized as a combination of sadness and fatigue).

Research on music and emotion over the last decade has tended to prefer dimensional models. In an effective demonstration of this approach, Bigand et al. (2005) identified a

<sup>1</sup>One further class of models that has elements of both discrete and dimensional approaches is the domain-specific model developed specifically for music by Zentner et al. (2008).

collection of music representing points across the entire surface of the two-dimensional grid constituted by the intersection of valence and arousal (henceforth valence-arousal grid; Russell, 1980; Schubert, 2004). Valence refers to the hedonic dimension of emotion, ranging from pleasant to unpleasant. Physiological correlates of the valence dimension in musically induced emotion include zygomaticus major and corrugator supercilii activity (e.g., Witvliet and Vrana, 2007; Lundqvist et al., 2009). Arousal refers to the mobilization of energy, ranging from calm to excited. Physiological correlates of the arousal dimension include autonomic measures such as HR, respiration, and galvanic skin response (e.g., Iwanaga et al., 1996; Krumhansl, 1997; Baumgartner et al., 2005; Etzel et al., 2006; Sandstrom and Russo, 2010). Three-dimensional models of emotion have also been applied to music (Illie and Thompson, 2006) but the advantages of including a third dimension (e.g., tension-arousal) are unclear at this stage of research (Eerola and Vuoskoski, 2011).

Past research on musically induced emotion and physiological responses has almost exclusively been limited to linear models. In contrast, similar research on subjective feelings in other contexts (e.g., video games), have begun to use non-linear computational models (Mandryk and Atkins, 2007; Fairclough, 2009). Non-linear computational models such as those generated by artificial neural networks have great potential for adding to the understanding of music and emotion as they allow for prediction of felt emotion without the artificiality that is introduced by requiring a listener to consciously reflect on their emotional experience. However, there are only a few studies of musically induced emotions that anticipate the computational approach taken here.

In a pioneering study, Kim and André (2008) trained an automatic musical emotion recognition system based on physiological data that was collected from three listeners. Their measures included HR, respiration, skin conductance (SCL), and electromyography of the trapezius muscle. They used an extended linear discriminant analysis to classify the emotion that listeners experienced as falling into one of the four quadrants of the valence-arousal grid. Although the model achieved a reasonable level of recognition accuracy (70%), the small number of listeners that were used to train the model greatly limits its generalizability. In addition, because the model was designed to classify a musical excerpt into one of four categories (the quadrants of the valencearousal grid), it was unable to predict subtle differences within the same quadrant or to account for variation along a particular dimension of emotion (e.g., valence).

Coutinho and Cangelosi (2009) used a neural network approach to predict continuous variation along the valence and arousal dimensions of musically induced emotion. The continuity of measurement represents an important departure from Kim and André (2008). Input to the neural network model involved combinations of low-level psychoacoustic features (timbre, mean pitch, pitch variation, and dynamics). The model was trained on three excerpts and tested on an additional three excerpts. The model was effective in predicting moment-to-moment changes in felt emotion.

In a subsequent study by Coutinho and Cangelosi (2011), psychoacoustic as well as physiological features were incorporated into neural network models for predicting musically induced emotions. Psychoacoustic features included loudness, pitch level, pitch contour, tempo, texture, and sharpness. Physiological features included HR and SCL. Results showed that the physiological features were able to provide only a slight increase in explained variance beyond that accounted for by the psychoacoustic features alone. The addition of other physiological features such as those considered here may have helped to further increase explanatory power. However, as acknowledged by the authors, the models derived from psychoacoustic features were already quite powerful and the variable lag in different channels of physiological response complicates continuous prediction.

In the current study, five channels of physiological data were obtained while participants listened to music excerpts selected to represent each quadrant of the valence-arousal grid: high arousal, positive valence (*Happy)*, high arousal, negative valence (*Agitated)*, low arousal negative valence (*Sad)*, and low arousal, positive valence (*Peaceful)*. All excerpts were drawn from the classical era so as to minimize variability in responses due to genre. Listeners provided global ratings of felt emotion (taking into account the entire excerpt). Linear regression and neural network models were developed using only physiological features as input and subjective appraisals of felt emotion as output. One promise of this particular approach that emphasizes physiological inputs is that it may inform the development of future models that are capable of predicting the appraisals of a particular listener, or a particular type of listener listening to a particular genre of music.

#### **METHODS PARTICIPANTS**

We recruited 32 undergraduate students through our departmental participant pool. Twelve of the participants had some proportion of missing physiological data in one or more of the channels due to measurement error. The most common error was that our recordings of facial muscle activity were interrupted for a portion of the trial due to electrodes losing surface contact (mainly due to an accumulation of perspiration toward the end of the session). Our analyses only considered data from those 20 participants providing a complete data set (17 females, 1 male, 2 undeclared). On average these participants were 25 years of age (*SD* = 9.2) with 1.7 years of individual music training (*SD* = 2.9) and 2 years of group training (*SD* = 2.8).

#### **STIMULI AND APPARATUS**

Our stimuli consisted of 12 classical music excerpts (M1–M12) from 12 different composers, as shown in **Table 1**. Three excerpts were chosen to represent each of the four emotion quadrants of the valence-arousal grid: high arousal, positive valence (*Happy)*, high arousal, negative valence (*Agitated)*, low arousal negative valence (*Sad)*, and low arousal, positive valence (*Peaceful)*. We used an excerpt of white noise, equated with the root-meansquare (RMS) level collapsed across the music tracks, as our baseline stimulus. A unique baseline was computed for each participant and trial. RMS-matched white noise provides a situational context that should be comparable to that of the music excerpts while remaining emotionally neutral, thus allowing us to isolate effects on physiology due to emotion (Nyklicek et al., 1997; Sokhadze, 2007). These excerpts were chosen based


#### **Table 1 | Twelve music excerpts with composers, emotion quadrants, and mean valence/arousal ratings.**

on previous work investigating emotional responses to music (Nyklicek et al., 1997; Bigand et al., 2005). All excerpts were 40 s in duration, normalized to a set RMS value, and presented at ∼75 dB SPL over Sennheiser HD 580 Precision headphones.

Participants were tested in a double-walled sound attenuation chamber (Industrial Acoustics Company). Five simultaneous channels of physiological data were sampled at 1000 Hz using a Biopac MP100 data acquisition system (Biopac Systems, Santa Barbara, CA) under the control of a Mac mini computer running AcqKnowledge software (Biopac Systems), version 3.9.2 for Mac: Measurement details for each channel are provided below.

#### *Skin conductance (SCL)*

Isotonic conductant gel was applied to two TSD203 Ag-Agcl electrodes. The electrodes were attached to the distal phalanges of the index and ring fingers of the non-dominant hand using Velcro straps, and connected to the GSR100C amplifier to measure SCL.

#### *Heart rate (HR)*

One TSD200 photoplethysmogram transducer was attached by a Velcro strap to the distal phalange of the middle finger of the nondominant hand. This transducer was connected to the PPG100C amplifier to measure capillary expansion through an infrared sensor, and thus indirectly measure the HR.

#### *Respiration rate (Resp)*

One TSD201 respiration belt was comfortably tightened around the upper part of the abdomen and attached to the RSP100C amplifier to record changes in thoracic or abdominal circumference.

#### *Facial muscle activity (Zyg and Corr)*

Shielded 4 mm silver-silver chloride (Ag/AgCI) miniature surface electrodes (Biopac, EL 208 S) were filled with electrode gel. Two of the electrodes were placed on the zygomaticus major and two on corrugator supercilii muscle regions, both on the left of the face separated by a distance of 25 mm and attached over the ear to the EMG100C amplifier to measure muscle activity.

#### **PROCEDURE**

Participants heard all 12 music excerpts in one session. Each music excerpt was preceded by 30 s of white noise, and followed by 50 s of silence. The 12 music excerpts were arranged in four different random orders. Each participant was randomly assigned to one of the four orders.

Immediately after hearing each music excerpt, participants reported the valence and arousal of the felt emotion using the Self-Assessment Manikin (Bradley and Lang, 1994). This procedure incorporates pictures to clarify Likert-type ratings from 1 to 9 on valence (least pleasant/most pleasant) and arousal dimensions (least excited/most excited). In addition to valence and arousal, participants provided a score on a scale from 1 to 4 regarding their familiarity with the excerpt, where 1 corresponds to "I've never heard this song before," 2 corresponds to "I think I might have heard this song once or twice before," 3 corresponds to "I am somewhat familiar with this song," and 4 corresponds to "I am very familiar with this song." The mean familiarity ratings were generally quite low; all excerpts had a mean familiarity rating lower than 2.5, and the mean excerpt familiarity rating was 1.78 (*SD* = 0.30).

#### **DATA PREPARATION AND PRELIMINARY ANALYSES**

In order to test for effects of presentation order and music training (number of years), a preliminary analysis of covariance was run on each dimension of felt emotion. For each analysis, the within-subjects factor was music excerpt and the betweensubjects factor was presentation order; music training was entered as the covariate. These analyses confirmed that the effects of presentation order, and music training, were non-significant, *F* s < 1, while the effects of music excerpt were significant, *F* s(11, <sup>165</sup>) = 6.03 and 22.86, *p* < 0.001.

As seen in **Figure 1** and reported in **Table 1**, the mean valence and arousal ratings for excerpts were well-distributed across the valence-arousal grid, and they aligned in the expected manner according to the four quadrants (happy, agitated, sad, peaceful). The mean valence ratings ranged from 3.35 (*SD* = 1.84) for M2 (Shostakovitch) to 6.8 (*SD* = 1.94) for M6 (Strauss). The mean arousal ratings ranged from 2.55 (*SD* = 1.54) for M10 (Chopin) to 7.5 (*SD* = 1.19) for M6 (Strauss). The inter-subject variability was comparable between valence and arousal ratings (Mean *SD* = 1.82 and 1.80, respectively).

Signal processing of physiological data involved the application of high-pass (HP) and/or low-pass (LP) filters, and where applicable, rate conversion using a peak detection algorithm with minima and maxima: SCL (no filters), Resp (no filters; Min/Max = 5/180), HR (*LP* = 3 Hz; *HP* = 0.5 Hz; Min/Max = 40/180), EMG (*HP* = 1 Hz; *LP* = 500 Hz). The data from each channel was standardized independently for each participant (converted to z-scores). A single feature value was then determined for each excerpt by subtracting mean values obtained in the final 20 s of white noise (baseline) from the mean of standardized values obtained in 30 s of each trial (i.e., the first 10 s of baseline and music were excluded to avoid capturing a startle effect). Filtering, standardization, and baseline subtraction was completed in FeatureFinder (Andrews et al., 2011), a freely available Matlab toolbox for custom analysis of physiological signals.

#### **LINEAR CORRELATION AND MULTIPLE LINEAR REGRESSION**

As a first step toward capturing the patterns by which these five physiological features accounted for valence and arousal ratings, we checked to see if there was a correlation between the physiological features and the mean valence-arousal (VA) ratings for the 12 music excerpts. Valence ratings were not significantly correlated with any of the physiological variables, *p*'s > 0.1. Arousal ratings were correlated significantly with HR, *r*(10) = 0.88, *p* < 0.001, and marginally with Resp, *r*(10) = 0.53, *p* = 0.08, but not with Zyg, Corr, or SCL, *p*'s > 0.1.

As a next step, we performed multiple linear regression with stepwise forward entry to determine whether there was a linear relationship between some combination of the physiological features and the VA ratings. The caveat here is that the models need to be interpreted with caution given that the ratio of sample size (number of excerpts) to predictors (physiological features) is smaller than accepted norms (Harrell, 2001). For valence, no significant model emerged. The best linear regression model for arousal included HR (*p* < 0.01) and Resp (*p* = 0.07), accounting for 85.2% of the variance, *F*(2, <sup>11</sup>) = 25.8, *p* < 0.001. These results suggest that while a linear combination of the physiological features may account for arousal, no linear combination adequately accounts for valence.

#### **ARTIFICIAL NEURAL NETWORKS**

One way of exploring non-linear combinations of physiological features is through the use of artificial neural networks. Although artificial neural networks have been applied extensively for classification and detection tasks in domains such as object and speech recognition, they have been relatively underutilized in music cognition (see however, Bharucha, 1987; Stevens and Latimer, 1992;

Krumhansl et al., 2000; Vempala and Maida, 2011). In the current study, we applied neural networks as a non-linear regression function to predict valence and arousal ratings using physiological features as inputs. Our implementation was a supervised feedforward neural network with backpropagation, also known as a multilayer perceptron (Rumelhart et al., 1986; Bishop, 1996; Haykin, 2008).

First, we defined the inputs and outputs for the network. From the 12 music excerpts, we arbitrarily chose two out of three from each quadrant for our training set: M1/M2 for *agitated*; M4/M5 for *happy*; M7/M8 for *peaceful*; and M10/M11 for *sad*. The test set consisted of the remaining four excerpts: M3 for *agitated*; M6 for *happy*; M9 for *peaceful*; and M12 for *sad*. The network's task was to predict the valence and arousal ratings based on the five physiological features. The training set consisted of eight input and output vectors. Each input vector had five values, one for each physiological feature, collapsed across participants. The corresponding output vector had arousal and valence values, again collapsed across participants. To maximize network learning (within and across channels), all of the physiological inputs were scaled to a value between 0 and 1 (Bishop, 1996). To avoid overfitting the network, we kept the number of hidden units equal to the number of input units. Thus, the network architecture consisted of five input units (one for each physiological feature), a single hidden layer with five units, and two output units as shown in **Figure 2**.

Next, we implemented the network in Matlab. The following procedure was used to train the network.


values were multiplied with the connection weights *Woh*, summed at each output unit, and passed through a sigmoidal function to arrive at the final output value.


We repeated this training procedure for 20 trials (i.e., 20 instances of fully trained networks). For each trial we re-initialized the network connection weights, repeated the training procedure on the same set of eight excerpts and tested the network on the remaining four.

**Figure 3** reports the average network performance for the four test excerpts in comparison with participant ratings. The network performed particularly well for M3 (Stravinsky) and M9 (Schumann). Predicted values for M6 (Strauss) were very close to the expected value on the arousal dimension and 1.6 scale units off on the valence dimension. M12 (Mozart) yielded the worst overall network performance, with an error of 1 scale unit on valence and 2 scale units on arousal.

To quantify the network's performance, we calculated the Euclidean distance between mean network-predicted outputs and mean participant ratings for valence and arousal. **Table 2** shows the network's performance for each selection and average performance across all four selections for valence and arousal. The network's mean performance error for valence was 0.82 (on a scale from 1 to 9), indicating that the network accuracy for valence was 89.75%. The network accuracy for arousal was 88.92%.

Having quantified the network's performance, we sought to determine whether the neural network approach yielded an improvement in emotion prediction over multiple linear regression. In order to derive comparable models, we computed regression models using stepwise forward entry based on data from the eight test excerpts (note that the regression models reported above had used all 12 excerpts). Given the small number of cases, it is not surprising that a significant model did not emerge. Nonetheless, to allow performance comparisons we computed

**for the test set and corresponding mean neural network outputs.** *NN* indicates neural network output.

**Table 2 | Performance of neural network and linear regression performance for each of the 4 test excerpts.**


the Euclidean distance between predicted outputs of the revised regression models and the mean participant ratings. As observed in **Table 2**, performance was extremely poor for the linear model of valence, with accuracy of 8.24%. Performance was somewhat better for the linear model of arousal, with accuracy of 67.36%. **Table 3** tells a similar story about the relative performance of the two approaches but from the perspective of RMSE and correlation between model outputs and mean valence/arousal ratings (*df* = 10). Collectively, these performance results confirm that a linear model of the five physiological features is inferior to a nonlinear model derived by an artificial neural network, particularly for the valence dimension.

Our next goal was to understand the importance of each physiological feature in terms of its contribution to the non-linear solution. To determine the relative contributions of each feature, we used a method derived by Milne (1995) that was designed for neural networks like ours with a single hidden layer. Milne's method is an improvement over a method first proposed by Garson (1991) that does not determine relative size of contributions in networks that include a combination of positive and



negative connection weights. Another method proposed by Wong et al. (1995) allows a determination of relative size but the sign of the contribution is lost. In contrast, Milne's method allows for the determination of relative size and direction of each contribution.

Using Milne's method, we determined the relative size and direction of each feature's contribution for each of the 20 trials (see **Figure 4**). The relative size data were then subjected to separate analyses of variance for valence and arousal with physiological feature as the repeated measure. There was a significant effect of physiological feature on relative size of contributions for valence, *F*(4, <sup>76</sup>) = 198.7, *p* < 0.001, and for arousal, *F*(4, <sup>76</sup>) = 23.1, *p* < 0.001<sup>2</sup> .

#### **DISCUSSION**

The neural network models that we developed on the basis of eight training excerpts were highly accurate in their prediction of valence and arousal ratings for four test excerpts (89.75 and 88.92%, respectively). The predictive power of these non-linear computational models was better than linear models that we implemented, particularly for the valence dimension. On the basis of the current study, it seems that valence cannot be adequately predicted using linear regression of features derived from physiology measures alone, but valence can be predicted using non-linear functions such as those found in neural networks. Although the network architecture prevents us from fully dissecting its nonlinear function, we were able to assess the relative size and the direction of each physiological feature's contribution.

In order to test the relative size of contributions, we established a threshold corresponding to expected performance given the null hypothesis (i.e., equal contributions from each physiological feature; henceforth the null threshold). For valence, the relative size of contributions of SCL, Resp, and Cor were above

the null threshold. Consistent with findings from prior studies, valence was negatively related to SCL and Cor (Krumhansl, 1997; Baumgartner et al., 2005) and positively related to Resp (Etzel et al., 2006; Witvliet and Vrana, 2007). For arousal, the relative size of contributions of HR and Resp were above the null threshold; the direction of both contributions was positive and thus consistent with prior research (Iwanaga et al., 1996; Krumhansl, 1997; Etzel et al., 2006; Sandstrom and Russo, 2010) as well as the results of our linear regression. Although the null threshold described above is somewhat arbitrary, we consider it non-trivial that the directionality of super-threshold contributions to felt emotion revealed in the neural networks is anticipated by prior research that employed linear modeling methods.

Our results suggest that estimates of felt emotion can be derived from neural network models that take input solely from peripheral physiological measures. While this might be considered a satisfactory outcome from a computational standpoint, it is important to ask what impact this might have for emotion science. In our view, the potential impact is greatest in the development of theory that seeks to explain the emotional trajectory of longer excerpts of music. If we have a fully validated model

<sup>2</sup>In order to account for the fact that the networks from the 20 trials were not independent (although the weights were randomized prior to training, the outputs and type of neural network were the same on every trial), we used hierarchical linear modeling, using the HLM for Windows software (Version 6.08; Raudenbush et al., 2004). HLM allows the slopes of the relationship between channel and proportion contribution to be different for each network trial. We created four dummy codes at level 1 (within-network trial) to represent the different channels, using Zyg as the reference group. There were no level 2 (between-network trial) variables. Analyses revealed that the proportion contributions of Corr, SCL, Resp, and HR were all greater than the proportion contribution of Zyg for both valence (*p*'s < 0.03) and arousal (*p*'s < 0.02). These results confirm that channels differ in their proportion contributions. The variance component for the intercept was not significant for either valence or arousal (*p*'s < 0.5), indicating that the proportion contribution intercepts did not vary significantly across network trials.

that does a good job of predicting subjective appraisals for an individual or a particular type of listener, then we avoid the problem of artificiality that is introduced by requiring the listener to consciously reflect on their emotional experience. Instead, we can ask the listener to experience the music as they would outside of the context of a laboratory, using the model to provide the output that the subjective appraisals are intended to provide. The output could then be explicated on the basis of acoustical, psycho-acoustical or musical factors abstracted from the music.

There are several limitations to acknowledge in this study. First, the neural network model was trained on the basis of only eight music excerpts. Although these excerpts were selected so as to span the entire valence-arousal grid, the small number of excerpts greatly limits generalizability of the findings even for excerpts of the same genre. Related to this first point is the potential problem of overfitting that may have occurred because there are more connections than training excerpts. It is quite possible that our neural network would be less accurate in the face of new excerpts from the same genre that differ in their emotional tenor. The small number of excerpts also prevents us from making statistical inference on the predictive power of linear and neural network models. Second, our excerpts were homogeneous with regard to genre of music—they were all instrumental classical music, albeit from different stylistic traditions (e.g., Bartok vs. Beethoven). Third, the inputs and outputs to the model were derived from a group of listeners that were treated as members of a homogeneous population. The inputs and outputs to the neural network models were based on aggregate data (collapsing across participants). We assume that a new randomly selected sample would yield similar aggregate data. However, participants varied with respect to felt emotion and their physiological responses. In all likelihood, this variability was influenced by their music preferences (Rentfrow and Gosling, 2003; Salimpoor et al., 2009) and the extent to which they are absorbed by music (Sandstrom and Russo, 2013). Future work should test the neural network model trained in the current study on an independent group of participants. In addition, it will be important to develop new models on larger participant samples and larger collections of music. One direction will be to train a domain-general model that is capable of performing well with any type of listener or genre of music. Another, potentially more important, direction will be to train domainspecific models that are tailored to particular types of listeners and genres. The former should be robust across contexts but mediocre in its predictive power. The latter will have increased power so long as it is tested in contexts that are consistent with training.

#### **REFERENCES**


*Int. J. Psychophysiol.* 60, 34–43.


Another important limitation of this study is that there was no representation of time in the models (c.f., Coutinho and Cangelosi, 2011). The experience of emotion in music often follows a trajectory (e.g., tension-release), in which the emotional response to a section of music will depend in part on the emotional state of the observer in the preceding section (Dibben, 2004). One means of incorporating time is through a simple recurrent network (e.g., an Elman network), which uses context from the previous time-step as additional input for the current time-step (Elman, 1993). Although previous studies have used Elman networks to predict variability in subjective reports of felt arousal and valence (Coutinho and Cangelosi, 2009, 2011; Vempala and Russo, 2012), physiological features have only led to limited explanatory gains over psychoacoustic features. One reason for this shortcoming may be the variable timecourse of physiological features (e.g., facial responses tend to be faster than changes in galvanic skin response). One potential solution that sidesteps the problem is to use time-steps that are long enough to accommodate variability in the time course of physiological features (e.g., no shorter than 5 s).

A final limitation of this study is that we have no way of determining whether the predictive features derived from physiological measures were the cause or effect of subjective appraisals. While we have treated the features as inputs and the appraisals as outputs, we are not suggesting that the physiological responses necessarily give rise to the appraisals. It is also possible that the relations are bidirectional in some manner, contributing collectively to the overall experience of emotion (Gross and Barrett, 2011).

#### **CONCLUSION**

Our results demonstrate that computational methods may be used to predict musically induced emotion on the basis of physiological features alone. Neural networks led to stronger predictions than linear modeling approaches, particularly along the valence dimension. The results of this study contribute to our understanding of the powerful emotional experience that leads so many people to listen to music.

#### **ACKNOWLEDGMENTS**

This research was supported by a Mitacs Elevate postdoctoral fellowship to Naresh N. Vempala co-sponsored by Mitacs and WaveDNA, Inc. and an NSERC Discovery grant awarded to Frank A. Russo. We thank Christopher Lachine for assistance with data collection, Roger Dean for critical feedback on quantification of network performance, and Paolo Ammirante for comments on the manuscript.

of emotional responses to music: the effect of musical expertise and of the duration of the excerpts. *Cogn. Emot.* 19, 1113–1139. doi: 10.1080/02699930500204250


the self-assessment manikin and the semantic differential. *J. Behav. Ther. Exp. Psychiatry* 25, 49–59. doi: 10.1016/0005-791 690063-9

Coutinho, E., and Cangelosi, A. (2009). The use of spatio-temporal connectionist models in psychological studies of musical emotions. *Music Percept.* 29, 359–375.


a simulation of cohort theory. *Cogn. Syst. Res.* 12, 66–78. doi: 10.1016/j.cogsys.2010.07.003


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 02 March 2013; accepted: 05 July 2013; published online: 08 August 2013.*

*Citation: Russo FA, Vempala NN and Sandstrom GM (2013) Predicting musically induced emotions from physiological inputs: linear and neural network models. Front. Psychol. 4:468. doi: 10.3389/fpsyg.2013.00468*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Russo, Vempala and Sandstrom. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Play it again, Sam: brain correlates of emotional music recognition

#### *Eckart Altenmüller <sup>1</sup> \*, Susann Siggel 1, Bahram Mohammadi 2,3, Amir Samii <sup>3</sup> and Thomas F. Münte2*

*<sup>1</sup> Institute of Music Physiology and Musicians's Medicine, University of Music, Drama and Media, Hannover, Germany*

*<sup>2</sup> Department of Neurology, University of Lübeck, Lübeck, Germany*

*<sup>3</sup> CNS Laboratory, International Neuroscience Institute, Hannover, Germany*

#### *Edited by:*

*Daniel J. Levitin, McGill University, Canada*

#### *Reviewed by:*

*Stefan Koelsch, Freie Universität Berlin, Germany Psyche Loui, Wesleyan University, USA*

#### *\*Correspondence:*

*Eckart Altenmüller, Institute of Music Physiology and Musicians' Medicine, University of Music, Drama and Media, Emmichplatz 1, 30175 Hannover, Germany e-mail: eckart.altenmueller@ hmtm-hannover.de*

**Background**: Music can elicit strong emotions and can be remembered in connection with these emotions even decades later. Yet, the brain correlates of episodic memory for highly emotional music compared with less emotional music have not been examined. We therefore used fMRI to investigate brain structures activated by emotional processing of short excerpts of film music successfully retrieved from episodic long-term memory.

**Methods:** Eighteen non-musicians volunteers were exposed to 60 structurally similar pieces of film music of 10 s length with high arousal ratings and either less positive or very positive valence ratings. Two similar sets of 30 pieces were created. Each of these was presented to half of the participants during the encoding session outside of the scanner, while all stimuli were used during the second recognition session inside the MRI-scanner. During fMRI each stimulation period (10 s) was followed by a 20 s resting period during which participants pressed either the "old" or the "new" button to indicate whether they had heard the piece before.

**Results:** Musical stimuli vs. silence activated the bilateral superior temporal gyrus, right insula, right middle frontal gyrus, bilateral medial frontal gyrus and the left anterior cerebellum. Old pieces led to activation in the left medial dorsal thalamus and left midbrain compared to new pieces. For recognized vs. not recognized old pieces a focused activation in the right inferior frontal gyrus and the left cerebellum was found. Positive pieces activated the left medial frontal gyrus, the left precuneus, the right superior frontal gyrus, the left posterior cingulate, the bilateral middle temporal gyrus, and the left thalamus compared to less positive pieces.

**Conclusion:** Specific brain networks related to memory retrieval and emotional processing of symphonic film music were identified. The results imply that the valence of a music piece is important for memory performance and is recognized very fast.

**Keywords: musical memory, episodic memory, emotions, brain-processing**

#### **BACKGROUND**

Many people value music because of the emotional richness it adds to their lives (Panksepp, 1995). Music has the potential to elicit strong emotional responses, which frequently are perceived as highly pleasurable and linked to chill-sensations (for a review see Altenmüller et al., 2013). According to brain-imaging studies, such emotional arousal is linked to activation of the central nervous reward circuits and dopaminergic mechanisms, which in turn can influence cognitive abilities and memory formation (Salimpoor et al., 2011; Altenmüller and Schlaug, 2013). It therefore is not astonishing that music is often remembered and recognized for extended periods of time and linked to strong biographical memories. In the field of music psychology this phenomenon is frequently termed the "Play-it-again-Sam-Effect," alluding to the famous movie Casablanca (Gaver and Mandler, 1987). Here, a specific song triggers strong biographical memories dating back more than a decade linked to emotions of sadness, nostalgia and remorse.

There are only a few studies investigating brain mechanisms of musical long-term memory. At present, it is still under debate, whether there is a specific memory store for music (e.g., Peretz, 1996; Ayotte et al., 2002; Peretz and Coltheart, 2003), or whether musical memories are represented in multiple stores depending on learning biography and context (Margulis et al., 2009). Recognition of familiar tunes engages the bilateral superior temporal regions and left inferior temporal and frontal areas (Ayotte et al., 2002; Plailly et al., 2007). Platel et al. (2003) have differentiated episodic and semantic musical memory. They observed frontal lobe activations in both semantic and episodic musical memory tasks. Specifically, comparison of the semantic and control tasks revealed predominately left hemispheric activation, involving the inferior frontal regions and angular gyrus in addition to bilateral medial frontal activation. In contrast, comparison of episodic and control tasks revealed predominantly right-sided activation of bilateral middle frontal regions and precuneus. Comparison of the familiar episodic and control tasks revealed activation of the right precuneus and frontal gyrus only, while comparison of the unfamiliar episodic and control tasks showed activation of the superior and middle frontal gyri and medial frontal cortex bilaterally. Thus, both familiar and unfamiliar melody recognition during the episodic task elicited frontal lobe activation, which was either right lateralized or bilateral, respectively. Interestingly, in another study from the same group, using fMRI and contrasting genuine musical and musicalsemantic memory by retrieving the titles of musical excerpts, a dissociation of the genuine musical memory, mainly related to increase in BOLD response in the superior temporal lobe and musical-semantic memory, more bound to activation in the middle and lower temporal gyrus was found (Groussard et al., 2010). The situation is different, when music is linked to strong autobiographical memories. Janata (2009) assessed in an elegant paradigm the salience of autobiographical memories linked to musicals excerpts and found a clear dorsal medial prefrontal cortex activation co-varying with the degree of saliency of the memories. Three other fMRI studies have examined the neural correlates of unfamiliar music recognition. Watanabe et al. (2008) found that successful retrieval of unfamiliar musical phrases was associated with activity in the right hippocampus, the left inferior frontal gyrus, bilateral lateral temporal regions as well as the left precuneus. Plailly et al. (2007) found that unfamiliar music elicited activation of the right superior frontal gyrus and superior middle gyrus, in addition to the left central and superior precentral sulci and left parietal operculum. Finally, and in contrast to the two above mentioned studies, Klostermann et al. (2009), made a very interesting observation when presenting very short (1.8–2 s) musical clips and measuring fMRI on retrieval. They found a pronounced unilateral right posterior parietal activation related to successful retrieval of the musical clips. Furthermore, the right middle frontal gyrus contributed. Taken together, with respect to retrieval of musical memories data are contradicting.

With respect to the emotional aspects of musical appreciation, again, only a few studies have addressed this question. Very positive emotions measured as chill-intensity elicited by familiar music lead to an increase of blood flow in the left ventral striatum, the dorsomedial midbrain, the right orbitofrontal cortex, the bilateral insula, paralimbic regions, the anterior cingulate cortex, as well as the thalamus, and the bilateral cerebellum. A decrease in blood flow was found for the right amygdala, the left hippocampus, the precuneus and the ventromedial PFC (Blood and Zatorre, 2001). In another, more recent study by the same group, the neurochemical specificity of [(11)C]raclopride positron emission tomography scanning was used to assess dopamine release on the basis of the competition between endogenous dopamine and [(11)C]raclopride for binding to dopamine D2 receptors (Salimpoor et al., 2011). They combined dopamine-release measurements with psychophysiological measures of autonomic nervous system activity during listening to intensely pleasurable music and found endogenous dopamine release in the striatum at peak emotional arousal during music listening. To examine the time course of dopamine release, the authors used functional magnetic resonance imaging with the same stimuli and listeners, and found a functional dissociation: the caudate was more involved during the anticipation and the nucleus accumbens was more involved during the experience of peak emotional responses to music. These results indicate that intense pleasure in response to music can lead to dopamine release in the striatal system. Notably, the anticipation of an abstract reward can result in dopamine release in an anatomical pathway distinct from that associated with the peak pleasure itself. Such results may well help to explain why music is of such high value across all human societies.

Even if individuals do not have intense "chill experiences," music can evoke activity changes in the amygdala, the ventral striatum and the hippocampus. When subjects were exposed to pleasing music, functional, and effective connectivity analyses showed that listening strongly modulated activity in a very similar network of mesolimbic structures involved in reward processing including the dopaminergic nucleus accumbens, the ventral tegmental area, the hypothalamus and insula (Menon and Levitin, 2005 review in Koelsch, 2010). Koelsch et al. (2006) compared brain responses to joyful instrumental tunes to those evoked by electronically manipulated, permanently dissonant counterparts of these tunes. During the presentation of pleasant music, increases in brain activation were observed in the ventral striatum and the anterior insula. Dissonant music, by contrast, elicited increased brain activity in the amygdala, hippocampus, and parahippocampal gyrus, regions linked to the processing of negative affect and fear. If musically untrained subjects listened to unfamiliar music that they enjoyed (Brown et al., 2004) bilateral activations in limbic and paralimbic regions were found. These were stronger in the left hemisphere, which is consistent with hypotheses about positive emotions being more strongly processed on the left (Altenmüller et al., 2002).

Memory and emotions partly share the same limbic structures, and there are strong reciprocal interactions between parahippocampal and frontal regions. The right parahippocampal gyrus is not only involved in learning and memory, but also in emotional processing of unpleasant emotions in music (Blood et al., 1999). Studies using emotional words, pictures, and events, have yielded evidence that arousal plays an important role in memory consolidation and retrieval of emotional stimuli independent of valence (Kensinger and Corkin, 2003; Phelps, 2004). Arousal facilitates focusing and directing attention to a stimulus which then is elaborated more deeply (Lane and Nadel, 2000). Interestingly, studies investigating the influence of valence on recognition and recall found a better recognition performance for either negative or positive valence compared to neutral valence independently from arousal (Anderson et al., 2003; Kensinger, 2004; Kuchinke et al., 2006). These studies pointed to a prefrontal and orbitofrontal cortex-hippocampal network to be involved in (especially positive) valence processing (Erk et al., 2005; Kuchinke et al., 2006).

In previous experiments, we found a significant valence effect when examining the retrieval of emotional vs. non-emotional music from long-term memory (Eschrich et al., 2008). Musical excerpts of symphonic film music with very positive valence attribution were better recognized than less positive pieces. Surprisingly, we could not demonstrate an arousal effect, since there was only a non-significant trend for a better recognition with increasing arousal.

To identify brain structures involved in encoding and retrieval of emotional music we conducted the present brain imaging study. We hypothesized that pieces of positive valence would be remembered better and that retrieval of these pieces would lead to activation in left prefrontal, orbitofrontal, and cingulate cortex.

To ensure strong emotional responses, highly arousing musical excerpts from symphonic film music with valence ranging from less positive (neutral) to very positive were selected.

#### **METHODS**

The local ethics committee approved the experimental protocol (Medical University Hannover) and the experiment was conducted according to the guidelines of the declaration of Helsinki.

#### **PARTICIPANTS**

A group of 18 non-musicians (9 women, 2 left-handers according to Oldfield, age = 28.7 years, range = 22–49, *SD* = 8.7) gave informed consent to participate in the study for a small monetary compensation of 20 Euro. They were undergraduate and graduate students of the University of Hanover with normal hearing abilities or singers in a non-musician choir. All but three participants had learned to play an instrument or sing in a choir for at least 1–2 years to more than 10 years. Three participants had received only 1 year of musical training in primary school, learning recorder playing as foreseen in some German school curricula. The mean of musical training was 8.4 years (range = 1–15, *SD* = 5.7). Eight of the participants were still actively engaged in music making. All participants appreciated listening to music and said that music was important in their lives, in particular because of its emotional effects. Participants listened to music several times a day for altogether 1–5 h with an estimated 80% deliberately chosen. They reported to like very different types of music from folk music to classics. Most of them listened to music while doing housework or while eating and they used music to stimulate themselves or to relax. The listening situation in the laboratory was very different from their usual listening habits which might be one reason why most of them indicated that they had weaker emotional reactions to the music in the study than they usually have.

#### **STIMULI**

Sixty excerpts of 10 s length of little known symphonic film music (mostly from so-called Hollywood B-movies) were selected from a larger pool of 160 excerpts on the basis of valence and arousal rating results. We included stimuli of an earlier study (Eschrich et al., 2008) and added new excerpts, which were selected in a further rating study performed with 37 participants (14 men, mean age = 28.6, range = 19–49 years). This selection process identified 60 structurally similar pieces with identical high arousal ratings but varying valence ratings. The loudness of all musical excerpts was normalized. We calculated the power spectrum for each piece per channel (Hanning window: sample rate = 44100 Hz; FFT size = 16384; maximum frequency resolution = 2.692 Hz) resulting in the relative amplitude for every frequency per channel. After this normalization procedure, the amplitude peaks per frequency band did not show differences between positive and less positive valence pieces (Mann-Whitney U-Test).

Two sets of 30 pieces were created with a comparable distribution of emotional and structural features. Each of these was presented to half of the participants during the first session outside of the scanner (encoding phase), while all stimuli were used during the second session (recognition phase) which took place inside the MRI-scanner.

After the experiment, the two sets of items were compared according to the participants' ratings of arousal and valence as well as recognition performance. Both item sets did not differ significantly with respect to any of these variables (Mann-Whitney-U-test for arousal, *p* = 0.9; valence, *p* = 0.17). Recognition performance of the participants did not differ between item sets (*d*- , *p* = 0.44).

#### **QUESTIONNAIRES**

We used self-developed questionnaires based on bipolar fivepoint rating-scales. After each piece of music, arousal, valence, and emotional intensity had to be rated on a five-point ratingscale (arousal: 1 = very relaxing/calming to 5 = very arousing; valence: 1 = less positive to 5 = very positive; for emotions felt and emotions perceived separately). We used "less positive" (in German: wenig positiv) instead of "negative," because in a pretest none of the music pieces received a "negative" rating. In a mood questionnaire participants were asked to rate their present state of arousal and valence at the beginning of each session. At the end of the first session participants filled out a questionnaire regarding demographic data, their musical knowledge, expertise, listening attitudes as well as music preferences and experience (expertise questionnaire).

#### **PROCEDURE**

During the first experimental session, participants sat in a comfortable chair with a computer keyboard on their knees, and listened to the stimuli via closed headphones (Beyerdynamic DT 770 PRO) and an USB soundcard (Audiophile, M-Audio). Questions and answer options appeared on the computer screen. Answers were logged by keyboard presses. In both sessions, prior to the music rating, participants filled out a short mood questionnaire. After this participants received written and oral instructions for the experiments. Prior to the experiment proper, three practice excerpts were given. During each trial, participants listened to the excerpt of a musical work of 10 s of length. Subsequently, participants pressed a button to start the valence, arousal, and liking rating questions on the screen. Responses were not timed. After the last question there was a break of 10 s, before the new excerpt started. Excerpts were presented in randomized order in two blocks of 15 pieces, which were separated by short breaks. The experiment was run using "Presentation."

During the encoding phase, participants were unaware of the subsequent recognition task in the second session. At the end of the first session participants filled out the expertise questionnaire. In the second session, on the next day, participants lay in the scanner and listened to the 30 old stimuli from the last session randomly inter-mixed with 30 new pieces. All participants had to make an old/new decision after each piece by pressing one of two buttons.

#### **DATA ANALYSIS**

Musical excerpts were categorized according to the pre-defined valence categories (less positive and very positive). For each participant *d* was computed for the entire set of stimuli and separately for each valence category. The *d* values per category were compared using Friedman tests and a Dunn's multiple comparison test as *post-hoc* test. For the analysis of the influence of musical structural features on the "recognizability" of the pieces a regression tree analysis was used (Cart 6.0, Salford Systems, default adjustments). As dependent variable *d* was calculated per musical piece rather than per participant to serve as a measure for recognizability of a certain piece of music. For half of the participants a certain piece of music was a target piece (hits), for the other half of participants it was a distractor piece (false alarms). Thus, the recognizability measure was based on empirical data from the experiment. The least square method was used to find the optimal tree.

#### **fMRI PROCEDURE**

A slow event related design was used for the stimulus presentation. Each stimulation period (10 s) was followed by a 20 s resting period during which participants pressed the answer button (one button for "old," the other for "new").

Stimuli were presented via fMRI compatible electrodynamic headphones integrated into earmuffs for reduction of residual background scanner noise (Baumgart et al., 1998). The sound level of stimuli was individually adjusted to good audibility.

#### **IMAGE ACQUISITION**

Magnetic-resonance images were acquired on a 3T Allegra Siemens Scanner equipped with a standard 8-channel head coil. A total of 650 T∗ <sup>2</sup>-weighted volumes of the whole brain (*TR* = 2000 ms, *TE* = 30 ms, flip angle = 80◦, FOV = 224 mm, matrix = <sup>64</sup>2, 30 slices, slice thickness <sup>=</sup> 3.5 mm, interslice gap <sup>=</sup> 0.35 mm, one run of 907 volumes) near to standard bicommisural (ACPC) orientation were collected. After the functional measurement T1-weighted images (*TR* = 1550 ms, *TE* = 7.3 ms, flip angle = <sup>70</sup>◦, FOV <sup>=</sup> 224 mm, and matrix <sup>=</sup> <sup>256</sup>2) with slice orientation identical to the functional measurement were acquired to serve as a structural overlay. Additionally, a 3D high resolution T1-weighted volume for cortex surface reconstruction (FLASH, *TR* = 15 ms, *TE* = 4.9 ms, flip angle = 25◦, matrix = <sup>1</sup>.<sup>2</sup> <sup>×</sup> <sup>256</sup>2, 1 mm isovoxel) was recorded. The participant's head was fixed during the entire measurement to avoid head movements.

#### **fMRI DATA ANALYSIS**

First the participant's head motion was detected by using Brain Voyager QX software. All datasets were motion- and slice scan time corrected prior to further analysis. Additional linear trends and non-linear drifts were removed by temporal filtering using a high-pass filter of 128 s. Finally, after the co-registration with the structural data, a spatial transformation into the standard Talairach space (Talairach and Tournoux, 1988) was performed.

To identify possible regions of activity group data were analyzed by multi-subject GLM in standard space. To emphasize spatially coherent activation patterns, functional data was additional spatially smoothed with a Gaussian kernel of 8 mm full width at half maximum.

Five different GLM were defined: The first compared the stimuli with silence. The second one compared less positive and positive pieces. The third GLM contrasted old (target) with new (distractor) pieces. The fourth one compared recognized with not recognized targets while the last GLM compared recognized positive with recognized less positive targets. These GLMs were used to disentangle memory retrieval effects as well as valence effects on recognition: Contrasting neutral and positive valence (over all pieces and only for recognized targets) trials yields valence related activations. The contrast of old and new stimuli as well as recognized and not recognized targets reflects memory (retrieval) effects. Statistical maps were created using a threshold of *p* < 0.001. When using a FEW-correction for multiple comparisons, statistical results did not reveal significant differences, therefore we used uncorrected for multiple comparisons. As significant results, we applied *p* < 0.05 (Bonferroni corrected for the comparison between silence and music) with a cluster threshold of 20 voxels. We decided to provide the results of several GLMs instead which comprise all contrasts that would have been examined in decomposing a Two-Way ANOVA and, in addition, a number of other contrast. Please note, that given the fact that in some contrasts results did not survive rigorous correction procedure we decided to provide SPMs at less strict thresholds to allow descriptive data analysis in the sense of Abt (1987).

#### **RESULTS**

#### **OVERALL RECOGNITION PERFORMANCE AND VALENCE EFFECT**

The number of correctly recognized targets (*n* = 30) differed among participants from 9 to 22 with a median of 15. The *d*- values ranged from −0.52 to 1.59 with a mean of 0.25. One participant with the very low *d* had a hit ratio on chance level (0.5) but a very high false alarms ratio (0.7). It was verified that he had not mistaken the assignment of the keys.

No significant effect of valence on recognition performance was found (less positive *d*- = 0.34; very positive *d*- = 0.16). A floor effect might have prevented the detection of a valence effect on recognition because of the low overall recognition rate.

Participants' ratings in the first session confirmed that the pieces were perceived as arousing (eight pieces with a median of 2, 52 ratings had a median of three or above) and either less positive (26 of 30 pieces were rated as such) or very positive (29 or 30).

The selected music pieces were indeed unfamiliar to the participants, with one participant knowing 3, one participant knowing 2, and two participants knowing 1 piece from prior exposure. No piece was known by more than one participant. Even if a participant had indicated to know a piece during encoding, he/she did not necessarily recognize this piece in the recognition session. We therefore decided to include all pieces of music in the analysis.

#### **IMAGING DATA**

Comparing silence with music (music > silence) yielded activation in the right and left superior temporal gyrus, the right insula, the right middle frontal gyrus as well as the bilateral medial frontal gyrus and the left anterior cerebellum (**Figure 1**

and **Table 1**). These results confirm previous experiments showing the important role of the superior temporal gyrus, the middle frontal gyrus and the insula in hearing in general as well as music perception and detection (for a review see Peretz and Zatorre,

music presentation. *p* < 0.05 (Bonferroni corrected).

2005). There were no significant activations for silence > music. For the contrast old > new pieces activation in the medial dorsal nucleus of the left thalamus and in the left midbrain was found. Interestingly the reverse contrast (new > old) yielded activation in the right middle frontal gyrus (**Table 1**).

Consistent with the findings for the old vs. new contrast there was a focused activation in the right inferior frontal gyrus and the left cerebellum for recognized vs. not recognized targets (**Figure 2**). For the reverse contrast no activations were seen (**Table 1**).

The contrast positive > less positive pieces yielded predominantly left-lateralized activations, in particular in the left medial frontal gyrus, the left precuneus, the left posterior cingulate, the left thalamus as well as the bilateral middle temporal gyrus, and the right superior frontal gyrus. There was also activation in the posterior cerebellum bilaterally. No activations were discovered for the contrast less positive > positive (**Table 1**).

The contrast recognized positive > recognized less positive yielded activation in the left superior and middle frontal gyrus, the bilateral medial frontal gyrus, the right superior temporal gyrus and the temporal pole, the left posterior cingulate and the left precuneus. Furthermore, activations were observed in the left precentral gyrus, the bilateral thalamus as well as in the bilateral anterior cerebellum and the right posterior cerebellum (**Figure 3**, **Table 1**).

#### **DISCUSSION**

This study addressed the neural basis of emotional musical long-term memory by means of fMRI in a recognition task.

Surprisingly, and in contrast to our previous study (Eschrich et al., 2008), we could not replicate the valence effect. Participants did not remember those musical excerpts better, which they had rated emotionally highly positive in the encoding phase. As the overall recognition performance was quite low, this may reflect a floor effect. It was rather difficult to find suitable stimuli for the recognition task which were structurally similar as to avoid **Table 1 | Laterality (R, right, L, left), coordinates and** *t***-values for every contrast and active brain region.**


*The table shows only the significant contrasts (Stimuli* > *silence: p* < *0.05, Bonferroni corrected; for the other contrasts: p* < *0.001, not corrected).*

that structural features of the music would have a bigger influence on recognition and fMRI activations than the emotional component. Yet, pieces had to differ in emotional effect and be different enough to be recognized. Due to constraints in time that can be spent in the scanner, music excerpts were rather short (10 s) which could have further contributed to recognition problems. As stimuli varied only on the valence dimension with arousal on a high level for all pieces, it might have been difficult for the participants to differentiate between the pieces and to feel a clear emotional difference. Thus, although we had conducted an extensive rating study, the stimulus selection might not have been optimal. Additionally, scanner-noise during the retrieval might

**FIGURE 2 | Contrast of recognized** *>* **not recognized target pieces over all participants.** The red-colored regions represent activation for the recognized target pieces. *p* < 0.001 (not corrected).

**FIGURE 3 | Contrast of recognized positive** *>* **recognized less positive target pieces over all participants.** The red-colored regions represent activation for the recognized positive target pieces. *p* < 0.001 (not corrected).

have interfered with both, recognition and emotion induction. It can be excluded that subjects suffering from an amusic disorder participated in the study, since we included only subjects who reported interest in music and we even assessed daily time of listening to music, which ranged between 0.5 and 5 h.

The low recognition rate might also explain why we only found thalamic and midbrain activity for the comparison of old to new music pieces and only activation in the right inferior frontal gyrus for the contrast of recognized vs. not recognized pieces. Retrieval processes from long-term representations of music tend to engage inferior frontal regions (Zatorre et al., 1996; Halpern and Zatorre, 1999; Zatorre and Halpern, 2005). Also studies in other domains show the importance of these brain regions for memory retrieval in general (e.g., Nyberg et al., 1996). Among the many functions assigned to the inferior frontal gyrus have been working memory (Zatorre et al., 1994; Holcomb et al., 1998), and the perceptual analysis of melodies (Fletcher and Henson, 2001). In particular, dorsolateral and inferior frontal areas are most often recruited when working memory load is high (Zatorre et al., 1994; Griffiths et al., 1999). However, according to other studies activation for musical memory retrieval would have been expected in inferior frontal and temporal regions as well as the superior temporal gyrus (Halpern and Zatorre, 1999; Platel et al., 2003; Rugg et al., 2003; Peretz and Zatorre, 2005). The activation in the left cerebellum might be due to hand motor control as the participants answered by button presses of the right hand (Platel et al., 2003). Our data indicate an involvement of the inferior frontal gyrus in the retrieval from musical long-term memory. However, further experiments examining the brain regions responsible for musical long-term memory are needed.

The hypothesized valence effect was confirmed concerning the left-lateralization of activation for the very positive stimuli and activity in frontal brain regions. The mainly left-sided activation of the frontal and temporal gyrus as well as the cingulate cortex confirm the role of these structures for emotion processing and corroborate earlier studies (Altenmüller et al., 2002; Davidson, 2003). The precuneus has been implicated in memory-related and selective attention processes and does not seem to be specific for emotions (Berthoz, 1994). Bilateral activity of the cerebellum when listening to emotional music has also been found in other studies, (e.g., Blood and Zatorre, 2001) although it has to be acknowledged that its role is not well understood (Koelsch, 2010). There was no specific activity for the contrast of less positive > very positive stimuli which might indicate that participants perceived the less positive pieces as emotionally neutral rather than negative, and thus none of the brain regions typically associated with negative affect were activated.

Concerning the contrast of recognized positive > recognized less positive stimuli the role of the posterior cingulate gyrus in emotion control (Blood et al., 1999; Koelsch et al., 2005, 2006; Ochsner and Gross, 2005; Masaki et al., 2006) as well as the role of frontal regions in the processing of complex stimuli and their valence (Kensinger and Corkin, 2003; Kensinger, 2004) was mostly confirmed. The cingulate gyrus seems also to be involved in episodic memory processing (Critchley, 2005). The right temporal pole was found to be active in the processing of positively valenced stimuli (e.g., Piefke et al., 2003; Brown et al., 2004; Ethofer et al., 2006; Jatzko et al., 2006). Thus, the respective regions most probably are involved in emotion (positive valence) processing. However, it should also be mentioned that there are data, not fitting into this scheme. In the study by Klostermann et al. (2009) right parietal and right middle frontal areas were related to memory retrieval and degree of pleasantness. Possibly this somehow isolated result can be ascribed to the different nature of the stimuli, which were extremely short and were explicitly composed containing novel sounds and timbres.

Surprisingly, and in contrast to many studies on brain correlates of emotional processing of music (Koelsch, 2010), no activation in the orbitofrontal and prefrontal cortex was found. We can only speculate about the reasons: The emotion variation in the different categories might not have been salient enough. Also, the short duration of the music pieces (10 s) and the presentation in the scanner might have precluded emotion induction.

As the right superior temporal gyrus and the middle and superior frontal gyrus were active for the contrast "recognized positive > recognized less positive" but not in the "positive > less positive" comparison, these regions seem to be involved in the recognition of emotional music.

This fMRI study was a first exploratory attempt to identify the neural underpinnings of emotional musical memory. Further experiments will be needed to clarify this issue in more detail. An idea to make emotional information more salient would be to compare music pieces with low arousal and less positive valence with pieces with high arousal and very positive valence, although in this case it would not be possible to disentangle the influence of arousal and valence on recognition performance and on brain activity. Additionally a sparse temporal sampling design (cf. Szycik et al., 2008) to avoid the loud scanner noise and make the situation more appropriate for appreciating and recognizing the music.

#### **ACKNOWLEDGMENTS**

Eckart Altenmüller was supported by a grant from the German Research Foundation (Al 269/5-3). Susann Siggel was supported by a Lichtenberg-scholarship of the state of Lower Saxony. Thomas F. Münte is supported by the German Research Foundation (DFG) and a Grant from the German Federal Ministry of Research and Technology (BMBF).

#### **REFERENCES**


neural etworks. *Neuroimage* 20, 244–256. doi: 10.1016/S1053-8119(03) 00287-8


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 29 April 2013; accepted: 27 January 2014; published online: 18 February 2014.*

*Citation: Altenmüller E, Siggel S, Mohammadi B, Samii A and Münte TF (2014) Play it again, Sam: brain correlates of emotional music recognition. Front. Psychol. 5:114. doi: 10.3389/fpsyg.2014.00114*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2014 Altenmüller, Siggel, Mohammadi, Samii and Münte. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

**REVIEW ARTICLE** published: 17 December 2013 doi: 10.3389/fpsyg.2013.00837

### Emotion felt by the listener and expressed by the music: literature review and theoretical perspectives

#### *Emery Schubert\**

*Empirical Musicology Group, School of the Arts and Media, University of New South Wales, Sydney, NSW, Australia*

#### *Edited by:*

*Daniel J. Levitin, McGill University, Canada*

#### *Reviewed by:*

*Bradley W. Vines, Nielsen, USA Gunter Kreutz, Carl von Ossietzky Universität Oldenburg, Germany*

#### *\*Correspondence:*

*Emery Schubert, Empirical Musicology Group, School of the Arts and Media, University of New South Wales, Street 2460, Sydney, NSW 2052, Australia e-mail: e.schubert@unsw.edu.au*

In his seminal paper, Gabrielsson (2002) distinguishes between emotion felt by the listener, here: "internal locus of emotion" (IL), and the emotion the music is expressing, here: "external locus of emotion" (EL). This paper tabulates 16 comparisons of felt versus expressed emotions in music published in the decade 2003–2012 consisting of 19 studies/experiments and provides some theoretical perspectives. The key findings were that (1) IL rating was frequently rated statistically the same or lower than the corresponding EL rating (e.g., lower felt happiness rating compared to the apparent happiness of the music), and that (2) self-select and preferred music had a smaller gap across the emotion loci than experimenter-selected and disliked music. These key findings were explained by an "inhibited" emotional contagion mechanism, where the otherwise matching felt emotion may have been attenuated by some other factor such as social context. Matching between EL and IL for loved and self-selected pieces was explained by the activation of "contagion" circuits. Physiological arousal, personality and age, as well as musical features (tempo, mode, putative emotions) also influenced perceived and felt emotion distinctions. A variety of data collection formats were identified, but mostly using rating items. In conclusion, a more systematic use of terminology appears desirable. Two broad categories, namely matched and unmatched, are proposed as being sufficient to capture the relationships between EL and IL, instead of four categories as suggested by Gabrielsson.

**Keywords: expressed and felt emotion in music, emotion locus, contagion, normative dissociation, contrast effect, affect valence, literature review**

The distinction between emotion felt by a listener (internal locus of emotion) and emotion expressed by a piece of music (external locus of emotion) has become a firmly established part of research agenda of music psychologists in the last decade. Since the seminal work of Gabrielsson (2002) we have seen evidence that emotion felt in response to music (e.g., "the music makes me feel happy") is sometimes the same as ("the music is happy") and sometimes different from ("the music is sad") the emotion expressed by the music—so called "perceived emotion." This paper aims to push the debate further by examining the data in the literature published in the decade after Gabrielsson (2002) and explaining why the emotions between the two loci (felt vs. expressed) are sometimes systematically different and sometimes the same. The paper is structured as follows. First, (1) the inclusion criteria and limitations of the review are laid out, (2) some early music psychology research related to emotion locus is presented including an overview of Gabrielsson's paper, followed by (3) a collation of the target literature of this review. Then, (4) the theoretical implications of the literature are discussed, with (5) a proposed reworking of Gabrielsson's locus relationships to accommodate a developed understanding, and to highlight some of the key research questions emerging in the field.

#### **INCLUSION CRITERIA AND LIMITATIONS OF THIS REVIEW INCLUSION CRITERIA**

The inclusion criteria for the research tabulated for the review are as follows: (1) studies which made a direct comparison between external locus emotion and internal locus emotion in connection with music listening; (2) studies which use the same response regime for both external locus and internal responses; and (3) studies appearing in peer-reviewed journals in the 10-year period of 2003–2012 (that is, the decade since Gabrielsson's publication). Nineteen studies met all three criteria. The reason for excluding studies that have some connection with emotion locus and music are now explained.

#### **LIMITATION 1: EXPRESSED AND FELT EMOTION DATA COMPATIBILITY**

When discussion of locus is presented in the research literature on emotion in music, it is most frequently an acknowledgment that a locus distinction exists, but that the study limits the investigation to one locus or the other (internal or external), without comparing both. Sometimes the data in each locus are not directly comparable (e.g., a rating of felt emotion, but a categorical, *a priori* label for the emotion expressed by the stimulus), as will be discussed below. Such studies were not included in the tabulated literature of this review.

Thus, studies that could allow comparison of locus by rating of expressed emotion *a priori* (e.g., by a panel of experts, as in numerous mood induction studies), or comparison of physiological measures with one locus or the other (e.g., Krumhansl, 1997; Grewe et al., 2007; Nagel et al., 2007; for a critical review, see Konecni, 2008; Grewe et al., 2009 ˇ , pp. 263–265) were not included. The bulk of the *a priori* rated expressed locus data are found in the well-established music mood induction literature<sup>1</sup> .

#### **LIMITATION 2: TERMINOLOGY TO DESCRIBE THE FELT/EXPRESSED DISTINCTION**

As discussed below, the terminology to describe locus of emotion is varied, making online keyword searching alone inappropriate for locating articles that fulfill the inclusion criteria <sup>2</sup> . Instead, a number of sources and databases were consulted—Google Scholar, PsycINFO, Scopus, and Web of Knowledge, as well as papers that cited Gabrielsson's (2002) paper.

#### **LIMITATION 3: EMOTION LOCUS RESEARCH IN NON-MUSIC RESEARCH**

Although an interest in comparing locus in other fields of research can be found—for example, in social reception (e.g., Jakobs et al., 2001; Bombari et al., 2013), cross-modal (e.g., Calder et al., 2000), film (e.g., Matsumoto and Kupperbusch, 2001; Wang and Cheong, 2006; Werner et al., 2007), literature (e.g., Oatley, 1995; Green, 2004; Miall, 2011), business (e.g., Pugh, 2001), and facial expression of emotion (e.g., Dimberg et al., 2000; Barthomeuf et al., 2012; Sato et al., 2013) research—it appears that most interest in the direct comparison of internal with external loci of emotion via empirical means is rooted in music perception research, making the transference of findings from other subdisciplines limited at this point in time. However, some relevant issues from non-music studies are mentioned in this review.

#### **LIMITATION 4: PHILOSOPHICAL ISSUES AND MOOD INDUCTION**

The review is limited to empirical data from music psychology research in which data from each emotion locus (felt by listener and expressed by music) are gathered and compared. It should be noted that music psychology has been influenced by ideas about emotion locus that were primarily in the realm of philosophy and aesthetics. It is those scholars who introduced terms of emotivism and cognitivism, often quoted in the music psychology literature (Baumgartner, 1992; Goldman, 1995; Scherer and Zentner, 2001; Rickard, 2004; Schubert, 2007a; Konecni, ˇ 2008; Konecni et al., 2008; Roy et al., 2008, 2009; Coutinho and ˇ Cangelosi, 2009; Lundqvist et al., 2009; Garrido and Schubert, 2011b; Hoeckner et al., 2011; Panagiotidi and Samartzi, 2012). However, I do not focus on these philosophical writings not just because they fall outside the "empirical research" gamut, but because they generally have a different focus from that of music psychology: where music psychologists seek to understand the nature and relationship between emotion loci in music, philosophers of aesthetics are frequently concerned with identifying the *value* of music that evokes emotion (the emotivist perspective) vs. expresses emotion (the cognitivist perspective). For example, Scruton (1983) writes, "To describe a piece of music as expressive of melancholy is to give reason for listening to it; to describe it as arousing of evoking melancholy is to give a reason for avoiding it" (p. 49), and (Kivy, 1989) "one substantial group of listeners who report that sad music makes them sad are simply (and understandably) mistaken" (p. 163). This *value* perspective has relevance in music psychology, particularly when value is operationalized as a variable such as liking (some empirical research has provided evidence supporting the spirit of the just cited statements—see, e.g., Konecni et al., 2008; Hunter ˇ et al., 2010; Vuoskoski and Eerola, 2012); however, the literature review is limited to *understanding the relationships* between emotion loci from the perspective of listener self-reports (rather than judging the value of the music). Outside such circumstances, the more purely philosophical research will be used to inform the reviewed material, rather than be part of it. Hence, a limitation of the present review will be to examine exclusively music-psychology literature (for excellent aesthetician accounts, see Radford, 1989; Kivy, 1990, 1999, 2002; Davies, 1994, 2003; Robinson, 1994).

#### **LIMITATION 5: WITHIN-LOCUS DISTINCTIONS**

On a related matter, this review will not attempt to separate "within emotion locus" distinctions, such as the internal locus distinction between inducing mood vs. feeling an emotion (e.g., Weld, 1912; Diener et al., 1995; Lychner, 1998; Gray et al., 2001; Sloboda and Juslin, 2010). Another within-internal-locus distinction is between "truly," internally felt emotion vs. the emotion one displays in an intrapersonal, social, or work setting. This kind of distinction is covered in non-music research on emotional regulation, which includes protective buffering (Langer et al., 2007; Manne et al., 2007), display rules (Ekman, 1972; Matsumoto, 1990), and emotional dissonance and emotional labor (Bono and Vey, 2005; Mann, 2005; Bakker and Heuven, 2006). Such distinctions within internal locus are not reported here because they have not been cited in empirical music perception investigations that meet the inclusion criteria, and do not at first *seem* to be of relevance because one would imagine that knowing how to behave in front of a piece of music is not relevant in the way that knowing how to behave in front of other people is relevant. That is, it seems unlikely that we would need to distinguish between how we actually feel and the felt emotion that we display (for example, through facial, bodily or written/typed response) in the study of emotion in music. Although this internal locus distinction is not covered in the empirical data of the literature reviewed, it will be relevant in future research (for a discussion in non-music contexts, see Gross et al., 2000).

<sup>1</sup>Music is used in clinical and laboratory settings to induce mood. For a review, see Vastfjall (2001). When music is selected for such mood induction, it is inferred that the music will affect the listener because the music is expressing the corresponding emotion. This is borne out by some of the studies in mood induction used in fields of research such as music therapy. Although there is a substantial literature that documents these effects (e.g., Kenealy, 1988; Thaut and De L'etoile, 1993; Gendolla and Kruesken, 2001; Kreutz et al., 2002; Larcom and Isaacowitz, 2009; Harkness et al., 2010; Dyck et al., 2011), these research outputs are not reviewed here because the participant is not rating the emotion the music expresses, but (when required to) the way the music makes them feel.

<sup>2</sup>This issue and the relatively small number of comparable studies made it difficult to sufficiently fulfill the "Checklist of items to include when reporting a systematic review or meta-analysis" proposed by Moher et al. (2009). Relevant details (mean, *SD* and *N*) for this review are nevertheless tabulated when available.

Within-external locus distinctions, such as whether music is *portraying* emotion, *expressing* emotion, *trying* to portray/express emotion, or "is" emotional (the music *is* sad), are also not covered in this review, again because empirical data investigating these distinctions are rare and mostly concerned with semantics. Instead, in this review, the terminology used to discriminate external and internal locus (rather than its semantic/linguistic utility) is reported (see subsection "Locus Terminology", below), with the exception of one study that is included (Van Zijl and Sloboda, 2011) because it raises the matter of the "performer" locus of emotion, which is external from the listener perspective unless the listener *is* the performer. Again, future empirical studies will be needed to examine the various external locus possibilities, specifically the emotion that the composer(s), performer(s) or other (perhaps imaginary) listener(s) are thought to be experiencing (according to the perceiver), and whether a further distinction between these other people and the music should be made when considering external locus. For example, some recent research has considered emotion ratings that others would make about a piece of music as a way of managing possible bias in external locus response. That is, instead of being asked what emotion they believe the music expresses, a participant is asked, "How would normal people feel when listening to this musical stimulus?" (Kawakami et al., 2013).

#### **PREVIOUS PSYCHOLOGICAL ACCOUNTS LEADING UP TO GABRIELSSON'S PUBLICATION**

Pre-review period (before 2003) accounts by music psychologists demonstrating an awareness of the distinction between felt and expressed emotion are frequent (e.g., Weld, 1912; Valentine, 1962, p. 10; Swanwick, 1975; Payne, 1980; Thayer, 1986; Gaver and Mandler, 1987; Frances, 1988, p. 243; Sloboda, 1992; Scherer and Zentner, 2001; for a discussion of key pre-review period studies, of course see Gabrielsson (2002) but in particular pp. 124–127 and pp. 132–133), but almost none of these explicitly compare locus responses empirically (exceptions among which are Lee, 1932; Collins, 1989; Zentner, 2000).

From a historical point of view, and from a theory-building perspective, it is worth dwelling for a moment on the pioneering study by Lee (1932). A questionnaire was developed to investigate responses to music made by music lovers, to tease out the role of emotion (among other things) reported by participants, and to identify any skepticism about whether music was capable of stimulating human expression and emotion. Over 100 responses were collected, spanning a period of more than 25 years. One part of the analysis was to classify "listeners" who had aesthetic interest in music and "hearers" who were more focused on human-emotional interests in music listening. Lee asked explicitly about locus, but upon analysis she noticed that participants treated the three options of the questions as "one of three" rather than two of three (A or B—internal locus, and then C—external locus):


Lee's openness and regret are explicit as she acknowledges:

What I wanted to know was whether the Answerer merely recognised that a given piece of music had (i.e., might be described as having) a given emotional character, e.g., was sad or cheerful, or whether hearing that piece of music made him feel sad or cheerful when he had not been so before? This fairly simple query, clearly worded as "do you merely recognize without participation that music *represents* varieties of human emotion and mood?" was the real subject under examination and ought therefore to have been put first and foremost. Instead of that, and from a mistaken hope of additional clearness, it was put after the queries (intended to be supplementary to it) whether the Answerer's already existing mood could be altered or whether that mood was merely intensified by the emotion which was not merely recognized as characterizing the music, but actually participated in when hearing that music. As a result of this strategical blunder, the majority of Answerers did not notice the main question of Participation versus Recognition. (p. 203, italics in original)

Nevertheless, Lee identified participants who explicitly reported the link between felt and expressed emotion; for example, two participants, Bob and Lewis, respectively, wrote:

"Music generally substitutes new moods and emotions: if the emotion is tragic or tender, it seems that my mood becomes tragic or tender." (This is recognition producing a sort of sympathetic imitation.)

[...]

"It is not that the music expressed one's own feelings, but that the feelings or mood which the music expresses awaken these very feelings in oneself. Music never intensifies existing feelings, it either awakens feelings which I haven't got or merely represents them." (p. 204, Lee's annotation of Bob's comment is shown in the parenthetical)

Hence, Lee's work appears to be the first explicit attempt in English language music psychology research to collect empirical data on the distinction between locus of emotion, a pioneering effort that has received regrettably little attention in subsequent research. Her explanation of "sympathetic imitation" will be highly relevant in the theory development of this review. Aside from the data and analysis reported by Lee, it was not until 2002 when Gabrielsson acknowledged in explicit terms the possible relationships between loci (Gabrielsson is also an important researcher, as was Lee, in compiling a large body of data on individuals' self-reported aesthetic/emotional reactions to music: Gabrielsson, 2011).

Gabrielsson's (2002) publication *was* noticed and paved the way for a new era of research, by laying out the possible relationships between felt and expressed emotions. Gabrielsson provided a cautious, broad definition of emotion, which is used as a starting point in the present review: "Not to get trapped in ... terminological confusion already from the beginning, I will use 'emotion' and 'feeling' in a generic and broad-minded sense, often involving cognitive components; "mood" and "affect" will be used when employed by authors referred to in the text" (p. 123–124). He then presented a detailed argument explaining that participants can only logically make distinctions between felt and perceived emotion in music through verbal report, as distinct from, say, physiological measurement.

Gabrielsson discusses some of the reasons for the absence of direct emotion locus comparisons in music perception. For example, he recounts the belief that there were different listener types, some who focused on the music, and others who focused on their own feelings when listening to music (Lee, discussed above, provides a case in point). Music was seen by many researchers and lay people as "an object for perception and reflection" and the emotional response as a listener's reaction. As a result, Gabrielsson concluded that the distinction between emotion loci in music "is not always clearly observed, neither in everyday conversation about emotions, nor in scientific papers" (Gabrielsson, 2002, p. 124). Another reason for the neglect is the influence of aestheticians upon music psychologists, as mentioned in Limitation 4, above, where the interest was in whether music that expresses emotion should be more valued than music that evokes emotion in the listener. This perspective suggests that each locus has a distinct aesthetic function, providing one reason why researchers have been distracted from the psychological *relationships* between the emotion loci.

According to Gabrielsson, in broad terms, felt and expressed emotions could be related through a "positive" relationship (felt emotion is the same as expressed emotion—feeling sad when hearing sad music), or it could be "negative" (i.e., opposite, e.g., feeling angry when hearing happy music). Furthermore, felt and expressed emotions could exhibit "no systematic relationship" (e.g., feeling various emotions when hearing calm music), or have no relationship at all (such as feeling no emotion, or identifying no emotion in the music). Gabrielsson traces back the presence of positive relationships to the ancient Greeks where music was thought to be able to directly affect the listener, in a manner similar to more contemporary research on mood management and regulation theory (Knobloch and Zillmann, 2002; Saarikallio and Erkkilä, 2007; Wilhelm et al., 2013), through a process that would later be called contagion by Juslin and Västfjäll (2008). That is, contagion explains how music can express an emotion that then "infects" its listener.

Opposite relationships, Gabrielsson explains, and is confirmed by subsequent research, are more idiosyncratic (e.g., when an event of a contrary emotion happens in one's life that becomes associated with music heard at the same time), or a sad piece of music makes the listener happy because it is relaxing, cathartic and pleasurable (Schubert, 1996; Matsumoto, 2002; Huron, 2011; Vuoskoski and Eerola, 2012; Vuoskoski et al., 2012; Schubert, 2013), or the listener, perhaps in attempting to improve mood, actually makes their mood worse, perhaps due to some complicating circumstances such as a mood disorder (Garrido and Schubert, 2011b, 2013).

In the case of no systematic relationship, the listener might not be affected by the music at all, but be able to observe the music as expressing some emotion—Gabrielsson characterizes this with the "analytic listener"—or that different emotions are evoked in the listener at different occasions—the "zero correlation" relationship. The final category, of "no relationship at all," is characterized by the absence or unreliable presence of emotions in one or both loci, such as the internal locus (human) emotions as identified by Scherer and Zentner (2001) that music cannot express with reasonable agreement, which is likely to include gratitude, fascination, disgust, jealousy, safety, warmth and humility. Gabrielsson's relationships are not intended as clear-cut, and polychotomous: The differentiation between internal and external locus can be blurred, and the experience felt vs. the perceived emotion in the music may not even be distinguishable or meaningful to some. Gabrielsson writes, "We may think of them as opposite extremes on a continuum from 'pure' emotion-free perception at the one end to intense emotional reaction at the other end. Rather than being at any of these extremes, in most situations listeners are probably somewhere along this continuum, depending on many circumstances" (p. 124).

In the years following Gabrielsson's influential paper, research on explicitly collected self-report felt and expressed emotion in music grew, and 16 peer-reviewed publications that met the inclusion criteria were located. Some of the publications reported more than one study that was concerned with emotion locus and music, and two reported locus data from previously published sources (Schubert, 2007a; Ilie and Thompson, 2011), bringing the total number of included, unique data sets (studies) comparing emotionlociinmusic to19. Fifteen of the studieswere experimental (Schubert, 2007a is not added to this count because it is a reanalysis of Schubert, 2007b), three were survey based (no music played), and one was a qualitative study (Van Zijl and Sloboda, 2011).

#### **REVIEW OF THE CONTEMPORARY LITERATURE: METHODOLOGICAL ISSUES**

A summary of the included studies is tabulated in **Supplementary Table 1**. The reviewed papers are discussed according to terminologies and methodologies, followed by key results.

#### **LOCUS TERMINOLOGY**

The labels used to denote internal and external locus varied considerably across the tabulated studies. Naming the locus *variable* produced several alternatives: instruction condition (Vieillard et al., 2008: "instructed to report own emotion or instructed to describe the music"), "response" (Hunter et al., 2010), "mode" (Dibben, 2004), "modality" (Zentner et al., 2008), "type of rating" (Ali and Peynircioglu, 2010 ˇ ), "point of view" (Kallinen and Ravaja, 2006), and "locus" (Schubert, 2007b).

The labeling of each of the two levels of the variable demonstrates a rich variety of ways of understanding the phenomena in question. Lee's pairing ("participate" for internal locus and "recognize" for external) is absent in all the contemporary literature reviewed, although Vieillard et al. (2008) use the term "recognize" in their recognized-experienced level labels. Collins (1989)—not part of the tabulated review—provides a rather detailed distinction between internal and external loci: "own emotional response—emotional content of the music" and "describe music—describe human emotion," as do Kallinen and Ravaja in asking their participants to respond to the music from two different "points of view":

Participants were first asked to evaluate the emotions the music aroused in them during listening (i.e., emotion felt; "How did you feel when you listened [to] the music?"), and second, evaluate the emotional quality of the music regardless of the experiences it aroused in them (i.e., emotion perceived; "What is the [more objective] emotional nature of music, regardless of your personal reactions to it?"). (Kallinen and Ravaja, 2006, p. 200)

Other variants were expressed-felt (Dibben, 2004); perceivingfeeling (Hunter et al., 2010); expressed-induced (Zentner et al., 2008); perception-induction; perceive-experience (Juslin and Laukka, 2004); conveyed-elicited (Ali and Peynircioglu, 2010 ˇ ); and participant believed the composer was intending to convey felt in response to the musical excerpt (Salimpoor et al., 2009). Van Zijl and Sloboda (2011)refer to musical emotion—own emotion. This study is interesting from the perspective that it tracks the responses of the listeners who are also the performers of the music. Such an approach may be incorporated into the "locus" nomenclature used by Schubert, by extending the meaning of external locus beyond the "perspective" of the listener. That is, if the listener is asked to judge the emotion that the performer is experiencing while playing, or the emotion the composer was experiencing while composing, or even the emotion experienced by any other listener (or an imagined agent), the locus would still be external, but focussed on another person, rather than on the music (a detailed investigation of the appreciation of the performer/composer emotion is beyond the scope of this review, but see, for example, Juslin, 2000; Juslin et al., 2001; Kreutz et al., 2008b).

In summary, no single pair of designations for each locus level was used consistently across the studies. This reflects a richness in the descriptions, subtle distinctions, but also lack of systematic classification. The recommendation of this review is to label two levels of the variable as deemed appropriate in each study: felt, induced (although this may be related to mood according to some researchers), evoked, internal locus vs. expressed, portrayed, "in the music," and external locus. "Perceived" is frequently used to describe external locus, however, the term could be confused because a participant can perceive many things, including their own feelings in some cases (Konecni, 2008 ˇ ). Context and grammar is important here: to make the locus distinction clear from the *listener's* perspective, and when the listener is the subject of the sentence, she or he feels an emotion, but *perceives* it (rather than expresses it) in the music—"express" does not designate external locus when the listener is the subject of the sentence ("the listener expresses happiness"). When the *music* is the subject of the sentence, *expressed* emotion is the external locus designation (obviously not "perceived," because music is not an agent that can literally perceive).

Using the term "locus" to describe the variable frees up labels that are commonly used to describe other variables, such as type and mode. "Perspective" is another possible term that could be used to describe the variable but was not cited with any regularity in the tabulated literature.

**Table 1** attempts to organize all of the terms used to describe each of the two levels of loci located in the literature. It should be understood, then, that the unambiguous use of the terminology depends on the explicit and implied grammatical subject (that is, if stated from the perspective of the listener or of the music). Any of the internal-external pairings are satisfactory but need to have the subject (music or listener) explicitly stated or made clear. The term "communicates" is listed under internal locus from the music perspective because it suggests a context: "the music

**Table 1 | Terminology of emotion locus levels by grammatical subject (perspective).**


communicates to me." But since the term "communication" refers to the *transmission* of information from one source to another, it may lead to some ambiguity (e.g., "the music is communicating an emotion"—external locus?).

#### **PARTICIPANTS**

All included experimental studies in this review had over 25 participants in each experiment, with the Konecni et al. (2008) ˇ study having 144 participants. The two questionnaire-based studies had 262 (Zentner et al., 2008) and 141 (Juslin and Laukka, 2004) participants. The open-ended study of the musicians' perspectives used 8 performers covering a variety of musical instruments (Van Zijl and Sloboda, 2011). Participants had a range of ages across studies, with one study examining results of older participants separately (Schubert, 2007b; Experiment 3). A wide range of musical experiences were reported, though none of the studies treated musical experience (e.g., high vs. low) as an experimental variable.

#### **TYPES OF MUSIC USED**

The bulk of music used in the studies tabulated comes from the common practice period (CPP) of Western art music, which is sometimes referred to as "classical" music. Twelve publications using experimenter-selected pieces used exclusively such music. Classical music, and in particular music from the romantic art music period, is considered particularly good for expressing and evoking emotions (Romantic(ism), 2013).

Since studies by Panksepp (1995), Blood and Zatorre (2001), and Rickard (2004), it has become evident that another effective way to evoke strong emotional responses is to use music that the participant, rather than the experimenter, selects. The Rickard study comes near the halfway mark of the review sample chronology, and it is after this date that we start to see self-selected pieces being used for locus of emotion in music studies. The locus studies reviewed first commenced using self-selected music from 2007 (Schubert, 2007a). Salimpoor et al. (2009) directly compared participant-selected and experimenter-selected pieces in their study.

Some research deliberately selected music that is unfamiliar. In the two experiments investigating emotion locus by Dibben (2004), she verified that the music was not familiar to any of the participants. Hunter et al. (2010) also selected unfamiliar music by Bach—eight selections—which were manipulated by tempo and mode to generate the 30 stimuli presented via MIDI playback. Thus, music of the common practice period presented a great range of choices for a Western-enculturated participant. Vieillard et al. (2008) is the only study in the tabulated literature where specially composed music was created for the purpose of exerting a high degree of control over the stimuli. This was balanced with some more familiar film music excerpts, used at the beginning of the procedure. The study by Van Zijl and Sloboda (2011) used self-selected pieces, but these were for the purpose of learning to play—so that participants would be getting to know their stimuli in a highly intimate way.

Two publications reviewed included no explicit musical stimuli (Juslin and Laukka, 2004; Zentner et al., 2008) because data about music in general were collected via questionnaire. The questionnaires provided critical data directly addressing the question of locus. In this respect, they fulfilled the inclusion criteria of this literature review because the participants were asked to think about music in general, or a favorite piece (see also Evans and Schubert, 2008) or genre of music. In addition, Juslin and Laukka (2004) was one of the first published studies (along with Dibben) to directly address the question of locus in the post-Gabrielsson period.

When musical examples were selected, they were usually chosen on the basis of which emotion they would express or evoke, with the intention commonly being to produce a range of emotions, whether based on the theory of basic emotions—such as "joy, sadness, anger, and fear" (Kallinen and Ravaja, 2006) or a sample from each quadrant of an emotion space (e.g., Dibben, 2004; see also Collins, 1989, though not tabulated for the present review). For the experimental studies, the typical length of the excerpts used ranged from 1 to 3 minutes. Ali and Peynirciogluˇ (2010) used stimuli thought to be unfamiliar, each lasting approximately 20 seconds, but in one condition, participants were familiarized by listening to an excerpt five times in succession. In one of the two Ilie and Thompson (2006) studies the stimuli used had an average duration of 6 seconds.

#### **DESIGN AND PROCEDURES**

Many of the tabulated studies ask the participant to perform tasks in a laboratory setting and often in groups. However, the grouplistening laboratory setting does not necessarily reflect the typical day-to-day listening experience or environment (for a discussion of the matter of naturalistic vs. experimental research, see Mitchell, 2012). Since the emerging ubiquity of personal, private online computer facilities, tablets and mobile phones (Krause and Hargreaves, 2012), it has become possible to collect sophisticated data outside the laboratory and in a private, individual environment (e.g., see Reid et al., 2009). Such technological advances will allow locus data to be collected in a variety of settings. For example, one of the studies outside the review epoch (Schubert, 2013) used an online survey to collect locus data, requesting participants to use YouTube or some other online streaming resource to listen to their self-selected pieces, and provide information about the piece and the URL (that is, the participants pasted the link they used to access the selected music to allow later inspection by the researcher). The survey could be completed in private. Further developments of this approach will be able to better ascertain the reliability and validity of collecting data in such a way, as compared to group settings with a researcher present and with predetermined ordering of stimulus presentation.

Juslin and Laukka (2004) used a survey with no musical stimuli to gather data on a range of issues regarding musical experiences and beliefs, including emotion perception and emotion induction. The survey included questions about the utility of certain words for describing emotion felt and emotion expressed. This kind of task is easy to administer and produces a rich source of data about emotion locus. Zentner et al. (2008) also used a survey-based approach with a direct rating for each of the two loci for a series of 146 "feeling terms" for a variety of musical styles. Participants first rated the terms as emotions felt for a selected (favorite) musical style, and then again for the emotion perceived in that style.

Indeed, one concern in the literature is the timing of the locus tasks. Should they both be completed immediately after hearing the musical stimulation, one after the other, or should only one be completed (internal only *or* external only) in case one rating influences the rating of the other locus? Completing the two rating items in immediate sequence has the advantage of requiring only a single pass of the music stimulus to obtain both ratings, but has a drawback if the participant (consciously or otherwise) responds to the second rating item under the influence of the first, through what is known as a "contrast" effect (Cacioppo and Gardner, 1999; Schwarz and Strack, 1999; Cheng, 2004), where the second response is made relative to the first, while the first rating item response is (probably) not. For example, a high rating of felt emotion might be exaggerated (rated even higher) if the external locus rating is made immediately before but is also rated as high.

Kallinen and Ravaja (2006) asked participants to rate felt emotion first so as not to dilute the felt emotion caused by the delay of rating expressed emotion first. This ordering was used in several other tabulated experiments (e.g., Dibben, 2004; Schubert, 2007b, Study 2 and 3; Evans and Schubert, 2008). However, Konecni ˇ et al. (2008) argued that by rating external locus first (i.e., the emotion in the music), the participant will be more cognizant in distinguishing their own internal locus response and not confuse it with the expressed emotion. They resolved the matter by counterbalancing—half the participants made their internal locus rating first, while the other half made their external locus rating first. Dibben (2004; Experiment 2), Vieillard et al. (2008), and Ali and Peynircioglu (2010) ˇ each had one internal locus group and another external locus group, each group completing questions for one locus only as a between-subjects design. Dibben (2004) concluded that when participants make a judgment in one locus alone, they do not differentiate between emotion and locus as well as they do when loci are presented together (p. 111), suggesting that some "contrasting" of rating items may be methodologically beneficial, as small differences are amplified. Schubert (2007b) findings were different but led to a similar conclusion: a contrast effect was not observed, and in fact, locus tasks performed together produced some "interference" (Gabrielsson, 2002, p. 127; Dibben, 2004, p. 95) leading to a more blended response, compared to performing one locus task at a time. But the same *trend* in response was noted when compared to the "locus-separate" condition (recording response to one locus only, and then the other on a second hearing of the stimulus), leading Schubert to conclude that responding to both loci in sequence at the same time had efficiency (almost halving experiment time, or "doubling" the data pool, and as a result increasing statistical power), perhaps compensating for the possible disadvantages. The recommendation of this review, then, is that it is more efficient to collect both locus ratings together, but to counterbalance the order of loci questions, as did Konecni et al. (2008) ˇ when possible. Most of the tabulated studies, when presenting the loci responses together, requested internal locus rating first. It is not always clear in the literature whether participants could change their answers, though it seems that no effort was made to prevent or withhold the option of checking or changing responses. Explicit investigation of this counterbalancing is recommended.

#### **RESPONSE FORMAT**

Fifteen of the nineteen tabulated studies were defined here as "experimental," and for several of those, as well as the three survey studies (see **Supplementary Table 1**, rows 3, 10, and 12), a wide range of emotions were presented as entities to be rated by the participant after listening to an extract of music, once for internal and once for external locus. This subsection therefore examines these item-rating response formats according to (1) the emotions rated, (2) the number of steps available on each rating item, (3) the number of items rated in each locus for each piece of music, and (4) unipolar versus bipolar item labeling issues.

#### *Emotions Rated*

The most frequently used pole labels (whether rating item was bipolar or unipolar) were happy (including happiness and veryhappy) and sad (including sadness and very sad), both used in at least <sup>3</sup> eight studies (**Supplementary Table 1**, rows 1, 2, 4, 9, 11, 13, 14, and 16. Note: All subsequent parenthetical references to row numbers in this subsection refer to this table, with focus on the Measure column). In (at least) five studies an item for rating happiness and another for rating sadness were presented (rows 1, 2, 11, 14, and 16), but on three occasions terms related to happy and sad were presented together at the opposite poles of a single, bipolar rating item (rows 4, 9, and 13), such as very happy to very sad, as used by Konecni et al. ˇ (2008, row 9). Wording related to "arousal/aroused" was located in four studies (rows 4, 5, 13, and 18—one of which had poles labeled "energetic/peppy" and "bored/vegetated:" Kallinen and Ravaja, 2006, row 4). If we group together the remaining labels according to similarity of meanings, the next most frequently used rating item labels in the tabulated literature are five occurrences related to anger ("angry", "fear", and "scary" <sup>4</sup> : rows 1, 2, 4, 11, and 16) and four related to calm ("calmness", "relax", and "peacefulness": rows 4, 11, 16, and 18).

Typically, the rating items were combined and manipulated to create dependent variables for statistical analysis and hypothesis testing. This transformation process is summarized in the Measure column at "DV" for each of the relevant tabulated studies in **Supplementary Table 1**. The most frequently used labels for dependent variables were related to "valence" (positive, negative, and/or valence: rows 1, 2, 4, and 18) and "arousal" (rows 4, 8, 13, and 18), although untransformed rating items also received these labels in some studies (e.g., row 5). Thus, the emotion constructs of valence and arousal have endured as methods of operationalizing emotion locus response.

#### *Number of Steps Per Rating Item*

When making a response via a rating item, participants are given a number of graded steps (points) along a continuum from which they are to make a single selection. The range of the available steps per rating item across the tabulated studies was from 4 to 13, with 5-point rating items used most frequent (seven studies, rows 1, 2, 4, 10, 14, 16, and 18). Researchers have had to balance the coarseness of completing rating items with fewer points with the greater difficulty for the participant in making a selection when a larger number of steps are offered (Alwin, 1997; Viswanathan et al., 2004; Dawes, 2008). While there is no "magic number" of rating scale steps to use, in the tabulated literature the number of steps is generally informed by precedence (e.g., using a rating item previously published) and the number of items to be rated—a large number of items is usually associated with a smaller number of steps per item, as discussed next.

#### *Number of Emotions Rated*

Konecni et al. ˇ (2008, row 9) deliberately used a single rating item for each stimulus/locus combination, arguing that a large number of rating items might be unrealistic to recall and impractical to complete in response to an excerpt of music (in accord with Viswanathan et al., 2004). Typically 4–6 emotion item ratings were requested per piece, per locus across the tabulated studies. In the open-ended responses, Van Zijl and Sloboda (2011, row 19) asked performers explicitly about emotions they felt when preparing a piece and the emotions that the music was expressing, meaning that there was no explicit limit to the number of emotions that could be reported. Two of the survey studies (Juslin and Laukka, 2004; Zentner et al., 2008, rows 3 and 12) requested participants to rate how frequently a word from a list of emotion terms was appropriate to describe music, on a four step rating item ranging from never to always. The checklist approach pioneered by Hevner (1936, 1937), which lists a relatively large number of emotion words from which the participant can select, is completely absent in the experimental studies tabulated, suggesting utility and ease of statistical analysis of the highly prevalent rating items.

#### *Unipolar vs. Bipolar Rating Items*

In the experimental studies, response items were most frequently presented in a unipolar format (9 studies: rows 1, 2, 8, 11, 13, 14, 15, 16, and 18), whereas bipolar labels were used for rating items in six studies (rows 4, 5, 6, 9, 13, and 15). Dependent variables were generated from combinations of the rated items in 9

<sup>3&</sup>quot;At least" is indicated because use of the same labels in subsequent studies with an author who had already used these labels is not, in general, added to the count. Exceptions are made when the author has used substantially different labels (e.g., Dibben, 2004: compare Measure column for her Experiments 1 and 2, rows 1 and 2 of **Supplementary Table 1**—both of these experiments are added to the count). Furthermore, when the term happy, sad, etc. is used to *exemplify* a label (such has "happy, calm, joy" to illustrate the meaning of the label "positive emotion"), it is not included in this count. The same applies for all summary data presented in this subsection. Where a term was presented in a language other than English, the English term, as reported in the target article, is used for tallying.

<sup>4</sup>Caution is urged in making such a grouping since "fear" and "scary" are importantly different along a dominance dimension (angry is a dominant emotion; fear/scared are submissive). The grouping is justified only in terms of similarity on typical valence-arousal semantic spaces (Russell, 1980).

studies, of which 4 were unipolar (rows 1, 2, 11, and 16) and 4 were bipolar (rows 4, 9, 15, and 18), with one study generating an angular variable from the combination of arousal and valence ratings (row 8). Some researchers converted unipolar item ratings into bipolar dependent variables, such as Ilie and Thompson (row 18), who took differences between pairs of unipolar item ratings (e.g., "pleasant" and "unpleasant") to generate the dependent variable score (e.g., "valence"). That study is also interesting from the point of view that it is one of the few to apply a more contemporary model of emotion dimensions to the data, with energy-arousal and tension-arousal scores generated, instead of the more common "arousal" dimension alone (Schimmack and Rainer, 2002).

The use of bipolar vs. unipolar rating items presents some interesting challenges (Yorke, 2001). A bipolar rating item is labeled at one pole with a term that is opposite in meaning to the other pole to the extent possible (such as "happy" at one end and "sad" at the other). But some researchers have found that supposedly opposite constructs such as happy and sad or excited and calm do not traverse from one pole to the other in a linear, unique, proportionally exclusive manner—that is, they are not exact opposites, do not refer to the identically opposite semantic construct, and do not transition from one to the other in a mutually exclusive manner (it is possible to feel happy and sad at the same time). While space does not permit the discussion of the important question of the distinction between response item formats (see Cacioppo et al., 1997; Larsen et al., 2001; Yorke, 2001), the semantics of rating item labels may interact with interpretations of magnitude when conclusions about locus distinctions are made. This is a crucial matter given that negative emotion rating items (e.g., "Rate how sad" vs. "Rate along a happy-sad bipolar continuum") and responses to emotions with putative negative emotions (e.g., a piece of *a priori* angry music) are routine design matters in investigations of emotion locus. The increase in experimental efficiency of collecting a single bipolar rating item response (instead of two unipolar rating item responses) weighed against the methodological challenges of using bipolar ratings is an issue that has not been systematically addressed in the emotion locus literature but may have important ramifications.

#### **KEY FINDING**

In nine of the tabulated entries, it was possible to perform a simple count of the number of times the mean internal locus rating was greater than the corresponding mean external locus rating (e.g., rating of internal locus sadness vs. rating of external locus sadness): this is shown as a fraction in the first entry in the Main Findings column of **Supplementary Table 1**. The counts are based on the cell pairs found in the tables and figures of the publications where direct comparisons of locus pairs were presented. Forty-five out of 178 cell-pair comparisons were higher for internal locus means compared to the external locus mean. If there was no trend, 89 (50%) was the expected count, and so a significantly *small* number of cases had internal locus rated higher [χ<sup>2</sup> (1, *<sup>N</sup>* <sup>=</sup> <sup>178</sup>) <sup>=</sup> <sup>43</sup>.51, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001]. Furthermore,where significant main effects of locus were reported, external locus ratings were rated higher than corresponding internal locus ratings in eight analyses, and vice-versa in two analyses. On other occasions, difference in mean locus ratings was not significantly different or not reported.

The data from the two Ilie and Thompson studies were collated and compared for the current review because those data were designed to allow such comparison. The results reflect the overall findings across the tabulated literature, and so are summarized here. One group of participants (*n* = 27) rated external locus responses in one experiment (Ilie and Thompson, 2006), and another group (*n* = 64) in a separate experiment rated internal locus emotions (Ilie and Thompson, 2011; Experiment 1). The same design and procedure as well as similar stimulus manipulations were used for both experiments with the exception of the musical repertoire used: 5–7 seconds long baroque and classical excerpts for the 2006 study, and a single Mozart piece lasting 7 minutes for the 2011 study. The stimuli were all digitally processed to produce manipulation of pitch, tempo and intensity (two levels for each). The comparisons of mean loci data are summarized in **Figure 1**. The figure demonstrates the trend found throughout the literature that felt emotions tend to be rated the same or lower than expressed emotions for the corresponding emotion rating and independent variable level combinations.

The use of different stimuli between locus conditions, as is the case in this example, may raise some concerns about the validity of such a comparison, but in that case, one might expect that either a systematic variable has led to the identified trend, or that the number of felt mean scores would be less than the perceived mean score in 50% of pairs (that is, it would be distributed according to chance). As shown in **Figure 1**, two comparisons have higher felt means than expressed mean ratings: valence rating for the soft, slow, high pitch condition—7.13 vs. 7.01, and tension rating for the loud, fast, low pitch condition—5.25 vs. 5.22. But for each of these pairs, the (1SE) error bars overlap.

While the tabulated literature did not allow a complete, direct statistical comparison of the relative mean magnitude of felt and expressed locus responses, a crude comparison using error bar overlapping was conducted, shown in the magnitude column of **Supplementary Table 1** when available or extractable (see note for that column for more details and limitations). A count of these crude "significance" tests revealed that overall there were 99 cases where mean felt emotions were rated as lower in magnitude (regardless of the emotion rated) than the mean expressed rating (for the corresponding emotion rating item), but only nine occasions where the reverse was the case (mean expressed magnitude lower than mean felt magnitude). In 77 tests, error bars between mean locus pairs overlapped, suggesting a relatively high proportion of cases where participants rank emotions as being well matched in magnitude across locus.

In one of the two studies where mean internal locus was rated as higher than mean external locus, it was valence that exhibited a main effect, with mean felt emotions rated as above zero, and mean expressed emotions as below zero (Kallinen and Ravaja, 2006). In the same study, a second main effect was reported where negative activation ratings were also rated in this order across loci. Furthermore, an interaction was identified for the valence score, where pieces that were expected to represent negative emotions (fear and sadness) were rated with higher felt negative emotions than expressed negative emotions (see Main Findings column in **Supplementary Table 1**, row 4). The second study where mean

internal locus was rated as higher than the mean external locus (Vieillard et al., 2008) produced an overall main effect of higher rating for felt than expressed emotion, using the "best label" score approach.

Apart from these two studies, the overwhelming evidence presented in **Supplementary Table 1** is that when emotion loci are not statistically the same, emotion felt is rated lower than emotion expressed by the music—for example, higher positive expressed emotion than positive felt emotion (Dibben, 2004; Schubert, 2007b; Konecni et al., 2008; Zentner et al., 2008; Ali and ˇ Peynircioglu, 2010; Hunter et al., 2010 ˇ ). The study by Salimpoor et al. (2009) provides evidence that when there is a difference in mean locus rating, the difference is constituted as felt emotion being lower, rather than expressed emotion being higher. This evidence is drawn from the comparison of music selection condition results: in the self-selected condition both felt and expressed emotions were rated with a statistically equal mean of about 7 on a pleasure scale of 1 (neutral) to 10 (extremely pleasurable), but in the experimenter music-selected condition, felt emotions ratings dropped to four while expressed emotion ratings remained at around seven.

Across the tabulated literature a number of interacting variables were investigated to see what other influences bear on locus response, including physiological state, personality, age, musicselection responsibility (participant or experimenter), musical features and, as already described above, the putative emotional connotation expected in the music—for example, when the music is expected to represent negative emotion. Some studies reported relationships between interacting variables and the gap across emotion loci ("GAEL," the difference between the internal and external locus scores, usually reported as an absolute value). With regard to personality differences, participants exhibiting high scores in trait neuroticism-anxiety and on a scale measuring the behavioral inhibition system ("BIS", the system that specializes in dealing with aversive signals) responded statistically with a larger GAEL score than did those scoring low in neurotic-anxiety and BIS (Kallinen and Ravaja, 2006). Kallinen and Rajava argued that high neurotic-anxiety participants suppressed emotional *experiences* more than their extraverted counterparts (p. 195). In the Salimpoor et al. study, the method of stimulus selection mattered, with participant-selected music producing more equal ratings across loci (smaller GAEL) than experimenter-selected ratings (Salimpoor et al., 2009). Furthermore, the statistically equal ratings between felt and expressed emotion are found when liked music is used, compared to disliked music (Schubert, 2007a, 2010). This is consistent with Salimpoor et al. because the self-selected pieces were presumably liked more than the experimenter-selected pieces in that study.

Hunter et al. (2010) identified interactions between loci and each of the musical feature variable manipulations investigated tempo and mode—for their specially manipulated stimuli. While they found the overall lower emotion ratings for internal compared to external locus conditions, happiness ratings amplified this difference when the tempo was fast, and sadness ratings amplified this GAEL when tempo was slow (for a summary of other interactions in this study, see Main Findings column of **Supplementary Table 1**, row 14). Schubert (2007a) reported systematic differences in locus due to preference, with loved music producing a close match between locus scores (small GAEL), a result that was replicated (Schubert, 2010), albeit with fairly marginal significance (*p* = 0.045—see **Supplementary Table 1**, row 15).

Apart from a main effect of internal locus emotions being rated lower than external locus emotions under many conditions, few of the additional independent variables investigated in the tabulated literature have produced consistent responses, most importantly because of a lack of replication. The most commonly repeated investigated interaction across the tabulated studies is the putative emotion of the musical stimuli. But, as mentioned above, the results are not altogether consistent across studies, and furthermore, there are different ways of collecting emotion ratings (e.g., different ways of labeling of the rating items—see discussion under Response Format, above), transforming those data into dependent variables, and the nature of the musical stimuli.

The clearest, new trend, then, that has emerged in the literature since Gabrielsson's review is that when there is a mismatch between felt and expressed emotion, *and* when these data are gathered via rating items (as distinct from selection of discrete emotions: see Eerola and Vuoskoski, 2011), results rarely show that felt emotions are rated as statistically higher than expressed emotions. As a consequence, the next section of this review focuses on building theory that explains this finding about emotion locus in music.

#### **THEORETICAL CONSIDERATIONS**

#### **FELT EMOTION LESS THAN EXPRESSED: EMOTIONAL CONTAGION THEORY**

As an extension of Gabrielsson's thinking described in the opening of this paper, it could be assumed that there is no differentiation between the conceptualization of internal and external loci of emotion: it is all part of the same emotional response, and being asked to make the distinction may even be considered artificial. But the findings of the literature reviewed, and in particular the explicit investigation by Juslin and Laukka (2004), demonstrate that a majority of people can make distinctions between these perspectives, understanding that music can appear to express an emotion and that the emotion can be felt. External and internal loci do not necessarily meld into the one experience, at least not for the wide majority of the participants in the Juslin and Laukka study. Even if this distinction is a purely artificial or cultural one, it appears to be widely present and is in need of theoretical understanding.

However, the reason for the quantitatively different rating between loci found in the tabulated literature, when it occurs, could be something as basic as the instructions. Konecni (2008) ˇ demonstrated that the wording and detail of the task can impose a secondary influence on the results. He revealed different amounts of detail about a publication on music and emotion and asked the participants to determine which locus the paper was referring to at the different stages. The locus indicated in the article was external and revealed in the title ("... perceived intensity of emotion") <sup>5</sup> . In Konecni's study, only 25% of participants ˇ identified the correct locus, suggesting they tend to think in terms of their own feelings when there is "ambiguity" in wording. The Konecni study demonstrates the importance of clarity of commu- ˇ nication. But a theoretically interesting question is why such bias in interpretation may be present.

To explain locus relationships, Evans and Schubert (2008) drew on the distinction made between absolutism and referentialism in the experience of music, as proposed by Meyer (1956) and reinterpreted by Schubert and McPherson (2006). Referentialism suggests that connections between music, emotions, and other situations/events are made by association primarily as a result of life experiences and cultural knowledge but also through highly individual and even idiosyncratic connections. A mismatch in musical emotion and felt emotion can be explained by these kinds of idiosyncratic, arbitrary pairings, as Gabrielsson (2002) points out. Schubert and McPherson then proposed that meaning can also be encoded more directly into the music (as according to Meyer's account of absolutism, or "absolute-expressionism"), where emotions are directly decoded by the listener [an idea found in the "lens model" proposed by Juslin (1997); Juslin and Lindström (2010)] through an act of mimicry, and neurophysiologically via the mirror neuron mechanism (Schubert, 2007b). An influential, related explanation was proposed by Juslin and Västfjäll (2008), who labeled this kind of process as "emotion contagion." In other words, emotional contagion is the direct influence upon the listener of the emotion that the music portrays, in the absence of outside "interference" through, for example, idiosyncratic connections—such as the unhappy break-up with a partner when otherwise happy music is playing ("referentialism" according to Evans and Schubert, and "episodic memory" according to Juslin and Västfjäll). Emotional contagion may then be taken as one theoretical position for understanding relationships between emotion loci in music.

Put simply, emotional contagion in music refers to the transmission of an emotion via the auditory sense alone. Bharucha et al. (2006) explain:

Unlike other types of contagions, the germs of emotion transmitted by music seem to require no social interaction—musical emotions are airborne contagions. The social contagion of emotions is thought to stem from the tendency to automatically mimic the social cues of others, such as body posture, movement, facial expressions, and vocal expressions. It is perhaps the latter that leads to social contagion in music. (p. 156)

Several of the reviewed studies (Gabrielsson, 2002; Dibben, 2004; Kallinen and Ravaja, 2006; Ali and Peynircioglu, 2010; Schubert, ˇ 2010) refer to the possible role of social inhibition and the laboratory setting. Under such circumstances, strong emotional outbursts can be considered inappropriate, leading participants to suppress felt emotional response relative to external locus rating. By adopting an emotional contagion framework, what this means theoretically is that inhibition of experienced emotion can take place when making internal locus responses. Evidence from nonmusic literature about social inhibition and its influence on felt emotion can be found in the within-internal-locus distinctions identified in social psychology where public displays of emotions

<sup>5</sup>Reading only the title of the article produced the greatest confusion in participant belief about the target locus of the article. However, this might not be so unreasonable given the ambiguity of the term "perceived", particularly in the absence of the grammatical subject of the sentence, discussed earlier, and summarized in **Table 1**.

do not identically map onto actually felt emotions (Bono and Vey, 2005; Mann, 2005; Bakker and Heuven, 2006; Langer et al., 2007; Manne et al., 2007; see also discussion in Limitation 5, above). However, further investigation will be required to examine whether this kind of social, contextual adjustment of felt emotion occurs in response to music.

If we apply contagion theory to explain emotion locus relationships in music perception, that expressed emotions are transmitted to (or infect) the listener, then we can explain inhibition as a plausible explanation for reduced felt ratings, and the inhibition may be a product of social context. Evidence from a study where internal locus emotions were rated in response to various emotion expressing film excerpts (Jakobs et al., 2001) suggests there is an influence of social context. When a film extract was viewed alone, felt sadness ratings were higher than when viewed with another person (see also Raghunathan and Corfman, 2006; and for a similar design, but using music stimuli, see Liljeström et al., 2012).

#### **FELT EMOTION MORE VARIED THAN EXPRESSED: DECODING THEORY AND THE LENS MODEL**

A further complication of the locus relationship debate is the variability in responses to either locus. If the properties of the artwork—in this case, the relationships among musical features over time—are consistent upon repeated exposures, then it may seem logical to assume that the emotion expressed by that stimulus is also stable, and it is the internal locus that might be more variable, depending, for example, on the mood of the listener/perceiver, as Gabrielsson points out in the "zero correlation" case of his "no-systematic relation" (Gabrielsson, 2002 p. 136). Juslin's lens model, discussed above, can be used to help interpret this situation. If a performer and/or composer encodes particular emotions into a piece of music, the listener's decoding will to some extent be a statistical process, meaning that decoding will not necessarily be the same as the encoded emotion. Felt emotion may be characterized as "encoded emotion plus noise." Schubert (2007b – see **Supplementary Table 1**, row 5) tested this explicitly by comparing the variance for each emotion rating pair across loci to assess the "stability" of the loci, arguing that if one locus had a lower variance than the other, it was more stable. Six *F*-tests out of 20 (4 emotions rated × 5 pieces of music) were significant at *p* = 0.05, with felt emotions demonstrating larger variance than expressed emotions, with three of these being in response to one piece, "Jupiter" from *The Planets* by Holst (for ratings of emotional strength, arousal and valence). One out of 20 may have been significant by chance alone, and so the study concluded that expressed emotions are overall more stable than felt. It suggests that when all things are equal, internal locus equals external locus plus noise, a claim that requires more research (Schubert, submitted).

Thus, the theoretical underpinning of emotional contagion has not yet been fully addressed in the current literature of emotion locus for music. Further studies will need to falsify the idea that felt emotion is lower in absolute magnitude than expressed emotion because of inhibited contagion, and whether there is a systematic difference in variance between the two loci. Cases of felt emotion being greater than expressed will need to be better explained from a theoretical stance before the inhibited contagion account can be fully supported. I will examine one further theoretical position that is able to explain some of the results identified in the tabulated studies.

#### **MIXED RESPONSES: DISSOCIATION THEORY**

I am not explicitly concerned in this review with research on music expressing conflicting emotions, such as happy and sad emotions, at the same time (Hunter et al., 2008, 2010). But peculiar to the locus debate is when a complex combination of emotion matches and non-matches occur between and within loci, such as music *expressing* fear but the listener reacts with feelings of embarrassment *and* joy. Let us suppose now that the joyful reaction was due to the memory that the music evokes about something quite personal and private (as per the "episodic memory mechanism" proposed by Juslin and Västfjäll, 2008), and that upon realizing the response was possibly inappropriate in the current setting (e.g., the music was heard in a concert setting where the other audience members were quiet and calm), the listener became embarrassed. Thus, three potentially mismatching emotions are at play here—one the fear expressed by the music, and the two felt emotions (embarrassment and joy).

After the work of Charland and Colombetti (Charland, 2005; Colombetti, 2005), I (Schubert, 2012, 2013) proposed a solution to the conundrum of mixed emotions in music by arguing that there are two qualitatively different kinds of "feeling" (to use the term in a way similar to Zentner et al., 2008) experiences: emotion valence and affect valence. Emotion valence is specific to emotional contemplation, without any necessary approach or withdrawal action readiness (Frijda et al., 1989). Affect valence, on the other hand, is concerned with the action/evaluative response qualities, which can generally be thought of as preferences (including enjoyment, liking and attraction, or lack thereof). Affect valence is the outcome of the music-listening activity and therefore, also encompasses the more powerful aesthetic responses to music, such as awe, spirituality and being moved (Kivy, 1990, 1999; Konecni, 2005 ˇ ). In the example, the listener was experiencing the positive *emotion valence* of joy (internal locus) but then had a negative *affect* evaluation of embarrassment. The Van Zijl and Sloboda study (2011, see quote in **Supplementary Table 1**, row 19, Main Findings) further exemplifies this separation through the felt emotion in response to the utilitarian task of learning the piece (affect valence: frustration, remain calm) and the *emotion* valence experienced in response to the music (e.g., again from the quote in **Supplementary Table 1**, peaceful, happy).

The separation of affect valence and emotion valence, although at times non-trivial, is proposed as a way of resolving previous confusion in the literature about some "mixed" emotional responses (an additional example is provided in **Table 2**). Simple preference (liking, loving, hating) is a typical example of affect valence found in the literature.

The affect/emotion valence distinction is explained from the cognitive theoretical standpoint of dissociation theory, where when listening to music we are usually in a state where we "switch off" pain circuits6 , meaning that we can enjoy negative emotions

<sup>6</sup>The term "circuits" is used here in a cognitive psychology, metaphorical context, rather than in a literal neuroscience way.



without the unpleasant negative affect valence (Schubert, 1996, 2009–2010). In the example above, the embarrassed (negative *affect* valence) response meant that the individual was not in a dissociated state and could, as a result, not (or no longer) enjoy the music that, in another context, he or she may have liked very much (positive affect valence). Recent classifications of descriptive adjectives have started to separate groups of terms in ways compatible with dissociation theory. For example, Juslin and Laukka (2004) propose that some emotions are more suitable for inducing in the listener, while others are more apt for being expressed by the music. Being *moved*, *amazed*, or *enchanted* are presented as examples of induced (but not expressed) emotions. By revising the way affect qualities are conceptualized, these may be understood as unique to induction (rather than expression) *because* they are affect valences (rather than emotion valences). Being moved might be a result of feeling sad, or happy (emotion valence), or some other experience(s) which led to the affective response of being moved.

My point is that by differentiating (dissociating) between emotion valence and affect valence, affect/emotion blends can be more simply understood than the otherwise complex responses we appear to have to music. "The music makes me sad, and that gives me pleasure" suggests a negative *emotion* valence of sadness, but a positive *affect* valence of pleasure. There is no need to view sadness and pleasure as conflicting. From a theoretical stance, the pleasure indicates that the listener is in a dissociated state, meaning that negative valence affects are inhibited, and so all emotions (negative and positive) can be enjoyed. Zentner et al. (2008) present a statistically determined grouping of music evoked adjectives, producing nine clusters of word groups, two of which—"wonder" and "transcendence"—actually fit well with the affect valence concept (with terms such as "amazed", "moved", feeling of "spirituality"), while the other clusters indicate adjectives more representative of emotion valence. However, terms such as "irritated" (part of the "tension" cluster) are more typically concerned with affect valence—being irritated by a piece of music is a reason that the listener might stop listening, rather than experience as an emotion that she or he can contemplate (Schubert, 2013). Dissociation theory may provide a solution to one of the enduring debates on emotion locus and whether there exists a special set of aesthetic or musical emotions that are activated only when an artistic (musical) activity or thought takes place, distinct from utilitarian emotions experienced in everyday life (for further discussion, see Kivy, 1989; Krumhansl, 1997; Pouivet, 2000; Khalfa et al., 2002; Krumhansl, 2002; Scherer, 2005; Silvia and Brown, 2007; Silvia, 2009; Barrett et al., 2010; Peretz, 2010; Perlovsky, 2010; Juslin, 2011; Juslin et al., 2011; Chan et al., 2013; Juslin, 2013).

#### **REVISION OF GABRIELSSON'S LOCUS RELATIONSHIPS**

Gabrielsson's categories of emotion locus relationships have provided an important framework for encouraging direct engagement with and awareness of the question of emotion locus in music. The current review suggests that the categories may be reworked and calibrated to reflect the current distinctions between loci of which the research community is now aware. **Table 2** summarizes the revised relationships and reports possible explanatory mechanisms (Juslin, 1997; Juslin and Västfjäll, 2008; Schubert, 2007b, 2009–2010). Reworded are the terms "positive" vs. "negative" (or "opposite") relationships—now "matched" vs. "unmatched," respectively to reserve the former terms for the conventional use of emotion and affect valence (positive/negative). "No systematic relationship" has been absorbed into "unmatched" to reflect the non-matching nature of emotion pairs that are neither "opposite" nor "positive" (referred to as contrapositive by Evans and Schubert, 2006), such as sad and excited, not just those that are directly "opposite" on an emotion-space geometry (such as sad and happy).

If we eliminate the no-systematic relationship category in those cases when it is due to instability of responses over time, or those that do not concern emotional relationships between expressed and felt emotions on any occasions (no systematic relationship, and no relationship), we end up with two broad categories: matched vs. unmatched relationships. The omission of the non-systematic and no relationships are justified by two findings (1) when no emotion is reported in both the felt and expressed loci, the categorization becomes irrelevant (see, e.g., Sloboda and Juslin, 2010, p. 83)—the participant may be having formalist, cognitive or no responses to the music, but without emotion, and (2) no emotion in one locus and some emotion, or even none, in another can be subsumed by the unmatched emotion locus relationship.

The proposed scheme attempts to revise Gabrielsson's set of relationships to bring them into line with the current state of research on emotion locus in music. The theoretical organizing principle of the revision is based on our understanding, assumptions, and limitations of the emotional contagion processes, and its interactions with a dissociation mechanism and the Lens model inspired decoding theory, discussed above.

Thus, in the revised format two main categories of relationships are proposed: matched and unmatched, referring to whether the expressed emotion is reflected in the felt emotion. A third main category of relationships is included, which is referred to as complex/mixed relationships to account for the possibility of both matched and/or unmatched relationships occurring at the same time. This category is reducible to matched and/or unmatched relationships occurring multiple times and/or simultaneously, and so the complex/mixed relationship is provided for completeness rather than necessity. Furthermore, this category allows for convenient discussion of the interesting area of research concerning mixed emotions portrayed and evoked by music (Evans and Schubert, 2008; Hunter et al., 2008, 2010; Barrett et al., 2010; Juslin, 2011; Juslin et al., 2011).

Subcategories can be attached to the two main categories based on the valence of the relations (for matched, positive [expressed] to positive [felt] or negative to negative, and for unmatched, positive to negative and vice versa—see **Table 2**). The key issue here is the position of the boundary delineating matched vs. unmatched emotion pairs. For magnitude comparisons of the same emotion variable (e.g., rating happiness on a 1–10 continuum for felt and expressed emotion), conventional inferential statistical procedures can be used to determine matched (no difference) or unmatched (different) loci. However, for discrete emotion words, or when emotions are plotted on an affect grid (Russell, 1980; Russell et al., 1989b), the analysis can be more involved. For example, in some contexts it might be sufficient to refer to a calm emotion expressed as being matched with a happy emotion felt. The current taxonomy does not explicitly dictate where the boundary between a matched vs. unmatched emotion pair lies—e.g., whether "happy" and "calm" are matched emotions or not. Furthermore, for discrete emotion words the boundary may be fuzzy or ill-defined. Evans and Schubert (2008) developed a criterion using Euclidean distance between valence (x-axis) and arousal (y-axis) (assuming the two variables to be orthogonal as per Russell, 1979, 1980), and selected a (more or less arbitrary, but conservative) angle in the space about which to identify whether a pair of emotions are matched vs. unmatched. After some experimentation, they selected an angle of 45◦ in the polar coordinate system within which the emotions were classified as matched. A "within same quadrant" (in a two-dimensional emotion space) approach is also a plausible, though less conservative, approach.

If this geometric approach to differentiating matched vs. unmatched emotions continues to be adopted, then it would be interesting to see if this empirically determined boundary angle could be reproduced in such a way as to group qualitatively similar emotions together, such as applying inferential statistical tests to subtended angles (Mardia and Jupp, 2000). In practice, it is unlikely that such boundaries will be fixed and stable given that a single emotion does not necessarily occupy the same location on an emotion-space, because there exists some fluidity of meaning within and across individuals and cultures (Russell et al., 1989a; Schubert, 1999). Furthermore, investigation of locus relationships using discrete emotions is in need of further attention, because nearly all of the tabulated studies used rating items, reflecting an interest in magnitude of emotion rather than the semantic distinctions of Russell's "circumplex" model of emotion.

#### **DISCUSSION AND CONCLUSIONS**

Gabrielsson's rediscovery and systematic organization of the relationships between felt and perceived emotion during a musical experience can be considered significantly responsible for opening up a new topic of music psychology and in regard to emotion research in general because it brought to focus the distinction between the locus of emotion when the reactions are not between two sentient beings, but between one sentient being and a piece of music. This review identified the key issues that have emerged in subsequent years and proposed three cognitive based theoretical frameworks to help focus further research. The theories of emotional contagion, decoding and dissociation were presented.

Emotional contagion explains why we tend to feel the emotion that music is conveying. Through the inhibition of this felt emotion, we feel emotion at a lower magnitude than the corresponding expressed emotion. This "inhibited-contagion" account will need more research to determine why there would then be any situations when felt emotions were rated as stronger (in magnitude) than expressed emotions. According to the inhibited contagion account, these situations should not occur, but they do (albeit relatively infrequently in the studies reviewed). Furthermore, is inhibited-contagion a satisfactory explanation for unmatched emotions across loci when the emotions are discrete (such as happy, sad, angry, calm)? And does the related idea of decoding theory mean that internal locus will have more variability than external locus? Such a view needs to account for the question of whether the judgment of an observed object (or piece of music) is really sensed in a fixed, stable way by the perceiver (for philosophical views on this matter, see Townsend, 1987). That is, contagion theory has the implication that it is possible to objectively "know" the external locus emotions, for example through examination of musical features. However, external locus of emotion must also be simulated by the perceiver—for example, illusions demonstrate that the thought-to-be-observed object is not always isomorphic with the physical stimulus (Coren and Girgus, 1978)—and this is a view that has not been tested in the tabulated literature, even though it is important in philosophical aesthetics and is attracting interest in neuroscientific research (e.g., Sevdalis and Keller, 2009; Novembre et al., 2012).

Dissociation theory attempts to address confusion in simultaneously differential responses to music within locus—such as experiencing high preference for music that makes that listener *feel* sad (both internal-locus responses). This theory allows researchers to understand that different, apparently conflicting feelings can be experienced at the same time, without having to dogmatically attribute them to one or the other locus of emotion (e.g., the music is sad, and I like that). Contagion and dissociation principles may work hand in hand in the perception of music: under normal, real-life, day-to-day circumstances it might be undesirable and even dangerous to acquire emotion solely through contagion transfer (e.g., feeling angry when someone else is angry—experiencing fear might be more adaptive. See Preston and De Waal, 2002, who also provide a neuroscientific explanation of inhibited contagion). That is, the individual experiences "utilitarian" rather than "aesthetic" emotions (Scherer, 2004). The day-to-day circumstance is one where emotion valence and affect valence are not dissociated, allowing adaptive responses (the anger leads to a negative affect valence). In music listening and other aesthetic contexts, dissociation of emotion valence from affect valence allows contagion to operate unhindered by day-to-day, utilitarian circumstances and can be enjoyed. In other words, according to dissociation theory, affect valence is always positive in an aesthetic context. The context is the cause of the dissociation between emotion valence and affect valence.

Preference being higher when internal and external loci are matched also requires further investigation. It could be that preference can be implicitly measured by how well the two loci are matched—that is, the smaller that gap across emotion loci, the greater the preference. But determining the causal chain will be of interest, too—whether the preference causes the locus to be different, whether the difference in locus causes preference to change, or whether some other variables are involved. Another interesting research question is which mechanism can explain this finding. Dissociation theory and contagion theory can explain the preference through the activation (which is pleasurable) of contagion circuits. Contagion circuits are the same or related to mirror circuits (e.g., Koelsch et al., 2006) and to the positive affects of empathy. But the nature of those circuits, as applied to the proposed cognitive theoretical frameworks, is in need of further investigation. For example, no studies have explicitly attempted to examine whether neural pathways for processing felt emotion might be different from the neural pathways for processing expressed emotions (e.g., see Blood and Zatorre, 2001; Peretz et al., 2004; Menon and Levitin, 2005; Brattico et al., 2009). Contagion-circuit theory assumes both loci have shared pathways (Preston and De Waal, 2002), but none of the data reported explicitly aims to test this assertion.

Trait and personality effects, including behavioral inhibition/activation systems (Kallinen and Ravaja, 2006), absorption (Kreutz et al., 2008a; Garrido and Schubert, 2011a; Herbert, 2011), rumination (Garrido and Schubert, 2013; Wilhelm et al., 2013) and so on (e.g., Rentfrow and Gosling, 2003; Rentfrow et al., 2011), may each impact the way individuals differ in their felt and perceived judgments of music, but apart from Kallinen and Ravaja, little attention has been given to the effect of trait upon emotion locus. Psychological, physical and physiological states may also influence emotion loci relationships in music, but only a single study has examined this in the tabulated literature (Dibben, 2004, see **Supplementary Table 1**, rows 1 and 2).

The wide range of task wordings may be in need of some standardization, with the name of the locus variable producing little agreement across studies and therefore unnecessary confusion (for example, when performing a database search on the topic). While the wording of the levels within the locus variable can be quite flexible, Konecni (2008) ˇ demonstrated that the wording and detail of the task can also affect the results. Furthermore, the reader needs to be clear on the grammatical subject of the locus level (the *music* expresses and the *listener* perceives external locus emotion). More consistency in the labeling used to describe the variable in question is urged. In this review, the term "locus of emotion" or "emotion locus" has been adopted.

The present review calls for a firmer theoretical stance to help direct future research. Interestingly, one of the rarely cited, early writers on the locus relationship had a premonition of a useful theoretical framework for understanding emotion locus in music, with Vernon Lee's idea of "sympathetic imitation," which in contemporary literature resembles emotional contagion—the dominant theoretical framework that provides a basis for explaining a large portion of the results of the literature investigated in this review.

Finally, based on an examination of the literature published in the decade after Gabrielsson's seminal work, some rearrangements and suggestions have been made that may assist future researchers investigating relationships between emotion expressed by music and emotion felt by the listener in response to music. The data examined and newly arising findings were used to rearrange Gabrielsson's categorization. Reflecting the growing interest in explaining why emotion ratings are different across locus in certain situations, the simplifying nomenclature of "matched" vs. "unmatched" emotion pairs across loci were used in this review as the basis of formulating locus relationships. There is much work to be done in understanding locus relationships, and this review, if successful, should soon become an interim report on the state of the art in the locus of emotion in music.

#### **ACKNOWLEDGMENTS**

I am grateful for the detailed and helpful suggestions made by the reviewers. This research was supported by an Australian Research Council Future Fellowship, FT120100053.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fpsyg. 2013.00837/abstract

**Supplementary Table 1 | Summary of emotion locus in music studies reviewed.**

#### **REFERENCES**


Kivy, P. (1989). *Sound Sentiment: An Essay on the Musical Emotions, Including the Complete Text of The Corded Shell.* Philadelphia, PA: Temple University Press.


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 26 April 2013; paper pending published: 03 July 2013; accepted: 22 October 2013; published online: 17 December 2013.*

*Citation: Schubert E (2013) Emotion felt by the listener and expressed by the music: literature review and theoretical perspectives. Front. Psychol. 4:837. doi: 10.3389/fpsyg. 2013.00837*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2013 Schubert. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Dynamic musical communication of core affect

#### *Nicole K. Flaig and EdwardW. Large\**

*Music Dynamics Lab, Department of Psychology, University of Connecticut, Storrs, CT, USA*

#### *Edited by:*

*Daniel J. Levitin, McGill University, Canada*

#### *Reviewed by:*

*Sarah Creel, University of California at San Diego, USA Silke Anders, Universität zu Lübeck, Germany Emery Schubert, University of New South Wales, Australia*

#### *\*Correspondence:*

*Edward W. Large, Music Dynamics Lab, Department of Psychology, University of Connecticut, 406 Babbidge Road, Unit 1020, Storrs, CT 06269-1020, USA e-mail: edward.large@uconn.edu*

Is there something special about the way music communicates feelings? Theorists since Meyer (1956) have attempted to explain how music could stimulate varied and subtle affective experiences by violating learned expectancies, or by mimicking other forms of social interaction. Our proposal is that music speaks to the brain in its own language; it need not imitate any other form of communication. We review recent theoretical and empirical literature, which suggests that all conscious processes consist of dynamic neural events, produced by spatially dispersed processes in the physical brain. Intentional thought and affective experience arise as dynamical aspects of neural events taking place in multiple brain areas simultaneously. At any given moment, this content comprises a unified "scene" that is integrated into a dynamic core through synchrony of neuronal oscillations.We propose that (1) neurodynamic synchrony with musical stimuli gives rise to musical qualia including tonal and temporal expectancies, and that (2) music-synchronous responses couple into core neurodynamics, enabling music to directly modulate core affect. Expressive music performance, for example, may recruit rhythm-synchronous neural responses to support affective communication. We suggest that the dynamic relationship between musical expression and the experience of affect presents a unique opportunity for the study of emotional experience. This may help elucidate the neural mechanisms underlying arousal and valence, and offer a new approach to exploring the complex dynamics of the how and why of emotional experience.

**Keywords: neurodynamics, consciousness, affect, emotion, musical expectancy, oscillation, synchrony**

#### **INTRODUCTION**

Every known civilization creates unique and sophisticated musical forms to communicate affect, and humans consciously seek out musical experiences because of the feelings they evoke. There is widespread agreement that music induces emotional experiences, and that at least some aspects of this phenomenon are universal. But what is the nature of musical feelings and what is the relationship between musical feelings and emotions? Over the years, competing theories have been developed to address these issues, and sophisticated experimental paradigms have been devised to investigate them. However, incommensurate claims and variable findings leave many open questions. Are affective responses to music accidents of evolution? Is there something special about musical communication? Can the study of emotion teach us anything about the nature of music? And what, if anything, can music teach us about the nature of emotional experience?

Since Meyer's (1956) pioneering work in music and emotion, theorists have struggled to explain how music could stimulate emotional experiences by somehow triggering basic, evolutionarily ancient psychological processes tied to survival. Empirical approaches have sought specific behavioral and/or physiological responses to music (Juslin and Sloboda, 2001). One specific problem that arises is the question of how an object-directed emotion can be evoked by music with no external referent. Another is how a sophisticated, non-referential form of cultural expression, learned over many years of exposure, could engender innate, survival-related responses.

The goal of this paper is to stake out some new territory in the debate on musical emotion. First, we will ask whether emotional experiences are really "basic" (Allport, 1924; Izard, 1971), or whether they are psychologically constructed from domaingeneral processes (Barrett, 2009a). Next, we will explore the idea that all conscious systems consist of multiple neural processes, produced by spatially dispersed events in the physical brain and integrated into a seamless neurodynamic whole through synchrony of neural oscillations (Edelman, 2003). Core affect is thought to be one primitive, domain-general aspect of the dynamic core of consciousness (Russell, 2003) especially relevant to emotional experience (Barrett, 2009a). Then, we will review recent evidence that neuronal synchrony with music gives rise to musical qualia including tonal and temporal expectancies (Large and Jones, 1999; Lerdahl, 2001; Huron, 2005; Large, 2010a). Finally, we will argue that music-synchronous responses couple into the dynamic core of consciousness, directly modulating core affect.

#### **THEORIES OF MUSIC AND EMOTION**

Music does not have obvious survival value (cf. Pinker, 1997) and yet is able to elicit strong emotional reactions. Many biological perspectives consider the primary function of emotion as a response to behavioral demands that may require mobilization for action; they evolved to prepare an individual to deal with situations that were significant for survival and reproduction (Darwin, 1872/1965; Cosmides and Tooby, 2000; Huron, 2005, 2006). Darwin (1872/1965) proposed

that the origin of the musical communication of emotion was to be found in the evolutionary process of sexual selection.

Psychological approaches to musical emotions have been heavily influenced by the theory of basic emotion. Basic emotion theorists have sought to identify categories of emotions that share distinct collections of properties such as patterns of autonomic nervous system activity and behavioral responses or action tendencies (Allport, 1924; Izard, 1971; Ekman, 1972; Panksepp, 1998). Other approaches consider emotions to be constructed from more primitive processes, including affect (Irons, 1987; Ortony et al., 1988; Russell, 2003; Barrett, 2006, 2009b; Duncan and Barrett, 2007). Mechanisms of musically induced emotion have been explored at great length with varying causations, interpretations, and results.

Meyer's (1956) approach to musical emotion has been highly influential in part because he was the first to seriously take into account both philosophical works (e.g., Langer, 1951) and psychological theories of emotion (cf. Dewey, 1895; MacCurdy, 1925; Angier, 1927; Rapaport, 1950). Meyer observed that namable emotions are – unlike music – event-directed, and that emotional experiences are much more subtle than the "crude and standardized words we use to denote them." He also observed that emotional responses are not innate, they are highly variable, and they depend on learning and enculturation (Meyer, 1956). He therefore concluded that, for the most part, music does not communicate genuine emotions. "That which we wish to consider" wrote Meyer, "is that which is most vital and essential in emotional experience, the feeling-tone accompanying emotional experience, that is, the affect" (Meyer, 1956, p.12).

Meyer was also the first to suggest that what is now called statistical learning applies to music and determines musical feelings. For example, a passage of tonal music leads to the feeling that some pitches are more stable than others. More stable pitches are felt as points of repose, and less stable pitches are felt to point toward, or be attracted to the more stable ones (Lerdahl, 2001). Such relationships are reflected in naming conventions of many musical cultures, including Western, Indian and Chinese (Meyer, 1956). Based on the failure of earlier attempts to account for musical communication based on vibrations, ratios of intervals, and so on, he argued that feelings of stability and attraction are learned through experience with the music of a particular culture. Moreover, the associations of musical moods, such as happy and sad, with major or minor harmonies, or the affective qualities associated with ragas in North Indian tonal systems are conventional designations, having little to do with the sound itself (Meyer, 1956).

Finally, Meyer (1956) argued that the frustration of expectancy is the basis for affective responses to music. He believed that affect is aroused when an action tendency is inhibited. Music, unlike other emotional stimuli, is not referential; it both creates and inhibits expectancies thereby providing meaningful and relevant resolutions within itself. Music communicates affect through violations and resolutions of learned expectancies.

The latter two points were taken up by modern empiricists and theorists, who studied musical expectancy and statistical learning in a variety of musical domains, and from various points of view (e.g., Krumhansl, 1990; Narmour, 1990; Large and Jones, 1999; Tillmann et al., 2000; Lerdahl, 2001; Huron, 2005; Temperley, 2007). Perhaps the most comprehensive attempt to extend Meyer's expectancy theory of musical emotion is Huron's "Sweet Anticipation" (2006). Huron argues that a fundamental job of the brain is to make predictions about the world, and successful predictions are rewarded. Within tonal context, the most stable pitches are experienced as most pleasant; within a metrical context, events that occur at expected times are more pleasurable. Thus, Huron argues, music is fundamentally a hedonic experience.

Huron emphasizes that tonal and temporal expectancies in music are learned. Musical events evoke distinctive musical qualia, and Huron reviews the body of empirical evidence showing that qualia such as stability and attraction ("scale degree qualia") correlate with statistical properties of music (Huron, 2006). He agrees with Meyer that the associations of major and minor modes with happy and sad qualia are learned associations.

In a key break from Meyer, however, Huron argues that expectancy evokes emotions, not merely affect. He adopts a twoprocess approach, which posits a fast time-scale *reaction* and a slow time-scale *appraisal* (LeDoux, 1996). Specific emotional responses involve primitive circuits that are conserved throughout mammalian evolution, and function relatively independently of cognitive circuits (LeDoux, 2000). He hypothesizes, for example, that unexpected events in music activate the neural circuitry for fear, leading to the feeling of surprise. He goes so far as to suggest that basic survival-related responses, including fight, flight and freezing, lead to the specific subjective musical experiences of frisson, laughter, and awe, respectively.

Juslin and Vastfjall (2008), consider emotions to be affective responses that involve subjective feelings, physiological arousal, expressions, and action tendencies. They too, reject Meyer's claim that music does not induce genuine emotions, because musical responses can display all these features. They endorse the notion that emotions involve intentionality1; emotions are "about" something. However, they claim that music induces a wide range of both basic and complex emotions because music triggers a variety of psychological mechanisms beyond expectancy. They go on to describe how brain stem reflexes, evaluative conditioning, visual imagery, episodic memory, and emotional contagion can lead to genuine emotional responses. Brain stem reflexes trigger emotional responses because acoustic characteristics are taken by the brain stem to signal an urgent event. Evaluative conditioning is a special kind of classic conditioning in which a stimulus without an emotional meaning, e.g., music, is consistently paired with an emotional experience, eventually coming to trigger the emotional response. Presumably, the learned pairing of major with happy and minor with sad (Meyer, 1956; Huron, 2006) would be an example of this phenomenon. Triggering of visual imagery as well as

<sup>1</sup> We use the word intentionality to refer to any mental phenomena that have referential content (see glossary).

episodic memories (Janata et al., 2007) can also lead to emotional experiences (Sloboda and O'Neill, 2001). In all of these cases the emotional responses are intentional – they are about something.

Emotional contagion, in Juslin and Vastfjall's (2008) view, is a process in which a listener perceives the emotional expression of the music, and then "mimics" this expression internally, as with other forms of interpersonal interaction like bodily gestures, facial gestures, and speech (Juslin and Vastfjall, 2008). In their conceptualization, music evokes basic emotions with distinct nonverbal expressions (Juslin and Laukka, 2003; Laird and Strout, 2007), and this process operates similarly to emotional contagion via facial and vocal expressions of emotion (Tomkins, 1962; Ekman, 1993). Emotional contagion is linked to activation of the so-called mirror neuron system (Rizzolatti and Craighero, 2004), and Juslin (2001) suggests that music can operate in this way because in some sense it imitates other forms of social interaction. Below we will offer a somewhat different view of contagion.

If we consider the full spectrum of phenomena discussed by Juslin and Vastfjall (2008), it seems clear that a wide range of emotional responses can be triggered by music. However, from a musical point of view, some of these mechanisms are more interesting than others. Loud unexpected sounds can frighten us, and auditory stimuli can trigger conditioned responses. But in these examples, music serves merely as a trigger. Many other kinds of stimuli can trigger such responses equally well; they need not be musical or even auditory (e.g., LeDoux, 2000). Episodic memory and visual imagery likely account for a significant proportion of the emotional responses people experience on a day-to-day basis (Janata et al., 2007). Moreover, there are a many reasons to believe that music is especially effective at eliciting episodic memories (Janata et al., 2007; Eschrich et al., 2008). If we could understand the fundamental mechanisms of musical communication, this may help us to understand why episodic memories are so effectively evoked by music. Here, we take a different approach to understanding the ability of music to elicit feelings, one that does not treat music merely as a trigger but rather focuses on fundamental dynamic mechanisms of affective communication.

The remainder of this article is concerned primarily with musical expectancy and contagion, as these are mechanisms that seem to us to be most inherently musical. Expectancies arise in response to complex, explicitly musical structures such as tonality and meter. Contagion is a kind of empathic resonance (Molnar-Szakacs and Overy, 2006; Chapin et al., 2010) that enables music to function as a type of interpersonal communication. Our approach will be to link expectancy and contagion with the dynamics of the physical brain. This involves addressing several basic questions. What is the nature of emotional experience: are emotions basic, evolutionarily adapted and cross-cultural, or are emotions constructed from more fundamental psychological ingredients? Are musical qualia based solely on learned contingencies, or do they arise from intrinsic neurodynamics? And what is the nature of the relationship, such that music is able to elicit affective experiences?

#### **EMOTION, AFFECT, AND CONSCIOUSNESS**

Influenced by Darwin's (1872/1965) theory of pan cultural emotions, Tomkins (1963) and Ekman et al. (1987) argued that emotions are genetically determined products of evolution. Basic emotions are discrete, and each category shares a distinctive collection of properties including patterns of autonomic nervous system activity, behavioral responses or action tendencies, and a set of emotion-specific brain structures that are thought to mediate these particular "basic" emotions. Each basic emotion derives from a particular causal mechanism; an evolutionarily preserved module in the brain (Tomkins, 1963; Ekman, 1992; Panksepp, 1998). Ekman (1984) proposed that the natural boundaries between types of emotion could be determined by differences in facial expression. Huron's (2005) approach to musical emotion and Juslin and Vastfjall's (2008) multiple mechanisms theory tend to endorse the basic emotion view. Basic emotions are cross-cultural and non-basic emotions are specific to cultural upbringing. However, there is little agreement about which emotions are basic, how many emotions are basic, and how basic emotions are defined.

Recent behavioral, psychophysiological, and neural findings (e.g., Barrett, 2006; Pessoa, 2008; Lindquist et al., 2012) have led a number of emotion theorists to question the basic emotion view (Ortony and Turner, 1990; Russell and Barrett, 1999; Duncan and Barrett, 2007). An alternative approach holds that diverse human emotions result from the interplay of more fundamental domain-general processes (Russell, 2003; Pessoa, 2008; Barrett, 2009b). Psychological constructionists argue that emotions are culturally relative, learned, and, though they are a result of evolution, they are not biologically basic (Russell and Barrett, 1999; Duncan and Barrett, 2007). Emotions are the combination of psychologically primitive processes that encompass both affective and intentional components. A specific emotion is not the invariable result of activation in a particular brain area; neural circuitry realizes more basic processes across emotion categories (Pessoa, 2008; Wilson-Mendenhall et al., 2013). Meyer's approach to the musical communication of affect is consistent with this view.

Contemporary neurodynamic approaches hypothesize that all conscious states are a multimodal process entailed by physical events occurring in the brain (Tononi and Edelman, 1998; Engel and Singer, 2001; Searle, 2001; Seth et al., 2006; Pessoa, 2008). The neural structures and mechanisms underlying consciousness contribute domain-general processes to many psychological phenomena. Importantly, when spatially distinct areas contribute to the contents of consciousness, they enter into a unified neurodynamic core (e.g., Edelman, 2003). Neurodynamic theories of consciousness propose that the synchronous activations of the thalamocortical system give rise to the unity of conscious experience (Edelman and Tononi, 2000; Varela et al., 2001; Cosmelli et al., 2006). Binding of spatially distinct processes is thought to occur through enhanced synchrony in gamma and beta band rhythms (Engel and Singer, 2001; Fujioka et al., 2012), and high frequency activity is modulated by slower rhythms such as delta and theta (Lakatos et al., 2005; Buzsáki, 2006; Canolty et al., 2006).

Intentionality and affect are fundamental properties of conscious experience (Searle, 2001). Conscious processes point to or are about something (Brentano, 1973; Searle, 2000), and they possess a valence and a level of activation (Barrett, 1998, 2006; Searle, 2000). Searle's theory of consciousness (Searle, 1992, 2004), Edelman's dynamic core theory (Edelman, 1987; Edelman and Tononi, 2000) and Damasio's somatic marker hypothesis (cf. Damasio, 1999) all emphasize dynamic processes that encompass both intentionality (e.g., appraisal, see Scherer, 2001; Smith and Kirby, 2001) and affect (Russell and Barrett, 1999; Davidson, 2000). An emotional experience includes affect as one important ingredient, but intentional psychological processes – perception, cognition, attention, and behavior – are also necessary (Pessoa, 2008; Barrett, 2009b). To a great extent, the difference between an emotion and a cognition depends on the level of attention paid to the core affect (Russell, 2003).

Affect can be characterized as fluctuating level of valence (pleasure/displeasure) and arousal (activation/deactivation; Wundt, 1897; Russell, 2003; Barrett and Bliss-Moreau, 2009). It is the most elementary consciously accessible sensation evident in moods and emotions (Russell, 2003). Core affect is so called because it is thought to arise in the core of the body or in neural representations of body state change (Russell and Barrett, 1999; Russell, 2003). It has been observed in subjective reports (Barrett, 2004), in peripheral nervous system activation (Cacioppo et al., 2000), and in facial and vocal expression (Cacioppo et al., 2000; Russell, 2003). The experience of core affect is thought to be present in infants (Lewis, 2000) and psychologically universal (Russell, 1991; Mesquita, 2003).

Intentional thought and affective experience are thought to arise as dynamic aspects of spatially distinct dynamic processes, integrated through synchrony of neural oscillations (Tononi and Edelman, 1998; Searle, 2001; Seth et al., 2006). Let us attempt to illustrate this idea, emphasizing dynamic over spatial aspects, by integrating over neural location. The result is an average, summarizing the activity of multiple brain areas, as shown in **Figure 1**. The dynamic properties of this pattern are the critical features; intentionality and affect correspond to dynamic aspects of the integrated neural activity. In this illustration, affective aspects correspond to changes in higher frequency activity, while intentional

aspects take place at lower frequencies, and appear as amplitude modulations. This is only a visual aid of course; we do not know enough to speculate about which frequency bands or dynamic features might correspond to intentionality and affect. Here we oversimplify to illustrate the point that relevant aspects of experience may correspond to dynamical aspects of integrated neural processes. If this approach is on the right track, however, then this way of thinking about core affect may lead to a better understanding why music is such an especially effective means of affective communication.

#### **MUSICAL NEURODYNAMICS**

At any given moment, a unified neurodynamic process is shaped by exogenous sensory input such as sights or sounds, input from the body such as vestibular sensations, endogenous constructs such as autobiographical memories, and by communication sounds such as music and speech. It seems likely that if intentionality and affect are different dynamic aspects of these spatiotemporal patterns, then different kinds of communication sounds may couple into different aspects of the dynamics. Of course, it is well established that different modes of auditory communication, i.e., music and speech, convey more of one aspect or the other. Speech primarily communicates intentionality; it is "about" events in the external world. Nevertheless, certain aspects of speech, such as prosody, directly communicate affect. Music, on the other hand, communicates primarily affect; it is most often not "about" anything. However music can signify objects or events, and it can evoke memories and images. Thus, both types of signals can induce emotions, although in different ways. What we want to suggest is that music may couple directly into affective dynamics because it causes the brain to resonate in certain ways.

Nonlinear oscillation and resonant responses to acoustic signals are found at multiple time scales in the nervous system, from thousands of Hertz in the auditory nerve and brainstem, to cortical oscillations in delta, theta, beta, and gamma ranges. The relative timescales of these processes are illustrated in **Figure 2**. From the earliest stages of the auditory system, volleys of action potentials time-lock to dynamic features of acoustic waves (Joris et al., 2004; Laudanski et al., 2010). Time-locked brainstem responses are thought to be important in the perception of pitch, which is observed from 30 Hz (Pressnitzer et al., 2001) up to about 4000 Hz (Plack and Oxenham, 2005). In auditory cortex, endogenous cortical oscillations entrain to low frequency rhythms of acoustic stimuli (Lakatos et al., 2008; Nozaradan et al., 2011). Cortical entrainment is thought to be important in the perception of rhythm, which extends from about 8 Hz (Repp, 2005a) to ultra low cortical frequencies (Buzsáki, 2006; Large, 2008). Between the timescales of pitch and rhythm lie the frequencies thought to be important in binding neural processes into unified conscious scenes (Engel and Singer, 2001; Seth et al., 2006; Fujioka et al., 2012).

#### **PITCH AND TONALITY**

In central auditory circuits, action potentials phase- and modelock to the fine time structure and the temporal envelope modulations of auditory stimuli at many different neural levels (Langner,

1992; Large et al., 1998; Joris et al., 2004; Laudanski et al., 2010). Neural synchrony is thought to be important in pitch perception (Cariani and Delgutte, 1996; Hartmann, 1996), consonance (Ebeling, 2008; Shapira Lots and Stone, 2008), and musical tonality (Tramo et al., 2001; Large, 2010a). While phase-locking is well established, mode-locked spiking patterns have recently been reported in the mammalian auditory system (Laudanski et al., 2010) and may explain the highly nonlinear responses to musical intervals that can be measured in the human auditory brainstem response (Lee et al., 2009; Large and Almonte, 2012; Lerud et al., 2013).

Mode-locking implies binding between neural frequencies that display particular frequency relationships (Hoppensteadt and Izhikevich, 1997). In this form of synchrony a periodic stimulus interacts with intrinsic neural dynamics causing *m* cycles of the oscillation to lock to *k* cycles of the stimulus. Mode-locking leads to neural resonance at harmonics (*k*∗*f1*), subharmonics (*f1*/*m*), summation frequencies (e.g., *f1* + *f2*), difference frequencies (e.g., *f2* − *f1*), and integer ratios (e.g., *k*∗*f1*/*m*)2. This implies feature binding based on harmonicity (Bregman, 1990), and suggests a role for mode-locking in the perception of pitch (cf. Cartwright et al., 1999). This also predicts a significant cross-cultural musical invariant (Burns, 1999) because octave frequency relationships (2:1 and 1:2) are the most stable, followed by fifths (3:2), and fourths (4:3). Mode-locking may provide a neurodynamic explanation for musical consonance and dissonance (Shapira Lots and Stone, 2008) that does not depend on interference (e.g., Plomp and Levelt, 1965).

Perhaps most relevant to the current discussion is the issue of scale degree qualia (Huron, 2006), which has important implications for understanding musical expectancy (Meyer, 1956; Zuckerkandl, 1956). Scale degree qualia differentiate musical sound sequences from arbitrary sound sequences, and are thought to enable non-referential sound patterns to carry meaning. Most discussions of expectancy and emotion assume scale degree qualia to be learned based on the statistics of

tonal sequences (e.g., Meyer, 1956; Krumhansl and Kessler, 1982; Lerdahl, 2001; Huron, 2006), and therefore culturedependent. However, recent dynamical analyses have shown that mode-locking provides a better explanation for quantitative measurements of stability in both Western and North Indian tonal systems (Krumhansl and Kessler, 1982; Castellano et al., 1984; Large, 2010a; Large and Almonte, 2011). Thus, scale degree qualia likely depend on the interaction of the stimulus sequence with intrinsic neurodynamic properties of the physical brain.

#### **RHYTHM AND METER**

At the timescale of rhythm and meter, relationships between musical and neural rhythms are equally striking (Musacchia et al., 2013). In auditory cortex, brain rhythms nest hierarchically, for example delta phase modulates theta amplitude, and theta phase modulates gamma amplitude (Lakatos et al., 2005). Like neural rhythms, music rhythms nest hierarchically, such that faster metrical frequencies subdivide the pulse (London, 2004). Pulse perception provides a good match for the delta band (0.5–4 Hz, see London, 2004) while fast metrical frequencies occupy theta (4–8 Hz, see e.g., Repp, 2005b; Large, 2008). Importantly, acoustic stimulation in the pulse range synchronizes auditory cortical rhythms in the delta-band (Will and Berg, 2007; Stefanics et al., 2010; Nozaradan et al., 2011) and modulates the amplitude of higher frequency beta and gamma rhythms (Snyder and Large, 2005; Iversen et al., 2006; Fujioka et al., 2012). Models of synchronization to acoustic rhythms (see e.g., Large, 2008) have successfully predicted a wide range of behavioral observations in time perception (Jones, 1990; McAuley, 1995), meter perception (Lerge and Kolen, 1994; Large, 2000), attention allocation (Large and Jones, 1999; Stefanics et al., 2010), and motor coordination (Kelso et al., 1990; Repp, 2005c). Moreover, musical qualia including metrical expectancy (Huron, 2005), syncopation (London, 2004), and groove (Tomic and Janata, 2008; Janata et al., 2012), have all been linked to synchronization of cortical rhythms and/or bodily movements. In addition, synchronization of rhythmic movements to music (Burger et al., 2013) and synchronization between individuals

<sup>2</sup>f1 and f2 denote frequencies of pure tones and k and m are positive integers.

(e.g., Hove and Risen, 2009) have been linked to affective responses.

The perception of rhythm also provides an example of synchronous time-locked patterns of activity integrating the function of multiple brain regions. When people listen to musical rhythms that have a pulse or basic beat, multiple brain regions are activated, including auditory cortices, cerebellum, basal ganglia, premotor cortex, and the supplementary motor cortex (Zatorre et al., 2007; Chen et al., 2008; Grahn and Rowe, 2009). In these areas, the amplitude of beta band activity waxes and wanes with the pulse of the acoustic stimulation (Snyder and Large, 2005; Iversen et al., 2006; Fujioka et al., 2012). The specific neural structures involved depend on the tempo of the stimulus, and it appears that the synchrony of beta band processes is what binds the neural activity (Fujioka et al., 2012). This suggests that perhaps it is not the areas *per se*, but the integrated neural activity that corresponds to the experience of pulse.

#### **MUSICAL COMMUNICATION AS NEURODYNAMIC RESONANCE**

We can summarize the above discussion by saying that music taps into brain dynamics at the right time scales to cause both brain and body to resonate to the patterns. This causes the formation of spatiotemporal patterns of activity on multiple temporal and spatial scales within the nervous system. The dynamical characteristics of such spatiotemporal patterns – oscillations, bifurcations, stability, attraction, and responses to perturbations – predict perceptual, attentional, and behavioral responses to music, as well as musical qualia including tonal and rhythmic expectations. Conceptualization of consciousness in similar neurodynamic terms leads to a new way to think about how music may communicate affective content. Neurodynamic responses that give rise to musical qualia also resonate with affective circuits, enabling music to directly engage the sorts of feelings that are associated with emotional experiences. In this section we ask, how might affective resonance take place, do musical qualia arise from intrinsic neurodynamics, and what exactly is communicated?

#### **AFFECTIVE RESONANCE**

We begin with an example of affective resonance to rhythm. Expressive piano performance is a kind of social interaction in which correlated fluctuations in timing and intensity transfer emotional information from the performer to the listener (Bhatara et al., 2010). Expressive tempo fluctuations display *1/f* structure (Rankin et al., 2009; Hennig et al., 2011), and listeners predict such tempo changes when entraining to musical performances (Rankin et al., 2009; Rankin, 2010). A recent study compared BOLD responses to an *expressive* performance and a *mechanical* performance, in which the piece was "performed" by computer, with no fluctuations in timing and intensity. Greater activations were found in emotion and reward related areas for the expressive performance, consistent with transfer of affective information. Tempo fluctuations, BOLD activations and real-time ratings of valence and arousal were also compared for the expressive performance. Over the 3–1/2 min performance, fluctuations in timing correlated with BOLD changes in motor networks known to be

involved in rhythmic entrainment, and in a network consistent with the human "mirror neuron" system (Chapin et al., 2010). As tempo increased, activation in these regions increased. Tempo fluctuations also correlated with real-time reports of affective arousal.

Despite the fact that the tempo-correlated activations were observed in so-called mirror neuron areas, this was not motor mirroring; half the participants were not musicians, and none were familiar with the piece. Could listener responses arise from a more general form of contagion in which the perception of affective expression directly induces the same emotion in the perceiver (Carr et al., 2003; Rizzolatti and Craighero, 2004; Molnar-Szakacs and Overy, 2006)? Based on what is known about neural responses to rhythm (Nozaradan et al., 2011; Fujioka et al., 2012), we propose a simple, if somewhat speculative, interpretation. Activation in mirror regions reflects resonance of endogenous cortical rhythms to exogenous musical rhythms. Activation increases as tempo increases because, as this neural circuit entrains to the musical rhythm it tracks the tempo (i.e., frequency modulations) of the performance (Herrmann et al., 2013). The frequency modulations themselves would represent violations of temporal expectancy (Large and Jones, 1999). The expressive performance also led to emotion and reward related neural activations (when compared with a mechanical performance that precisely controlled for melody, harmony, and rhythm, see Chapin et al., 2010). We hypothesize that the frequency modulation of mirror regions led to these activations (Molnar-Szakacs and Overy, 2006; Chapin et al., 2010). Thus, perhaps music directly couples into affective circuitry by exploiting resonant modes of cortical function, thereby creating the basis for affective communication

#### **INTRINSIC DYNAMICS, MUSICAL QUALIA AND COGNITIVE DEVELOPMENT**

The preceding discussion suggests that at least some aspects of affective responses to music are deeply rooted in the intrinsic physics of the brain and body. If this is true, then neurodynamic investigations may ultimately explain how musical rhythms couple into neural circuits and modulate affective responses. But, could the neurodynamic approach explain musical qualia more generally? Consider the fundamental qualitative difference between pitch and rhythm. A simple acoustic click, repeated at 5 ms intervals, generates a pitch percept at 200 Hz. Increase the interval to 500 ms and the percept is that of a series of discrete events, with a pulse rate of 2 Hz. From a dynamical systems point of view, it makes perfect sense that the neural mechanisms brought to bear on the two stimuli may be similar; the difference is merely one of timescale. Yet from a phenomenological point of view, the two are fundamentally different: a single continuous event versus a rhythmic sequence. Why the difference in qualia? Perhaps it is because the timescale at which distinct neural events are bound together into unified conscious scenes lie between these timescales of pitch and rhythm. Perhaps the difference in qualia lies in the timescale relationship, not in the mechanism *per se*. If so, perhaps neural oscillation explains not only rhythm related responses, but also pitch related responses, such as stability and attraction.

We have argued elsewhere that the terms stability and attraction, used by theorists to describe scale degree qualia (Meyer, 1956; Zuckerkandl, 1956; Lerdahl and Jackendoff, 1983; Lerdahl, 2001), are not metaphorical. These refer to real, dynamical stability and attraction relationships in a neural field stimulated by external frequencies (Large, 2010a; Large and Almonte, 2011). In other words, scale degree qualia are simply what it feels like when our brains resonate to tonal sequences. This approach can explain the perception of tonal stability and attraction in Western modes (Krumhansl and Kessler, 1982; Large, 2010a), and North Indian raga (Castellano et al., 1984; Large and Almonte, 2011). It may also shed light on the development of statistical regularities in tonal melodies, implying that certain pitches occur more frequently because they have greater dynamical stability in underlying neural networks.

There is now a great deal of evidence regarding development of basic music structure cognition, including meter (Hannon and Trehub, 2005; Kirschner and Tomasello, 2009; Winkler et al., 2009) and tonality (Trainor and Trehub, 1992; Schellenberg and Trehub, 1996; Trehub et al., 1999). Such results reveal developmental trajectories that occur over the first several years of life, as well as perceptual invariants that are consistent with intrinsic neurodynamics (Large, 2010b) tuned with Hebbian plasticity (Hoppensteadt and Izhikevich, 1996; Large, 2010a). Dalla Bella et al. (2001) asked if children can determine whether music is happy or sad. 3- to 4-year-olds failed to distinguish happy from sad above chance, 5-year-olds" responses were affected by tempo, while 6- to 8-year-old children used both tempo and mode. Thus, children begin to use tempo at about the same time the ability to synchronize movements emerges, and they begin to use mode at about the same time that sensitivity to key emerges (Trainor and Trehub, 1992; Schellenberg and Trehub, 1999; McAuley et al., 2006). The fact that the development of the two main musical dimensions – rhythm and tonality – have the same time course as their affective correlates, strongly suggests a link between the development of neurodynamic responses and music-induced affective experience.

We do not claim that musical qualia are hard-wired, however, our argument does suggest that substantive aspects of musical expectancy and musical contagion may be explainable directly in neurodynamic terms, linking "high-level" perception with "lowlevel" neurodynamics. In combination with Hebbian plasticity, intrinsic neurodynamic constraints could explain the sensitivity of infant listeners to musical invariants, as well as the ability to acquire sophisticated musical knowledge. Moreover, this explanation suggests that association of affective responses with the musical modes of diverse cultures may not be due entirely to convention, as has been speculated previously (Meyer, 1956; Huron, 2005). Indeed cross-cultural studies in the perception of Western music suggest that happiness and sadness are communicated, at least in part, based on mode (Balkwill et al., 2004; Fritz et al., 2009). Moreover, unencultured Western listeners may be able to understand the moods intended by Indian raga performances (Balkwill and Thompson, 1999; Chordia and Rae, 2008). At the very least, these cross-cultural findings suggest that associations of mood with mode have been prematurely dismissed as conventional, and these relationships deserve to be reevaluated.

#### **WHAT IS COMMUNICATED – BASIC EMOTION OR CORE AFFECT?**

Juslin and Vastfjall (2008) propose that emotional contagion operates similarly to facial expression of basic emotions (Juslin and Laukka, 2003; Laird and Strout, 2007). However, because the theory of basic emotions has recently been called into question, it makes sense to review the body of evidence that pertains to music. In communication studies, both performances and listener judgments of intended emotion have been linked to specific musical features, including tempo, articulation, intensity, and timbre (Gabrielsson and Juslin, 1996; Peretz et al., 1998; Juslin, 2000; Juslin and Laukka, 2003). However, these studies also show that, at least for Western music, happiness and sadness are the most reliably communicated emotions (Kreutz et al., 2002; Lindström et al., 2002; Kallinen, 2005; Kreutz et al., 2008; Mohn et al., 2011), while other "basic" emotions are more often confused (Gabrielsson and Juslin, 1996; Peretz et al., 1998; Juslin, 2000; Juslin and Laukka, 2003; Kreutz et al., 2008; Mohn et al., 2011). Interestingly, Ekman (1993) and Izard et al. (2000) have both questioned the theory of facial expression of basic emotion based on variability and confusability. Moreover, analyses of physiological responses to music show that while musical stimuli elicit significant responses, physiological measures do not generally match listener self-reports using emotion terms (Krumhansl, 1997).

Basic emotion theory has been linked to an approach in which music is supposed to somehow imitate or mimic more biologically relevant stimuli, such as speech or mother-infant interactions (Juslin and Laukka, 2003), leading to the direct perception of emotion. Such discussions generally assume that musical communication is not evolutionarily selected, but needs to piggyback on more fundamental mechanisms. Our proposal is that music speaks to the brain in its own language, it need not imitate any other form of communication. In this sense, other forms of communication may be seen to induce or modulate emotions more indirectly, i.e., the effect is more cognitive (cf. Langer, 1951). Thus, the study of music may provide a unique window into the fundamental nature of affective communication, which might explain, for example, why music has the ability to evoke emotional memories (Janata et al., 2007).

It is tempting to try to unify core affect with basic emotion (Juslin, 2001) by assuming that each emotion category is associated with a specific core affective state (e.g., fear is unpleasant and highly arousing, sadness is unpleasant and less arousing, etc.). However, the mapping of emotion to affect is not unique; core affective states experienced during two different episodes of a given, nameable emotion (e.g., fear) typically differ depending on the situation (Meyer, 1956; Barrett, 2009b; Wilson-Mendenhall et al., 2013). Moreover, musical variables such as melodic contour, tempo, loudness, texture, and timbral sharpness, predict real-time listener ratings of arousal and valence well (Schubert, 1999, 2001, 2004), and correlate with BOLD responses in a number of brain regions (Chapin et al., 2010). Neuroimaging studies have also revealed BOLD responses to parametric manipulation of pleasantness (Blood et al., 1999; Koelsch et al., 2006), and these overlap with responses to intensely pleasurable music (Blood and Zatorre, 2001; Salimpoor et al., 2010).

#### **CONCLUSION**

In summary, we believe that a coherent picture is developing, based on recent findings of nonlinear resonant responses to acoustic stimulation at multiple timescales (Ruggero, 1992; Joris et al., 2004; Lakatos et al., 2005; Lee et al., 2009; Laudanski et al., 2010; Nozaradan et al., 2011) and theoretical analyses that show how such processes could underlie complex cognitive computations as well as phenomenal and affective aspects of our musical experiences (Baldi and Meir, 1990; Hoppensteadt and Izhikevich, 1997; Izhikevich, 2002; Shapira Lots and Stone, 2008; Large, 2010b). Such results and analyses suggest that neurodynamics provides an appropriate level at which to understand not only perceptual and cognitive responses to music, but ultimately affective and emotional responses as well. We suggest that, to support affective communication, music need not mimic some other type of social interaction; it need only engage the nervous system at the appropriate timescales. Indeed, music may be a unique type of stimulus that engages the brain in ways that no other stimulus can.

Thus, we suggest that there is something special about the way music communicates emotion. Our approach recasts musical expectancy and affective contagion as nonlinear resonance to musical patterns. Resonance occurs simultaneously on multiple timescales, leading to stable or metastable patterns of neural responses. Such patterns are inherently spatiotemporal, however, temporal aspects of the stimulus determine at any specific point which neural structures are involved. Violations of expectancy, such as the occurrence of a strong rhythmic event on a weak beat (a syncopation) or the prolongation of an unstable tone where a stable tone is expected (an appoggiatura), would correspond to a disruptions, or perturbations of the ongoing pattern. Implication and realization would correspond to relaxation toward, and reestablishment of a stable orbit. Stable and unstable in this context, are determined by the intrinsic neurodynamics of brain networks involved, which depend, in part, on tuning of the dynamics via synaptic plasticity. In this way, music may modulate affective neurodynamics directly by coupling into those aspects of the dynamic core of consciousness that govern our subjective feelings from moment to moment.

#### **GLOSSARY**

**Affect/Core affect:** The most elementary consciously accessible sensation evident in moods and emotions*.* Affect can be characterized as fluctuating level of valence (pleasure/displeasure) and arousal (activation/ deactivation; Barrett and Bliss-Moreau, 2009). Core affect is so called because it is thought to arise in the core of the body or in neural representations of body state change (Russell, 2003).

**Basic emotions:** A few privileged emotion kinds (e.g., anger, sadness, fear, and happiness), each of which is thought to derive from an evolutionarily preserved brain module. Basic emotions are discrete, and each category shares a distinctive collection of properties, including patterns of autonomic nervous system activity, behavioral responses, and action tendencies. A set of emotion-specific brain structures is thought to mediate these particular "basic" emotions (Allport, 1924; Izard, 1971; Ekman, 1972; Panksepp, 1998).

**Dynamic core:** Functional clusters of neuronal groups in the thalamocortical system that are hypothesized to underlie consciousness. Distinct neuronal groups contribute to the contents of consciousness through enhanced synchrony of neural rhythms. The boundaries of this core are suggested to shift over time, with transitions occurring under the influence of internal and external stimulation (Seth, 2007).

**Empathy:** A feeling that arises when the perception of an emotional gesture in another person directly induces the same emotion in the perceiver without any appraisal process (see Juslin and Vastfjall, 2008).

**Emotional contagion/Affective contagion:** A process that occurs between individuals in which emotional or affective information is transferred from one individual to another. The idea that people may "catch" the emotions of others when seeing their facial expressions, hearing their vocal expressions, or hearing their musical performances (see Juslin and Vastfjall, 2008)

**Emotion:** Affective responses to situations that usually involve a number of sub-components – subjective feeling, physiological arousal, thought, expression, action tendency, and regulation – which are more or less synchronized (Juslin and Vastfjall, 2008). Emotions are intentional; they are about an object or event.

**Feelings:** The subjective phenomenal character of an experience, used informally to refer to qualia or affect.

**Intentionality:** The power of minds to be about, to represent, or to stand for, things, properties and states of affairs (Jacob, 2010).

**Psychological constructionism:** The theory that emotions results from the combination of psychologically primitive processes, which encompass both affective and intentional components. A specific emotion is not the invariable result of activation in a particular brain area; neural circuitry realizes more basic processes across emotion categories. Psychological constructionists argue that emotions are culturally relative and learned (see Russell and Barrett, 1999; Barrett and Bliss-Moreau, 2009; Barrett, 2009a,b).

**Qualia/musical qualia:** The distinctive subjective character of a mental state; what it is *like* to experience each state; the introspectively accessible, phenomenal aspects of our mental lives (Tye, 2013). Musical qualia refers to the subjective character of specific musical events, experienced within a tonal and/or temporal context.

#### **ACKNOWLEDGMENTS**

This work was supported by NSF grant BCS-1027761 to Edward W. Large. We wish to thank Dan Levitin and all of the reviewers (with special thanks to reviewer 3) for their careful reading and excellent suggestions for improving this manuscript.

#### **REFERENCES**

Allport, F. (1924). *Social Psychology*. New York: Houghton Mifflin.


Brentano, F. (1973). *Psychology from an Empirical Standpoint*. New York: Routledge.


*sounds*, ed. D. Swaminathan (Berlin: Springer), 110–124. doi: 10.1007/978-3-540- 85035-9\_7


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 30 April 2013; paper pending published: 01 July 2013; accepted: 19 January 2014; published online: 17 March 2014.*

*Citation: Flaig NK and Large EW (2014) Dynamic musical communication of core affect. Front. Psychol. 5:72. doi: 10.3389/fpsyg.2014.00072*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2014 Flaig and Large. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### The same, only different: what can responses to music in autism tell us about the nature of musical emotions?

#### *Rory Allen1 \*, Reubs Walsh2 and Nick Zangwill <sup>3</sup>*

*<sup>1</sup> Department of Psychology, Goldsmiths, University of London, London, UK*

*<sup>2</sup> Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, UK*

*<sup>3</sup> Department of Philosophy, University of Durham, Durham, UK*

*\*Correspondence: r.allen@gold.ac.uk*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Anjali Bhatara, Université Paris Descartes, France*

We propose addressing the theme of this special issue by examining the affective responses that music evokes in the individual. The logical first step is to enquire how far these responses resemble naturalistic emotions, i.e., those that are not specifically musical, but have ordinary non-musical content. The literature is ambivalent on this. Many authors suggest that whilst certain emotions are exclusive to music (Scherer and Zentner, 2008), there is considerable overlap between "musical" and "naturalistic" emotions (Zentner et al., 2008); others deny that musically induced emotions are naturalistic (Konecni, 2005, 2008), a view elaborated by the nineteenth century critic Eduard Hanslick (Hanslick, 1986; see also Kivy, 2001, 2009; Zangwill, 2004, 2007, 2011).

Perhaps consideration of music's origins might clarify the issue. If the universality of music in human society were the consequence of biological selection (Huron, 2001; Mithen, 2009), this would support the naturalistic interpretation. If music is, literally, in our DNA, then human responses to music will form part of the normal repertoire of emotions. However, Patel (2008) has given reasons for rejecting the evolutionary theory in favor of the idea of music as "transformative technology," implying that it is the outcome of cultural, not biological evolution. An important recent study has provided perhaps the first experimental evidence for this. Inspired by cultural transmission theory (Boyd and Richerson, 1985), MacCallum et al. (2012) demonstrated the effectiveness of consumer selection in generating music out of noise in a Darwinian model of cultural evolution. This suggests that music has evolved to satisfy human aesthetic criteria, not vice versa.

Moreover, the universality of music goes hand in hand with an extraordinary diversity, as MacCallum et al. point out (the need to explain this being one of the drivers for their study). Language's acknowledged genetic basis is associated with deep structural similarities between human languages. Yet the "languages" of music have little in common crossculturally. Javanese gamelan music uses two scales, both totally different from the 12 note Western scale (Patel, 2008, p. 19). West African drum music employs a feature, the "time line," unknown to Western tradition (Agawu, 2006) and a rhythmic structure which has "a richness and subtlety found in no other music" (Temperley, 2000, p. 79). Japanese hogaku is "a music erected upon such a different foundation and animated by so different an aesthetic" that it has essentially nothing in common with the Western classical corpus (Dean, 1985, p. 147).

Taking the longitudinal view, and focusing on just one musical sub-culture, that of Western music, its development during the past millenium from Gregorian plainchant to modern electronic music illustrates that the evolution of music operates several orders of magnitude faster than human evolution. Languages, by comparison, show continuities in deep structure that can be traced back 10,000 years and even beyond (Dunn et al., 2005). The conclusion seems inescapable. The evolutionary boot is on the other foot: it is music that has evolved to fit humans, not vice versa. But if human responses to music are not the result of biological evolution, the *a priori* argument that these affective responses must correspond to naturalistic emotions falls away.

On the other hand, Patel's model not only explains the speed and diversity of musical development, but also suggests an answer to our initial question. Borrowing Patel's analogy, humans migrating out of Africa into Europe would have found the warmth of fire a life-giving substitute for the warmth of tropical sunshine, and the fact that it was not in all respects identical with sunshine did not detract from its value. For some purposes the warmth of fire would have proved superior: it does not induce sunburn, and one can cook with it. Similarly, the emotional warmth induced by music need not be identical with that provided by any natural emotion, and in some respects the differences may enhance its value: what we call the "sadness" in music may strike us as so pleasant partly because it does *not* induce real sadness in listeners.

The outcome of the study reported in Allen et al. (2013), though not designed with this purpose, turns out to have a bearing on the question. Matched adult autism and control groups were compared on the autonomic and cognitive components of their emotional responses to a standard list of music items (Quintin et al., 2011). Whereas the groups responded similarly at an autonomic level, they differed at a cognitive level precisely as would be expected if autonomic arousal causes cognitive arousal. Regression analysis suggested that the causal chain was mediated by levels of type II alexithymia, or the cognitive inability to interpret and verbalize the lower level autonomic or visceral aspects of emotion (this has high comorbidity with autism: Berthoz and Hill, 2005). The mediation interpretation was robust: levels of alexithymia, and verbalization of emotion, also correlated significantly within the control group.

How should we interpret this result? If we assume as a working hypothesis that affective responses to music are naturalistic, it follows that they should be activated via one or both of the two principal routes to emotion induction, the "fast" (thalamic) and "slow" (cortical) routes (LeDoux, 2000). The fast route rapidly alerts the autonomic nervous system, priming the body to take immediate action to cope with a potential emergency, or opportunity; only subsequently are the higher cognitive functions recruited, to verify, elaborate, and possibly revoke the fast track responses to an alert. With the slow route, incoming sensory signals are first appraised by the higher centers of the brain; if found emotionally relevant, they induce autonomic and bodily arousal. In both cases, however, for the arousal to be considered a naturalistic emotion, autonomic and cognitive components must both eventually be activated, and should be congruent with one another.

The results from Allen et al. (2013), in particular the mediation analysis, appeared inconsistent with a naturalistic slow track route for the induction of musical emotions in our study. It might be argued that this was because the individual musical extracts were short (30 s): we have no difficulty with accepting that higher level cognitive processes, including such mechanisms such as the ITPRA sequence described by Huron (2006), are important for the aesthetic aspects of musical appreciation in extended listening, though many cognitive effects may occur in as brief a period as three seconds (Plazak and Huron, 2011). The pleasure induced by music activates normal dopamine reward and anticipation circuits (Salimpoor et al., 2011). However, this is irrelevant since pleasure in general, and aesthetic pleasure in particular, is not an emotion.

Juslin and Västfjäll (2008) propose six mechanisms for emotion induction by music (they exclude cognitive appraisal from the outset, we think correctly). These are brain stem reflexes, evaluative conditioning, emotional contagion, visual imagery, episodic memory, and musical expectancy. Of these, the first and last have been discussed above (in the "fast route," and the ITPRA mechanisms respectively). Emotional contagion involves no appraisal process and we would include it in our fast route mechanism. Evaluative conditioning and episodic memory both rely on arbitrary associations, essentially independent of any properties of the music. In the case of visual imagery, Juslin and Västfjäll cite no clear experimental evidence of any consistent causal link between musical structure and particular visual images, let alone between music and any emotions induced through that mechanism.

We conclude that musical emotions, if they are emotions at all in the conventional sense, are fast track emotions. With naturalistic fast track emotions, the autonomic arousal component should be complemented by the appropriate cognitive counterpart. According to Huron (2011), this is not the case, at least for negative autonomic responses such as sadness. Huron considers that the autonomic system responds automatically to such music with a kind of "sham pain," but we enjoy "sad" music because the conscious brain realizes that the situation is not threatening, and responds with relief, aided by the liberation of prolactin, so that the net effect is pleasurable. If we accept Huron's ITPRA model, we may plausibly speculate that a further effect might be due to the combination in music of a high degree of order and pattern, and sufficient variety to make it unpredictable. This acts as intellectual catnip to the pattern detection and prediction aspects of executive functioning. The analysis of these patterns may be sufficiently interesting to, and demanding of, the higher brain centers that they are distracted from their normal role in monitoring autonomic arousal, thus permitting the arousal induced by the fast track mechanism to persist in defiance of its lack of congruence with reality. These processes would enable the generation of patchwork emotion states comprising activation of combinations of different brain circuits not found in naturalistic emotions. We might call these states "chimerical," after the composite lion/goat/serpent creature of Greek myth. They would be sufficiently rewarding to make us wish to repeat the listening experience, and this could be a driver, in a model such as that of MacCallum et al. (2012), for music to evolve ways of generating ever more desirable chimerical combinations.

Two questions suggest themselves. Firstly, if musical emotions are indeed not naturalistic emotions, how can we account for the stubbornly persistent illusion that they are? Secondly, how is it that music has the ability to induce such powerful affective states, if indeed they are unnatural? On the first question, we have long known that experiencing the physiological counterpart of an emotion can lead to the brain's attributing the state to a naturalistic emotion even when the cognitive counterpart is not present (Schachter and Singer, 1962). We suggest that this kind of unconscious confabulation may be happening when a listener is asked to describe their emotional experiences, especially if there is social pressure to feel the "appropriate" emotions. Moreover, as pointed out in Zentner et al. (2008), p. 494, some experimental protocols embody a theory of emotion developed outside music that compels the use of standard emotion words.

As to the second question, Juslin (2000, 2001) has argued that musical instruments act as "superexpressive voices," which enhance and exaggerate the emotionally expressive components of the human voice. This theory is a perfect counterpart, for music, to that of Ramachandran and Hirstein (1999) in the visual arts, with their notion of "supernormal stimuli" (though we should note that due priority should be accorded to Aldous Huxley: see his 1956 book "Heaven and Hell"; also Allen and Heaton, 2010, p. 255). However, composers have an important advantage over visual artists, who cannot precisely control how viewers scan an artwork, whereas listeners cannot avoid hearing the notes in the intended order: this allows for all the sophisticated mechanisms for generating tension and satisfying, or violating expectation as described in Huron's ITPRA model (Huron, 2006).

Concluding on a constructive note, though affective responses to music may lack validity as naturalistic emotions, they are not for that reason valueless. Music can undoubtedly influence mood, indeed we know that mood management is one of the main reasons people give for listening to music (North et al., 2004). It is plausible that music has the ability to vary mood states in a positive way along both axes of the two-dimensional arousal space described by Thayer (1978), leading to satisfying alterations of mood from tense to calm, and from dull to excited. Such uses were clearly described by participants with autism in Allen et al. (2009): see also Bhatara et al. (2010). Incidentally, the lack of sophisticated emotion descriptors cited in Allen et al. (2009) suggests, in the light of the present paper, that our participants were actually *more* insightful into the true nature of their affective responses to music than typically developing individuals, a nice reversal of the usual representation of autism as a syndrome of deficits.

It was proposed in Allen and Heaton (2010) that the apparent preservation of affective responses to music in neurodevelopmental disorders such as autism, might be used as a means to repair the link between autonomic and cognitive components of emotion where this link is damaged or underdeveloped. The suggestion originated from the personal experience of the second author who had found that autism, with its associated difficulties in learning about emotions via the usual route of social interaction, did not prevent the induction by music of intense affective states, in an unthreatening context, which led to a better understanding of naturalistic emotions. A pilot study under the auspices of the Baily Thomas Charitable Fund currently being conducted by the first author is exploring whether associative learning can be used to help people with type II alexithymia by teaching them, via musical extracts, to attach cognitive labels to their autonomic arousal states. Very preliminary results suggest that our procedure does produce measurable benefits (pending formal publication, some further details of the study can be found online: Allen et al., 2012). Musical emotions possess some of the characteristics of naturalistic emotions and lack others, and we suggest that it is this dual nature which may make them useful in treating conditions where emotional processing is partially preserved, and partially disrupted. If this viewpoint is correct, then their value in this context is precisely because, like individuals with autism, they are both the same and different.

#### **ACKNOWLEDGMENTS**

We gratefully acknowledge the role of the Baily Thomas Charitable Fund in making possible the research cited in the final paragraph.

#### **REFERENCES**


*Received: 28 February 2013; accepted: 13 March 2013; published online: 04 April 2013.*

*Citation: Allen R, Walsh R and Zangwill N (2013) The same, only different: what can responses to music in* *autism tell us about the nature of musical emotions? Front. Psychol. 4:156. doi: 10.3389/fpsyg.2013.00156 This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology. Copyright © 2013 Allen, Walsh and Zangwill. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

### Valence, arousal, and task effects in emotional prosody processing

#### *Silke Paulmann1 \*, Martin Bleichner <sup>2</sup> and Sonja A. Kotz 3,4\**

*<sup>1</sup> Department of Psychology and Centre for Brain Science, University of Essex, Colchester, UK*

*<sup>2</sup> Department of Neurology and Neurosurgery, Rudolf Magnus Institute of Neuroscience, University Medical Center Utrecht, Utrecht, Netherlands*

*<sup>3</sup> Department of Neuropsychology, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany*

*<sup>4</sup> School of Psychological Sciences, University of Manchester, Manchester, UK*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Lorena Gianotti, University of Basel, Switzerland Lucia Alba-Ferrara, University of South Florida, USA*

#### *\*Correspondence:*

*Silke Paulmann, Department of Psychology, Centre for Brain Science, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, UK*

*e-mail: paulmann@essex.ac.uk;*

*Sonja A. Kotz, Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstraße 1a, 04103 Leipzig, Germany e-mail: kotz@cbs.mpg.de*

Previous research suggests that emotional prosody processing is a highly rapid and complex process. In particular, it has been shown that different basic emotions can be differentiated in an early event-related brain potential (ERP) component, the P200. Often, the P200 is followed by later long lasting ERPs such as the late positive complex. The current experiment set out to explore in how far emotionality and arousal can modulate these previously reported ERP components. In addition, we also investigated the influence of task demands (implicit vs. explicit evaluation of stimuli). Participants listened to pseudo-sentences (sentences with no lexical content) spoken in six different emotions or in a neutral tone of voice while they either rated the arousal level of the speaker or their own arousal level. Results confirm that different emotional intonations can first be differentiated in the P200 component, reflecting a first emotional encoding of the stimulus possibly including a valence tagging process. A marginal significant arousal effect was also found in this time-window with high arousing stimuli eliciting a stronger P200 than low arousing stimuli. The P200 component was followed by a long lasting positive ERP between 400 and 750 ms. In this late time-window, both emotion and arousal effects were found. No effects of task were observed in either time-window. Taken together, results suggest that emotion relevant details are robustly decoded during early processing and late processing stages while arousal information is only reliably taken into consideration at a later stage of processing.

#### **Keywords: P200, LPC, ERPs, arousal, task demands, emotion, prosody**

#### **INTRODUCTION**

There is a recent increase in studies informing about the complexity and diversity of how the brain processes emotional information from the voice. Much progress has been made in depicting which brain structures are implied during emotional prosody processing, that is the variation of acoustic cues such as fundamental frequency (F0), amplitude (or intensity), timing, and voice quality (energy distribution) during speech (see e.g., Kotz and Paulmann, 2011 for recent review). In addition, electrophysiological studies have investigated the time-course or speed with which emotional prosodic information is processed to ensure appropriate social behavior (e.g., Pihan et al., 1997; Schirmer et al., 2002, 2005b; Schirmer and Kotz, 2003; Bostanov and Kotchoubey, 2004; Wambacq et al., 2004; Kotz and Paulmann, 2007; Paulmann and Kotz, 2008; Paulmann and Pell, 2010). However, although most researchers would agree that emotional information as conveyed by the voice (or other non-verbal channels such as face or body posture) can be described in a two-dimensional space, that is, with regard to valence (pleasant – unpleasant) and arousal (sometimes referred to as activation: high – low; see e.g., Feldman-Barrett et al., 2007 for review of emotion theories), most electrophysiological research on vocal emotion processing has concentrated on exploring when *emotional* or *valence* attributes are processed, thereby ignoring the possible contribution of *arousal* during emotional prosody processing. Thus, the present investigation aims to start filling this gap in the literature by studying how and when these two dimensions impact on emotional prosody processing.

Event-related brain potentials (ERPs) have been widely used to define the temporal processes involved in emotional prosody processing. For instance, early studies on vocal emotion processing have focused on assessing when stimuli of different valences can be distinguished from one another (e.g., Wambacq and Jerger, 2004; Schirmer et al., 2005a). Later studies have explored when language stimuli expressing so-called basic emotions (anger, fear, disgust, sadness, surprise, happiness) can be differentiated from neutral stimuli and/or each other (e.g., Paulmann and Kotz, 2008; Paulmann et al., 2011). Generally speaking, ERP findings support the notion that valence information is detected and analyzed rapidly (within the first 200 ms after stimulus encounter) from prosody (e.g., Schirmer et al., 2005a, 2013; Paulmann and Kotz, 2008; Garrido-Vásquez et al., in press), irrespective of speaker voice (Paulmann and Kotz, 2008; Paulmann et al., 2008b), and even when information is not task-relevant (Wambacq et al., 2004; Kotz and Paulmann, 2007), or when it is processed preattentively (Schirmer et al., 2005a). Under attentive processing conditions, the process of rapid emotional salience detection has repeatedly been linked to the P200 component, a fronto-centrally distributed positivity reaching its peak approx. 200 ms after stimulus onset.

While the early P200 component is assumed to reflect enhanced attention to emotional stimuli so that they can be preferentially processed if need be, (concurrent) later ERP components are often linked to more in depth processing mechanisms (e.g., meaning evaluation, access to memory representation). Specifically, late emotional prosody effects have been observed in several late ERP components including the P300 (e.g., Wambacq and Jerger, 2004), N300 (e.g., Bostanov and Kotchoubey, 2004), N400 (e.g., Schirmer et al., 2002, 2005a; Schirmer and Kotz, 2003; Paulmann and Pell, 2010), and a late positive complex (LPC; Kanske and Kotz, 2007; Schirmer et al., 2013), depending on stimuli, tasks, and experimental designs used. Thus, a growing body of literature suggests that emotion signaling features such as valence or even emotional category knowledge are rapidly extracted and analyzed during emotional prosody processing. However, next to nothing is known about additional emotion relevant parameters that could potentially influence this early evaluation process. Specifically, so-called circumplex models of emotion propose that both valence and arousal dimensions are crucial when describing how someone feels (e.g., Feldman-Barrett and Russell, 1998; Feldman-Barrett, 2006), that is both dimensions should modulate how emotions are perceived from speech.

While previous electrophysiological research on *vocal* emotional language processing has either ignored the dimension of arousal altogether or has tried to control for arousal by keeping activation attributes of stimuli similar, ERP research on *visual* emotional language processing has already started to explore the (combined) influence of valence and arousal on processing affective word or sentence stimuli. For instance, Hinojosa et al. (2009) presented positive prime-target word pairs which were either congruent or incongruent with regard to their arousal level. Participants were instructed to identify whether the target word was of either high or low (relaxing) arousal. The authors report reduced LPC amplitudes for high-arousal congruent target words when compared to high-arousal incongruent target words. This priming effect occurred between 450 and 550 ms after target word onset and was interpreted to reflect reduced attentional resources needed to process highly arousing stimuli when preceded by stimuli of the same arousal level (Hinojosa et al., 2009). Similarly, Bayer et al. (2010) report a short negativity between 330 and 430 ms after stimulus onset for sentences containing negative high-arousal target words when compared to sentences with negative low-arousal target words while participants engaged in a semantic judgment task (does the target word fit the preceding context). Combined, their results are in line with the view that arousal relevant details about word stimuli are processed at a rather "late" (cognitive) processing stage compared to valence or emotion relevant details, which have been reported to be processed in earlier processing stages (e.g., Gianotti et al., 2008). In other words, findings from studies exploring visual emotional language processing suggest that arousal influences allocation of attentional resources and later sustained stimulus evaluation processes while valence or emotion attributes of stimuli can impact early, initial evaluation of stimuli which ensures that potentially relevant stimuli are preferentially processed over irrelevant (c.f., Hinojosa et al., 2009). This view has also received support from studies using non-language emotional stimuli such as pictures (see Olofsson et al., 2008 for review). It should not go unmentioned that there is also some sparse evidence that arousal of language stimuli can modulate early ERP components: Hofmann et al. (2009) reported that high-arousal negative words elicit an increased early negative ERP between 80 and 120 ms after word onset in contrast to neutral and low-arousal negative words when participants performed a lexical decision task. This greater ERP negativity was linked to an early effect of arousal on lexical access processing. Specifically, the authors interpreted the ERP amplitude differences between high-arousal negative and neutral words to reflect early facilitative lexical access for arousing negative stimuli suggesting an early influence of arousal on affective word processing. The same effect, however, was not found for positive word stimuli, that is, the general influence of arousal on early emotional word processing mechanisms remains to be further investigated. Given that different studies applied different tasks, it can also not be excluded that varying task demands (explicit emotional/arousal focus, implicit emotional/arousal focus) could partly account for the equivocal time-course findings in the literature.

We are unaware of electrophysiological studies exploring the influence of arousal and valence on emotional prosody processing in a combined experimental design. Thus, the current study tested the influence of arousal and valence on both early (P200) and late (LPC) ERP components by using pseudo-sentence stimuli intoned in six distinct emotional tones (anger, disgust, fear, sadness, surprise, happiness). Given that emotions can be expressed with either high or low arousal (e.g., one can say "stop" in a calm but firm, or in a shriek voice; both times expressing anger), stimuli were also grouped according to arousal level of the speakers, who intoned the sentences so that each emotional category contained sentences that were rated as either low or high arousing. To test for the influence of task focus, half of the participants were asked to rate the arousal level of the speaker who intoned the sentence they had just heard, while the other half was asked to rate how aroused they felt after listening to the sentence. Thus, task demands (e.g., processing effort) are comparable as in both instances, participants made use of a nine-point Likert scale; however, task focus is different in that one group focused on the arousal level of the presented stimuli (explicit task), while the other group focused on their own arousal level (implicit task). In view of previous findings from emotional visual language processing (see above), we hypothesized that stimuli expressing different emotions (and valences) would elicit differently modulated P200 amplitudes (rapid emotional salience detection) as well as differently modulated LPC amplitudes (sustained emotional evaluation). In contrast, arousal effects should only modulate ERPs in a later time-window (LPC) if true that emotional relevant attributes (e.g., saliency, category knowledge) are processed before arousal relevant attributes (i.e., determine the calmness/excitation of a stimulus). However, in light of findings which suggest that arousal effects might be modulated by task focus (explicit vs. implicit), a potential influence of arousal on early processing mechanisms could not be completely ruled out.

#### **MATERIALS AND METHODS**

#### **PARTICIPANTS**

Forty right-handed native speakers of German (21 female, mean age: 25 years, range: 20–30 years) participated in the study. Data from one participant had to be excluded due to excessive muscle movements during the electroencephalogram (EEG) recording. None of the participants reported any hearing impairments, and all had normal or corrected-to-normal vision. Participants gave their written informed consent and the experiment was approved by the Ethics Committee of the Max Planck Institute (CBS, Leipzig). All participants were compensated financially for their participation.

#### **STIMULUS MATERIAL**

Emotional portrayals were elicited from two native German actors (one male, one female). Recordings were made with a digital camcorder connected to a high quality clip on microphone. During the recording session, actors produced pseudo-sentences, that is sentences which contain prosodic information but no semantic content, belonging to one of six basic emotional (happiness, pleasant surprise, anger, disgust, fear, sadness), or a neutral category. Stimuli were phonotactically and morpho-syntactically legal in German (example: Mon set die Brelle nogeferst and ingerafen). We presented a total of 360 emotional sentences (30 sentences per emotional category, each spoken by a male and a female speaker) and 50 different neutral filler sentences (again, each spoken by both speakers). Each neutral sentence was repeated three to four times (per speaker) throughout the experiment to ensure that an equal amount of emotional and neutral stimuli were presented (360 sentences each). Given that neutral sentences lack the dimension of arousal (high vs. low), they were not included in the analysis and solely served as filler material. All sentences were rated for their emotional tone of voice by 24 participants (12 female, none of the raters participated in the present study) in a forcedchoice paradigm. The mean percentage agreement for the sentences selected for the present study was: 90.66% for anger, 68.18% for disgust, 60.78% for fear, 68.17% for sadness, 57.55% for happiness, 54.34% for pleasant surprise, and the mean percentage correct for neutral was 90.09%. Further rating details can be found in Pell et al. (2009). See **Table 1** for results of acoustical analyses of stimuli. Comparable to the majority of previous studies exploring emotional prosody processing, stimuli were not artificially matched for amplitude, pitch, or tempo across emotional categories to ensure natural-like material. In addition, arousal ratings for stimuli were obtained from participants of the current study who rated materials for arousal level of the speaker (explicit task, see below). For each emotional category, sentences were grouped according to the arousal level as expressed by the speaker: for each speaker, we selected the 10 sentences that were ranked most highly to count as high arousing stimuli, and 10 sentences that were ranked lowest as low arousing stimuli. Thus, for each emotional category, 20 sentences were categorized as low and 20 sentences were categorized as high arousing stimuli. The 10 sentences rated as "medium" arousal were not included in the ERP analysis.

#### **PROCEDURE**

After preparation for EEG recordings, participants were seated in an electrically shielded chamber at a distance of approx. 115 cm



in front of a monitor. Auditory stimuli were presented via loudspeakers positioned directly to the left and right side of the monitor. Stimuli were pseudo-randomized and presented to the participant split into 10 blocks of 72 trials each. Half of the participants carried out an "implicit task" ("How aroused do you feel when listening to the sentence") and the other half carried out an "explicit task" (How aroused did the speaker feel when uttering the sentence?"). Task distribution was counterbalanced across participants. A trial was as follows: before the onset of each auditory stimulus, a fixation cross was presented in the center of the screen for 200 ms. This was immediately followed by the stimulus presentation (sentence duration was max. 3000 ms long). Following this, a nine-point arousal scale appeared on the screen for 200 ms, prompting the participant to respond. After the response, an inter-stimulus interval (ISI) of 1500 ms followed, before the next stimulus was presented. After each block, the participant paused for a self-determined duration before proceeding.

#### **ERP RECORDING**

The EEG was recorded from 49 Ag–AgCl electrodes mounted on a custom-made cap (Electro-Cap International) according to the modified expanded 10–20 system (Nomenclature of the American Electroencephalographic Society, 1991). Signals were recorded continuously with a band pass between DC and 70 Hz and digitized at a sampling rate of 500 Hz (Xrefa amplifier). The reference electrode was placed on the left mastoid. Bipolar horizontal and vertical EOGs were recorded for artifact rejection purposes. Electrode resistance was kept below 5 K-. Data was re-referenced offline to linked mastoids. The data was inspected visually in order to exclude trials containing extreme artifacts and drifts, and all trials containing EOG-artifacts above 30.00 µV were rejected automatically. In total, approximately 16% of the data was rejected. Trials were averaged over a time range of 200 ms before stimulus onset to 1000 ms after stimulus onset.

#### **DATA ANALYSIS**

For the ERP analysis, the electrodes were grouped according to regions of interests. Left frontal electrode-sites: F5, F3, FC5, FC3; left central sites: C5, C3, CP5, CP3; left posterior sites: P5, P3, PO7, PO3; right frontal sites: F6, F4, FC6, FC4; right central sites: C6, C4, CP6, CP4; and right posterior sites: P6, P4, PO8, PO4. Based on visual inspection and previous evidence, an early timewindow from 170 to 230 ms (P200 component, Paulmann et al., 2010) and a later time-window from 450 to 750 ms after sentence onset (LPC component, Lazlo and Federmeier, 2009) were selected for analysis of mean amplitudes.

Mean amplitudes were entered into a repeated measurements ANOVA using the within-subject factors arousal (high, low), emotion (anger, disgust, fear, sadness, happiness, pleasant surprise), region of interest [six ROIs: left/right frontal (LF), left/right central (LC), left/right posterior (LP) electrode-sites], and the between-subjects factor task (implicit/explicit stimulus evaluation). Customized tests of hypotheses *(post hoc* tests) were carried out using a modified Bonferroni procedure correction for multiple comparisons when appropriate (see Keppel, 1991). Therefore, in cases where all emotions were contrasted with one another (15 contrasts in total), the alpha level for significance testing was set at *p* < 0.017 and not at *p* < 0.05. Comparisons with more than one degree of freedom in the numerator were corrected for non-sphericity using the Greenhouse–Geisser correction (Greenhouse and Geisser, 1959). The graphs displayed were filtered with a 7 Hz low-pass filter.

#### **RESULTS**

For the ease of reading, only significant main effects and interactions involving the critical factors *emotion, arousal,* and/or *task* are reported.

#### **P200 MEAN AMPLITUDES (170–230 ms)**

In the early time-window a significant effect of *emotion* [*F*(5, 185) = 3.25, *p* = 0.01] was found, revealing differently modulated amplitudes for sentences spoken in the different tones of voice. This main effect was qualified by a significant twoway interaction between *emotion* and *ROI* [*F*(25, 925) = 2.38, *p* = 0.01]. *Post hoc* contrasts at each ROI revealed the following patterns. At LF electrode-sites, sentences intoned in an angry tone differed significantly from sentences intoned in a fearful voice [*F*(1, 37) = 10.21, *p* < 0.01], as well as in a sad tone [*F*(1, 37) = 22.43, *p* < 0.0001]. At this ROI, there was also a marginal difference between pleasant surprise and sad sentences [*F*(1, 37) = 4.42, *p* < 0.05]. At LM sites, angry sentences could again be distinguished from fearful [*F*(1, 37) = 10.38, *p* < 0.01] and sad [*F*(1, 37) = 4.42, *p* = 0.019] sentences. Additionally, ERPs for disgust sentences differed marginally from fearful sentences [*F*(1, 37) = 4.82, *p* < 0.05]. Fearful sentences also differed significantly from happy [*F*(1, 37) = 8.38, *p* < 0.01] and marginally from pleasant surprise sentences [*F*(1, 37) = 4.33, *p* < 0.05]. ERPs in response to happy and sad sentences also differed marginally [*F*(1, 37) = 4.54, *p* < 0.05] at LM sites, as well as at LP sites [*F*(1, 37) = 4.45, *p* < 0.05]. At RF sites, ERPs in response to angry sentences differed from ERPs in response to fearful [*F*(1, 37) = 10.37, *p* < 0.01] and sad [*F*(1, 37) = 19.71, *p* < 0.0001] sentences. ERPs to fearful sentences also differed from pleasant surprise [*F*(1, 37) = 8.53, *p* < 0.01] and marginally from happy [*F*(1, 37) = 5.02, *p* = 0.03] sentences. The same was found for the contrasts between ERPs in response to sad sentences and pleasant surprise [*F*(1, 37) = 11.71, *p* < 0.01] and sad and happy [*F*(1, 37) = 6.13, *p* = 0.018] sentences. At RM sites, a similar pattern emerged: ERPs in response to angry sentences differed significantly from fearful [*F*(1, 37) = 10.90, *p* < 0.01] and sad sentences [*F*(1, 37) = 17.10, *p* < 0.001], and marginally from disgust sentences [*F*(1, 37) = 4.33, *p* < 0.05]. Moreover, ERPs in response to disgust and happy sentences differed [*F*(1, 37) = 6.35, *p* < 0.017] as did ERPs in response to fearful and happy [*F*(1, 37) = 7.58, *p* < 0.01] and pleasant surprise [*F*(1, 37) = 8.08, *p* < 0.01]. ERPs in response to sad sentences differed significantly from happy [*F*(1, 37) = 7.83, *p* < 0.01] and pleasant surprise [*F*(1, 37) = 9.13, *p* < 0.01] sentences. No significant differences were found at RP sites.

Finally, there was also a marginally significant main effect of *arousal* [*F*(1, 37) = 3.28, *p* = 0.078], revealing a stronger positivity for high arousing stimuli when compared to low arousing stimuli. No other main effects or interactions turned out to be significant. See **Figures 1** and **3** for visualization of effects.

In summary, data analysis confirms a significant *emotion* effect revealing early differentiation of vocal emotional expressions in the P200 amplitude though individual contrasts between specific emotional tones seem to vary as a function of distribution. Significant differentiation effects are primarily found at frontal and central electrode-sites. Also, the analysis revealed a marginally significant effect of *arousal* with high arousing stimuli eliciting more positive P200 amplitudes than low arousing stimuli. Finally, there was no indication that task instructions influenced P200 amplitude modulation.

#### **LPC MEAN AMPLITUDES (450–750 ms)**

In the later time-window, a significant effect of *emotion* was found [*F*(5, 185) = 7.22, *p* < 0.0001], revealing differently modulated LPC amplitudes for the different emotional sentences. All *post hoc* contrasts comparing each emotion with one another turned out to be significant (all *F*'s > 6.5; all *p*'s < 0.017). The main effect of *emotion* also interacted with *ROI* [*F*(25, 925) = 4.48, *p* < 0.0001] suggesting distribution differences for the *emotion* effect. *Post hoc* contrasts at LM sites revealed significant differences between ERPs in response to disgust and angry [*F*(1, 37) = 13.08, *p* < 0.001], fearful[*F*(1, 37) = 6.91, *p* < 0.017], happy [*F*(1, 37) = 12.58, *p* < 0.001], and pleasant surprise [*F*(1, 37) = 13.66, *p* < 0.001] sentences. At this ROI, the contrast between happy and sad sentences also turned out to be marginally significant [*F*(1, 37) = 4.68, *p* < 0.05]. At LP sites, the following contrasts reached significance: anger vs. disgust [*F*(1, 37) = 31.08, *p* < 0.0001]; anger vs. fear [*F*(1, 37) = 12.69, *p* < 0.001]; anger vs. sadness [*F*(1, 37) = 22.32, *p* < 0.0001]; disgust vs. happiness [*F*(1, 37) = 17.10, *p* < 0.001]; disgust vs. pleasant surprise [*F*(1, 37) = 26.24, *p* < 0.0001];fear vs. happiness [*F*(1, 37) = 9.03, *p* < 0.01]; fear vs. pleasant surprise [*F*(1, 37) = 14.30, *p* < 0.001]; happy vs. sadness [*F*(1, 37) = 38.29, *p* < 0.0001]; and pleasant surprise vs. sadness [*F*(1, 37) = 23.14, *p* < 0.0001]. At RF electrode-sites, ERPs in response to disgust sentences differed significantly from fearful [*F*(1, 37) = 13.10, *p* < 0.001], happy [*F*(1, 37) = 10.89, *p* < 0.01], and pleasant surprise [*F*(1, 37) = 7.69, *p* < 0.01] sentences. The contrasts between disgust and sad sentences almost reached significance [*F*(1, 37) = 4.20, *p* < 0.05], as did the contrast between happy and sad sentences [*F*(1, 37) = 5.65, *p* < 0.03]. At RM electrode-sites, results revealed a (marginal) significant difference between LPC amplitudes for angry and disgust [*F*(1, 37) = 19.37, *p* < 0.0001] and angry and sad [*F*(1, 37) = 5.50, *p* < 0.03] sentences. Disgust sentences were found to differ from all other emotional sentences except for sad stimuli (all

*F*'s > 17.0 and all *p*'s < 0.001). LPCs for sad and happy sentences also differed [*F*(1, 37) = 19.26, *p* < 0.0001] as did sad and pleasant surprise sentences [*F*(1, 37) = 7.12, *p* < 0.01]. Finally, at RP sites, ERPs to angry sentences differed from ERPs to disgust [*F*(1, 37) = 31.56, *p* < 0.0001], fearful [*F*(1, 37) = 5.90, *p* = 0.02], and sad [*F*(1, 37) = 14.78, *p* < 0.001] sentences. Similar to RM sites, disgust sentences were again found to differ from all other emotional sentences except for sad sentences (all *F*'s > 10.36 and all *p*'s < 0.001). In addition, ERPs in response to fearful sentences were significantly different from ERPs in response to pleasant surprise [*F*(1, 37) = 8.45, *p* < 0.001] and marginally different from ERPs in response to happy [*F*(1, 37) = 8.45, *p* = 0.02] sentences. Comparable to LP sites, sad sentences also elicited different LPC amplitudes to happy [*F*(1, 37) = 26.47, *p* < 0.0001] and pleasant surprise [*F*(1, 37) = 15.32, *p* < 0.001] sentences at RP sites.

The analysis also revealed a marginally significant main effect of *arousal* [*F*(1, 37) = 3.29, *p* = 0.08] as well as a significant interaction between *arousal* and *ROI* [*F*(5, 185) = 3.20, *p* < 0.05]. Arousal effects were significant at LM, LP, and RP ROIs (all *F*'s > 4.94 and all *p*'s < 0.05). In all instances, high arousing stimuli elicited more positive-going amplitudes than low arousing stimuli. Last, there was a significant three-way interaction *emotion* × *arousal* × *ROI* [*F*(25, 925) = 1.98, *p* < 0.05] but stepdown analyses by *ROI* revealed no further significant effects. See **Figures 2** and **3** for visualization of effects.

In sum, analyses for LPC amplitudes revealed that different emotional prosodies can be distinguished from one another in this late time-window. In addition, arousal effects turned out to be significant. Again, there was no indication that task instructions influence this differentiation in the present data.

#### **DISCUSSION**

To the best of our knowledge, this is the first ERP study simultaneously investigating the temporal dynamics of emotion and arousal effects on early (P200) and late (LPC) ERP components when processing affective information from prosodic speech materials. We report an early differentiation of six basic emotions as reflected in differently modulated P200 amplitudes at fronto-central electrode-sites. In addition, high arousing stimuli elicited slightly stronger P200 amplitudes than low arousing stimuli. The P200 effect was followed by an LPC in which the different emotions could again be differentiated from each other. Also, high arousing stimuli elicited larger LPCs than low arousing stimuli. No interaction between the two factors nor an influence of task focus was found in either time-window. Taken together, the results are thus in line with reports from visual affective language and picture processing which suggest that emotion or valence relevant information is extracted before arousal relevant information (e.g., Keil et al., 2002; Gianotti et al., 2008). Below, we will outline how the current results contribute to our understanding of affective prosody processing.

#### **P200**

Differently modulated P200 amplitudes in response to emotional speech materials have been repeatedly reported in the literature (e.g., Paulmann and Kotz, 2008; Paulmann et al., 2010; Schirmer et al., 2013; Garrido-Vásquez et al., in press). However, previously, authors only tested whether emotional materials could be differentiated from neutral materials. Here, we extend these findings by reporting that different emotional categories can also be distinguished from one another in this early time-window. This goes in line with an earlier tentative suggestion that specific emotional categories can be inferred from rather short stimulus durations (e.g., Paulmann and Pell, 2010), i.e., within 200 ms of stimulus onset. We have previously theorized that early emotional detection as reflected in the P200 is primarily based on the integration of emotionally relevant salient acoustic features including pitch, tempo, voice quality, and loudness. Some authors have claimed that the sensitivity of the P200 to physical stimulus attributes undermines

**FIGURE 2 | The illustration shows mean LPC amplitudes (in mV) for each emotional category at left/right central and left/right parietal electrode-sites.**

**selected electrodes for high/low arousing stimuli (A) and the six different emotional categories tested (B).** The left panel shows average waveforms for high (solid) and low (dotted) arousing stimuli from 200 ms side, average waveforms for different emotional prosodies are displayed from 200 ms before sentence onset up to 800 ms after stimulus onset.

the interpretation that it can reflect early emotional decoding (see Schirmer et al., 2013). However, given that stimuli with a similar acoustic profile (e.g., fear and disgust, see **Table 1**) can still be differentiated in the P200 makes this criticism less severe. In fact, more systematic P200 variations should be found if the early effect was only driven by a single acoustic parameter (i.e., stimuli with the same intensity or same pitch should elicit non-differentiable P200 amplitudes). Moreover, there is evidence by Stekelenburg and Vroomen (2007, 2012) which shows dissociations between N1 effects that were linked to processing general visual/auditory physical stimulus characteristic and P2 effects which were linked to processing of phonetic, semantic, or associative information. Taken together, it thus seems reasonable to suggest that P200 amplitude differences reflect emotional salience detection rather than sensory processing only. Crucially, researchers exploring emotional prosody perception have previously argued that matching acoustical attributes across stimuli from different categories would result in a serious reduction of the emotionality conveyed by a specific stimulus (e.g., Wiethoff et al., 2008) given that emotions are conveyed through a specific combination of different acoustic features (e.g., Banse and Scherer, 1996; Paulmann et al., 2008a). Artificially changing or removing these features results in ecologically less valid stimuli. Finally, in the neuro-imaging literature, some authors (e.g., Alba-Ferrara et al., 2011) have tried to statistically control for the influence of primary acoustic features such as pitch. Generally, similar brain activation patterns were found for stimuli that differed with regard to specific acoustical features (e.g., pitch), once more suggesting that emotional prosody evaluation is not driven by a single parameter. Rather, specific acoustic configuration patterns seem to convey emotionality through the voice. Future research should thus aim to explore which combination of acoustic parameters drives early emotional evaluation.

The present findings also revealed a marginally significant P200 effect of arousal irrespective of the emotional category tested. While most previous studies report late arousal effects for visually presented emotional materials (e.g., Herbert et al., 2006; Schupp et al., 2007; Hinojosa et al., 2009), there are some indications that arousal information can be extracted at an early stage of emotional processing, too (Hofmann et al., 2009; Feng et al., 2012). For instance, Hofmann and colleagues report an early EPN effect of arousal when processing negative (but not positive) word stimuli while participants carried out a lexical decision task suggesting that arousal characteristics can facilitate lexical processing. Feng et al. (2012) describe that the P2b component was influenced by arousal in their implicit picture viewing task (participants had to identify the color of the picture frame). High arousing pictures elicited larger P2 amplitudes at posterior electrode-sites than low arousing pictures. The discrepancy between studies reporting only late arousal effects and those reporting early arousal effects is often linked to task differences. Both Hofmann et al. (2009) and Feng et al. (2012) used tasks in which participants were not required to focus on emotional *or* arousal attributes while both of our task instructions focused on arousal attributes. Future studies should thus explore whether early effects only robustly arise if emotional or arousal evaluation is not in task focus.

Alternatively, discrepancies in the literature with regard to the temporal dynamics of effects could result from differences in stimulus duration. Early arousal effects for non-language stimuli have usually been reported for stimuli that were only *briefly* presented (e.g., 300 ms in Feng et al., 2012, or 120 ms in Schupp et al., 2004). The explanation seems less likely to apply to language-relevant stimuli though given that early arousal effects are reported for words that were presented for 1000 ms (Hofmann et al., 2009). Together with the marginal effect reported here,

this suggests that for language stimuli, stimulus duration is not crucially influencing arousal effects.

In sum, the current findings suggest that emotionality detection seems to be more relevant to listeners than extraction of arousal information at an early processing stage. However, arousal characteristics of stimuli do not seem to go completely unnoticed. We thus theorize that the P200 is robustly modulated by emotional significance of an affective prosodic stimulus independent of task focus as no task differences were found in the present or in previous studies. Though less robust, the P200 can also be modulated by arousal features of stimuli suggesting that arousal level of speakers can impact on the way they produce emotional prosody. Hence, we propose that the P200 reflects early facilitated processing of motivationally or emotionally relevant stimuli. These intrinsic relevant features are transmitted through a combination of different acoustic parameters thus leaving open the possibility that part of this early emotional detection mechanism is influenced by sensory processing.

#### **LATE POSITIVE COMPLEX**

Next to assessing whether arousal and emotionality of stimuli can influence early processing mechanisms, the present study also set out to investigate in how far the later LPC can be influenced by these two factors. Results showed that all emotional expressions elicited differently modulated LPCs at central-posterior electrode-sites (bilaterally). In addition, LPC amplitudes were differently modulated for high as opposed to low arousing stimuli irrespective of which emotion they belonged to. This effect was slightly more prominent at left centro-parietal electrodes than at their right lateralized counter parts. No influence of task focus was observed in the present study and we also fail to find an interaction between emotion and arousal attributes of stimuli.

The finding that different emotions elicit differently modulated and differently distributed LPC amplitudes fits well with observations from the imaging literature on emotional language processing, which revealed a diversified bilateral brain network of cortical and sub-cortical brain structures underlying emotion processing in speech (e.g., Kotz et al., 2003; Grandjean et al., 2005; Wildgruber et al., 2005; Ethofer et al., 2009; and see e.g., Kotz and Paulmann, 2011 for review). Moreover, imaging studies that explored both arousal and valence, seem to suggest that two distinct neural systems underlie the processing of these two dimensions. In these studies, arousal processing has predominantly been linked to sub-cortical brain structures (e.g., amygdala) while emotion processing has been linked to frontal cortex activity (e.g., Lewis et al., 2007). The present data show a similar neural dissociation as distribution of arousal effects clearly differed from the distribution of emotion effects (e.g., the arousal effect was primarily visible over left hemisphere electrode-sites). It thus seems sensible to suggest that emotion and arousal processing relies at least partially on differing neural mechanisms. However, given that ERPs lack the accurate spatial resolution of other imaging techniques this interpretation of distribution differences remains tentative.

As for a functional interpretation of the LPC, previous visual emotion studies have linked the component to reflect enhanced or continuous analysis of emotionally relevant visual stimuli (e.g., Cuthbert et al., 2000; Kanske and Kotz, 2007; Hinojosa et al., 2009; Bayer et al., 2010; Leite et al., 2012). Here, we propose to extend this interpretation to stimuli that convey emotionality or arousal only through the tone of voice that they are uttered in. In line with multi-step processing models of affective prosody (e.g., Schirmer and Kotz, 2006; Kotz and Paulmann, 2011), the present findings confirm that early emotional salience detection is followed by more elaborate processing of stimuli. Specifically, we suggest that larger LPC effects for high arousing stimuli reflect persevere processing of salient affective information which might ultimately lead to preferential processing of emotionally relevant stimuli similar to reports of other previously observed later ERP components (e.g., late negativity in Paulmann et al., 2011). In a recent emotional prosody processing study, Schirmer et al. (2013) present findings that modulations in the early P200 component can predict evaluation differences in the subsequently observed LPC component. While the direct influence of the P200 on the concurrent LPC was not directly tested here, it seems reasonable to assume that stimuli which have been identified as potentially relevant (e.g., due to their emotionality or arousal level), need to be thoroughly processed and analyzed to ensure appropriate subsequent social behavior (e.g., fight/flight). While arousal effects were only marginally significant in the P200 component, the LPC seems to be robustly modulated by both the arousal and emotion dimension though no interaction between the two factors was observed (c.f., Leite et al., 2012 for similar finding when participants had to view pictures). That is, the present findings are in line with the view that the LPC might simply reflect enhanced processing of stimuli that carry potentially relevant affective information (e.g., Cuthbert et al., 2000; Bayer et al., 2010; Leite et al., 2012). This processing step seems to be unrelated to arousal level of participants (i.e., how much they potentially engage with the stimulus) as we find significant LPC effects under both task instructions tested (c.f., Bayer et al., 2010 for similar interpretation of LPC effects for visual sentence processing).

#### **THE INFLUENCE OF TASK INSTRUCTIONS ON THE P200 AND LPC**

The present experimental design also allowed testing for the influence of task instructions on the P200 and LPC component. In the "implicit" task condition, participants were asked to evaluate their own arousal level after listening to the stimulus, while in the "explicit" task condition they were required to evaluate the arousal level of the speaker. Thus, the only difference between the two tasks was level of attention that participants had to pay to our stimuli. No influence of task instructions was

#### **REFERENCES**


standard electrode position nomenclature. *J. Clin. Neurophysiol.* 8, 200–202. doi:10.1097/00004691- 199104000-00007


observed suggesting that both early as well as subsequent more enhanced affective analyses are largely independent of task focus of participants. The lack of task influence for early emotional decoding (P200 component) has previously been documented (Garrido-Vásquez et al., in press). Here, we extend previous findings by reporting evidence which suggests that a possible early evaluation of arousal attributes is also not dependent on task instructions. Hence, the P200 component seems to be robustly elicited irrespective of how much participants need to attend to the stimulus.

In contrast, the LPC is reported to be more vulnerable to task demands. For instance, Schacht and Sommer (2009) report enhanced LPC amplitudes to emotional words only when participants engaged in lexical or semantic task evaluations, but not when participants had to report whether they saw an italicized letter (structural task). Here, participants had to attend to the affective attributes of stimuli in some way which could explain why LPC amplitudes did not differ between our two tasks. Future studies will have to shed further light on the impact task effects can have on the LPC when task foci are very different. For now, we propose that the LPC in response to affective auditory stimuli is not heavily influenced by task focus for as long as participants pay at least some attention to the affective properties of stimuli. This idea is in line with results from the visual emotion literature (e.g., Feng et al., 2012) showing that emotion and arousal can affect affective processing stages even when participants engage in implicit tasks and do not have a "task-related motivation" to analyze stimuli.

#### **CONCLUSION**

This study set out to explore the influence of emotion and arousal on early and later ERP components. In line with findings from the literature on visual emotion processing, our results suggest that emotion relevant details are robustly decoded during early (P200) and late processing (LPC) stages while arousal information is only reliably taken into consideration at later stages of processing. Given the lack of an interaction between the two factors of interest, the results also suggest that the two dimensions are largely independent of each other (c.f., Russell, 1980) at least when stimuli are attended to and somewhat task-relevant.

#### **ACKNOWLEDGMENTS**

The authors would like to thank Andrea Gast-Sandmann for help with graphical presentation and Kristiane Werrmann for help with data acquisition. Funding: German Research Foundation (DFGFOR-499 to Sonja A. Kotz).

words within sentences: the impact of arousal and valence on event-related potentials. *Int. J. Psychophysiol.* 78, 299–307. doi:10.1016/j.ijpsycho.2010.09.004

Bostanov, V., and Kotchoubey, B. (2004). Recognition of affective prosody: continuous wavelet measures of event-related brain potentials to emotional exclamations. *Psychophysiology* 41, 259–268. doi: 10.1111/j.1469-8986. 2003.00142.x

Cuthbert, B. N., Schupp, H. T., Bradley, M. M., Birbaumer, N., and Lang, P. J. (2000). Brain potentials in affective picture processing: covariation with autonomic arousal and affective report. *Biol. Psychol.* 52, 95–111. doi: 10.1016/S0301-0511(99) 00044-7


M. A. (2009). Arousal contributions to affective priming: electrophysiological correlates. *Emotion* 9, 164–171. doi: 10.1037/ a0014680


memory: neural correlates and interindividual differences. *Cogn. Affect. Behav. Neurosci.* 13, 80–93. doi:10.3758/s13415-012-0132-8


39, 885–893. doi:10.1016/ j.neuroimage.2007.09.028

Wildgruber, D., Riecker, A., Hertrich, I., Erb, M., Grodd, W., Ethofer, T., et al. (2005). Identification of emotional intonation evaluated by fMRI. *Neuroimage* 24, 1233–1241. doi:10.1016/ j.neuroimage.2004.10.034

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 08 April 2013; accepted: 28 May 2013; published online: 21 June 2013.*

*Citation: Paulmann S, Bleichner M and Kotz SA (2013) Valence, arousal, and task effects in emotional prosody processing. Front. Psychol. 4:345. doi: 10.3389/ fpsyg.2013.00345*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Paulmann, Bleichner and Kotz. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## Feeling backwards? How temporal order in speech affects the time course of vocal emotion recognition

#### *Simon Rigoulot 1,2\*, Eugen Wassiliwizky1,3 and Marc D. Pell 1,2*

*<sup>1</sup> Faculty of Medicine, School of Communication Sciences and Disorders, McGill University, Montreal, QC, Canada*

*<sup>2</sup> McGill Centre for Research on Brain, Language and Music, Montreal, QC, Canada*

*<sup>3</sup> Cluster of Excellence "Languages of Emotion", Freie Universität Berlin, Berlin, Germany*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*David V. Becker, Arizona State University, USA Emiel Krahmer, Tilburg University, Netherlands*

#### *\*Correspondence:*

*Simon Rigoulot, Faculty of Medicine, School of Communication Sciences and Disorders, McGill University, 1266 Avenue des Pins Ouest, Montreal, QC H3G 1A8, Canada e-mail: simon.rigoulot@ mail.mcgill.ca*

Recent studies suggest that the time course for recognizing vocal expressions of basic emotion in speech varies significantly by emotion type, implying that listeners uncover acoustic evidence about emotions at different rates in speech (e.g., *fear* is recognized most quickly whereas *happiness* and *disgust* are recognized relatively slowly; Pell and Kotz, 2011). To investigate whether vocal emotion recognition is largely dictated by the amount of time listeners are exposed to speech or the position of critical emotional cues in the utterance, 40 English participants judged the meaning of emotionally-inflected pseudo-utterances presented in a gating paradigm, where utterances were gated as a function of their syllable structure in segments of increasing duration from the *end* of the utterance (i.e., gated syllable-by-syllable from the *offset* rather than the onset of the stimulus). Accuracy for detecting six target emotions in each gate condition and the mean identification point for each emotion in milliseconds were analyzed and compared to results from Pell and Kotz (2011). We again found significant emotion-specific differences in the time needed to accurately recognize emotions from speech prosody, and new evidence that utterance-final syllables tended to facilitate listeners' accuracy in many conditions when compared to utterance-initial syllables. The time needed to recognize *fear*, *anger*, *sadness*, and *neutral* from speech cues was not influenced by how utterances were gated, although *happiness* and *disgust* were recognized significantly faster when listeners heard the end of utterances first. Our data provide new clues about the relative time course for recognizing vocally-expressed emotions within the 400–1200 ms time window, while highlighting that emotion recognition from prosody can be shaped by the temporal properties of speech.

#### **Keywords: vocal emotions, prosody, speech perception, auditory gating, acoustics**

#### **INTRODUCTION**

Emotional events, and more specifically social displays of emotion—the expression of a face, the tone of a speaker's voice, and/or their body posture and movements—must be decoded successfully and *quickly* to avoid negative outcomes and to promote individual goals. Emotional expressions vary according to many factors, such as their mode of expression (auditory/visual), valence (positive/negative), power to arouse (low/high), antecedents, and potential outcomes (see Scherer, 2009 for a discussion). As early as the seventeenth century, these differences raised the question of the *specificity* of emotions; in his Traité *"Les Passions de l'Ame,"* the French philosopher Descartes proposed the existence of six "primary" emotions from which all other emotions are derived. In recent decades, studies demonstrating accurate pan-cultural recognition of emotional faces (Izard, 1971; Ekman, 1972) and distinct patterns of autonomic nervous system activity in response to certain emotions (e.g., Ekman et al., 1983; Levenson, 1992) have served to fuel the idea of a fixed set of discrete and hypothetically "basic" emotions, typically *anger, fear, disgust, sadness,* and *happiness*, although opinions vary (see Ekman, 1992; Sauter et al., 2010). Within this theoretical framework, expressions of basic emotion possess unique physical characteristics that render them discrete in communication when conveyed in the face as well as in the voice (Ekman, 1992), although the vast majority of this work has focused on communication in the facial channel.

The structure of *vocal* emotion expressions embedded in spoken language, or *emotional prosody*, is now being investigated systematically from different perspectives. Perceptual-acoustic studies show that basic emotions can be reliably identified and differentiated at high accuracy levels from prosodic cues alone, and that these expressions are marked by distinct acoustic patterns characterized by differences in perceived duration, speech rate, intensity, pitch register and variation, and other speech parameters (among many others, Cosmides, 1983; Scherer et al., 1991; Banse and Scherer, 1996; Sobin and Alpert, 1999; Johnstone and Scherer, 2000; Juslin and Laukka, 2003; Laukka et al., 2005; Pell et al., 2009). For example, speech rate tends to decrease when speakers are sad and increase when speakers experience fear; at the same time, differences in relative pitch height, variation, and other cue configurations serve to differentiate these (and other) emotional meanings (see Juslin and Laukka, 2003 for a comprehensive review). Similar to observations in the visual modality, cross-cultural studies on the identification of vocal emotions show that *anger*, *fear*, *sadness*, *happiness*, and *disgust* can be recognized by listeners at levels significantly above chance when they hear semantically-meaningless "pseudo-utterances" or utterances spoken in a foreign language (Scherer et al., 2001; Thompson and Balkwill, 2006; Pell et al., 2009; Sauter et al., 2010). These data argue that basic emotions conveyed by speech prosody exhibit a core set of unique physical/acoustic properties that are emotion-specific and seemingly shared across languages (Scherer et al., 2001; Pell et al., 2009).

A critical process that has been underestimated in the characterization of how vocal emotions are communicated is the *time course* for recognizing basic emotions in speech. In the visual modality, the time course for recognizing emotional facial expressions has been investigated by presenting static displays of facial expressions (Tracy and Robins, 2008) or animated face stimuli (Becker et al., 2012). In this latter study, the authors used a morphed continuum running from a neutral exemplar to either a happy or an angry expression and found that happy faces were recognized faster than angry faces, suggesting temporal specificities in the process for recognizing basic emotions in the visual modality (see Palermo and Coltheart, 2004). Since emotional meanings encoded by prosody can *only* be accessed from their temporal acoustic structure, it is surprising that comparative data on the time course for recognizing basic emotions from prosody remain sparse.

Recently, two studies (Cornew et al., 2010; Pell and Kotz, 2011) examined the temporal processing of vocal emotion expressions using a modified version of Grosjean's (1980) gating paradigm. The auditory gating procedure—originally designed to pinpoint how much acoustic information is needed for lexical access and word recognition—consists of artificially constructing "gates" as a function of specific time increments or of relevant linguistic units of spoken language; the gated stimuli are judged by listeners in blocks of increasing gate duration, typically starting at the onset of the relevant stimulus, where the last gate presented usually corresponds to the entire stimulus event (see Grosjean, 1996 for a discussion of methodological variables). An emotional variant of this paradigm considers how much acoustic information is needed for vocal emotions to be registered and consciously accessed for explicit recognition, using a forced-choice emotionlabeling paradigm. Given the hypothesis that acoustic patterns reflect "natural codes" that progressively activate stored conceptual information about basic emotions (e.g., Schirmer and Kotz, 2006; Wilson and Wharton, 2006), this emotional gating procedure allows inferences about the time course of emotion processing in the specific context of speech, and whether the time needed varies as a function of the emotional signal being transmitted.

In the first study, Cornew and colleagues (2010) presented English-like pseudo-utterances spoken in a *happy*, *angry*, or *neutral* prosody to English listeners spliced into 250 millisecond (ms) gates of increasing duration. Following each stimulus, participants made a three-choice forced response to identify the meaning conveyed. The authors found that listeners required less time (i.e., exposure to acoustic information) to identify *neutral* sentences when compared to *angry* and *happy* sentences, suggesting that vocal emotion expressions unfold at different rates (an effect the authors attributed to a *neutral* bias in perception). The idea that vocal emotions unfold at different rates was replicated by Pell and Kotz (2011), who gated English-like pseudo-utterances as a function of their *syllable structure* as opposed to specific time increments. Forty-eight English participants listened to 7 syllable utterances conveying one of five basic emotions (anger, disgust, fear, sadness, happiness) or neutral prosody, beginning with presentation of only the first syllable of the utterance, the first two syllables, and so forth until the full sentence was presented (a six-choice forced response was recorded). Emotion identification times were then calculated by converting the number of syllables needed to accurately identify the target emotion of each utterance without further changes in the participant's response at longer gate intervals, into their actual duration for recognition.

Results showed that there were important emotion-specific differences in the accuracy and time course for recognizing vocal emotions, with specific evidence that *fear*, *sadness*, *neutral*, and *anger* were recognized from significantly less acoustic information than *happiness* or *disgust*, from otherwise identical pseudo-utterances. Prosodic cues conveying *neutral*, *fear*, and *sadness* and *anger* could be detected from utterances lasting approximately 500–700 ms (*M* = 510, 517, 576, and 710 ms, respectively), whereas *happiness* (*M* = 977 ms) and *disgust* (*M* = 1486 ms) required substantially longer stimulus analysis. Despite the fact that Cornew et al. (2010) focused on a restricted set of emotions when compared to Pell and Kotz (3-choice vs. 6 choice task), and gated their stimuli in a different manner (250 ms increments vs. syllables), there were notable similarities between the two studies in the average times needed to identify neutral (444 vs. 510 ms), angry (723 vs. 710 ms), and happy expressions (802 vs. 977 ms, respectively), although Pell and Kotz's (2011) results show that this does not reflect a bias for recognizing neutral prosody as initially proposed (Cornew et al., 2010). Together, these studies establish that the time course of vocal emotion recognition in speech varies significantly according to the emotional meaning being conveyed, in line with results demonstrating emotion-specificity in facial emotion recognition (Becker et al., 2012), although the relative pattern of emotion-specific differences observed in the auditory vs. visual modality appears to be quite different as noted elsewhere in the literature using different experimental paradigms (e.g., Wallbott and Scherer, 1986; Paulmann and Pell, 2011).

Of interest here, closer inspection of Pell and Kotz's (2011) data reveal that recognition of *happiness* and *disgust*, in contrast to other basic emotions, improved at relatively long utterance durations (5–7 syllables); in fact, when full sentences were presented, recognition of *happy* prosody was comparable in accuracy to *sadness, anger,* and *fear* despite the fact that these latter emotions were recognized much more accurately than happiness following brief stimulus exposure. Some emotions such as *happiness* and *fear* seemed to be particularly salient when the last syllables were presented, leading to significant increases in recognition accuracy at the end of utterances in that study. These results imply that the amount of time needed to identify basic emotions from prosody depends partly on the *position* of salient acoustic properties in speech, at least for certain emotions. Interestingly, Pell (2001) reported that *happy* utterances exhibit unique acoustic differences in sentence-final position when compared to linguistically identical *angry*, *sad*, and *neutral* utterances, arguing that the position of acoustic cues, and not just time, is a key factor in communicating vocal emotions in speech. Other data underscore that the ability to recognize basic emotions varies significantly depending on the channel of expression—i.e., whether conveyed by facial expressions, vocal expressions, or linguistic content (Paulmann and Pell, 2011)—with evidence that *fear, sadness, anger,* and *neutral* are effectively conveyed by speech prosody, whereas other emotions such as *happiness* or *disgust* are much more salient in other channels (Paulmann and Pell, 2011). These findings raise the possibility that when basic emotions are preferentially communicated in channels other than the voice, vocal concomitants of these emotions are encoded and recognized somewhat differently; for example, they could be partly marked by local variations in acoustic cues that signal the interpersonal function or social relevance of these cues to the listener at the end of a discourse, similar to how the smile may reflect *happiness* or may serve social functions such as appeasement or dominance (Hess et al., 2002).

Further investigations are clearly needed to understand the time course of vocal emotion recognition in speech and to inform whether temporal specificities documented by initial studies (Cornew et al., 2010; Pell and Kotz, 2011) are solely dictated by the *amount of time* listeners require to identify vocal emotions, or whether linguistic structure plays a role for identifying some emotions. We tested this question using the same gating paradigm and emotionally-inflected utterances as Pell and Kotz (2011), although here we presented pseudo-utterances gated syllable-bysyllable from the *offset* rather than the onset of the stimulus (i.e., in a "backwards" or reverse direction) to test whether recognition times depend on how utterances are presented. If the critical factor for recognizing certain basic emotions in the voice is the unfolding of acoustic evidence over a set period of time, we expected similar outcomes/emotion identification times as those reported by Pell and Kotz (2011) irrespective of how utterances were gated; this result would establish that modal acoustic properties for understanding emotions tend to permeate the speech signal (perhaps due to their association with distinct physiological "push effects," e.g., Scherer, 1986, 2009) and are decoded according to a standard time course. However, if important acoustic cues for recognizing vocal emotions are differentially encoded within an utterance, we should witness significantly different emotion identification times here when utterances are gated from their offset when compared to when they are presented from their onset (Pell and Kotz, 2011). This result could supply evidence that some emotions are "socialized" to a greater extent in the context of speech prosody through functionally distinct encoding processes.

#### **METHODS**

#### **PARTICIPANTS**

Forty native English speakers recruited through campus advertisements (20 men/20 women, mean age: 25 ± 5 years) took part in the study. All participants were right-handed and reported normal hearing and normal or corrected-to-normal vision. Informed written consent was obtained from each participant prior to the study which was ethically approved by the Faculty of Medicine Institutional Review Board at McGill University (Montréal, Canada). Before the experiment, each participant completed a questionnaire to establish basic demographic information (age, education, language skills).

#### **STIMULI**

As described by Pell and Kotz (2011), the stimuli were emotionally-inflected pseudo-utterances (e.g., *The placter jabored the tozz*) selected from an existing database of recorded exemplars, validated and successfully used in previous work (e.g., Pell et al., 2009; Paulmann and Pell, 2010; Rigoulot and Pell, 2012). Pseudo-utterances mimic the phonotactic and morpho-syntactic properties of the target language but lack meaningful lexicalsemantic cues about emotion, allowing researchers to study the isolated effects of emotional prosody in speech (see Scherer et al., 1991; Pell and Baum, 1997 for earlier examples). The selected utterances were digitally recorded by two male and two female speakers in a sound-attenuated booth, saved as individual audio files, and perceptually validated by a group of 24 native listeners using a seven forced-choice emotion recognition task (see Pell et al., 2009, for full details). For this study we selected a subset of 120 pseudo-utterances that reliably conveyed *anger, disgust, fear, happiness, sadness* and *neutral* expressions to listeners (20 exemplars per emotion). Thirteen unique pseudo-utterance phrases produced by the four speakers to convey each emotion were repeated throughout the experiment (see Section Appendix). These sentences were the same in their (pseudo) linguistic content as those presented by Pell and Kotz (2011), although the precise recordings selected here were sometimes different because some phrases were emotionally expressed by a different speaker (75% of the chosen recordings were identical to those presented by Pell and Kotz, 2011). For all emotions, the target meaning encoded by prosody for these items was recognized at very high accuracy levels based on data from the validation study (anger = 86%; disgust = 76%; fear = 91%; happiness = 84%; sadness = 93%; neutral = 83%, where chance in the validation study was approximately 14%). Pseudo-utterances conveying each emotion were produced in equal numbers by two male and two female speakers and were all seven syllables in length prior to gate construction.

#### **GATE CONSTRUCTION**

Each utterance was deconstructed into seven gates according to the syllable structure of the sentence using Praat speech analysis software (Boersma and Weenink, 2012). As we were interested in the time course of emotion recognition when utterances were presented from their end to their beginning, the first Gate (Gate\_7) of each stimulus consisted of only the last syllable of the utterance, the second gate (Gate\_6-7) consisted of the last two syllables, and so on to Gate\_1-7 (presentation of the full utterance). For each of the 120 items, this procedure produced seven gated stimuli (Gate\_7, Gate\_6-7, Gate\_5-7, Gate\_4-7, Gate\_3-7, Gate\_2-7, Gate\_1-7) each composed of a different number of syllables (120 × 7 = 840 unique items). Note that since the onset of most gated stimuli occurred at a syllable break *within* the utterance (with the exception of Gate\_1-7), these stimuli gave the impression of being "chopped off " at the beginning and starting abruptly. As shown in **Table 1**, the duration of items presented in each gate condition differed by emotion type due to welldocumented temporal differences in the specification of vocal emotion expressions (Juslin and Laukka, 2003; Pell and Kotz, 2011).

#### **EXPERIMENTAL DESIGN/PROCEDURE**

Participants were invited to take part in a study of "communication and emotion"; they were seated in a quiet, dimly lit room at a 75 cm distance from a laptop screen. SuperLab 4.0 software (Cedrus, USA) was used to present auditory stimuli played over volume-adjustable, high-quality headphones.

Seven presentation blocks were built, each containing 120 items with the same gate duration (i.e., number of syllables) presented successively in blocks of increasing syllable duration. The first block contained all Gate\_7 stimuli (tokens with only the last syllable), the second block contained all Gate\_6-7 stimuli (last two syllables), and so on until the Gate\_1-7 block containing the full utterances was presented. As in Pell and Kotz (2011), this block design was chosen to mitigate potential artifacts such as response perseveration (Grosjean, 1996). Individual stimuli were randomized within blocks, and participants were instructed to identify the emotion expressed by the speaker as accurately and quickly as possible from six alternatives presented on the computer screen (*anger, disgust, fear, sadness, happiness, neutral*). Responses were recorded by a mouse click on the corresponding emotion label. Following the emotion response, a new screen appeared asking participants to rate how confident they were about their emotional decision along a 7-point scale, where 1 indicated they were "very unsure" and 7 meant that they were "very sure" about their judgment. After recording the confidence rating, a gap of 2 s separated the onset of the next trial.

Participants completed ten practice trials at the beginning of the testing session and additional practice trials prior to each block to become familiar with stimuli representing each gate duration condition. Participants were allowed to adjust the volume during the first practice block of each session. Since the volume of our stimuli was homogenized, only one adjustment at the beginning was necessary to meet the participants' individual preferences. The full experiment was administered during two separate 60-min sessions (session 1 = first three gate conditions, session 2 = last four gate conditions) to reduce fatigue and familiarity with the stimuli. Participants received \$25 CAD compensation for their involvement.

#### **STATISTICAL ANALYSES**

Participants' ability to identify emotional target meanings (% correct) and their associated confidence ratings (7-pt scale) were each analyzed. From the uncorrected accuracy (hit) rates of each participant, Hu-scores were computed for each gate and emotion to adjust for individual response biases when several emotion categories are used (see Wagner, 1993). The computation of Huscores takes into account how many stimulus categories and answer possibilities are given in the forced choice task. If only two stimulus categories and two answer possibilities are used (e.g., neutral and anger) the Hu-score for the correct identification of one category, say anger, would be computed as follows: *Hu* = *a*/*a* + *b* × *a*/*a* + *c*. Here *a* is the number of correctly identified stimuli (anger was recognized as anger), *b* is the number of misidentifications, in which anger was incorrectly labeled as neutral, whereas *c* is the number of misidentifications, in which neutral was incorrectly labeled as anger. Wagner (1993) describes the Hu-scores as "[...] the joint probability that a stimulus category is correctly identified given that it is presented at all and that a response is correctly used given that it is used at all."

Hu-scores and confidence scores were submitted to separate 7 × 6 ANOVAs with repeated measures of gate duration (seven levels) and emotion (*anger, disgust, fear, happiness, sadness, neutral*). To infer how much time participants required to correctly identify emotions, we computed the "emotion identification point" for each of the 120 pseudo-utterances by determining the gate condition where a participant identified the target emotion without subsequent changes at longer gate durations of the same stimulus. The emotion identification points were then transformed into "emotion identification times" by converting the number of syllables needed to identify the target into the exact speech duration in milliseconds, which was then averaged across items for each participant (see Pell and Kotz, 2011 for detailed procedures). Of the 4800 possible identification points (20 items × 6 emotions × 40 participants), 419 items that were not correctly identified by a participant even when the


**Table 1 | Duration of the stimuli presented in the experiment in each gate duration condition as a function of emotion.**

*Pseudo-utterances were always gated at syllable boundaries from the offset of the utterance in gates of increasing syllable duration.*

full utterance was presented were labeled as "errors" and omitted from the calculation of emotion identification times (a total of 4381 data points were included). Mean emotion identification times were submitted to a one-way ANOVA with repeated measures on emotion (*anger, disgust, fear, happiness, sadness, neutral*).

Since the stimuli, procedures, and analyses adopted here were virtually identical to those of Pell and Kotz (2011), our experiment allows unprecedented comparisons of how recognition of emotional prosody evolves over time as a function of the gating *direction*, shedding light on how the position of acoustic patterns for detecting emotions influences recognition processes. For each of our three dependent measures (accuracy scores, confidence ratings, emotion identification times), we therefore performed a second analysis to directly compare the current results to those of Pell and Kotz (2011) by entering the between-groups factor of Presentation Direction (gating from offset vs. onset). Separate *t*-tests first compared the age and education (in years) of the current participant group (*n* = 40) with participants studied by Pell and Kotz (2011, *n* = 48); there was no difference in the formal education of the two samples [17 vs. 16 years, respectively; *t*(86) = 1.548; *p* = 0.125], although participants in the present study were older on average [25 vs. 22 years; *t*(86) = 2.578; *p* = 0.012]. Given the age difference, we entered age as a covariate in separate mixed ANCOVAs on the Hu-scores, confidence ratings, and emotion identification times as described above with the additional grouping variable of presentation Direction (onset, offset) of key theoretical interest in these analyses. For all statistical analyses, a significance level of 5% (two-sided) was selected and *post-hoc* comparisons (Tukey's HSD, *p* < 0.05) were applied whenever a significant main or interactive effect was observed.

#### **RESULTS**

#### **ACCURACY (HU-SCORES) AND CONFIDENCE RATINGS**

#### *Effects of backwards gating on accuracy and confidence scores*

**Table 2** shows the mean accuracy of participants (% correct target recognition) in each emotion and gate condition when utterances were presented from their offset, prior to correcting these scores for participant response bias. A 7 (Gate) × 6 (Emotion) ANOVA performed on the *unbiased* emotion recognition rates (i.e., calculated Hu-Scores) revealed a main effect of Gate duration [*F*(6, <sup>228</sup>) = 390.48; *p* < 0.001], Emotion [*F*(5, <sup>190</sup>) = 142.57; *p* < 0.001], and a significant interaction of these factors [*F*(30, <sup>1140</sup>) = 10.684; *p* < 0.001]. Post hoc (Tukey's) tests of the interaction first considered how the recognition of each emotion evolved as a function of gate duration when sentences were gated from their offset. As shown in **Figure 1**, the recognition of *fear*, *anger*, and *sadness* significantly improved over the course of hearing the first three gates (i.e., the last three syllables of the utterance, *p*s < 0.003) with no further accuracy gains by the fourth gate condition (Gate\_4-7, *p*s > 0.115). In contrast, accurate recognition of *neutral*, *happiness*, and *disgust* each significantly improved over a longer time frame corresponding to the first four gate conditions (Gate\_7 to Gate\_4-7, *p*s < 0.001) without further changes after this point (*p*s > 0.087).

Further inspection of the interaction then looked at emotional differences on accuracy at each gate condition. When listeners heard only the utterance-final syllable (Gate\_7), *fear* and *anger* prosody were recognized significantly better than all other emotional voices (*p*s < 0.006), and *fear* was also recognized significantly better than *anger* (*p* < 0.001). After fear and anger, *sad* expressions were identified significantly better from the last syllable than *happy* and *neutral* expressions (*p*s < 0.001), which did not differ (*p* = 1.000), followed by *disgust* which was recognized more poorly than any other emotion (*p*s < 0.046). This pattern was similar for stimuli composed of the last two and the last three syllables (Gate\_6-7 and Gate\_5-7, respectively) but changed somewhat as stimulus duration increased. After presenting the last four syllables (Gate\_4-7), *fear* continued to exhibit the highest accuracy score (this was true in all gate conditions; *p*s < 0.017) but recognition of *anger* and *sad* expressions was equivalent (*p* = 1.0), followed by *happiness* which was recognized significantly better than *disgust* (*p* < 0.001). After the last five syllables were presented (Gate\_3-7), *angry*, *sad* and *happy* sentences were recognized at a similar rate (*p*s > 0.555), surpassing *neutral* and *disgust* (*p*s < 0.001). In the two longest gate conditions (Gate\_2-7, Gate\_1-7), accuracy scores for *anger*, *sad*, *happy* and *neutral* sentences were not statistically different (*p*s > 0.407) while vocal expressions of *fear* and *disgust* were respectively the best and worst recognized from speech prosody (*p*s < 0.017).

The analysis of associated confidence ratings (on a scale of 1–7) was restricted to trials in which the emotional target of the prosody was correctly identified. Two male participants

**Table 2 | Mean accuracy (% target recognition) of the 40 listeners who judged pseudo-utterances conveying each emotion according to the gate duration, when utterances were gated from the offset of the sentence.**


*Standard deviations are shown in parentheses.*

who failed to recognize any of the *disgust* expressions (producing an empty cell) were excluded from this analysis. The ANOVA on the confidence scores revealed a main effect of gate duration [*F*(6, <sup>192</sup>) = 48.653; *p* < 0.001], a main effect of emotional prosody [*F*(5, <sup>160</sup>) = 46.991; *p* < 0.001] and a significant interaction of Gate × Emotion [*F*(30, <sup>960</sup>) = 3.814; *p* < 0.001]. Confidence scores tended to increase with stimulus/gate duration, although there were differences across emotions as a function of gate duration. After listening to the final one or two syllables, participants were significantly more confident about their detection of *fear* and *anger* (*p*s < 0.001) and least confident when they correctly recognized *neutral* and *disgust* (*p*s < 0.001). Confidence ratings for *happiness* and *sadness* were between those extremes, differing significantly from the other two emotion sets (*p*s < 0.048). By the third gate condition (Gate\_5-7), confidence about *neutral* prosody began to increase over *disgust* (*p* < 0.001), and by the fourth gate condition and when exposed to longer stimuli, confidence ratings for *fear*, *anger*, *happiness*, and *sadness* were all comparable, although confidence about *disgust* remained significantly lower even when full utterances were presented (Gate\_1-7).

#### *Impact of gating direction on accuracy and confidence scores*

The 2 × 7 × 6 ANCOVA on Hu-scores gathered here and by Pell and Kotz (2011) showed a significant three-way interaction of Direction, Gate duration, and Emotion [*F*(30, <sup>2550</sup>) = 12.636; *p* < 0.001]. This interaction allowed us to explore the influence of presentation direction (onset vs. offset) on the accuracy of emotional prosody recognition as additional syllables revealed acoustic evidence about each emotion; these relationships are demonstrated for each emotion in **Figure 2**. Step-down analyses (2x7 ANOVAs) showed that the interaction of Direction × Gate duration was significant for *anger* [*F*(6, <sup>516</sup>) = 14.218; *p* < 0.001], *fear* [*F*(6, <sup>516</sup>) = 33.096; *p* < 0.001], *disgust* [*F*(6, <sup>516</sup>) = 10.851; *p* < 0.001], *sadness* [*F*(6, <sup>516</sup>) = 11.846; *p* < 0.001], and *happiness* [*F*(6, <sup>516</sup>) = 9.663; *p* < 0.001]. For each of these emotions, recognition always improved when the *end* of utterances were heard first (i.e., when gated from their offset vs. onset), although the temporal region where accuracy improved within the utterance varied by emotion type. *Post-hoc* comparisons showed that *anger* and *fear* were recognized significantly better in the offset presentation condition even when little acoustic evidence was available; listeners detected *anger* better over the course of the first to third syllable in the offset vs. onset condition, and over the course of the first to sixth syllables for *fear* (*p*s < 0.001). *Happiness* showed an advantage in the offset condition beginning at the second up to the fourth gate (*p*s = 0.027), *disgust* showed a similar advantage beginning at the third to the fifth gate (*p* < 0.049), and *sadness* displayed the offset advantage beginning at the third up to the sixth gate (*p*s < 0.031). Interestingly, there was no effect of the direction of utterance presentation on the recognition of *neutral* prosody [*F*(6, <sup>516</sup>) = 0.409; *p* = 0.873].

The ANCOVA on confidence ratings between studies yielded a significant three-way interaction of Direction, Gate duration and Emotion [*F*(30, <sup>2370</sup>) = 4.337; *p* < 0.001]. Step-down

analyses (2 × 7 ANOVAs) run separately by emotion showed that the interaction of Direction × Gate duration was significant for *anger* [*F*(6, <sup>516</sup>) = 35.800; *p* < 0.001], *fear* [*F*(6, <sup>516</sup>) = 19.656; *p* < 0.001], *happiness* [*F*(6, <sup>504</sup>) = 18.783; *p* < 0.001], and *sadness* [*F*(6, <sup>516</sup>) = 10.898; *p* < 0.001]. Listeners were more confident that they had correctly identified these four emotions only when one syllable was presented in isolation (i.e., at the first gate duration, *ps* < 0.049), with increased confidence when they heard the sentence-final as opposed to the sentence-initial syllable. For *disgust* and *neutral*, the two-way interaction was also significant [*F*(6, <sup>492</sup>) = 7.522; *p* < 0.001; *F*(6, <sup>516</sup>) = 7.618; *p* < 0.001, respectively] but *post hoc* tests revealed only minor differences in the pattern of confidence ratings in each presentation condition with no differences in listener confidence at specific gates (*p*s > 0.618). These patterns are illustrated for each emotion in **Figure 3**.

#### **EMOTION IDENTIFICATION TIMES** *Effects of backwards gating on the time course of vocal emotion recognition*

As described earlier, emotion identification times were computed by identifying the gate condition from sentence offset where the target emotion was correctly recognized for each item and participant, which was then converted into the precise time value of the gated syllables in milliseconds. A one-way ANOVA performed on the mean emotion identification times with repeated measures of emotion type (*anger, disgust, fear*, *happiness*, *sadness* and *neutral*) revealed a highly significant effect of emotion [*F*(5, <sup>190</sup>) = 113.68; *p* < 0.001]. As can be seen in **Figure 3**, *fearful* voices were correctly identified at the shortest presentation times (*M* = 427 ms), significantly faster than *sadness* (*M* = 612 ms), *neutral* (*M* = 654 ms) and *anger* (*M* = 672 ms) which did not significantly differ one from another. These emotions required significantly less time to identify than *happiness* (*M* = 811 ms), which in turn took significantly less time than *disgust* (*M* = 1197 ms) which required the longest stimulus exposure for accurate recognition (all *p*s < 0.001).

#### *Impact of gating direction on emotion identification times*

Finally, a 2 × 6 (Direction × Emotion) mixed ANCOVA was performed on the emotion identification times to compare the present results to those of Pell and Kotz (2011); this analysis revealed a significant interaction of presentation Direction and Emotion [*F*(5, <sup>425</sup>) = 13.235; *p* < 0.001] as also shown in

**Figure 4**. The average time listeners required to correctly identify emotional prosody was significantly reduced when syllables were presented from the offset vs. onset of utterances, but only for *disgust* (*p* < 0.001) and *happiness* and (*p* = 0.050). In contrast to accuracy and confidence ratings, the manner in which utterances were gated had no significant impact on the amount of time listeners needed to recognize *fear*, *sadness*, *anger*, or *neutral* prosody (all *p*s > 0.157).

#### **DISCUSSION**

Following recent work (Cornew et al., 2010; Pell and Kotz, 2011), this experiment sought a clearer understanding of how vocal expressions of basic emotion reveal their meanings in speech using a modified version of the gating paradigm, where emotionally-inflected pseudo-utterances were truncated and presented in excerpts of increasing syllable duration from the *end* of an utterance. While the current manner for presenting our stimuli might bear no immediate resemblance to how emotional speech is encountered in structured conversations–especially because our stimuli were only auditory and not spontaneously produced (see Barkhuysen et al., 2010 for a discussion on this topic)—our performance measures may help to understand some processes involved when listeners "walk in" on an emotional conversation, or have their attention directed to emotional speech in the environment that is already in progress, an experience that is common to everyday life. Critically, our design allowed important hypotheses to be tested concerning the evolution and associated time course of emotional prosody recognition (in English) as listeners are progressively exposed to representative acoustic cue configurations. In line with past findings, we found that listeners tended to be most accurate at recognizing vocal expressions of *fear* (Levitt, 1964; Zuckerman et al., 1975; Paulmann and Pell, 2011; Pell and Kotz, 2011) and least accurate for *disgust* (e.g., Scherer et al., 1991; Banse and Scherer, 1996) irrespective of how many syllables/gates were presented. Expressions of *fear* were also recognized from the shortest stimulus duration, implying that listeners need minimal input to recognize this emotion in speech (Pell and Kotz, 2011). Interestingly, emotion identification times were significantly reduced for certain emotions (*happiness, disgust*) when sentences were presented from their offset rather than their onset, and there were other apparent "advantages" to recognizing emotion prosody when listeners were first exposed to

the *end* of utterances. These effects and their implications are discussed in detail below.

#### **EFFECTS OF GATING DIRECTION AND CUE LOCATION ON VOCAL EMOTION RECOGNITION**

Our data show that recognition of vocal emotions generally improves with the number of syllables presented, even when listeners hear utterance fragments in reverse order, but reaches a plateau for all emotions after hearing the last three to four syllables of the utterance. When viewed broadly, these findings suggest that "prototypical" acoustic properties for accessing knowledge about basic emotions from speech (Laukka, 2005; Pell et al., 2009) are decoded and consciously recognized at peak accuracy levels after processing three to four spoken syllables—approximating a mean stimulus duration of 600–1200 ms, depending on the emotion in question (review **Table 1**). This broad conclusion fits with observations of two previous gating studies that gated emotional utterances in syllabic units (Pell and Kotz, 2011) or in 250 ms increments (Cornew et al., 2010). However, there were notable emotion-specific recognition patterns as a function of gate duration; when stimuli were very short (i.e., only the final one or two syllables were presented) there was a marked advantage for detecting *fear* and *anger* when compared to the other expression types, and listeners were significantly more confident that they had correctly identified these two emotions based solely on the utterance-final syllable. As the gate duration gradually increased to five syllables (Gate\_3-7), no further differences were

observed in the ability to recognize *anger, sadness*, and *happiness*, although participants remained significantly more accurate for *fear* and significantly less accurate for *disgust* at all stimulus durations.

The observation that *fear*, and to a lesser extent *anger*, were highly salient to listeners at the end of utterances even when minimal acoustic information was present (i.e., the final syllable) is noteworthy. Leinonen and colleagues (1997) presented twosyllable emotional utterances in Finnish (the word [saara]) and reported higher recognition scores and distinct acoustic attributes of productions conveying *fear* and *anger* when compared to eight other emotional-motivational states, suggesting that these emotions are highly salient to listeners in acoustic stimuli of brief duration. Similarly, Pell and Kotz (2011) reported that recognition of most emotions improved over the *full course* of the utterance when they were gated from sentence onset and that certain emotions, such as *happiness* and *fear*, demonstrated clear gains in that study when listeners processed the last two syllables of the utterance. When combined with our current findings, this implies that syllables located towards the *end* of an utterance provide especially powerful cues for identifying basic emotions encoded in spoken language. This argument is supported by our direct statistical comparisons of the two data sets when utterances were gated from their onset vs. offset; we found that presentation *direction* had a significant impact on the accuracy and confidence levels of English listeners, with improved recognition of all emotions except *neutral* when participants heard utterances commencing with the last syllable. Gating utterances from their offset also reduced mean emotion identification times for some emotions (*happiness*, *disgust*) as elaborated below. In contrast, there was no evidence in our data that listeners were at an advantage to recognize emotional prosody when utterances were gated from their onset, with the possible exception of accuracy rates for *sadness* that were somewhat higher in the onset condition at very short gate intervals.

Why would natural, presumably biologically-specified codes for signaling emotions in the voice (e.g., Ekman, 1992; Wilson and Wharton, 2006) bear an important relationship to the temporal features of spoken language? This phenomenon, which has been highlighted at different times (Cosmides, 1983; Scherer, 1988), could be explained by the accent structure of utterances we presented for emotion recognition and by natural processes of speech production, factors which both contribute to the "socialization" or shaping of vocal emotion expressions in the context of spoken language. It is well known that the accent/phrase structure of speech, or the relative pattern of weak vs. strong syllables (or segments) in a language, can be altered when speakers experience and convey vocal emotions (Ladd, 1996). For example, speakers may increase or decrease the relative prominence of stressed syllables (through local changes in duration and pitch variation) and/or shift the location or frequency of syllables that are typically accented in a language, which may serve as an important perceptual correlate of vocal emotion expressions (Bolinger, 1972; Cosmides, 1983). Related to the notion of local prominence, there is a well-documented propensity for speakers to lengthen syllables located in word- or phrase-final position ("sentence-final lengthening," Oller, 1973; Pell, 2001), sometimes on the penultimate syllable of certain languages (Bolinger, 1978), and other evidence that speakers modulate their pitch in final positions to encode gradient acoustic cues that refer directly to their emotional state (Pell, 2001) to give to the final position of sentences a special impact in the identification of the emotional quality of the voice.

The observation here that cues located toward the end of an utterance facilitated accurate recognition of most emotions in English likely re-asserts the importance of accent structure during vocal emotion processing (Cosmides, 1983; Ladd et al., 1985). More specifically, it implies that sentence-final syllables in many languages could act as a vehicle for reinforcing the speaker's emotion state *vis-à-vis* the listener in an unambiguous and highly differentiated manner during discourse (especially for *fear* and *anger*). Inspection of the mean syllable durations of gated stimuli presented here and by Pell and Kotz (2011) confirm that while there were natural temporal variations across emotions, the duration of utterance-final syllables (*M* = 386 ms, range = 329–481) was more than double that of utterance-initial syllables (*M* = 165 ms, range = 119–198), the latter of which were always unstressed in our study. In comparison, differences in the cumulative duration of gates composed of two syllables (*M* = 600 vs. 516 in the offset vs. onset conditions, respectively) or three syllables (*M* = 779 vs. 711) were relatively modest between the two studies, and these stimulus durations were always composed of both weak and stressed syllables. This difference of duration observed is in line with the above described propensity of speakers to lengthen syllables located in the final position of the sentences. Also, given the structure of the pseudo-utterances (see Section Appendix), it should be noted that the forward presentation of pseudo-utterances might differ from the backward presentation in terms of expectations of the participants. In Pell and Kotz (2011), the first gate was always a pronoun or a determiner and was always followed by the first syllable of a pseudo-verb, whereas in the present experiment, the two first gates were always the two final syllables of a pseudo-word. It is difficult to know whether participants may have developed some expectations about the following syllable and to what extent these expectations could have impacted the identification of the prosody. We cannot exclude that these expectations could have been more difficult to make in the backward condition, when the gates were presented in a reverse order, altering how participants focused on the emotional prosody of the sentences. However, such an interpretation would not explain why the direction of presentation did not influence the performance of participants when sentences were uttered with a neutral note and why this influence was limited to some specific gates when the sentences were spoken in an emotional way.

Nevertheless, these results suggest that there is a certain alignment in how speakers realize acoustic targets that refer to semantically-dictated stress patterns and emotional meanings in speech, demonstrating that recognition of vocal emotional expressions is shaped to some extent by differences in the temporal (accent) structure of language *and* that emotional cues are probably not equally salient throughout the speech signal. Further studies that compare our findings with data from other languages will clearly be needed to advance specific hypotheses about how vocal emotion expressions may have become "domesticated" in the context of spoken language. For example, we could replicate forward and backward gating experiments in another stressed-language like German, and see if critical cues in the identification of some emotions could be located at different places of a sentence. We could also compare forward and backward presentation of pseudo-sentences in a language that does not use accentuated stress such as French, which supposedly would lead to similar results in the time needed to identify emotional prosody irrespective of the direction of presentation of the sentences.

#### *Further reflections on the time course of vocal emotion recognition*

While our data show that the position of emotionally meaningful cues plays a role in how vocal emotions are revealed to listeners, they simultaneously argue that the average *time* needed to accurately decode most basic emotions in speech is relatively constant irrespective of gating method (syllables vs. 250 ms increments) or stimulus set (Cornew et al., 2010; Pell and Kotz, 2011). When mean emotion identification times were computed here, *fear* required the least amount of stimulus exposure to recognize (*M* = 427 ms), followed by *sadness* (*M* = 612 ms), *neutral* (*M* = 654 ms), *anger* (*M* = 677 ms), *happiness* (*M* = 811 ms), and *disgust* (*M* = 1197 ms). With the exception of *neutral* which took slightly (although not significantly) longer to detect when utterances were gated in reverse, this emotion-specific pattern precisely mirrors the one reported by Pell and Kotz (2011) for the same six emotions and replicates Cornew et al.'s (2010) data for *neutral*, *anger*, and *happy* expressions when utterances were gated in 250 ms units. When the mean emotion identification times recorded here are compared to those reported by Pell and Kotz (2011) and Cornew et al. (2010), it can be said that recognition of *fear* occurs approximately in the range of 425–525 ms (427, 517 ms), *sadness* in the range of 600 ms (612, 576 ms), *anger* in the range of 700 ms (677, 710, 723 ms), *happiness* in the range of 800–900 ms (811, 977, 802 ms), and *disgust* requires analysis of at least 1200 ms of speech (1197, 1486 ms). As pointed out by Pell and Kotz (2011), the time needed to identify basic emotions from their underlying acoustic cues does not simply reflect characteristic differences in articulation rate across emotions (e.g., Banse and Scherer, 1996; Pell et al., 2009), since expressions of *sadness* are routinely slower and often twice the duration of comparable *fear* expressions, and yet these two emotions are accurately recognized from speech stimuli of the shortest duration. Rather, it can be claimed that prototypical cues for understanding vocal emotions are decoded and consciously retrievable over slightly different epochs in the 400–1200 ms time window, or after hearing roughly 2–4 syllables in speech. The idea that emotional meanings begin to be differentiated after hearing around 400 ms of speech fits with recent priming data using behavioral paradigms (Pell and Skorup, 2008) and event-related potentials (ERPs, Paulmann and Pell, 2010) as well as recent neuro-cognitive models on the time course and cognitive processing structure of vocal emotion processing (Schirmer and Kotz, 2006).

Evidence that vocal expressions of certain negative emotions, such as *fear, sadness,* or *anger*, require systematically less auditory input to decode accurately, whereas expressions of *happiness* and *disgust* take much longer, may be partly explained by the evolutionary prevalence and significance of negative emotions over positive emotions (e.g., Cacioppo and Gardner, 1999). Expressions that signal threat or loss must be decoded rapidly to avoid detrimental outcomes of great urgency to the organism; this negativity bias has been observed elsewhere in response to facial (Carretié et al., 2001) and vocal expressions of fear and anger (Calder et al., 2001, 2004), and would explain why *fear* prosody was recognized more accurately and *faster* than any other emotional expression in the voice (Levitt, 1964). The biological importance of rapidly differentiating negative vocal signals (e.g., Scherer, 1986) potentially explains why the *amount* of temporal acoustic information, and not the position of critical cues, appears to be the key factor governing the time course of recognizing *fear, anger,* and *sadness*, since we found no significant differences in emotion identification times for these emotions between our two studies.

In contrast, *happy* and *disgust* took significantly longer to identify and were the only emotions for which recognition times varied significantly as a function of gating direction (with a reduction in emotion recognition times of approximately 200 ms and 300 ms between studies, respectively). Difficulties recognizing *disgust* from prosody are well documented in the literature (Scherer, 1986; Scherer et al., 1991; Jaywant and Pell, 2012) and are sometimes attributed to the fact that *disgust* in the auditory modality is more typical in the form of affective bursts such as "yuck" or "eeeew" (Scherer, 1988; Simon-Thomas et al., 2009). It is possible that identifying disgust from running speech, as required here and by Pell and Kotz (2011), activates additional social meanings that take more time to analyze and infer than the decoding of pure biological signals such as *fear, sadness*, and *anger*. For example, it has been suggested that there are qualitatively different expressions of disgust in the visual (Rozin et al., 1994) and auditory (Calder et al., 2010) modality, including a variant related to violations of moral standards that is often conveyed in running speech, as opposed to physical/visceral expressions of disgust which are better conveyed through exclamations (yuck!). If presentation of disgust utterances engendered processes for inferring a speaker's social or moral attitude from vocal cues, a more symbolic function of prosody, one might expect a much slower time course as witnessed here. A similar argument may apply to our results for *happiness*; although this emotion is typically the quickest emotion to be recognized in the visual modality (Tracy and Robins, 2008; Palermo and Coltheart, 2004; Calvo and Nummenmaa, 2009), it exhibits a systematically slower time course in spoken language (Cornew et al., 2010; Pell and Kotz, 2011). Like disgust, *happiness* may also be communicated in a more rapid and reliable manner by other types of vocal cues that accompany speech, such as laughter (e.g., Szameitat et al., 2010). In addition, there is probably a need to differentiate between different types of vocal expressions of happiness which yield different rates of perceptual recognition (Sauter and Scott, 2007). Nonetheless, our results strongly imply that speakers use prosody to signal *happiness*, particularly towards the end of an utterance, as a conventionalized social cue directed to the listener for communicating this emotion state (Pell, 2001; Pell and Kotz, 2011), perhaps as a form of self-presentation and inter-personal expression of social affiliation. Further inquiry will be needed to test why *disgust* and *happiness* appear to be more socially mediated than other basic emotions, influencing the time course of their recognition in speech, and to define the *contexts* that produce variations in these expressions.

Interestingly, the recognition of *neutral* prosody was uniquely unaffected by the manner in which acoustic information was unveiled in the utterance, with no significant effects of presentation direction on accuracy, confidence ratings, or emotion identification times between studies. This tentatively suggests that the identification of neutrality, or a lack of emotionality in the voice, can be reliably inferred following a relatively standard amount of time in the range of 400–650 ms of stimulus exposure (Cornew et al., 2010; Pell and Kotz, 2011). Since our measures of recognition include conscious interpretative (naming) processes and are biased somewhat by the gating method, our data on the time course for *neutral* prosody are not inconsistent with results showing the *on-line* differentiation of neutrality/emotionality in the voice at around 200 ms after speech onset, as inferred from amplitude differences in the P200 ERP component when German utterances were presented to listeners (Paulmann et al., 2008). One can speculate that listeners use a heuristic or default process for recognizing *neutral* voices whenever on-line analysis of prosody does not uncover evidence of emotionally meaningful cue configurations; presumably, this process for rejecting the presence of known acoustic patterns referring to emotions, like the process for decoding known patterns, is accomplished over a relatively stable time interval. To test these possibilities, it would be interesting to modify neutral sentences by inserting local variations in emotionally-meaningful acoustic features at critical junctures in time to determine if this "resets the clock" for inferring the presence or absence of emotion in speech.

#### **CONCLUSION**

Following recent on-line (ERP) studies demonstrating that vocal emotions are distinguished from neutral voices after 200 ms of speech processing (Paulmann and Kotz, 2008), and that emotionspecific differences begin to be detected in the 200–400 ms time window (Alter et al., 2003; Paulmann and Pell, 2010), our data shed critical light on the time interval where different emotionspecific meanings of vocal expressions are fully recognized and available for conscious retrieval. While it seems likely that the phrase structure of language governs local opportunities for speakers to encode emotionally-meaningful cues that are highly salient to the listener, at least in certain contexts, there are remarkable consistencies in the *amount* of time listeners must monitor vocal cue configurations to decode emotional (particularly threatening) meanings. As such, the idea that there are systematic

#### **REFERENCES**


differences in the time course for arriving at vocal emotional meanings is confirmed. To gather further information on how social factors influence the communication of vocal emotional meanings, future studies using the gating paradigm could present emotional utterances to listeners in their native vs. a foreign language; this could reveal whether specificities in the time course for recognizing emotions manifest in a similar way for native speakers of different languages, while testing the hypothesis that accurate decoding of vocal emotions in a foreign language is systematically delayed due to interference at the phonological level (Van Bezooijen et al., 1983; Pell and Skorup, 2008).

#### **ACKNOWLEDGMENTS**

This research was financially supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN 203708-11 to Marc D. Pell). Assistance from the McGill University Faculty of Medicine (McLaughlin Postdoctoral Fellowship to Simon Rigoulot) and the Konrad-Adenauer-Foundation (to Eugen Wassiliwizky) are also gratefully acknowledged.


*Speech Commun.* 54, 1–10. doi: 10.1016/j.specom.2011.05.011


emotions. *Psychol. Sci.* 3, 23–27. doi: 10.1111/j.1467-9280.1992. tb00251.x


architecture. *Philos. Trans. R. Soc. B* 364, 3459–3474. doi: 10.1098/rstb.2009.0141


*Cult. Psychol.* 14, 387*–*406. doi: 10.1177/0022002183014004001


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 22 February 2013; paper pending published: 27 March 2013; accepted: 04 June 2013; published online: 24 June 2013.*

*Citation: Rigoulot S, Wassiliwizky E and Pell MD (2013) Feeling backwards? How temporal order in speech affects the time course of vocal emotion recognition. Front. Psychol. 4:367. doi: 10.3389/fpsyg. 2013.00367*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Rigoulot, Wassiliwizky and Pell. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

### **APPENDIX**

A list of pseudo-utterances produced to convey each target emotion that were gated for presentation in the experiment.


### The siren song of vocal fundamental frequency for romanti relationships c

"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 1 — #1

### *SarahWeusthoff1\*, Brian R. Baucom2 and Kurt Hahlweg<sup>1</sup>*

*<sup>1</sup> Clinical Psychology, Psychotherapy, and Assessment, Department of Psychology, Technische Universität Braunschweig, Braunschweig, Germany <sup>2</sup> Department of Psychology, University of Utah, Salt Lake City, UT, USA*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Alexandra Suppes, Columbia University, USA Cheryl Carmichael, City University of New York, USA*

#### *\*Correspondence:*

*Sarah Weusthoff, Clinical Psychology, Psychotherapy, and Assessment, Department of Psychology, Technische Universität Braunschweig, Humboldtstrasse 33, 38106 Braunschweig, Germany e-mail: s.weusthoff@tu-bs.de*

A multitude of factors contribute to why and how romantic relationships are formed as well as whether they ultimately succeed or fail. Drawing on evolutionary models of attraction and speech production as well as integrative models of relationship functioning, this review argues that paralinguistic cues (more specifically the fundamental frequency of the voice) that are initially a strong source of attraction also increase couples' risk for relationship failure. Conceptual similarities and differences between the multiple operationalizations and interpretations of vocal fundamental frequency are discussed and guidelines are presented for understanding both convergent and non-convergent findings. Implications for clinical practice and future research are discussed.

**Keywords: fundamental frequency, attraction, emotional arousal, romantic relationships, couples, communication, conflict**

#### **INTRODUCTION**

Across cultures, adults rate marriage and long-term monogamous relationships as the most important ones in their lives (Buss, 2005). Despite this, divorce rates in industrialized countries rank between 40 to 55% (Hahlweg et al., 2010). A tremendous amount of effort has been accordingly devoted to identifying and understanding risk factors for relationship dissolution (Gottman and Notarius, 2002) with a particular emphasis on identifying risk factors that are present from the outset of a relationship (Caughlin and Huston, 2002). The polarization model of romantic relationships (Jacobson and Christensen, 1998; Baucom and Atkins, 2013) suggests that one of the most important kinds of risk factors originates in variables that are initially associated with attraction and desire that later become sources of distress and interpersonal friction. A growing body of empirical evidence suggests that the fundamental frequency of romantic partner's voices may represent precisely this kind of risk. Vocal qualities that are found to be attractive during the early stages of relationship formation are also associated with increased levels of dysfunction, increased risk for divorce, and decreased likelihood of benefitting from couple therapy in later stages of a relationship.

Evolutionary models suggest that perceptions of health status and likelihood of reproductive success are one of the primary determinants of attractiveness and mate selection. Physical aspects of attractiveness like facial symmetry, body height, and the voice are indicative of levels of sexual hormones, and thus of the likelihood of producing healthy offspring. Generally, indicators of higher levels of masculinity in males (based on higher levels of testosterone) and higher levels of femininity in females (based on higher levels of estrogen) are perceived as more attractive and are linked to better reproductive success. This leads to a dimorphism in attractiveness ratings: males tend to prefer female partners with higher f0 voices and smaller body sizes while female tend to prefer males with lower f0 voices and larger body sizes (Puts et al., 2012). Though these sex-related differences are important determinants of attraction, the polarization model of relationship distress suggests that they may become associated with dysfunction and increased risk for divorce as relationships mature. A change from attraction to distress is often accompanied by a change in attribution for the difference itself. Differences that are seen as enriching and complimentary are often experienced as desirable (Aron and Aron, 1997) while differences that are seen as short-comings and faults in the other commonly contribute to a cycle of distress. When differences are seen as faults in the other, spouses typically alternate between criticizing and blaming one another for the problems in their relationship and defending themselves from their partner's attacks (i.e., they engage in a strong demand/withdraw cycle of conflict). As this process unfolds, the relationship becomes polarized, and the three core aspects of interaction, communication, perception, and physiology, become imbalanced (Burman and Margolin, 1992; Gottman, 1993), conflict ensues, and even routine interaction with the spouse becomes highly aversive. The aversive nature of this cycle typically results in increasing polarization over time and makes it very difficult for spouses to resolve even minor problems. Being stuck in the polarization cycle is highly distressing for spouses, and high levels of aversive arousal during interaction are one of the most robust predictors of risk for relationship distress and dissolution (Gottman and Levenson, 1992). Aversive arousal has most often been captured via well-established physiological indices like heart rate (HR), blood pressure (BP), or skin conductance

(Larsen et al., 2008) as well as via endocrine measures like cortisol, or epinephrine (Robles and Kiecolt-Glaser, 2003; Ditzen et al., 2011). In addition to well-replicated findings based on physiological measures, a growing body of evidence shows that higher levels of vocally encoded emotional arousal, which are reflected in higher levels of f0, are similarly related to increased risk for a wide range of negative relationship outcomes.

#### **FUNDAMENTAL FREQUENCY (f0)**

As a mathematical quantity, fundamental frequency (f0) refers to the lowest frequency harmonic of the speech sound wave. This frequency is biologically determined by the pattern of vibration created by the vocal folds during phonation, the phase of speech production where the outward flow of air from the lungs is regulated by the larynx. Higher rates of opening and closing of the vocal folds across the glottis are associated with higher f0 values (measured in cycles per second or Hertz [Hz]). Perceptually, f0 is highly correlated with pitch where higher f0 values correspond to higher pitch (Juslin and Scherer, 2005). F0 can be easily calculated using Praat, a freely available, Windows-based software package (Boersma and Weenink, 2013; www.praat.org), to analyze existing audio recordings of speech in adequate quality. F0 is assessed continuously during human speech and can change rapidly. Different parameters like mean f0, f0 range, minimum f0, or maximum f0 can be calculated and used in empirical research. All of those can be calculated either at a very small scale (such as for each talk turn) or averaged across over the entirety of a conversation. As is true of much behavioral research, there is greater agreement about how to calculate f0 than there is about how to understand what f0 represents. F0 has been variously interpreted as an index of vocally encoded emotional arousal (see Juslin and Scherer,2005 for a review; Weusthoff et al., 2013) and dominance (Puts et al., 2006; Borkowska and Pawlowski, 2011), and additional work demonstrates that f0 correlates with other factors such as age and pubertal (Hollien and Shipp, 1972; Brown et al., 1991; Hollien et al., 1994, 1997), phonemic and syntactic structure of speech (Whalen and Levitt, 1995). Recent work examining f0 during social interaction provides a framework for integrating the various interpretations of f0. In Weusthoff et al.'s 2013 examination of simultaneous associations between f0, biological sex, physiological indices of arousal, and social behavior, biological sex was the largest predictor of individual differences in f0 while physiological arousal and social behaviors were the best predictors of variance in f0 attributable to a specific social interaction. These results suggest that f0 can be understood as conveying information about both traits (such as biological sex) and states (momentary physiological arousal and social behavior) and highlights the need for careful analysis of f0 to allow for specific interpretations.

#### **METHOD OF REVIEW**

In contrast to the nascent body of research examining f0 during marital interactions,f0 has been intensively studied using different research paradigms across multiple disciplines (e.g., single conversations recorded during naturally occurring stressful events like the New York City blackout in 1977, Streeter et al., 1982, or emotion portrayals by professional actors, Banse and Scherer, 1996; see Scherer, 2003 for a review). For example, a key-word based

"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 2 — #2

search on SCOPUS (using the search terms "social interaction" AND "verbal" OR "non-verbal" in "Keywords, Abstract, or Title") yielded more than 2000 results. Here, we focus on research with a specific focus on relationship formation and maintenance. The body of work on f0 and attraction is robust and well-developed. Thus, we seek to provide a representative sampling of the main findings in this area. In contrast, there are many fewer studies of naturalistic social interaction in general much less about marital interaction specifically. As others have noted, studies using naturalistic vocal expressions of affect with sufficient intensity of the emotions displayed are needed in vocal expression research in order to obtain speech that offers sufficient levels of "ecological validity" (Juslin and Scherer, 2005). Communication in close relationships seems to fulfill these criterions but research has only recently begun to take f0 into account here.

Research with human participants can be split up in two further groups: whether interlocutors had some form of relationship with each other before interacting, or not. In the case of pre-existing relationship, four main areas of relationships have been covered: the ones between parents and infants (see Irwin, 2003 for a review), between psychotherapists and clients (see Greenberg and Pascual-Leone, 2006 for a review), between physicians and patients (see Hassan et al., 2007; Hulsman et al., 2011 for a review), and between spouses in intimate relationship. Independent of targets, f0 has been found to be associated with arousal. As it it beyond the scope of this review to cover this all, the interested reader is referred to the according reviews. In couples interactions, however, the role of vocally encoded emotional arousal has not been investigated closely. This review aims at looking into f0 and its associations with couple functioning more closely. The authors identified a total number of *N* = 5 studies for this task. All analyses were based on dyadic communication settings (conflict discussions) between spouses from long-term or married heterosexual couples. Instructions for the tasks were standardized, and recordings of the interactions were videotaped during assessment sessions conducted in research laboratories. Participants were either English (Baucom et al., 2009, 2011), or German (Baucom et al., 2012b; Kliem et al., 2012; Weusthoff et al., 2013) native speakers (for a detailed description of the studies included in this review, please see **Table 1**).

As also described in Weusthoff et al. (2013), f0 was obtained continuously from the speech of each person during the problem discussion of the assessment point(s). Conversations were segmented per speaker using audio editing programs like Adobe Premier Pro, resulting in data per speaker containing only conversation parts during which he or she talked separately without any other human, or background sound being present. Bandpass filtering was applied to all voice samples prior to further analyses (calculating f0 values) in order to restrict f0 values to the normal range of emotional adult speech. The typical range for male speakers is 75–150Hz, for female speakers 150–300Hz, though wider limits can be observed during highly aroused emotional states (Owren and Bachorowski, 2007). Common f0 mean values are around 225 Hz for women, and about 120 Hz for men. Across the lifespan, women's mean f0 decreases while male speakers' mean f0 scores also decrease first, but start to rise again. Changes in adults are most prominent for both sexes between age 50 and


"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 3 — #3

**Table 1 | Methodological details of interaction studies reviewed.**

60, with women additionally experiencing hormonal influences associated with the onset of the menopause (Hollien and Shipp, 1972; Brown et al., 1991; Hollien et al., 1997). Though it is possible for human speakers to generate sound outside the typical filtering limits of 75 and 300 Hz,f0 scores outside of this range are very likely resulting from background machine or electronic noise. Minimum, maximum, and mean f0 values were generated separately for each spouse by analyzing their respective segmented audio recordings using Praat, afree multiple platform program (Boersma and Weenink, 2013). F0 range was calculated separately for each partner at each assessment by subtracting the partner's averaged minimum f0 from the partner's averaged maximum f0 across the whole conversation. In Baucom et al.'s (2011) study, f0 mean was derived from raw f0 scores across the whole conversation. For a more detailed description of potential f0 range indices, please see **Table 1** in Baucom et al. (2012a).

#### **f0 IN INTIMATE RELATIONSHIP RESEARCH**

Due to three reasons, it is important to further investigate the role of f0 in intimate relationships. One, relationship deterioration and/or separation are associated with tremendous costs for affected individuals and society as a whole, both concurrently and in the long run. Dysfunctional relationships lead to poorer mental and physical health in spouses (Kiecolt-Glaser and Newton, 2001), a higher likelihood for divorce and reduced levels of social support (Amato, 2010). Due to higher levels of unhealthy behaviors like smoking, even a decreased life expectancy (Larson and Halfon, 2013) in affected children has been observed. Second, couple-relationship education programs (CRE) aim at reducing the risk of relationship dissolution by teaching communication skills and ways of dealing with aspects of couple's emotional lives, e.g., *EPL – Ein Partnerschaftliches Lernprogramm* ("A Learning Program for Couples"; Kaiser et al., 1998), especially during conflict discussions. These programs are known to be highly effective in doing so both in the short and in the long run (Hawkins et al., 2008; Hahlweg and Richter, 2010). However, the mechanisms of change in CRE and how emotional arousal affects the work in and outcome of CRE are still unknown. Communication is assumed to be the central mechanism of change, however, empirical studies using the paradigm of couples having videotaped conflict discussions that are later rated and / or analyzed with regard to different aspects of behavior have not consistently supported this supposition. One reason for this might be the rather simplistic ways of operationalizing a complex behavior like human communication as single elements like positive, or negative utterances (Christensen, 2010). Therefore, research should consider several aspects of communication simultaneously in a multi-channel approach (Gottman and Notarius, 2002). Third, though emotional arousal is known to play a crucial role in relationship dissolution (Gottman and Levenson, 1992), the specific pathways and interactions with other important variables like communication behavior, or relationship satisfaction are still unknown (Weusthoff et al., 2013). As future research in this field should focus on how to "modify emotion-driven, dysfunctional, and destructive interactional behavior," and to "elicit avoided, emotion-based" behavior (Christensen, 2010, pp. 36-37), there is additional need for an objective tool for detecting and explaining emotional arousal in the context of couple interactions and intimate relationships.

During courtship, f0 is one of the evolutionary most important signals for non-visual gender discrimination (which is important for successful identification of potential mates; Junger et al., 2013), andfor judging the attractiveness of a potential spouse (Borkowska and Pawlowski, 2011). Very masculine voices in males (perceived as low voice pitch, and thus low f0), and very feminine voices in females (perceived as high voice pitch, and thus high f0) are perceived as particularly attractive (DeBruine et al., 2006; Feinberg et al., 2006, 2012; Borkowska and Pawlowski, 2011; O'Connor et al., 2012). These perceptions are attenuated in both men and women during high-fertile phases of the female menstrual cycle (Puts, 2005; Pipitone and Gallup, 2008; Hodges-Simeon et al., 2010). High vocal masculinity in males are linked to high levels of testosterone, and associated with a higher likelihood for producing healthy offspring (O'Connor et al., 2012). However, in the long run, males with higher levels of vocal masculinity report a larger number of different sexual partners, and are less likely to engage in relationship maintaining and parental behaviors (O'Connor et al., 2011, 2012). It also seems plausible that f0 during relationship initiation stages displays high levels of arousal that is taken by partner as an indicator for higher engagement in a relationship, also hinting at a process similar to polarization.

During maintenance, less research on f0's role has been conducted so far. F0 has most often been studied in couples' conflict interactions where it seems to index levels of emotional arousal. F0 during couple conflict has been demonstrated to be significantly positively associated with multiple cardiac, and endocrine indices of arousal, and linked to more negative and less positive communication behavior. F0 is thought to simultaneously display autonomic physiological as well as socially learned reactions in one signal (Weusthoff et al., 2013). Furthermore, it has been shown that spouses influence each other in their levels of arousal. If one partner is highly aroused, it is becomes more difficult for the other one to maintain on a functional level of arousal, thus leading to polarization processes with regard to f0 (Baucom and Atkins, 2013).

As also noted elsewhere (Weusthoff et al., 2013), f0 offers a number of methodological and conceptual benefits with regard to being further used in research on romantic relationships. Its non-invasive nature and lack of need for additional equipment (only a good audio recording device is needed) makes the assessment of f0 an excellent candidate for data collection in situations like conflict interactions that are most often videotaped, or for post-hoc analyses (e.g., spousal discussions, or therapy sessions). Additionally, f0 can be computed and analyzed from the videotapes even after conducting a study in which it was not of primary interest. Furthermore, f0 is more closely associated with individual psychological than physical distress (Johannes et al., 2007), making it more sensitive to detect changes in psychological load and less prone to artifacts that influence physiological measures of arousal like HR or skin conductance (e.g., movements, Sloan and Kring, 2007). Being based on physical properties of speech, f0 can be analyzed objectively and independent of the researcher's native language and culture (Weusthoff et al., 2013). Perhaps most importantly, f0 is directly involved in and available during the

"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 4 — #4

process of communication and can be perceived by a listener (Baucom et al., 2011). There is good evidence that partners in communication directly respond to each other's f0 without being aware of this fact (Gregory and Webster, 1996). Similar to cardiovascular indices of arousal (e.g., HR), f0 seems to encode aspects of emotional arousal that are part of an autonomic process not entirely controllable by a speaker, similar to conditioned emotional responses (Kliem et al., 2012).

Merging these theories, expression of emotions (emotional expression) toward one's spouse can be considered as a central part of well-functioning intimate relationships, the role of underlying emotional arousal, however, seems to be somewhat different. Endocrine and psychophysiological indices of heightened levels of emotional arousal have consistently been linked to unwanted relationship outcomes like dysfunctional communication behavior, higher risk for divorce and separation, or lower levels of relationship satisfaction (Gottman and Levenson, 1992; Kiecolt-Glaser and Newton, 2001; Gottman and Notarius, 2002). As emotions and their underlying processes like arousal are essential parts of social interactions (Juslin and Laukka, 2003) and can influence them heavily (Scherer, 2005), it makes them important information not only for researchers in the social interaction area but also for people's everyday lives.

#### **STUDIES REVIEWED**

#### **SUMMARY OF FINDINGS**

Across studies, languages, and gender, f0 has been treated as an index of emotional arousal. Like other indices f0 has been associated with a number of different negative aspects of couple functioning both concurrently and in the long run. Significant associations have been found in conflict interactions of distressed couples, either participating in couples therapy, or in CRE. More specifically, higher levels of a person's physiological indicators of emotional arousal like HR, BP, or salivary cortisol were associated with higher levels of one's own f0. Higher levels of a spouse's f0 have also been linked to higher levels of negative communication behavior, both observed and self-reported, and to lower levels of observed non-verbal positive communication behavior. In highly distressed couples, the likelihood for long-term success of couple therapy (being in the recovered range 2 years after treatment termination) was higher when wives' displayed lower levels of f0 in a conflict discussion prior to treatment. Elevated levels of f0 in conflict discussions prior to participation in CRE also lead to a higher likelihood for separation and divorce, and a smaller number of communication skills remembered 11 years after the CRE.

#### **DISCUSSION**

This review aimed at examining research on vocally encoded emotional arousal, namely f0, in close personal relationship communication (interaction between spouses). Significant associations were found with psychophysiological and endocrine indices of emotional arousal, observed and self-reported communication behavior, and relationship functioning (stability and satisfaction). Furthermore, f0 has been found to be related to skills remembrance in CRE. Significant results were found for concurrent as well as longitudinal links between f0 and variables of interest, and for a time frame as long as 11 years. Significant gender differences emerged for f0 range's association with some variables of interest but not for f0 mean's.

#### **FINDINGS CONSIDERED FROM AN EVOLUTIONARY PERSPECTIVE**

Across studies, f0 was analyzed as an index of emotional arousal thought to display information about the internal state of the speaker to an interaction partner via the human voice. Stable associations between f0 and different forms of communication behavior (positive, negative, observed, self-reported, verbal, and non-verbal) emerged across studies. Social signaling theories assume evolutionary reasons for this: In order to heighten chances of survival in ancient hunter-gatherer societies, emotions were adaptive developments helping to avoid danger and to cooperate with others (Buss, 2005). Vocally encoded emotional arousal enables individuals to communicate emotions non-verbally from one person to another, independent of words and language. Vocal communication of emotion is more likely to appear in goal-relevant behavior (Juslin and Laukka, 2003). F0 scores indexing vocally encoded emotional arousal should therefore be higher during goal-relevant behavior. What is considered as goal-relevant behavior depends on situational aspects: Couple and family conflict is conceptually thought of as a chronic stressor (Doss et al., 2004) where often a wide range of negative communication behavior (like DW) is displayed in order to achieve change in one's spouse (Christensen et al., 2006) and high levels of emotional arousal are displayed via multiple channels (Kiecolt-Glaser and Newton, 2001). Positive associations between f0 and these variables as found in the reviewed studies investigating conflict could be indexing goal-achieving strategies by spouses.

With regard to the theoretical foundation of emotional arousal, coordinated responses in multiple physiological systems stemfrom the same source, namely the cognitive appraisal in a given situation and the following changes in autonomous and somatic nervous system activity (Scherer, 2009). These changes were assessed via multiple channel of arousal in Weusthoff et al.'s 2013 study, with the associations between f0 and psychophysiological indices of arousal being in expected directions. With the periaqueductal grey (PAG), a brain-stem based areal (and thus an evolutionary old one) in humans is involved in both vocal and cardiovascular responses to different kinds of stress. The PAG is thought to integrate emotional aspect of stress responses into the autonomic nervous system responses in different channels (Linnman et al., 2012). These findings suggest that f0 as an index of emotional arousal, and thus the intensity of the emotional reaction (Weusthoff et al., 2013), are influenced by both basic biological processes as well as by socially learned communication behaviors stemming from a similar evolutionary basis (Juslin and Laukka, 2003).

Different brain regions are found to be involved in processing male and female voices (Sokhi et al., 2005), and discriminatory performance for human voices was better for opposite-sex stimuli than for same sex-stimuli (faster identification; Junger et al., 2013). F0 seems to enable a stronger attendance toward stimuli higher in social significance like potential mates (Feinberg, 2008), which could be considered as an important aspect of relationship functioning not only in initiation but also in maintaining phases.

"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 5 — #5

The studies covered in this review seem to foster this interpretation: during later phases of relationship and especially in "rough" times, spouses seem to especially use f0 as a non-verbal signal in relationship-relevant information like expression of distress.

#### **FINDINGS WITH REGARD TO GENDER DIFFERENCES**

Across age and interaction settings, gender differences were found in a number of studies. Biologically speaking, f0 is based on vocal cord length and tension. Given the physiological differences in body size and thus, throat length between men and women, gender differences in f0 emerge after puberty with females on average having significantly higher f0 scores than males (Titze, 1989). Two studies in this review (Baucom et al., 2012b; Weusthoff et al., 2013) explicitly investigated gender differences and as expected found higher f0 range scores for female than for male speakers. Except for Baucom et al. (2011) and Weusthoff et al. (2013), all studies investigating adult speakers included in this review found variables of interest to have significant associations with f0 range for female speakers only, or found different associations for men and women. Given that female partners are more likely to seek help with regard to problems and/or conflict in close personal relationships (O'Brien,1988) it seems also plausible that gender differences could stem from different goals and resulting goal-oriented behavior in males and females (especially as no associations being significant for men only emerged). Finding the best partner to produce healthy offspring seems to be an important goal during selection of a potential mate and sex partner during courtship. Significant associations between perceived, and self-rated attractiveness and health, and f0 for both sexes during sexual selection (though due to endocrine reasons in different directions, with higher attractiveness and perceived health being associated to f0 scores in women and lower f0 scores in men; Puts, 2010) also hint at f0 being an important aspect in goal-relevant behavior.

#### **LIMITATIONS**

The work reviewed in this manuscript has taken into account only three indicators of vocally encoded emotional arousal out of a wide variety of potential ones (Juslin and Scherer, 2005, pp. 103- 104): f0 range, f0 mean, and f0-time-to-peak as a time-varying aspect of f0. This has happened for a number of reasons. As earlier work on vocally encoded emotional arousal has focused on f0 mean in a number of different research settings (i.e., f0 in speech of psychiatric patients, Tolkmitt et al., 1982; or emotion portrayal by professional actors, Banse and Scherer, 1996), it was chosen as the parameter of interest in the first study conducted on f0 in close personal relationships (Baucom et al., 2011) in order to enable comparisons to these published empirical findings. However, among the different f0 indices that can be calculated f0 range seems to bear a number of advantages in research on close personal relationships (see Juslin and Scherer, 2005 for a review), and was therefore chosen for the studies conducted later. f0 range is considered to be the cleanest vocal indicator of emotional arousal, and to depict the biggest amount of information on emotional arousal (Juslin and Scherer, 2005; Busso et al., 2009). Furthermore, it facilitates the interpretation of gender differences as it adjusts for potential individual differences by way of its calculation (subtracting the individual's minimum f0 score from the individual's f0 maximum score; Weusthoff et al., 2013). Nonetheless, it is possible that other indicators of vocally emotional arousal (e.g., jitter) might contain additional information on the nature of emotional arousal in close personal relationship communication.

Though biological reasons for differences in male and female voices are well-documented and have been discussed in this review, it remains unclear why expected sex differences for f0 indices and their associations with different variables of interest did not emerge consistently across studies. Particularly, none of the studies found significant gender differences in the associations between f0 mean (compared to studies using f0 range) and variables of interest. Future research should regard gender as a covariate in f0 research in close personal relationship but more detailed investigations on the details of gender differences in f0 indices are needed.

Investigating emotional arousal in human speech using f0 includes the drawback of participants being able to willingly influence and control f0 to a certain degree. Empirical evidence has shown that conscious manipulations can result in both elevated and lowered pitch levels that are identifiable by a communication partner (Kuenzel, 2000). However, given the arousing, stressful, and cognitively demanding situation of couple conflict, it is quite unlikely that participants in the studies reviewed made use of this hypothetical possibility.

#### **IMPLICATIONS FOR CLINICAL WORK AND FUTURE RESEARCH**

As f0 seems to be able to display an individual's internal levels of emotional arousal, other dyadic settings in which emotional arousal is considered to be important information could also benefit from research on vocally encoded arousal. During social support interactions between spouses, showing the interaction partner positive and helpful behaviors in order to help him or her to deal with stressful situations is the main goal. Social support is considered to consist of different positive and helpful behaviors being observable in communication between interaction partners leading to changes in individual cardiovascular, and endocrine functioning which influences individual health outcomes. Consistent with this model, lower levels of social support behavior have been empirically linked to higher levels of physiological indices of arousal (most often BP), and to poorer health outcomes in affected individuals (Uchino et al., 1996). Given the associations between BP, and HR with f0 in couple conflict, vocally encoded emotional arousal also seems to be important during social support interactions between spouses. However, in comparison to conflict discussions, more positive and less emotionally arousing behaviors seem be goal-relevant in social support and should therefore be displayed more often between interaction partners (Verhoftstadt et al., 2005). Links between f0 and variables of interest in social support behavior could therefore be of different direction and magnitude than the ones found in conflict discussions.

Communication Accomodation Theory (CAT; Giles and Coupland, 1991) describes changes in partners' communication styles during a conversation leading to more, or less similitarity between the partners (convergence and divergence). Convergence can occur on different levels (e.g., pitch pattern, speech rate, or emotional expression), and has empirically been shown to occur in

"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 6 — #6

various dyads. Convergence seems to happen somewhat naturally but is associated with different outcomes depending on the interaction setting and other external factors in intimate relationships. Convergence seems to be a beneficial process during interactions asking for social support between partners. For example, high levels of emotional convergence (meaning similarity in emotional expressions between spouses) while sharing emotional events of one's day are related to higher levels of concurrent and longitudinal relationship satisfaction, and higher relationship stability (Anderson et al., 2003). During couple conflict, however, convergence in emotional expression and behavior like the DW pattern have negative impacts on relationship functioning and outcomes (Lee et al., 2012; Baucom and Atkins, 2013). A closer look at potential reasons for these differential effects of convergence in couples' communication could be a fruitful avenue for future research. Are the associations due to levels of relationship satisfaction (happy vs. unhappy couples), do they depend on the context of the interaction (social support vs. conflict), or do both factors and/or a third one contribute?

Vocal characteristics between talkshow host and guests have also been demonstrated to converge across the course of an interview. However, across different dyads, status and/or power seem to be an important factor influencing convergence: The conversation partner lower in status and power (having a greater "need for social approval"; Giles and Coupland, 1991, p. 73) exhibits more change in vocal and emotional expressions, moving in

#### **REFERENCES**


Prediction of response to treatment in a randomized clinical trial of couple therapy: a 2-year follow-up. *J. Consult. Clin. Psychol.* 77, 160–173. doi: 10.1037/a0014405


the direction of the more powerful partner (Gregory and Webster, 1996).

Power processes are also influential in couples communication (Baucom et al., 2011), and are interdependently linked to each other. For example, f0 in distressed spouses' problem-solving discussions is known to converge in terms of magnitude: if one spouse is highly aroused, the other partner is also more likely to be highly aroused. Furthermore, this covariation seems to lead to problems in partner's ability to regulate their own state of arousal to a comfortable level (Baucom and Atkins, 2013).

Given these findings, it seems likely that spouses'f0 scores could also be associated with each other across speakers. Gottman and Notarius (2002, p.185) explicitly state that there is a "need for continued focus on sequences or patterns of interaction" in order to identify beneficial and harmful aspects of spousal communication. Sequential analyses of f0 could shed light on presence, magnitude, and direction of interdependent and cyclical aspects of emotional arousal in couple communication.

#### **ACKNOWLEDGMENTS**

This research was supported by grants from the Deutsche Forschungsgemeinschaft (DFG; DFG Fe 263/5-1, Ha 1400/4-1, Ha 1400/16-1, and Ha 1400/16-2) awarded to Kurt Hahlweg and from the National Institute of Child Health and Human Development (F32 HD060410) and the University of Utah awarded to Brian R. Baucom.

professional singing. *J. Voice* 5,310– 315. doi: 10.1016/S0892-1997(05) 80061-X


"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 7 — #7

68, 1029–1044. doi: 10.1111/j.1741- 3737.2006.00311.x


contexts. *Behav. Ecol. Sociobiol.* 66, 413–418. doi: 10.1007/s00265-011- 1287-y


*J. Consult. Clin. Psychol.* 76, 723–734. doi: 10.1037/a0012584


The efficacy of a compact psychoeducational group training program for married couples. *J. Consult. Clin. Psychol.* 66, 753–760. doi: 10.1037/0022- 006X.66.5.753


"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 8 — #8

varies across the menstrual cycle. *Evol. Hum. Behav.* 29, 268–274. doi: 10.1016/j.evolhumbehav.2008. 02.001


O. J. (2005). Conflict and support interactions in marriage: an analysis of couples' interactive behavior and on-line cognition. *Pers. Relationsh.* 12, 23–42. doi: 10.1111/j.1350- 4126.2005.00100.x


f0 of vowels. *J. Phon.* 23, 349– 366. doi: 10.1016/S0095-4470(95) 80165-0

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 28 March 2013; accepted: 25 June 2013; published online: 15 July 2013. Citation: Weusthoff S, Baucom BR and Hahlweg K (2013) The Siren song of*

"fpsyg-04-00439" — 2013/7/12 — 10:54 — page 9 — #9

*vocal fundamental frequency for romantic relationships. Front. Psychol. 4:439. doi: 10.3389/fpsyg.2013.00439*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Weusthoff, Baucom and Hahlweg. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## Voice quality in affect cueing: does loudness matter?

#### *Irena Yanushevskaya\*, Christer Gobl and Ailbhe Ní Chasaide*

*Phonetics and Speech Laboratory, Centre for Language and Communication Studies, School of Linguistic, Speech and Communication Sciences, Trinity College Dublin, Ireland*

#### *Edited by:*

*Petri Laukka, Stockholm University, Sweden*

#### *Reviewed by:*

*Tanja Bänziger, Högskolan i Gävle, Sweden Sona Patel, Northwestern University, USA*

#### *\*Correspondence:*

*Irena Yanushevskaya, Phonetics and Speech Laboratory, Centre for Language and Communication Studies, School of Linguistic, Speech and Communication Sciences, Arts Block, Trinity College Dublin, Dublin 2, Ireland e-mail: yanushei@tcd.ie*

In emotional speech research, it has been suggested that loudness, along with other prosodic features, may be an important cue in communicating high activation affects. In earlier studies, we found different voice quality stimuli to be consistently associated with certain affective states. In these stimuli, as in typical human productions, the different voice qualities entailed differences in loudness. To examine the extent to which the loudness differences among these voice qualities might influence the affective coloring they impart, two experiments were conducted with the synthesized stimuli, in which loudness was systematically manipulated. Experiment 1 used stimuli with distinct voice quality features including intrinsic loudness variations and stimuli where voice quality (modal voice) was kept constant, but loudness was modified to match the non-modal qualities. If loudness is the principal determinant in affect cueing for different voice qualities, there should be little or no difference in the responses to the two sets of stimuli. In Experiment 2, the stimuli included distinct voice quality features but all had equal loudness to test the hypothesis that equalizing the perceived loudness of different voice quality stimuli will have relatively little impact on affective ratings. The results suggest that loudness variation on its own is relatively ineffective whereas variation in voice quality is essential to the expression of affect. In Experiment 1, stimuli incorporating distinct voice quality features consistently obtained higher ratings than the modal voice stimuli with varied loudness. In Experiment 2, non-modal voice quality stimuli proved potent in affect cueing even with loudness differences equalized. Although loudness *per se* does not seem to be the major determinant of perceived affect, it can contribute positively to affect cueing: when combined with a tense or modal voice quality, increased loudness can enhance signaling of high activation states.

#### **Keywords: voice quality, loudness, intensity, perception, emotion, affect**

#### **INTRODUCTION**

Expressive, affectively colored speech is characterized by dynamic variation of the voice source. Prosodic features of the voice play a fundamental role in conveying emotions and attitudes in human communication. Specific affective states are expressed and recognized in terms of tone-of-voice, which entails features of voice quality (including perceived loudness of the voice) and pitch as well as temporal factors such as speaking rate. Experiments reported in Gobl and Ní Chasaide (2003), Gobl (2003), Gobl et al. (2002), Yanushevskaya et al. (2011) explored the mapping of voice quality to affect. They used an utterance synthesized with different voice qualities to examine how changes in voice quality can alter its perceived affective coloring. Results in repeated experiments showed a clear mapping between voice quality and affect. The voice qualities synthesized in those experiments are discussed below.

This paper is prompted by questions arising out of these earlier studies and explores, in two related experiments, the role that loudness plays in the way that differences in voice quality can cue affect. In these earlier studies, in changing the glottal pulse shape to synthesize the different voice qualities, the loudness is concomitantly altered. This parallels what happens in typical human productions. Different voice qualities tend to be characterized as having differences in loudness, e.g., tense and harsh voice will most likely be perceived as louder than whispery or breathy voice.

However, despite the tendency for specific voice qualities to be produced with differences in loudness, there is no absolute linkage: while one may tend to produce tense voice more loudly than modal, one *can* produce a relatively quieter tense voice quality. Similarly, while whispery voice does tend to be produced with a lower loudness level than modal voice, it can be produced over a range of loudness levels.

Thus, while our earlier experiments reported distinct affective associations with particular voice qualities, the question arose as to the extent to which the effect might be due to the intrinsic loudness differences in the stimuli. The question can be framed in terms of two opposing hypotheses. On the one hand, one could hypothesize (Hypothesis A) that the loudness level is entirely responsible for the affective coloring achieved in these earlier experiments. Or, to take the opposing view (Hypothesis B) it could be that the differences in inherent loudness among the stimuli was irrelevant to the affective coloring they impart. A further hypothesis (Hypothesis C), is perhaps more likely: that loudness contributes somewhat to the affect cueing. This hypothesis would be consistent with the suggestion (Schröder, 2004) that manipulating loudness of a synthesized stimulus while keeping voice quality constant should have a less prominent impact on the stimulus perception than varying the voice quality and keeping absolute loudness unchanged. A more recent study on the interdependencies among voice source parameters in emotional speech (Sundberg et al., 2011) showed the importance of accounting for loudness variation in the analysis of affectively colored speech and further prompts investigation into the relative contribution of voice quality and loudness in vocal expression of affect.

Experiment 1 used stimuli incorporating distinct voice quality features including intrinsic loudness variations and stimuli where voice quality (modal voice) was kept constant, but in which loudness was systematically modified to match the loudness level of the non-modal qualities. If loudness is the principal determinant in affect cueing for different voice qualities, there should be little or no difference in the responses to the two sets of stimuli, and they should both signal affect in a way that is similar to our earlier reported experiments. In Experiment 2, three series of stimuli were presented to listeners. In the first series, the stimuli incorporated distinct voice quality features but all had equal loudness (they were normalized to the loudness level of the original modal voice stimulus). In the other two series, the intensity levels of the first series were either increased or decreased by 2 dB. If loudness is the main determinant of affect cueing the responses should be little differentiated within any one of these series, but one would expect to see differences across the three series.

#### **LOUDNESS AND AFFECT**

Although broadly speaking, the role of voice quality in communicating affect has been relatively little studied, there is an extensive literature on the affect signaling correlates of pitch and intensity variation in speech, and it has often been suggested that there are affects that are expressed loudly and others for which a low intensity is typical. Acoustic profiles of emotional expressions (Scherer, 1986, 2003; Sundberg et al., 2011) suggest that anger and happiness are signaled by increased pitch, increased loudness, and a faster rate of speech, whereas boredom and grief are characterized by low pitch and a slow speaking rate. As summarized in Frick (1985), contempt is loud and grief and boredom are soft. Siegman and Boyle (1993) showed that an increase in speech rate and loudness when speaking about fear and anxiety arousing events was associated with a corresponding increase in listener's perception of fear and anxiety. A similar correlation was found between sadness and depression and the decrease in speech rate and loudness. Certain negative emotions and signs of aggression are characterized by increased speech intensity (and consequently by increased perceived loudness) (Scherer, 2003). Voice quality variations related to the vocal effort of the speaker, intensity (and its perceptual correlate – loudness) of affectively colored vocalizations are therefore often suggested to be important factors in the encoding and recognition of high activation affective states.

In speech communication research, loudness has been studied primarily as a perceptual correlate of linguistic prominence and stress using acoustic measures related to the overall intensity of speech signal, spectral properties of the signal (spectral slope, spectral balance, or spectral emphasis) as well as through the studies of vocal effort (e.g., Sluijter and van Heuven, 1996; Traunmüller and Eriksson, 2000; Heldner, 2001; Kochanski et al., 2005).

The term loudness has been used somewhat differently across studies and perceived loudness is not infrequently (though inaccurately) treated as synonymous to intensity. In psychoacoustics, loudness is defined as the perceived magnitude of the sound (Scharf, 1978; Plack and Carlyon, 1995; Zwicker and Fastl, 1999; Moore, 2003). Assumptions of perceived loudness as subjective auditory sensation have to be made based on the results of listening tests using psychoacoustic procedures such as magnitude estimation and magnitude production. Objective methods of estimation of perceived loudness include the use of loudness models and loudness meters (Skovenborg and Nielsen, 2004). Loudness can be expressed in sones (perceived loudness) or phons (loudness level).

As would be expected, perceived loudness is mainly determined by the sound intensity, but the relationship between sound intensity and loudness is complex. For instance, two sounds being perceived as equally loud may have very different sound intensity (and vice versa) depending on their spectral characteristics and/or bandwidth. The reason for this complex relationship is linked to how sound is processed in the cochlea, i.e., whether the acoustic energy is spread over many or only one or a few critical bands (Moore, 2003). It furthermore depends on such factors as the properties of the signal (spectral content and bandwidth or duration and intermittency of sound) and the conditions in which the sound is presented to the listener (for example, the background). There also exists an important interaction between the properties of the signal and the listener. As pointed out by Scharf (1978, p. 188), "loudness resides in the listener, not in the stimulus." Perceived loudness will depend to various degree on factors such as stimulus presentation (binaural or monaural), whether the listener has been exposed to noise, whether the listener has a hearing impairment, and to what extent listening is a conscious process (Scharf, 1978). The study of the perception of loudness and the way it is related to the temporal and spectral properties of a signal is fundamental to the understanding of the way in which the sounds are represented in the auditory system (Moore, 2003).

#### **VOICE-QUALITY-VARYING STIMULI USED IN EARLIER STUDIES**

As the experiments reported here follow on earlier experimental studies (Gobl et al., 2002; Gobl and Ní Chasaide, 2003; Yanushevskaya et al., 2011), and use as a starting point the same synthetic stimuli varying in terms of their voice qualities, we will summarize briefly here how these were generated.

The set of voice-quality-varying stimuli include modal voice, whispery voice, breathy voice, lax-creaky voice, harsh voice, and tense voice. These stimuli represent a range of voice qualities according to the classification system in Laver (1980), with the exception of lax-creaky voice, which is conceptually an extension of the Laver framework. The stimuli were based on a recording of a Swedish utterance "ja adjö" [|j**a** a|jø], produced with modal voice by a male speaker. The utterance was inverse filtered using manual interactive software system (Gobl and Ní Chasaide, 1999) and the voice source parameterization data obtained by matching the Liljencrants-Fant (LF) model (Fant et al., 1985) to the estimated glottal flow signal using the same system. The utterance was subsequently re-synthesized using the LF model implementation incorporated in the KLSYN88a formant synthesizer (Klatt and Klatt, 1990). Based on the modal utterance, whispery, breathy, laxcreaky, tense, and harsh voice were generated by manipulating a set of the KLSYN88a parameters. The synthesis was guided by the earlier analytic studies (Gobl, 1988, 1989) as well as by the broader literature on voice quality (see review in Ní Chasaide and Gobl, 1997; Gobl and Ní Chasaide, 2010). A detailed description of the stimuli is given in Gobl and Ní Chasaide (2003).

#### **PRELIMINARY EXPERIMENT: LOUDNESS MATCHING**

For the loudness-related manipulations of perceptual Experiments 1 and 2 below, a preliminary loudness matching experiment was first carried out. For Experiment 1, we aimed to generate a series of stimuli with modal voice quality, but with loudness levels matched to those of the voice-quality-varying stimuli. In Experiment 2, we aimed to neutralize the inherent loudness differences among our voice-quality-varying stimuli by equalizing them to the loudness of our original modal stimulus.

Given that, as discussed above, loudness is defined as the subjective magnitude of a sound, simple intensity normalization was considered unsatisfactory in generating stimuli matching in loudness. Even though intensity manipulations were used as the method of generating the stimuli, the best loudness match had to be obtained in the course of a preliminary auditory experiment. Thus for example, a modal stimulus matching the loudness of a voice quality stimulus would not necessarily have the same sound intensity but should be perceived as equally loud.

A listening test was therefore carried out in order to find modal stimuli that would best match in terms of perceived loudness each of the original voice quality stimuli. The test used our original modal voice quality as the basic stimulus, and varied the loudness systematically. Its intensity level was increased/lowered in relatively fine steps of 1 dB to provide a selection of sample sounds, which could then be compared to the original voice quality stimuli in the course of auditory tests. The procedure was similar to the loudness matching experiments common in psychoacoustic research. However, rather than letting the listeners regulate the gain control continuously to adjust the loudness of the test stimuli to match the reference stimulus, the listeners could choose the best match from a set of discrete stimuli, differing in relatively fine loudness steps.

A set of 24 stimuli was thus generated using the GoldWave v.4.26 software. Each stimulus was given a numeric value corresponding to the change in intensity in dB so that the "quietest" stimulus (Stimulus −12) had an intensity level that was 12 dB less than that of the original modal voice stimulus and the "loudest" stimulus (Stimulus +12) had an intensity level that was 12 dB higher than that of the modal voice. The original modal voice stimulus (Stimulus 0) was also included in the set. The total number of stimuli was 25.

To obtain the required intensity values, the amplitude of the original modal stimulus was multiplied by scaling factors corresponding to an increase/decrease of the intensity level by 1 dB [scaling factor <sup>=</sup> <sup>10</sup>(dB\_value/20) ]. The resulting stimuli were arranged according to increasing intensity from the lowest intensity (Stimulus −12) to the highest intensity (Stimulus +12 dB), with the modal voice in the middle of the range. This order was kept constant as the range of stimuli was presented to the listeners. The listeners were informed of this arrangement of the stimuli prior to the experiment.

Sixteen native speakers of Irish-English participated in the preliminary listening test. They were instructed to listen in turn to each of the five original voice quality stimuli described above (whispery, breathy, lax-creaky, harsh, and tense, presented five times in random order) and to select for each sound the best loudness match out of the range of 25 modal voice stimuli of varying loudness level. The participants were allowed to listen to the stimuli as many times as they needed to make a decision, and then to mark the responses on an answer sheet.

For each of the original voice qualities, the numbers of the best matching stimuli were averaged across the responses of the 16 participants (a total of 16 × 5 = 80 responses). The average measure Intraclass Correlation Coefficient (ICC) (R) calculated to test the overall consistency of the participants in the ratings of the stimuli was found to be high at 0.99.

As the stimulus numbers corresponded to the dB change in amplitude level of the original modal stimulus to bring it to the loudness level of a particular voice quality stimulus, the mean values represent the required change in dB to match our original modal stimulus to the loudness of the original voice qualities. These values and their standard deviations (in brackets) are shown in **Table 1**. The corresponding scaling factors which were applied to the modal stimulus to generate the matched series of stimuli are also shown in **Table 1**.

In the second experiment a similar loudness adjustment was made insofar as the loudness levels of the original voice-qualityvarying stimuli were equalised to the loudness level of the original modal voice stimulus. The scaling factors derived here (shown in **Table 1**) were used also for this purpose except that this time values are divided by (rather than multiplied by) the scaling factor.

#### **EXPERIMENT 1: TESTING AFFECT CORRELATES OF LOUDNESS-VARYING MODAL STIMULI**

Experiment 1 tests the impact of loudness on the perception of affect using two types of synthesized stimuli: (1) stimuli of a constant voice quality (modal voice) in which loudness was systematically modified to match the loudness levels of the voice-qualityvarying stimuli, and (2) the original series of voice-quality-varying stimuli whose intrinsic loudness varies correspondingly.

Returning to the hypotheses stated earlier, if Hypothesis A is correct (loudness is the main determinant of observed affective

**Table 1 | The scaling factors and the difference in dB between the modal stimulus and the stimuli selected as best loudness matches for the voice-quality-varying stimuli.**


associations) results for the two series of stimuli should be identical and both series should impart affective coloring akin to what was found in the earlier mentioned studies. If Hypothesis B is correct (loudness is irrelevant to the observed affective associations), results for the two series should be markedly different: the voicequality-varying stimuli should replicate the affective colorings of our earlier studies, while the modal stimuli (varying in loudness) should not. According to Hypothesis C, the modal series with loudness variation should yield some degree of affective ratings in the direction of those ratings obtained in the voice-quality-varying series. In the event of results pointing toward Hypothesis C, the results of this test might further give some idea of how important the contribution of loudness might be.

#### **METHODS**

#### *Stimuli for Experiment 1*

There were two series of stimuli. The first six, the voice-qualityvarying stimuli, included modal voice, whispery voice, breathy voice, lax-creaky voice, harsh voice, and tense voice. They have been briefly described in Section "Voice-Quality-Varying Stimuli Used in Earlier Studies" above and a detailed description is given in Gobl and Ní Chasaide (2003). As explained above, the second series consisted of five stimuli, each of which has modal voice quality, but whose loudness levels were matched to those of the voice-quality-varying stimuli. These were generated by simply scaling the sample data of the original modal voice stimulus with each of the five scaling factors shown in **Table 1** (see Preliminary Experiment: Loudness Matching). Overall, this yielded 11 stimuli, the original modal voice stimulus and five pairs of loudness-mached stimuli. Each pair consists of a specific voice quality and a loudness-matched modal version (e.g., a breathy voice stimulus and a modal stimulus with the loudness level of breathy voice).

#### *Listening test*

The 11 stimuli (breathy voice, modal voice with loudness matching that of breathy voice, whispery voice, modal voice with loudness matching that of whispery voice, lax-creaky voice, modal voice with loudness matching that of lax-creaky voice, tense voice, modal voice with loudness matching that of tense voice, harsh voice, modal voice with loudness matching that of harsh voice, and modal voice) were presented to 16 female participants, all native speakers of Irish-English.

The perception test was conducted as a series of six subtests following the procedure in Gobl and Ní Chasaide (2003). In any one subtest, the 11 stimuli were presented to the participants in random order 10 times. The participants were asked to judge the stimuli on a bipolar scale, defined with the contrastive adjectives (e.g., *intimate–formal*) at each end. For each stimulus, participants indicated whether the speaker sounded more *intimate* or *formal*, and marked their responses on the answer sheet. The ratings were interpreted as a seven point scale ranging from −3 to +3, where 0 corresponded to "no affect perceived," and ±1, 2, or 3 to mild, moderate, and strong presence of an affect (either *intimate* or *formal*) respectively. This kind of semantic differential scale is commonly used in the study of attitude (Heise, 1970; Osgood et al., 1975; Russell and Carroll, 1999; Streiner and Norman, 2008) and allows one to measure directionality of reaction (e.g., sad vs. happy) as well as intensity (slight to extreme). The scale is usually interpreted as a 7 point scale where the neutral attitude (or in our case, "no affective coloring") is assigned the value of zero (Heise, 1970, p. 235). The same use of scale for measuring attitude in intonation contours is found, for example, in Uldall (1964). Following the description in (Gamst et al., 2008, p. 10), this scale is a summative response scale, and the data obtained with it can be analyzed statistically using a general linear model (e.g., ANOVA).

The affective labels defining the opposite ends of each of the six scales have been chosen to cover a fairly broad range of emotions and milder affective states such as attitudes and interpersonal stances. The pairs of affective attributes are among those most frequently found in the literature and in the lists of emotion-related words (Juslin and Laukka, 2003; Scherer, 2005; Douglas-Cowie et al., 2006; Baron-Cohen, 2007). The specific pairs used include *apologetic-indignant, bored-interested, intimate– formal, relaxed-stressed, sad-happy,* and *scared-fearless.* As these were largely the same as those used in our previous experiments (e.g., Gobl and Ní Chasaide, 2003; Yanushevskaya et al., 2011), their use here allows comparison with results of these earlier studies. Note that the pairs of affects differed in terms of high vs. low activation (e.g., *apologetic, bored, relaxed* have low activation in comparison to *indignant, interested, stressed*); the low activation affect was placed on the negative end of the rating scale in each case.

#### *Statistical analysis*

The material in Experiment 1 comprises, along with the original modal voice stimulus, five pairs of loudness-matched stimuli, whose loudness was equalized but which differed in terms of voice quality.

To compare the effect of loudness and voice quality on the strength of affective ratings, a 2 × 5 factorial design was used in the statistical analysis, with 2 within-subjects factors: voice (2 levels: non-modal voice quality and modal voice quality) and ∗Loudness∗ (5 levels: loudness of whispery, breathy, lax-creaky, harsh, and tense voice). The dependent variable was the rating score for each stimulus averaged across 10 randomizations for each participant. The reader should note that the ∗Loudness∗ factor strictly subsumes differences in voice quality and so results in this test for ∗Loudness∗ cannot be taken as an independent contribution of loudness. The independent contribution of loudness is tested in a separate analysis (one-way ANOVA).

Initial inspection of results (**Figure 1**) revealed that whispery voice, breathy voice, lax-creaky voice, and their counterparts from the loudness-matched modal stimuli set are consistently rated toward the low activation end of the scale in the different tests. On the other hand, harsh voice and tense voice and their loudnessmatched modal counterparts are consistently associated with high activation affective labels. Therefore, a two-way repeated measures ANOVA test was conducted in two parts, separately for the "lax" voices (whispery, breathy, lax-creaky, and their loudness-matched modal counterparts) and for the "tense" voices (harsh, tense, and their loudness-matched modal counterparts).

The two-way repeated measures ANOVA was done for each of the six affective subtests separately. The alpha level was set to 0.05.

The Mauchly test indicated that the data did not meet the assumption of sphericity, and therefore a Greenhouse-Geisser correction was applied to the degrees of freedom in the analysis. Bonferroni corrections were further applied to account for multiple comparisons in the *post hoc* tests. The results for the two-way ANOVA are shown in **Table 2**. As the factorial design did not include modal voice, a series of simple contrasts was conducted in a separate test in which modal voice was compared to the other 10 stimuli and the results for these tests are found in **Table 3**.

To assess an independent contribution of loudness to affective ratings, it was necessary to look more closely at the modal series whose loudness was varied to match that of the voice quality


#### **Table 2 | Results of the two-way repeated measures ANOVA in Experiment 1 for the six subtests.**

*The abbreviations in the left column are as follows: A-I, apologetic-indignant; B-I, bored-interested; I-F, intimate–formal; R-S, relaxed-stressed; S-H, sad-happy; S-F, scared-fearless. Part 1: breathy, whispery, lax-creaky voice;part 2: harsh and tense voice.*

*\*p* < *0.05; \*\*p* < *0.01; \*\*\*p* < *0.001.*

stimuli. In a separate one-way repeated measures ANOVA, ratings for these stimuli and the original modal voice were tested, with Loudness as an independent factor.

To establish whether the listeners rated voice qualities in a coherent fashion, listeners' agreement/consistency in ratings for each subtest was measured using single measures and average measures intraclass correlation coefficients (ICCs) (Landis and Koch, 1977; Shrout and Fleiss, 1979; Yaffe, 1998) calculated for each subtest. Since the stimuli used here represent only a sample of possible voice qualities, and since the listener judges were randomly selected from a larger population, the two-way random model was used (McGraw and Wong, 1996; Yaffe, 1998). As it is of interest to establish whether we can assume that the judgment of one rater is similar to that of the others, the single measures ICC (*r*) rather than the average measures ICC (*R*) will be mostly considered here as an indicator of raters' consistency. Following Landis and Koch (1977), ICC of 0.40–0.59 will be interpreted here as moderate interrater agreement, ICC of 0.60–0.79 will be interpreted as substantial interrater agreement, and ICC of 0.80 and above – as outstanding interrater agreement.

#### **RESULTS AND DISCUSSION**

The results of the two-way ANOVA are given in **Table 2**. Pairwise comparisons are summarized in **Table 3**. The data on the raters' agreement were obtained as ICCs and are given in **Table 4**.

The effects of voice and ∗loudness∗ and the interactions of these factors were found to be different for "lax" (Part 1 of the two-way ANOVAs) and "tense" (Part 2 of the ANOVAs) voices (**Table 2**).

"Lax voices" (Part 1 of the two-way ANOVAs): in five subtests out of six (*apologetic-indignant, bored-interested, intimate–formal, relaxed-stressed, sad-happy*) highly significant effect of voice was found. This suggests substantial contribution of the voice quality factor to the difference in the strength of affective ratings between the two series of stimuli. With the exception of *intimate–formal*, the effect of ∗loudness∗ in these tests was also significant, although (as shown by partial eta squared values) the magnitude of the effect of ∗loudness∗ was substantially smaller. Significant voice and ∗loudness∗ interaction was found in four of these tests (with an exception of *apologetic-indignant*, where there was no significant interaction effect of voice and ∗loudness∗).


**Table 3 | Pairwise comparisons in Experiment 1 (using Bonferroni adjustment for multiple comparisons; the mean difference is significant at the 0.05 level).**

*Mod(L), loudness-matched modal stimuli; VQ, voice-quality-varying stimuli.*

*\*p* < *0.05; \*\*p* < *0.01; \*\*\*p* < *0.001.*

**Table 4 | Intraclass correlation coefficients (***r, R***) in the six subtests in Experiment 1 and their interpretation as the raters' agreement following Landis and Koch (1977).**


"Tense voices" (Part 2 of the two-way ANOVAs): in *apologeticindignant* and *relaxed-stressed*, only the effect of voice was found significant, but there was no effect of ∗loudness∗, nor any interaction effect of voice and ∗loudness∗. The effect of ∗loudness∗ was significant only in the *bored-interested* and *intimate–formal* subtests. In *sad-happy* and *scared-fearless*, no effects were found statistically significant.

As mentioned above, to assess the independent contribution of loudness to the affective ratings, a separate one-way repeated measures ANOVA was conducted on the selected six stimuli incorporating modal voice (the five loudness-matched modal stimuli and modal voice), with Loudness as an independent factor. The

results of this one-way ANOVA showed a significant effect of loudness in all the six subtests: *apologetic-indignant* (*F*(2.2, <sup>33</sup>.4) = 44.3, *p* < 0.01; η<sup>2</sup> *<sup>p</sup>* = 0.75); *bored-interested* (*F*(3.2, <sup>47</sup>.5) = 77.9, *p* < 0.01; η<sup>2</sup> *<sup>p</sup>* = 0.84); *intimate–formal* (*F*(2.9, <sup>43</sup>.3) = 119.5, *p* < 0.01; η<sup>2</sup> *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.89); *relaxed-stressed* (*F*(1.6, <sup>25</sup>) <sup>=</sup> <sup>64</sup>.1, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.01; <sup>η</sup><sup>2</sup> *p* = <sup>0</sup>.81); *sad-happy* (*F*(2.2, <sup>33</sup>.1) <sup>=</sup> <sup>41</sup>.1, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.01; <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0.73); *scaredfearless* (*F*(1.2, <sup>18</sup>.6) <sup>=</sup> <sup>7</sup>.1, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.01; <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0.32).

The data on raters' agreement summarized in **Table 4** suggest that overall the raters were highly consistent in voice-to-affect associations with the exception of the *fearless-scared* subtest where the raters agreement was poor.

The results of Experiment 1 are further presented in **Figure 1**, which shows mean ratings for all the stimuli tested along with the 95% confidence intervals. This figure shows for each of the subtests what affective ratings were yielded by each of the stimuli sets. Ratings for the voice-quality-varying stimuli are shown in red, ratings for the loudness-matched modal stimuli are shown in black. As a reference, the ratings obtained for the modal voice stimulus are also shown (white data points joined by a fine black line). The rating scales follows those used in the tests, with the more high activation affects shown on the positive side of the scale. As with our earlier experiments, the discussion of results will primarily focus on ratings above ±1 where one can be reasonably confident of a distinct affective contribution. This threshold is admittedly arbitrary and is not intended to claim that the ratings below this threshold are necessarily of no importance (and indeed statistically significant difference can be found between ratings that are quite low). Rather, by examining and discussing the ratings above one we are more likely to focus on more robust and consistent voice-to-affect associations. In **Figure 1**, this area of weak affective attribution is shaded in gray.

In **Figure 1**, by comparing the ratings for the loudness-matched modal stimuli to ratings for the original modal voice stimulus we get an indication of the potency of loudness variation alone in conjuring affect. It is clear that by varying the loudness one can alter the affective rating, and as the results of the one-way ANOVA (above) and of the pairwise comparisons in **Table 3** [Modal vs. Mod(L)] show, in the majority of cases the difference between the modal voice and loudness-matched modal stimuli is statistically significant. However, it is worth noting that most ratings of the loudness-matched modal stimuli remain within the −1 to +1 range (=weak affective signaling). The only affect clearly signaled by the loudness increase appears to be *formal* (when modal has the loudness of tense or harsh voice), while the loudest stimulus (matching the loudness of the tense voice quality) is associated also with a degree of *stressed* and *interested* affective coloring. It is quite striking also that bringing the loudness level of modal voice to that of "quieter" voices (whispery and breathy) shifts the affective ratings from high activation end of the rating scale (e.g., *formal*) toward the low activation end (e.g., *intimate).*

**Figure 1** allows us to compare the ratings obtained for the voicequality-varying stimuli (red) to ratings for the loudness-matched modal stimuli (black) and to get some sense of the extent to which the loudness factor contributes. At a glance, one sees that the affective ratings of the voice-quality-varying stimuli are higher overall. With the exception of *happy, scared,* and *fearless,* each affect appears to be well signaled by one or other voice quality, while (as mentioned above) the loudness-matched modal stimuli tend in general to have relatively weak affective signaling.

Not surprisingly, the quieter stimuli (with or without voice quality variation) are associated with low activation states, while the louder stimuli tend to signal high activation states. It is clear that for the quieter stimuli (lax voice qualities and their loudness-matched modal counterparts) the difference between the two stimulus sets is virtually always significant [see **Table 3**, VQ vs. Mod(L)]. For many affects (*bored, relaxed, sad, intimate*) the lax-creaky voice quality achieves very high ratings, while the ratings for the loudnessmatched modal counterpart is at, or close to zero, and the difference is statistically significant (**Table 3**). Note that high ratings for these affects were also reported with lax-creaky voice in a number of our earlier studies (Gobl et al., 2002; Gobl and Ní Chasaide, 2003; Ní Chasaide and Gobl, 2005). For *apologetic* the breathy and whispery voice qualities achieved high ratings, and these qualities were also highly rated for *intimate* and for *sad*. The ratings for breathy and whispery voice for these affects were significantly higher than for their loudness-matched modal counterparts (**Table 3**). The fact that different (though related) voice qualities can signal a particular affect has also been noted before.

At the positive (high activation) end of the scale, the difference between the voice quality stimuli and the loudness-matched stimuli is statistically significant for *indignant, stressed* and is not statistically significant for *formal, interested, happy* (**Table 3**). In the case of *formal*, it is clear that it is sufficiently cued by loudness alone and that the addition of harsh and tense voice qualities adds nothing (the ratings for harsh and tense voice and their loudness-matched modal counterparts in **Figure 1** are identical). In the case of *interested* and *happy* the affective ratings are relatively weak for both sets of stimuli. Note that the *fearless-scared* test yielded little affective signaling from either of these stimulus sets and the same is true for the *happy* affect in the *sad-happy* test, which was a trend in our earlier results as well.

Returning to the hypotheses stated earlier, it is clear that Hypotheses A and B are not supported. Loudness variation does not account for the affective ratings yielded by voice-qualityvarying stimuli (Hypothesis A), but on the other hand, it is not irrelevant (Hypothesis B). Clearly, Hypothesis C receives the most support: loudness variation does contribute to affective signaling, and the contribution is nearly always significant (**Table 2**). However, the magnitude of the effect, as indicated by the affective ratings, is generally considerably less than what is achieved when voice quality is also varied (remaining mostly in the weak affect signaling region). Increases to loudness are important in the signaling of some high activation states, particularly *formal.* Decreasing loudness of the modal voice to the level of whispery voice or breathy voice shifts the affective ratings to the low activation end of the scale. For effective cueing of these low activation states, however, the voice quality component appears to be crucial. This is particularly striking in the case of lax-creaky voice.

#### **EXPERIMENT 2: TESTING AFFECT CORRELATES OF LOUDNESS-EQUALIZED VOICE QUALITY STIMULI**

Experiment 1 demonstrated that stimuli in which the loudness level of modal voice was manipulated to match the loudness level of voice-quality-varying stimuli are rather ineffective in cueing affect (low activation states in particular) compared to the stimuli incorporating voice quality variation. However, this does not necessarily mean that the intrinsic loudness differences which tend to be correlated with particular voice qualities are irrelevant. Given that in human speech production there is a natural tendency toward covariation of voice quality and loudness, it could be the case that loudness differences do play an important role – but only when these loudness variations occur with the appropriate voice quality. Experiment 2 reported below tests the hypothesis that the intrinsic loudness of the voice-quality-varying stimuli is *not* the main determinant of their affective signaling effect. This is assessed by effectively equalizing the loudness of these stimuli and testing their affect cueing ability in a series of perception tests.

In the second experiment, the loudness of the original voicequality-varying stimuli was equalized to that of the original modal stimulus (Series M): thus they retained the differences in voice quality but without inherent loudness variation. From Series M two further series were generated: one with increased intensity (Series L) and one with decreased intensity (Series Q). If loudness is the main determinant of affect cueing, within each of the series the difference in the voice qualities of the stimuli should have little impact on affective ratings, but one should see differences between the three series. Our basic hypothesis is that the loudness variation is not, *per se,* a major determinant of cued affect. Consequently, one would predict that results across the three series of stimuli Q → M → L would be very similar, showing only a slight enhancement of affective rating due to loudness differences. Concomitantly, we hypothesize that by removing the loudness variation within any one series (e.g., M) will not have a major detrimental effect on the spread of affective ratings obtained.

#### **METHODS**

#### *Stimuli for Experiment 2*

The construction of the new set of stimuli for the perception tests of Experiment 2 consisted of two steps. The first step involved an increase or a decrease of the loudness of all the original non-modal voice quality stimuli (whispery voice, breathy voice, lax-creaky voice, and tense voice) to match them to the loudness of the modal voice stimulus (Series M). Note that as the results for harsh voice in Experiment 1 were very similar to those of tense voice, harsh voice quality was not included in Experiment 2. In the second step, for each of the new loudness-normalized voice quality stimuli a "louder" version (Series L) and a "quieter" version (Series Q) were generated. The difference between the "loudness" versions was set to ±2 dB with the purpose of capturing moderate, but plausible, loudness variation in each of the voice qualities. Thus, for example, there was a stimulus with whispery voice quality whose loudness was equalized to that of the modal voice, and its "louder" (+2 dB) and "quieter" (−2 dB) versions.

*Step 1: setting perceived loudness of non-modal voice qualities to that of the modal voice.* To generate the new voice quality stimuli with the loudness matching that of the modal stimulus (Series M), the waveform samples of the original voice quality signals were multiplied by the reciprocal of the corresponding scaling factors used in Experiment 1, as follows:


*Step 2: generating "louder" and "quieter" versions.* From the Series M stimuli, two more stimulus series were generated, Series L ("louder" versions) in which the intensity level of all stimuli was increased by 2 dB, and Series Q ("quieter" versions) in which the intensity level of all stimuli was reduced by 2 dB. Since the aim was to compare the perception of stimuli differing in voice quality characteristics but having the same loudness as well as to test the effect of any perceivable difference in loudness, a 2 dB difference between the intensity level of the stimuli in the three groups was considered sufficient. Note that neither the "louder" (+2 dB) nor the "quieter" (−2 dB) versions of the new stimuli would match the loudness of any of the original voice quality stimuli. For example, the intensity level of the new "louder" tense voice stimulus was 0.6 dB lower than that of the original tense voice quality. There were 15 stimuli in total: 3 series (Q, M, and L) × 5 stimuli (whispery voice, breathy voice, lax-creaky voice, modal voice, tense voice).

#### *Listening tests*

The 15 stimuli were randomized 10 times and presented to the participants in a series of six subtests following the same procedure as in Experiment 1 and using the same pairs of affective attributes: *apologetic-indignant, bored-interested, intimate–formal, relaxed-stressed, sad-happy,* and *scared-fearless.* The participants in the experiment were also 16 female native speakers of Irish-English. The stimuli were presented to the participants over a high quality speaker in a quiet room.

#### *Statistical analysis*

In the statistical analysis, a 5 × 3 factorial design was used. The two within-subject factors were "Voice" (five levels: whispery, breathy, lax-creaky, modal, tense) and "Loudness" [three levels: "Q" ("quieter" version, −2 dB), "M" (loudness of the original modal voice) and "L" ("louder" version, +2dB)]. The dependent variable was the mean rating score for each stimulus averaged across 10 randomizations for each subject. The two-way repeated measures ANOVA was conducted for each of the six subtests separately. The alpha level was set to 0.05. As the data did not meet the sphericity assumptions as indicated by the Mauchly test in most cases, a Greenhouse–Geisser correction was applied to the degrees of freedom in the analysis. As in Experiment 1 above, the raters' agreement was measured using ICC (*r, R).*

#### **RESULTS AND DISCUSSION**

The results of the two-way ANOVA are shown in **Table 5**. Pairwise comparisons of the affective ratings of the stimuli are given in **Tables 6A,B**. The data on the raters' agreement are presented in **Table 7**.



*The abbreviations in the left column are as follows: A-I, apologetic-indignant; B-I, bored-interested; I-F, intimate–formal; R-S, relaxed-stressed; S-H, sad-happy; S-F, scared-fearless.*

*\*p* < *0.05; \*\*p* < *0.01; \*\*\*p* < *0.001.*

As seen in **Table 5**, in all subtests, with the exception of *scaredfearless,* highly significant effects of voice and loudness were found as well as significant (though much weaker) interaction effects. The raters' agreement was moderate to substantial in all subtests, again with the exception of the *scared-fearless*subtest, which yielded poor interrater agreement. The main interaction effects are further shown in **Figure 2** which compares the affective ratings obtained for the loudness-equalized voice quality stimuli in the Q, M, and L Series.

#### *Series M: how are affective ratings affected by removal of the loudness component?*

The ratings of Series M plotted in **Figure 2** (black) show the effect of loudness normalization (increase or decrease of the intrinsic loudness level in the non-modal voice qualities to that of the modal voice) on voice-to-affect association. In other words, the ratings of the stimuli from Series M in **Figure 2** allow us to ascertain what the voice quality can achieve when the intrinsic loudness differences are neutralized. (Note that the intensity level of tense voice was lowered by 3 dB, the intensity level of lax-creaky voice was increased by 2.8 dB, and the intensity level of whispery voice was increased by about 7 dB.)

It is clear that even with loudness normalization, non-modal voice qualities are still effective in affect cueing. Looking at the M results in **Figure 2**, note that each non-modal voice quality is associated with clear affective signaling. Ratings were substantially above the ±1 threshold for at least one affect for all the non-modal voice qualities, with the exception of breathy voice. As in our earlier studies, the tense voice quality was the quality clearly linked to high activation states such as *indignant, interested, formal, stressed, happy*. The lax-creaky voice emerged as the most potent quality for signaling low activation states (*bored, intimate, relaxed, sad*) though *apologetic* was more highly rated for whispery voice, which also yielded high ratings for *intimate.* The ratings of modal voice are particularly conspicuous compared to the non-modal voice qualities: in no case did it achieve high ratings for any of the affects tested, although, as in our earlier experiments, it was rated somewhat in the positive direction (high activation). Given that the loudness differences between modal and tense voice are neutralized, it is clear that a tenser voice quality is favored by the raters as cueing high activation affects.

For the test *fearless-scared,* none of the stimuli from the M Series yielded high ratings. Indeed, this test yielded little result for any of the three series. It is worth noting that the raters' agreement here was rather poor (see **Table 7**) and the range of ratings for all stimuli in this subtest is broad, bringing the average values close to zero. This largely reflects the listeners' uncertainty in voice-to-affect association in this subtest. Note that this lack of effect is consistent with our earlier studies and with the results of Experiment 1. Therefore, this particular test will not be further discussed here.

While it is clear that the different non-modal voice qualities retain affect cueing potential even when the loudness feature is removed/normalized, it is nonetheless the case that the overall strength of the ratings appear to be reduced, compared to results obtained for the same stimuli where loudness differences are retained. We do not have a direct comparison of the original voice quality stimuli and the present Series M which have the intrinsic loudness differences removed, but if we broadly compare results in **Figure 2** with those for the non-modal qualities in **Figure 1** we note that there is a reduced range of affective ratings where the loudness-normalized stimuli are concerned. It could be concluded that although non-modal voice qualities are still potent in affect signaling, changing their intrinsic loudness level to that of modal voice does influence their potential in communicating affect, as the ratings are somewhat lower.

For both Series Q (white data points in **Figure 2**) and L (red), we note rather similar observations in terms of affective ratings: there is a clear linkage of particular voice qualities to affect and some reduction in the overall strength of affective ratings compared to when the loudness variable is retained, as in **Figure 1**.

#### *Comparing series Q, M, and L: what is the contribution of loudness variation?*

The changes in loudness moving from the Q → M → L Series appear to be correlated with a slight shifting in the Low activation → High activation direction (see **Figure 2**; **Table 6A**). The effects are not fully symmetrical. A shift from M to Q may sometimes slightly enhance the ratings for those stimuli (whispery, breathy, and lax-creaky voice) which signal the low activation states, although, as evident from **Table 6A**, the differences are rarely statistically significant. As a corollary, a shift from M to L for these stimuli undermines their affective signaling role.

For those stimuli (tense, modal) which tend to signal high activation, a shift from M to L tends to have a somewhat larger effect in enhancing the ratings, and this enhancement is statistically significant in most cases (**Table 6A**). The shift from M to Q similarly somewhat reduces the affective ratings, though this effect is generally not significant.

**Table 6 | Pairwise comparisons in Experiment 2 (using Bonferroni adjustment for multiple comparisons; the mean difference is significant at the 0.05 level): (A) compared are the ratings for each voice quality, for the three loudness series (Q, M, and L) and (B) compared are the ratings for different voice quality within each series.**


*The abbreviations in the first row are as follows: A-I, apologetic-indignant; B-I, bored-interested; I-F, intimate–formal; R-S, relaxed-stressed; S-H, sad-happy; S-F, scared-fearless.*

*\*p* < *0.05; \*\*p* < *0.01; \*\*\*p* < *0.001.*

**Table 7 | Intraclass correlation coefficients (***r, R***) in the six subtests in Experiment 2 and their interpretation as the raters' agreement following Landis and Koch (1977).**


An increase or decrease of loudness appears therefore to only result in the increase of affective ratings for certain voice qualities and certain affects. On the other hand, when the loudness is not set to extreme values but to that of modal voice, voice quality alone proves to be sufficient for successful affect cueing. The effect of the loudness variation among the three series is limited compared to the large differences in ratings yielded by differences in voice quality. It is also striking that the change in ratings for the shift M → L is greater than that yielded by the change M → Q.

Looking more closely at the data represented in **Figure 2**, note that not all voice quality stimuli are equally affected by shifts in intensity levels. In particular, changing the loudness of lax-creaky voice does not appear to have much impact on its affective rating, and the difference between the lax-creaky voice stimuli from different loudness series virtually never reaches statistical significance (**Table 6A)**. This is interesting given that lax-creaky voice is the most potent signaler of low activation affects.

The effect of loudness manipulation is particularly striking for modal voice. Simply increasing its intensity level by 2 dB results in quite a dramatic statistically significant increase in affective ratings for *indignant, interested, formal, stressed,* and *happy,* even though in no case does the Series L modal stimulus achieve ratings higher than those of the Series L tense stimulus.

In summary, our hypothesis is supported: the loudness variation is not, *per se, the* major determinant of cued affect. Non-modal voice qualities in which the loudness differences have been equalized are still potent in signaling affect. Compared to the cueing power of changes to the voice quality, differences in loudness appear to make a relatively more modest contribution to the cueing of affect. Nonetheless, loudness differences are important, particularly in the cueing of the high activation affects, and these differences can be highly significant (see **Tables 5** and **6A**). Furthermore, there are indications in these results that the loudness contribution is not the same for each voice quality (**Table 6A**). Similar to the findings in Experiment 1, increasing loudness of breathy and whispery voice significantly weakens the potency of these voice qualities to signal low activation states or even (as in the case of *bored-interested)* shifts the ratings toward the high activation end of the scale. Similarly, lowering the loudness level of modal or tense voice results in significant lowering in affective rating for high activation states that these voice qualities can achieve.

We can conclude that, while voice quality variations remain potent in affect cueing even where loudness cues have been eliminated, nonetheless, loudness can play an important role, particularly with tense or modal voice in the signaling of high activation states. Furthermore, when the loudness is closer to the intrinsic loudness of a particular voice quality, it is, perhaps unsurprisingly, at its most effective.

#### **CONCLUSION**

The study focused on the role of loudness and voice quality in affect cueing. Two experiments were conducted with synthesized stimuli in which loudness was systematically manipulated. In Experiment 1, stimuli incorporating voice quality features including intrinsic loudness variations were compared to stimuli where voice quality was kept constant (modal) but in which loudness was systematically modified. In Experiment 2, three series of stimuli were compared differing in loudness levels. Within each series there were distinct voice qualities represented, but all had equal loudness.

In Experiment 1, stimuli incorporating voice quality variations consistently obtained higher ratings than the loudness-matched modal stimuli. The results of Experiment 2 suggest that even with loudness differences equalized, non-modal voice quality stimuli are potent in affect cueing. Even if loudness *per se* is not *the* major determinant of affect, it nonetheless plays a significant role: when combined with tense or modal voice quality, it can enhance the signaling of high activation states, such as *formal, indignant, interested, stressed, happy.* Furthermore, increasing the loudness of intrinsically "quiet" voice qualities (breathy, whispery) or decreasing the loudness of intrinsically "loud" voice qualities (tense) has a detrimental effect on these voice qualities' potency to cue affect.

The situation is different in the case of low activation states, such as *apologetic, bored, intimate, relaxed, sad,* where it would appear that loudness plays a lesser (if still sometimes significant) role. Specific voice qualities are essential in signaling these affects, and lax-creaky voice emerges as a particularly potent quality whose loudness level seems to matter little.

The studies support our initial hypothesis that affective cueing found in our earlier studies was not simply a consequence of the loudness variation in these voice quality stimuli. Rather, loudness appears to play a role in affect cueing in conjunction with the variations in voice quality. The contribution of loudness is not a single symmetrical effect but varies depending on the voice quality and affect in question. There are indications that loudness variation (increase) may be particularly important in some cases, e.g., in the signaling of formality. It also appears to be the case that for some voice qualities, such as lax-creaky voice, the affect cueing does not seem to be influenced by the loudness level. Note, however, that even where the contribution of loudness to the cueing of affect appears to be relatively small, it can still reach statistical significance.

This paper illustrates the complex interplay between voice dimensions in affect cueing. It further highlights the need for a feature such as loudness to be investigated in the context of other complex voice parameters that are involved in the signaling of affect.

It must be pointed out that the selection of voice qualities investigated in these experiments is not comprehensive. Other phonation types such as falsetto might usefully have been included. Furthermore, in constructing the stimuli, extreme qualities were avoided: it would be possible to get a more extreme version of tense voice, etc. These factors must be borne in mind when

considering particularly the cases where affect is not clearly signaled. So, for example, in the test *scared-fearless,* we cannot necessarily say that these affects are not signaled by voice quality, but rather that they are not signaled by the particular qualities (and/or ranges of qualities) used in these experiments. A similar point holds for the range of loudness levels used in the present experiments. As with the voice qualities, the differences are not very extreme. Thus the findings reported have to be understood in terms of the voice quality and loudness ranges examined here.

For future work we would be looking at further experiments comparing directly the loudness-normalized (voice quality) series and the original voice quality stimuli which have intrinsic loudness variation. This would enable us to quantify more precisely the contribution of loudness variation to affect perception.

#### **ACKNOWLEDGMENTS**

This work was partly funded by the FP6 IST HUMAINE Network of Excellence and was further supported as part of the

#### **REFERENCES**

Baron-Cohen, S. (2007). *Mind Reading: The Interactive Guide to Emotions.* London: Jessica Kingsley Publishers.

Douglas-Cowie, E., Cowie, R., Martin, J.-C., Devillers, L., and Cox, C. (2006). *HUMAINE Human Machine Interaction Network on Emotions.* Mid Term Report on Database Exemplar Progress (Workpackage 5 Deliverable). Available online at: http://emotionresearch.net/projects/humaine/deli verables/D5g%final.pdf. Last accessed: May 29, 2013.


and F. E. Gibbon (Oxford: Blackwell Publishing Ltd), 378–423.


FASTNET project – Focus on Action in Social Talk: Network Enabling Technology funded by Science Foundation Ireland (SFI) 09/IN.1/I2631.

Moore (New York: Academic Press), 123–160.


source parameters in emotional speech. *IEEE Trans. Affect. Comput.* 2, 162–174. doi:10.1109/T-AFFC.2011.14


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 31 March 2013; accepted: 22 May 2013; published online: 18 June 2013.*

*Citation: Yanushevskaya I, Gobl C and Ní Chasaide A (2013) Voice quality in affect cueing: does loudness matter? Front. Psychol. 4:335. doi: 10.3389/fpsyg. 2013.00335*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Yanushevskaya, Gobl and Ní Chasaide. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## Encoding conditions affect recognition of vocally expressed emotions across cultures

#### **Rebecca Jürgens\*‡ , Matthis Drolet ‡ , Ralph Pirow† , Elisabeth Scheiner † and Julia Fischer**

Cognitive Ethology Laboratory, German Primate Center, Göttingen, Germany

#### **Edited by:**

Petri Laukka, Stockholm University, Sweden

#### **Reviewed by:**

Olivier Piguet, Neuroscience Research Australia, Australia Gary J. McKeown, Queen's University Belfast, UK

#### **\*Correspondence:**

Rebecca Jürgens, Cognitive Ethology Lab, German Primate Center, Kellnerweg 4, 37077 Göttingen, Germany. e-mail: rjuergens@dpz.eu

**†Present address:**

Ralph Pirow, Institut für Zoophysiologie, Universität Münster, Münster, Germany; Elisabeth Scheiner, Ecole d'Humanité, Goldern, Switzerland.

‡Rebecca Jürgens and Matthis Drolet have contributed equally to this work. Although the expression of emotions in humans is considered to be largely universal, cultural effects contribute to both emotion expression and recognition. To disentangle the interplay between these factors, play-acted and authentic (non-instructed) vocal expressions of emotions were used, on the assumption that cultural effects may contribute differentially to the recognition of staged and spontaneous emotions. Speech tokens depicting four emotions (anger, sadness, joy, fear) were obtained from German radio archives and re-enacted by professional actors, and presented to 120 participants from Germany, Romania, and Indonesia. Participants in all three countries were poor at distinguishing between play-acted and spontaneous emotional utterances (58.73% correct on average with only marginal cultural differences). Nevertheless, authenticity influenced emotion recognition: across cultures, anger was recognized more accurately when play-acted (z = 15.06, p < 0.001) and sadness when authentic (z = 6.63, p < 0.001), replicating previous findings from German populations. German subjects revealed a slight advantage in recognizing emotions, indicating a moderate in-group advantage. There was no difference between Romanian and Indonesian subjects in the overall emotion recognition. Differential cultural effects became particularly apparent in terms of differential biases in emotion attribution. While all participants labeled play-acted expressions as anger more frequently than expected, German participants exhibited a further bias toward choosing anger for spontaneous stimuli. In contrast to the German sample, Romanian and Indonesian participants were biased toward choosing sadness.These results support the view that emotion recognition rests on a complex interaction of human universals and cultural specificities.Whether and in which way the observed biases are linked to cultural differences in self-construal remains an issue for further investigation.

**Keywords: acoustics, culture, emotion, play-acting, recognition, speech, vocalization**

#### **INTRODUCTION**

Emotions are an important part of human social life. They mediate between the internal state and external world and they prepare the organism for subsequent actions and interactions. Although there is an ongoing debate about the definition of emotions (see for example Mason and Capitanio, 2012; Mulligan and Scherer, 2012; Scarantino, 2012), there is a growing consensus among theorists that emotion needs to be viewed as a multi-component phenomenon (Scherer, 1984; Frijda, 1986; Lazarus, 1991). The three major components of emotions are neurophysiological response patterns in the central and autonomic nervous systems; motor expression in face, voice, and gesture; and subjective feelings. Many theorists also include the evaluation or appraisal of the antecedent event and the action tendencies generated by the emotion as additional components of the emotional process (Scherer, 1984; Smith and Ellsworth, 1985; Frijda, 1986; Lazarus, 1991).

Different theoretical frameworks have been put forward as to whether emotions are universal and evolved adaptations (Darwin, 1872) or whether they are socially constructed and vary across cultures (Averill, 1980). Both approaches are, however, not mutually exclusive, and it has recently been argued that the dichotomy

between nature and nurture should be abandoned (Prinz, 2004; Juslin, 2012; Mason and Capitanio, 2012). Matsumoto (1989), for example, argued that although emotions are biologically programed, cultural factors have a strong influence on the control of emotional expression and perception.

Scherer and Wallbott (1994) conducted a series of crosscultural questionnaire studies in 37 countries to investigate the influence of culture on the experience of emotions and found strong evidence for both universality and cultural specificity in emotional experience, including both psychological and physiological responses to emotions. Ekman and colleagues (Ekman et al., 1969; Ekman and Friesen, 1971; Ekman and Oster, 1979) tested the universality of facial expressions and demonstrated that a standardized set of photographs depicting different emotion expressions was correctly judged by members of different, partly preliterate, cultures. At the same time, recognition accuracy was higher for members of the cultural background from which the facial expressions were obtained. Thus, facial expressions are considered to be largely universal (but, see Jack et al., 2012), while cultural differences are observed in the types of situations that elicit emotions (Matsumoto and Hwang, 2011), in small dialectic-like differences (Elfenbein et al., 2007), and in the culture-specific display rules that alter facial expressions (Ekman and Friesen, 1969; Matsumoto et al., 2008).

The human voice is also an important modality in the transmission of emotional information, both through verbal and nonverbal utterances (Banse and Scherer, 1996; Juslin and Laukka, 2003; Hammerschmidt and Jürgens, 2007; Sauter et al., 2010). Expression of emotion in the voice occurs via modifications of voice quality (Gobl and Ni Chasaide, 2003) and prosody in general (Scherer,1986). Initial research on vocal emotion recognition indicated that the patterns in prosodic recognition were largely universal (Frick, 1985), which paralleled the results from facial expressions (Elfenbein and Ambady, 2002). Ratings of vocalizations by listeners showed that they were able to infer vocally expressed emotions at rates higher than chance (Banse and Scherer, 1996; Juslin and Laukka, 2003). In a classic study, Scherer et al. (2001) compared judgments by Germans and members of eight other cultures on expressions of emotions by German actors. They found that with increasing geographical distance from the speakers the recognition accuracy for emotional expressions decreased. Additionally, recognition accuracy was greaterforforeign judges whose own language was closer to the Germanic language family. A meta-analysis on emotion recognition within and across cultures revealed that the in-group advantage found by Scherer et al. (2001) for German judges is a typical finding in cross-cultural emotion recognition studies (Elfenbein and Ambady, 2002). This meta-analysis included studies that used different types of stimuli, from facial and whole-body photographs to voice samples and video clips. Emotions were universally recognized at better-than-chance levels. However, there was also a consistent in-group advantage: accuracy was higher when emotions were both expressed and recognized by members of the same national, ethnic, or regional group. This advantage was smaller for cultural groups with greater exposure to one another, measured in terms of living in the same nation, physical proximity, and telephone communication (Elfenbein and Ambady, 2002).

Cultural variations in emotion recognition can not only be explained by differences in the emotion encoding, but also by response biases on part of the recipient due to culture-dependent decoding rules (Matsumoto, 1989; Elfenbein et al., 2002). For example, revealing that Japanese participants were less accurate in recognizing anger, fear, disgust, and sadness, Matsumoto (1992) suggested a bias against negative emotions in collectivistic societies as an important factor to maintain group stability (but, see Elfenbein et al., 2002 for divergent results).

Much of the research cited above has been performed on stereotypical and controlled expressions of emotions often produced by actors. Though actors spend many years perfecting the authenticity and clarity of their portrayals of human behavior and emotions (Goldstein and Bloom, 2011), acted emotional expressions may still be more stereotyped and more intense than spontaneous expressions (Wilting et al., 2006; Laukka et al., 2012, but, see Jürgens et al., 2011; Scherer, 2013), and are thought to be more strongly bound by social codes (Hunt, 1941; Matsumoto et al., 2009). In addition, preselected, stereotypical expressions might conceal possible effects of response biases in cross-culture studies due to their clear and

unmistakable expression patterns (Wagner, 1993; Elfenbein et al., 2002).

In a series of previous studies we presented listeners with emotional speech tokens produced without external instruction ("authentic") obtained from radio archives, as well as corresponding tokens re-enacted by professional actors ("play-acted"). We found that (German) listeners were poor at distinguishing between authentic and play-acted emotions. Intriguingly, the recording conditions nevertheless had a significant effect on emotion recognition. Anger was recognized best when play-acted, while sadness was recognized best when authentic (Drolet et al.,2012).Moreover, using an fMRI approach, we found that both explicit recognition of the source of the recording, i.e., whether it was authentic or play-acted (compared to the recognition of emotion) and authentic stimuli (versus play-acted) lead to an up-regulation in the ToM network (medial prefrontal, retrosplenial, and temporoparietal cortices). Moreover, acoustic analyses revealed significant differences in the F0 contour, with a higher variability in F0 modulation in play-acted than authentic stimuli (Jürgens et al., 2011).

Based on these findings, we here aim to expand our understanding of the recognition of play-acted and authentic stimuli and biases in emotion recognition. By testing participants from different cultures we intended to gain insights into the influence culture has on our findings.We selected Romanian and Indonesian participants because they differ in terms of the distance to the German sample,with a higher degree of overlap between the Romanian and German cultures than between Indonesian and German. Moreover, Romania and Indonesia have been described as collectivistic societies in contrast to the individualistic German society (Hofstede, 1980, 1996; Trimbitas et al., 2007), which allows a comparison of listeners' culture-dependent response biases on non-instructed, more ambivalent speech tokens (Matsumoto,1992;Elfenbein et al., 2002). If the observed interaction between emotion recognition and recording condition is based on universal processes in emotion recognition, we would predict a similar pattern across the three cultures. Specifically, more stereotyped displays should be recognized more easily across cultures (Elfenbein et al., 2007). If, in contrast, acting reflects a socially learned code, then the higher recognition of play-acted anger should disappear in the other two cultures (Hunt, 1941; Matsumoto et al., 2009), with a stronger effect in Indonesian than Romanian participants, due to cultural distance. If collectivistic societies foster a response bias against negative emotions, Romanian and Indonesian participants should reveal a bias against judging an emotion as anger, fear, or sadness in contrast to the German participants (Matsumoto, 1992; Elfenbein et al., 2002). This effect should be increased in cases in which the stimulus material is less clear and less stereotypical (Wagner, 1993; Elfenbein et al., 2002).

#### **MATERIALS AND METHODS RECORDINGS**

We focused on four emotions that differ in terms of valence, dominance, and intensity: anger, fear, joy, and sadness (de Vignemont and Singer, 2006; Bryant and Barrett, 2008; Ethofer et al., 2009). These are the most commonly used emotions in this field of research (Sobin and Alpert, 1999; Scherer et al., 2001; Juslin and Laukka, 2003) and were accessible in the radio interviews used for stimulus material. Neutral prosody, while interesting for comparative reasons, is rare and hard to control in real-life settings. One possibility, news anchors,whose voices are characterized by neutral prosody, unfortunately represent a way of speaking more related to acting than to natural speech. We compared emotional expressions that were obtained during radio interviews to re-enacted versions of the same stimuli. The authentic speech recordings were selected from the database of a radio station and consisted of German expressions of fear, anger, joy, or sadness. The recordings were made during interviews with individuals talking in an emotional fashion about a highly charged ongoing or recollected event (e.g., parents speaking about the death of their children, people winning in a lottery, being in rage about current or past injustice, or threatened by a current danger). Emotions were ascertained through the content of the text spoken by the individuals, as well as the context. While the possibility of social acting can never be completely excluded we aimed to minimize this effect by excluding clearly staged settings (e.g., talk-shows). Stimuli were saved in wave format with 44.1 kHz sample rate and 16 bit sampling depth. Only recordings of good quality and low background noise were selected. Prior to the experiment, we asked 64 naïve participants to rate the transcripts for emotional content to ensure that the stimulus material was free of verbal content that could reveal the emotion. Text segments that were assigned to a particular emotion above chance level were shortened or deleted from the stimulus set. Thus, the stimuli that were used in the experiment did not contain any keywords that could allow inference of the expressed emotion, as for example: "I have known him for 43 years" (translation; original German: "Ich kenn ihn 43 Jahr") was used as a sad stimulus, and "up to the window crossbar" (German: "bis zum Fensterkreuz") as a fear stimulus. Of the chosen 80 speech tokens, 35 were made outdoors and varied in their noise surroundings. The final stimulus set consisted of 20 samples of joy and sadness, 22 samples of anger, and 18 samples of fear, half of which were recorded from female speakers, resulting in a total of 80 recordings made by 78 different speakers. Segments had a mean length of 1.9 s (SD: 1.2 s). These wave files represent the so-called authentic stimuli. An information sheet was prepared for each authentic stimulus, which indicated the gender of the speaker, the context of the situation described, and a transliteration of the spoken text surrounding and including the respective selection of text.

The play-acted stimuli were produced by 21 male and 21 female actors (incl. 31 professional actors, 10 drama students, and 1 professional singer) recruited in Berlin, Hanover, and Göttingen, Germany. Actors were asked to reproduce two to three of the authentic recordings. Using the recording information sheet, the actors were told to express the respective text and emotion in their own way, using only the text, identified context, and emotion (the segment to be used as stimulus was not indicated and the actors never heard the original recording). Each actor could practice as long as needed, could repeat the acted reproduction as often as they required, and the recording selected for experimental use was the repetition each actor denoted as their first choice. To reduce any category effects between authentic and play-acted stimuli, the environmentfor the play-acted recordings was varied and 30 out of 80 randomly selected re-enactments were recorded outside. Nevertheless, care was taken to avoid excessive background noise. The

relevant play-acted recordings (wave format, 44.1 kHz, 16 bit sampling depth) were then edited so they contained the same segment of spoken text as the authentic recordings. The average amplitude of all stimuli was equalized with AvisoftSASLab Pro Recorder v4.40 (Avisoft Bioacoustics, Berlin, Germany).

#### **ETHICS**

It was not possible to obtain informed consent from the people whose radio statements were used, as these were not individually identified. The brevity of the speech samples also precluded individual identification; we thus deemed the use of these samples as ethically acceptable. Actors gave verbal informed consent and were paid C20; experimental participants gave written informed consent and were paid C5 for their participation. Both actors and participants were informed afterward about the purpose of the study.

#### **PROCEDURE**

Due to the unequal numbers of speakers in the two conditions, we split the dataset in two and presented the two sets (playback A and playback B) to different groups of listeners. This also served to avoid participant exhaustion. Each set contained five authentic and five corresponding play-acted duplicates per speaker gender and intended emotion, resulting in a total of 80 stimuli (40 authentic, 40 play-acted) per set. Apart from three exceptions the playbacks were prepared in such a way that each actor was present in one set only once and related recordings (authentic versus play-acted) were presented in a pseudo-randomized fashion with the stipulation that speech token pairs were not played immediately after each another to make direct comparisons between recording pairs unlikely.

Each of the two sets of stimuli was presented to 20 listeners (10 female and 10 male) per country, resulting in 40 participants per country. In Germany, all participants were native German speakers recruited at the Georg-August University, Göttingen. Thirty-six were students, three were Ph.D. students, and one was an assistant lecturer. The age of German listeners varied between 20 and 33 years, the average age was *M* = 24.4, SD = 2.8 years for the listeners of playback A and *M* = 25.1, SD = 3.0 years for the listeners of playback B. The 40 Romanian listeners were recruited at the Lucian-Blaga-University of Sibiu, Romania. All of them were students. The age of Romanian listeners varied between 18 and 22 years, the mean age was *M* = 20.0, SD = 1.2 years for the listeners of playback A and *M* = 19.5, SD = 0.7 years for the listeners of playback B. The 40 Indonesian listeners were recruited at the Jakarta University, Indonesia. All Indonesian participants were students aged 18–31 years. The mean age was *M* = 20.7, SD = 2.8 years for the listeners of playback A and *M* = 20.5, SD = 1.9 years for the listeners of playback B. Neither the Romanian nor the Indonesian participants spoke any German. Romanian participants were, however, more familiar with German due to a large German community in the town of Sibiu. We did not collect any information about the emotional state of the participants before or during the experiments.

The stimuli were played back using a laptop (Toshiba Satellite with a Realtek AC97 Soundcard) via a program called Emosurvey (developed by Martin Schmeisser). Participants heard the stimuli via earphones (Sennheiser HD 497). They could activate the playback of the stimuli themselves and each stimulus could be activated a maximum of three times. The ratings were made via mouse clicks on the screen.When all questions were answered, the next stimulus could be activated. The listeners' ratings were automatically saved in a log file, which could afterward be transferred to other software packages for analysis. In a forced-choice design participants were asked to determine, for each stimulus, the emotion expressed (emotion rating: joy, fear, anger, sadness), and whether the emotion was authentic or play-acted (dichotomous authenticity rating: authentic, play-acted).

#### **STATISTICAL ANALYSIS**

All models were implemented in the R statistical computing environment (R Developmental Core Team, 2008). We analyzed the authenticity ratings as well as the emotion ratings with generalized linear mixed models (GLMM) using the glmer function from the lme4 package for binomial data (Bates, 2005). The responses for correct authenticity rating and for correct emotion rating were tested with the predictor variables Country, Intended emotion, Stimulus authenticity, as well as their interactions, and the random factors Participant and Text stimulus (model formulation: correct recognition ∼ Country × Emotion ×Authenticity + Random factor Text stimulus + Random factor Participant). Both models (Authenticity rating and Emotion rating) were compared to their respective null models (including only the intercept and the random factors, model formulation: correct recognition ∼ 1 + Random factor Text stimulus + Random factor Participant) using a likelihood ratio test (function ANOVA with the test argument "Chisq"). This comparison revealed differences, such that each of the full models accounted for more variance than the null models. Based on the chosen model we specified a set of experimental hypotheses that we tested *post hoc* using the glht function from the multcomp package (Hothorn et al., 2008), adjusting the *p*-values for multiple testing via single-step method.

Assessing recognition accuracy by simply counting hit rates, without addressing potential false alarms or biases (a strong preference toward one response), can be misleading (Wagner, 1993). For instance, if participants have a strong preference for rating stimuli as "authentic," then one would obtain high hit rates for "authentic" speech tokens, but also many wrongly classified play-acted ones (called false alarms). Although the mean recognition rate in this case is quite high, the true ability to recognize authenticity is low. This example shows the importance of calculating biases for understanding rating behavior. A standardized method for analyzing the true discrimination ability for two response options was first introduced as Signal Detection Theory (SDT; Tanner et al., 1954). This technique offers both a measure of discriminatory ability *d* 0 (also called sensitivity) which is the true ability to discern one stimulus from another, and a measure of the response bias toward one category, which is independent of sensitivity (criterion c). As the emotion recognition task in our study included four response options (four emotions), we analyzed the ratings using Choice Theory (Luce, 1959, 1963; Smith, 1982). Choice theory is a logitmodel analog to SDT, which allows the analysis of more than two discrete response categories. A Choice Theory analysis provides

(1) the participants' relative bias (*b*), which is the equivalent criterion c and (2) dissimilarity values (α), which are equivalent to the discriminatory ability *d* 0 .

We implemented the choice theory analysis as a baselinecategory logit-model (Agresti, 2007). We used the fitted intercept and slope coefficients to derive the bias and similarity parameters of choice theory. The binomial "mixed" model for authenticity recognition (binomial due to the two response options "authentic" and "play-acted") was calculated in R using the glmer function of the lme4 package (Bates, 2005). The multinomial "mixed" model for emotion recognition was programed under WinBUGS (Lunn et al., 2000) using the R2WinBUGS interface package (Sturtz et al., 2005) to account for the four response options ("anger," "fear," "sadness," and "joy").

#### **RESULTS**

#### **AUTHENTICITY RECOGNITION**

Across cultures, recognition accuracy for authenticity was only slightly above chance (*M* = 58.73%, SD = 8.84%), with a higher recognition rate for authentic (*M* = 67.81%, SD = 12.37) than for play-acted speech tokens (*M* = 49.58%, SD = 16.78). *Post hoc* tests confirmed this difference in recognition rates (*z* = 18.39, *p* < 0.001; **Figure 1**). German raters, correct in 62.43% of cases, were, on average, more accurate in their authenticity ratings than either Romanian (57.20%) or Indonesian raters (56.67%; German – Romanian *z* = 2.99, *p* = 0.028; German – Indonesian *z* = 2.95, *p* = 0.031).

The analysis of ratings using choice theory revealed that participants had a strong bias toward choosing the response "authentic" in the authenticity ratings (**Figure 2**),which may explain the higher recognition accuracy for authentic speech tokens. The *post hoc* pair-wise comparisons between the participants of the different countries revealed a significantly greater bias in Romanians than Germans (*z* = 2.64, *p* = 0.045; **Figure 2**).

The overall mean dissimilarity of 0.40 implies a generally low discriminatory capability between authentic and playacted vocal expressions of emotions (MacMillan and Creelman, 2005). *Post hoc* tests revealed that German participants had a higher dissimilarity value and thus a better discriminatory ability than Romanian and Indonesian participants (German-Romanian: *z* = 4.535, *p* < 0.001; German – Indonesian:*z* = 4.590, *p* < 0.001).

#### **EMOTION RECOGNITION**

In total, the correct response rate in emotion ratings was 40.65% (SD = 6.41%), which is higher than a chance response rate of 25% resulting from a random selection of one of the four emotions. The emotion recognition ratings in general showed similar patterns in the three countries (**Figure 3**). The GLMM analysis revealed that the rate of correct emotion recognition was influenced by Intended emotion, Stimulus authenticity, and Country (see **Table 1** for the results of the *post hoc* analysis). Play-acted stimuli were recognized more accurately (42.78%) than authentic stimuli (38.52%). Specifically, play-acted anger was recognized more frequently than authentic anger and authentic sadness more than play-acted sadness. Authenticity did not significantly influence the emotion recognition rates for fear and joy. Concerning the four emotion categories, anger and sadness were on average

**FIGURE 1 | Probability of correct authenticity recognition by intended emotion (A – anger, F – fear, J – joy, S – sadness) and stimulus authenticity (authentic or play-acted).** The data are split by cultural

affiliation (G – Germany, R – Romania, I – Indonesia). Given are means and 95% confidence intervals. The probability of correct authenticity recognition by chance is 0.5 as indicated by the dashed horizontal lines.

recognized significantly more frequently than fear and sadness was recognized more frequently than joy. Finally, emotion recognition rates were significantly higher for German participants in comparison to Romanian and Indonesian participants, but not for Romanian participants in comparison to Indonesian participants (**Table 1**).

The response bias for emotion judgments was calculated with respect to cultural affiliation and stimulus authenticity. In all three countries participants showed a bias toward rating play-acted stimuli as angry (**Figure 4**). This bias was higher for German than for Romanian or Indonesian participants. German participants were also biased toward rating authentic stimuli as angry, while Romanian and Indonesian participants preferentially chose "sadness" and were additionally biased against choosing "anger" when rating authentic stimuli. There was no effect of authenticity or country of origin with respect to the responses "joy" and "fear." Indonesian participants, whose bias against "joy" was less distinct

probability of correct emotion recognition with respect to the intended emotion (A – anger, F – fear, J – joy, S – sadness) and stimulus authenticity (authentic or play-acted). The data are split by cultural

**Table 1 | Post hoc tests of cultural affiliation, and stimulus-specific factors (stimulus authenticity, intended emotion) on the probability of correct emotion recognition.**


The p-values are adjusted for multiple testing. Auth – non-instructed; play – instructed; A – anger; F – fear; J – joy; S – sadness. \*p < 0.05; \*\*p < 0.01; \*\*\*p < 0.001.

than for Romanian or German participants, were the only exception.

The outcome of the calculation of the dissimilarity values for all possible stimulus-response pairs during emotion ratings (including effects of country and stimulus authenticity) are shown in **Figure 5**. There were few differences between authentic and playacted emotional expressions and between the participants of the three countries. High dissimilarity values were found between anger and sadness, which indicates that these emotions could be lines.

recognition by chance is 0.25 as indicated by the dashed horizontal

distinguished easily. The very low dissimilarity values for the stimulus"fear" (see row"F"in the matrix plot in **Figure 5**) indicate high confusion with the other emotion categories and reflect the low recognition rates for fear.

#### **DISCUSSION**

Participants in all three cultures had difficulties distinguishing between authentic (spontaneous) and play-acted (instructed) emotional expressions. The recognition of the expressed emotion also showed relatively low rates, but varied with respect to the emotion category and listener country of origin. Notably, the stimulus origin (authentic versus play-acted) had a clear impact on the recognition of vocal expressions of anger and sadness across all three cultures: anger was recognized more frequently when playacted and sadness was recognized at higher rates when authentic, bolstering earlier findings for an independent German population (Drolet et al., 2012). While these results are significant, it remains unclear what leads to this effect. It may be that play-acted anger is more exaggerated than spontaneously expressed anger, while sadness, in contrast, is more difficult to play-act. On the other hand, it may be that, overall, some stimulus feature makes play-acted stimuli more likely to be perceived as anger and spontaneous stimuli as sadness.

With regard to our initial hypotheses, we found support for the conjecture that play-acted anger was recognized with higher accuracy than authentic anger across cultures, possibly because of its stereotypical nature. For the other three categories, acting does not necessarily appear to be connected with a more exaggerated expression, which is contrary to previous results (Barkhuysen et al., 2007; Laukka et al., 2012). According to our results, play-acted expressions do not represent a socially learned code (Matsumoto et al., 2009). Considering the similar interaction of emotion recognition and stimulus authenticity across the three

**FIGURE 4 | Analysis of emotion recognition data by choice theory.** Given is the log-transformed response bias for each of the four possible choices (anger, fear, joy, sadness) with respect to cultural affiliation (G – Germany, R – Romania, I – Indonesia). The filled and open symbols indicate the response bias for authentic and play-acted stimuli. Data are

given as means and 95% uncertainty interval. In the absence of any bias, all four log-transformed bias values would be zero. Positive values indicate a bias toward choosing the response named in the headline, whereas a value below zero indicates a bias against choosing the respective response.

cultures, our findings lend further support for the notion that emotion recognition is underpinned by human universals.

The fact that listeners of all three cultures were poor at discriminating between authentic and play-acted vocalizations shows that previous findings (Drolet et al., 2012) are applicable crossculturally. If emotional expressions are indicators for underlying states that may require behavioral responses by the observer (see for controversial discussion, Russell et al., 2003; Barrett, 2011), the ability to detect fake emotional expressions should be important and evolutionarily adaptive (Schmidt and Cohn, 2001; Mehu and Scherer, 2012). The inability to distinguish between play-acted and spontaneous expressions is, therefore, counter-intuitive, but has also been found in previous studies (see for corresponding results, Ekman and O'Sullivan, 1991; Audibert et al., 2008). People tend to believe in the truthfulness of a statement rather than mistrust it (Zuckerman et al., 1984; Levine et al., 1999). This effect, labeled as "truth bias," is reflected in our participants' bias to choose the answer "authentic" when asked about the encoding condition of the emotional expression. It may be that the social cost of ignoring an emotion in others (miss) or wrongly considering others to be deceivers (false alarm) may make a bias toward believing in the authenticity of social signals advantageous (Ekman, 1996).

In addition to the well documented in-group effect for German participants (Scherer et al., 2001; Elfenbein and Ambady, 2002) in both emotion and authenticity recognition, cultural effects mainly became apparent in rating biases of emotions and not in recognition accuracy or dissimilarity. This has also been demonstrated by Sneddon et al. (2011), who showed that emotional stimuli were recognized similarly across different cultures, although the intensity ratings varied. Our initial hypothesis that Indonesian and Romanian participants exhibit a bias against negative emotions was, however, only partially supported. They had, in accordance to our hypothesis, a clear bias against selecting "anger," but only for authentic stimuli. When listening to the spontaneous speech tokens, Indonesian and Romanian participants preferentially chose "sadness." No cultural difference was found for the selection of "fear." German participants showed a bias toward selecting "anger" for both authentic and playacted stimuli. According to the hypothesis that individualistic cultures are expected to reinforce the expression of negative emotions, German participants may have expected a higher likelihood of being confronted with expressions of anger based on their everyday experiences, regardless of the stimulus type presented. Conversely, the more collectivistic Romanian and Indonesian participants may have expected expressions of sadness to be more likely (see Matsumoto, 1989 for similar results). Thus, sadness seems to rank differently compared to anger and the lumping of all negative emotions in the context of response bias seems to be an over-simplification, which might also explain the absence of clear bias effects in previous studies (Elfenbein et al., 2002; Sneddon et al., 2011). Interestingly, the expected response bias against "anger" for the Romanian and Indonesian participants is only present for authentic stimuli, which can be explained by stimulus-inherent features of the play-acted speech tokens overriding the response bias (Wagner, 1993; Elfenbein et al., 2002). The link between putative cultural biases requires stronger empirical investigations before firm conclusions can be drawn, in particular regarding limitations on the number and types of countries examined (with respect to language and cultural distance). However, our results demonstrate that the implicit effects of authenticity clearly derive from a complex interaction between stimulus-inherent features and cultural expectations about the likelihood of specific emotional expressions.

Due to the use of spontaneous emotional expressions taken from anonymous radio interviews, our study did not allow for a within-speaker design. We thus could not explicitly test whether individual differences in speaker expressivity affected the results. However, the large number of radio speakers and actors involved (more than generally seen in comparable studies) allowed us to minimize the influences of such effects. Additionally, the recognition rates of fear and joy were quite low compared to previous studies on vocal expressions of emotions (e.g.,Van Bezooijen et al., 1983; Scherer et al., 2001; Pell and Kotz, 2011). This is interesting, taking into account that not only the spontaneous emotions, for which a low recognition would have been predicted, but also the play-acted ones, revealed recognition rates near chance levels. In contrast to standard methodology, we did not use exaggerated emotional expressions, preselected speech tokens, or emotional outbursts in a word or two (Van Bezooijen et al., 1983; Scherer et al., 2001; Pell et al., 2009). Actors were provided with longer transcripts (several sentences) to portray emotionally to ensure situations as similar to the authentic recordings as possible. It seems unlikely that specifically these professional actors were unable to encode joy or fear, considering that this has been done by laymen and inexperienced actors before (Van Bezooijen et al., 1983; Pell et al., 2009). In particular, the low recognition rates for joy and fear at or close to chance levels might reveal interesting facts about emotional expressions in general. The inability to recognize fear may indicate that fear is less clear in segments of longer speech samples than previously thought. In fact, we believe that the low recognition rates overall is what made the discovery of the interaction with authenticity, as well as the differences in the response bias, possible. It is clear that further work in this direction is needed to understand the relevance of emotion recognition research to day-to-day life. Nevertheless, the cross-cultural results revealed that spontaneous and play-acted emotional expressions are recognized similarly across cultures,indicating that both the recognition of play-acted and spontaneous emotional expressions rest on a similar universal basis. Furthermore, our results emphasize the importance of rating response biases, especially regarding more ambiguous expressions such as those taken from spontaneous situations.

#### **CONCLUSION**

Combining all results, this study supports the view that emotion recognition rests on a complex interplay between human universals and cultural specificities. On the one hand, we found the same pattern of recognition and the same implicit effects of encoding conditions across cultures; on the other hand, cultural differences became evident in distinct biases. In addition, although the low recognition of encoding conditions would appear to argue for acted stimuli in vocal research, the implicit effects on emotion recognition seen here indicate that the design of future studies on vocal emotion recognition must take this variation in stimulus characteristics into account.

#### **ACKNOWLEDGMENTS**

This research was funded by the German BMBF (Bundesministerium für Bildung und Forschung) within the collaborative research group "Interdisziplinäre Anthropologie." We thank Jeanette Freynik for aid with conducting the experiments and Annika Grass for valuable comments on the manuscript.

#### **REFERENCES**


differences. *Front. Psychol.* 2:180. doi:10.3389/fpsyg.2011.00180


versus collectivism. *J. Cross Cult. Psychol.* 39, 55–74.


universal characteristics. *J. Cross Cult. Psychol.* 14, 387–406.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 06 December 2012; accepted: 18 February 2013; published online: 13 March 2013.*

*Citation: Jürgens R, Drolet M, Pirow R, Scheiner E and Fischer J (2013) Encoding conditions affect recognition of vocally expressed emotions across cultures. Front. Psychol. 4:111. doi: 10.3389/fpsyg.2013.00111*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Jürgens, Drolet, Pirow, Scheiner and Fischer. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## Perception of emotionally loaded vocal expressions and its connection to responses to music. A cross-cultural investigation: Estonia, Finland, Sweden, Russia, and the USA

#### *Teija Waaramaa1 \* and Timo Leisiö2*

*<sup>1</sup> School of Communication Media and Theatre, University of Tampere, Tampere, Finland*

*<sup>2</sup> School of Social Sciences and Humanities, University of Tampere, Tampere, Finland*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Michihiko Koeda, University of Glasgow, UK Åsa Abelin, University of Gothenburg, Sweden*

#### *\*Correspondence:*

*Teija Waaramaa, School of Communication Media and Theatre, University of Tampere, Kalevantie 4, Tampere 33014, Finland e-mail: teija.waaramaa@uta.fi*

The present study focused on voice quality and the perception of the basic emotions from speech samples in cross-cultural conditions. It was examined whether voice quality, cultural, or language background, age, or gender were related to the identification of the emotions. Professional actors (n2) and actresses (n2) produced non-sense sentences (n32) and protracted vowels (n8) expressing the six basic emotions, interest, and a neutral emotional state. The impact of musical interests on the ability to distinguish between emotions or valence (on an axis positivity – neutrality – negativity) from voice samples was studied. Listening tests were conducted on location in five countries: Estonia, Finland, Russia, Sweden, and the USA with 50 randomly chosen participants (25 males and 25 females) in each country. The participants (total *N* = 250) completed a questionnaire eliciting their background information and musical interests. The responses in the listening test and the questionnaires were statistically analyzed. Voice quality parameters and the share of the emotions and valence identified correlated significantly with each other for both genders. The percentage of emotions and valence identified was clearly above the chance level in each of the five countries studied, however, the countries differed significantly from each other for the identified emotions and the gender of the speaker. The samples produced by females were identified significantly better than those produced by males. Listener's age was a significant variable. Only minor gender differences were found for the identification. Perceptual confusion in the listening test between emotions seemed to be dependent on their similar voice production types. Musical interests tended to have a positive effect on the identification of the emotions. The results also suggest that identifying emotions from speech samples may be easier for those listeners who share a similar language or cultural background with the speaker.

**Keywords: voice quality, expression, perception of emotions, valence, musical interests, cross-cultural**

#### **INTRODUCTION**

Basic emotions are thought to be universal in their manifestation since they are considered to be phylogenetic, evolutionarysurvival related affects (Izard, 2007). The vocal expression and perception of these emotions tend to be based firstly on genetically inherited, and secondly on culturally learnt elements (Matsumoto et al., 2002). Also, the expression and perception of emotions expressed by music tends to be affected by both inherited characteristics and by cultural learning (Morrison and Demorest, 2009), and even by individual preferences, e.g., a piece of music may emotionally move one person but not another (Cross, 2001). In this paper it is hypothesized that the origin of speech and temporal experiences such as emotional and musical expressions are linked together in the evolution (Juslin and Laukka, 2003a). According to Richman (2001) "in the beginning speech and music making were one and the same: they were collective, realtime repetitions of formulaic sequences." Moreover, Thompson

et al. (2004) have suggested that "it seems unlikely that human evolution led to duplicate mechanisms for associating pitch and temporal cues with emotions."

In voice research, voice quality is traditionally defined as the coloring of the speaker's voice (Laver, 1980), and in a narrower sense, as a combination of voice source (the air flow and vocal fold vibration), and filter functions (the vocal tract and formant frequencies) (Fant, 1970). The amount of subglottal air pressure and adduction of the vocal folds in the glottis determine the phonation type, whether it is hyperfunctional or hypofunctional. In a hyperfunctional phonation type the spectral slope is flatter and there is more energy and stronger overtones in the high frequency area than in a hypofunctional phonation type, where the slope is steeper and the overtones are weaker (Gauffin and Sundberg, 1989). Hyperfunctional phonation type is perceived as pressed voice quality and hypofunctional as breathy voice quality. Perceptual interpretations of the voice quality may either clarify or blur the meaning of the message, or change the whole information sent by a speaker.

Similarly to music, vocal expressions always have a fundamental frequency (F0) (excluding whisper), intensity (sound pressure level, SPL), and duration. These are the traditional parameters studied from the voice quality in emotional expressions. As sound is transmitted via vibrating objects there is no music without movement (Cross, 2001; Levitin and Tirovolas, 2009), and this connection between sound and movement tends to be evolutionarily based (Liberman, 1981; Liberman and Mattingly, 1985; Rizzolatti et al., 1996). As in voice production, the air pressure from the lungs makes the vocal folds vibrate, and without this action there is no vocal sound. According to the motor control theory and also the more recent theory of the mirror neurons speech is said to be understood rather in terms of its production than from the characteristics of the acoustic cues (Liberman, 1981; Liberman and Mattingly, 1985; Gentilucci and Corballis, 2006). In turn, the acoustic cues are connected to the physiological principles, and are the carriers of the emotional content of speech (see e.g., Juslin and Laukka, 2003b).

Human vocal communication inevitably conveys emotional messages – whether intended or not. Cultural differences do occur in humans in spite of the genetically based similarities in the expression and perception of the basic emotions (Matsumoto et al., 2002; Abelin, 2004). The cultural differentiation in music seems to occur by the end of the first year of life (Hannon and Trehub, 2005; Belin et al., 2011), and the cultural conventions of the music are learnt by the age of five (Trehub, 2003; Hannon and Trehub, 2005).

Typical of music, always based on harmonic relations between tones, are the rules (syntaxes) which govern the ways a tune is allowed to be composed. These rules are local and they deal with various alternating combinations of 1, 2, 3, 4, 5, or 6 tones (Leisiö and Ebeling, 2010). Typicality creates expectations and predictions of the characteristics of the musical sounds in a particular culture (Levitin and Tirovolas, 2009). However, the three basic elements of musical expression, frequency, intensity, and duration are not culture-specific as such.

There also appear to be similarities in the musical emotional expressions between cultures, e.g., emotional content of happy, sad, and fearful Western music has been reported to be recognized clearly above chance level by African listeners (Fritz et al., 2009). Balkwill and Thompson (1999) studied the perception of emotions in Western and Indian music and suggested that listeners are sensitive to unfamiliar tonal systems.

However, recognition of the emotions is more demanding in the absence of the familiar perceptual cues. This was also verified by Scherer et al. (2001), who conducted an extensive research project on the perception of vocal emotional utterances in seven countries, in Europe, Indonesia, and the USA. The vocal language-free portrayals used were produced by German professional actors, who expressed four emotions and a neutral emotional state. The emotions were perceived with 66% accuracy across countries. However, as the dissimilarities between the languages increased the accuracy of the perception decreased. As a result, the researchers stated that culture and language specific patterns may have an influence on the decoding processes of emotional vocal portrayals.

Sauter et al. (2009) studied perception of English and Himba non-verbal vocalizations representing basic emotions. Their results showed that listeners from both groups could identify the emotions, however, better accuracy was achieved when the producer and the listener were from the same culture.

Similar results were reported by Koeda et al. (2013) in a recent investigation of non-verbal "ah" affect bursts. The vocalizations were produced by French-Canadian actors. Canadian and Japanese participants served as listeners. It was found that the Canadian listeners recognized the emotions expressed, both positive and negative, more accurately than did the Japanese listeners.

Thompson et al. (2004), and Lima and Castro (2011) investigated whether music training assists speech prosody decoding. The researchers concluded that music training may facilitate the recognition of the emotional content of speech. Trimmer and Cuddy (2008) came to somewhat opposite conclusion. They reported that music training does not seem to be linked to the ability to recognize emotional speech prosody. Instead, emotional intelligence may predict sensitivity to emotion recognition from speech prosody, and this tends to require different processes than those required in musical or acoustical sensitivity. Strait et al. (2009) have stated that subcortical mechanisms are involved in the auditory processing of emotions, and musical training enhances these processes: training when younger than 7 years facilitates pitch and timber perception, and duration of training impacts processing of temporal features.

The present study was concerned with whether the voice quality of emotional speech samples affects the identification of emotions and emotional valence (on the axis positivity – neutrality – negativity). The second aim was to investigate cross-cultural perception, whether it is dependent on language or cultural background, age, or gender. Thirdly, whether the ability to recognize emotional states is related to musical interests was studied (Thompson et al., 2004; Trimmer and Cuddy, 2008; Levitin and Tirovolas, 2009; Strait et al., 2009). Therefore, the participants of the listening tests were asked on a questionnaire about their subjective musical interests. Listening tests for 250 randomly chosen, volunteer participants were conducted on location in five countries: Estonia, Finland, Russia, Sweden, and the USA.

#### **MATERIALS AND METHODS ACOUSTIC AND STATISTICAL ANALYSES**

Emotionally loaded sentences (n32) and protracted vowels [a:], [i:], [u:] (n8) were produced by Finnish professional actors (n2) and actresses (n2). They read aloud a non-sense text *(Elki neiku ko:tsa, fonta tegoa vi:fif:i askepan:a* æ*spa. Fis:afi: te:ki sta:ku porkas talu.)* expressing six basic emotions, namely anger, disgust, fear, joy, sadness, surprise, and a neutral emotional state. These emotions were chosen since 4–6 of them (depending on the source) are thought to be universal (Murray and Arnott, 1993; Juslin and Laukka, 2003a; Mithen, 2006). Interest is sometimes also listed as one of the basic emotions since it is seen as the principle force in organizing consciousness and focusing attention (Izard, 2007, see also Scherer and Ellgring, 2007). Based on this definition, interest was included in the present investigation. The recordings were made by Sony Sound Forge 9.0 recording and editing system, and Rode NTK microphone at a professional recording studio MediaBeat in Tampere, Finland. The speakers' distance from the microphone was 40 cm. In the tests the listeners used Sennheiser HD 598 headphones.

Acoustic parameters were measured with Praat Software, version 5.2.18. A frequency range of 0–5 kHz and crosscorrelation were used. F0, maximum pitch, SPL, filter characteristics (formant frequencies F1, F2, F3, F4), duration, mean harmonics-to-noise ratio (HNR, dB), number of pulses, and number and degree of voice breaks were measured. HNR measures perturbation in the voice signal. The number of voice breaks is the ratio between the number of pulse distances (min 1.25) and the pitch floor. Degree of voice breaks is the ratio between the non-voiced breaks and duration of the signal. (http://www.fon.hum.uva.nl/praat/manual/ Voice\_1\_\_Voice\_breaks.html.) The vowels were replayed consecutively to the participants in the listening tests. As the stress is always on the first syllable in Finnish language and thus carries the main communicational information, the acoustic parameters were studied only for the first [a:] vowel. Alpha ratio was calculated by subtracting the SPL in the range 50 Hz–1 kHz from SPL in the range 1–5 kHz (Frøkjær-Jensen and Prytz, 1973). Alpha ratio is used to get an illustration from the spectral energy distribution.

Emotional valence was coded by the researcher: positive valence (interest, joy and surprise) = 1, a neutral emotional state = 0, negative valence (anger, disgust, fear and sadness) = −1.

Statistical analyses were conducted using Excel and IBM SPSS Statistics 19 to investigate whether the voice parameters measured correlated with the identification of the emotions or valence and whether the perception of emotions differed by country, age, gender, or self-reported musical interests.

#### **QUESTIONNAIRE AND LISTENING TESTS**

Listening tests were conducted on location in five countries with different cultural and/or language backgrounds: Estonia, Finland, Russia, Sweden, and the USA. American English, Russian, and Swedish are related as members of the Indo-European linguistic family while Estonian and Finnish belong to the same Finno-Ugric language genus. As Nordic countries Finland and Sweden share a similar cultural background.

Fifty randomly chosen listeners in each country (25 males and 25 females × 5 countries = 250 listeners) participated in the perception test. The only criteria for participation was that the listeners were native speakers of the specific main language in each country, i.e., Estonian in Estonia, Finnish in Finland, Russian in Russia, Swedish in Sweden, and American English in the USA, and that the participants had lived most of their lives in the country. In Sweden, some of the listeners had one parent from another country, and one listener was adopted to Sweden as a baby, however, every listener spoke Swedish as their first language. The listeners were adults (18+ years old), mean age 33 years (Finland 47.5 years, Russia 34.5 years, Estonia 32 years, Sweden 27 years, and the USA 23 years).

The contact universities in the countries studied published the research project and called for volunteers to participate in the listening tests. Neither personal data registers nor invasive methods were used. All participants' anonymity was ensured. Consequently, no permission of the ethics committee was needed. The participants recruited in the USA were offered a course credit for participating.

The listening tests were conducted one by one with the listeners in an office (Finland and partly the USA), or in normal classroom conditions (Estonia, Russia, and partly the USA) or in a soundproof studio (Sweden). The researcher was alone with the listener in the test, except when a translator was needed in Russia. Listening tests are traditionally conducted in soundproof studio conditions. In the present study this was not required so as to be able to conduct the research independently using the facilities the universities in different countries were able to offer a visiting researcher. Furthermore, it was of interest to replicate the conditions of a normal social situation where people talk to each other having some random sounds around them, and nevertheless, focusing on listening to the speech and the voice of their interlocutors.

The participants completed a questionnaire eliciting background information, and responded to the following statements concerning their musical activities: (1) I like to listen to music. (2) It is easy for me to respond to music. (3) I am interested in singing. (4) I play a musical instrument. (5) I am interested in dancing. (6) It is easy for me to dance in the correct rhythm. (7) It is easy for me to learn a new melody. (8) Music may affect my mood. (9) Music may cause me physical reactions. The idea was to study the participants' subjective opinion about their relation to music, not to measure their activity or education in music.

The questionnaire and the emotion labels were translated by university teachers, either native speakers of the language (Estonian and English) or Finnish teachers in Swedish and Russian.

In the perception test the listener first heard four two-sentence non-sense samples, one from each speaker, and then one example of each emotion expressed by the four speakers. The researcher named the samples by the emotion before replaying them one by one in order to familiarize the listener with the speakers' voices and the vocal variation the speakers used in the emotional expressions. Next, the researcher replayed the 32 emotional nonsense sentences one by one (eight emotions × four speakers), and the listener reported orally which emotion he/she perceived. The researcher wrote down the answers given. Finally, the listener heard eight simple protracted vowel samples, two emotions from each speaker, and chose his/her answer again from the list of the eight emotions expressed. Free choice was not used. The test took about 35 min for each listener.

All the samples were replayed in the same random order from the researcher's computer to the participants. The listeners did not have to use any equipment while listening and answering. In unclear cases the participants were instructed to choose the nearest emotion to what they assumed to be the target. They were asked to choose neutral only when they thought there was no particular emotion expressed. The participants were instructed to answer as briefly as possible. On the other hand they were allowed to listen to a sample as many times as they felt they needed to (usually 1–2 times). They were also allowed to listen to the previous samples again so as to avoid possible order effects.

### **RESULTS**

#### **VOICE QUALITY**

In vowel [a:] alpha ratio correlated significantly negatively with duration in both genders. In the sentences alpha ratio and SPL correlated significantly positively. Alpha ratio and SPL have been shown to vary together (Nordenberg and Sundberg, 2003; Sundberg and Nordenberg, 2006). Duration correlated negatively with F0. These results suggest that in hypofunctionally produced samples duration is longer than in hyperfunctional produced samples.

Significant correlations with share of identified *emotions* and voice parameters were found in both genders for mean HNR, number of voice breaks and SPL, and in females also for maximum pitch and number of pulses. Significant correlations with share of *valence* and voice parameters identified were found in both genders for number of pulses and number of voice breaks, and in males also for duration (**Table 1**).

Number of voice breaks was highest for sadness and lowest for anger, and degree of voice breaks was highest for fear and lowest for joy. The voice production type in sadness and fear tends to be more hypofunctional than in anger and joy thus, having less energy e in the higher frequency area of the spectrum.

The mean duration of the sentence samples was 9652 ms, and vowels 930 ms. Anger in males, and joy in females had the lowest durations for the sentences. Negative emotion of sadness followed by fear had the longest durations in both genders.

#### **QUESTIONNAIRE**

Degree of tiredness or mood tended to be non-significant features in relation to the identification accuracy of the emotional samples. Seventeen participants reported impaired hearing (Estonia 1, Finland 8, Russia 2, Sweden 5, and the USA 1).

The results of the Student's *T* test showed that those who reported impaired hearing did not identify the emotions less successfully (69% identified) than those with normal hearing (70% identified).

The listeners were divided into two groups, under 40 years and 40+ years in order to study perceptual age differences. The younger group identified emotions with 70% accuracy and valence with 91% accuracy, and the older group emotions with 68% and valence with 90% accuracy. When Pearson correlation was studied by country, a slight negative correlation between age and the identification of the emotions was found for Finland, Russia, and the USA (**Table 2**).

The first statement in the questionnaire was "I like to listen to music." By this statement the idea was to measure the degree of consumption of music. The results showed that the degree of consumption by listening to the music did not seem to be associated with the emotions or valence identified in the vocal samples (**Table 3**).

The other statements concerning musical interests were statistically significantly associated with the emotions and valence identified. Those participants who reported engaging in musical interests and responding to music were compared to those who did not have a clear response to these activities. It was found that the listeners reportedly engaging in music differed significantly in the share of the identified emotions and valence from the listeners who did not report musical interests or sensitive response to music (**Table 3**).

Females reported significantly more often than males being interested in singing while males reported playing a musical instrument significantly more often than females. When studied by country, those who were interested in singing and who played a musical instrument were most often Swedish listeners. "I am interested in dancing" was most often answered "Yes" by the US listeners.

Emotional states of fear, interest, and joy were most frequently associated with musical interests. Neutrality was not associated with any of the musical interests. "It is easy for me to learn a new melody" and "I am interested in singing." were the statements which seemed to be engaged with most of the identified emotions. The statement "It is easy for me to dance in the correct rhythm" was not emotion specific and was not associated with any particular emotion (**Table 4**).

#### **LISTENING TESTS**

Crohnbach's alpha for the listening test by country was: Finland 0.945, Estonia 0.929, Sweden 0.905, the USA 0.874, and Russia 0.871. The results showed that the percentage of emotions and valence identified was clearly above the chance level in each of the five countries with different language and/or cultural backgrounds. A confusion matrix in percentages and numbers for the emotions identified is shown in **Table 5**. Sadness and fear were the most frequently chosen emotional states for an answer, followed by neutrality. Anger was the most rarely chosen answer (**Figure 1**).

For the first four samples the percentage of identified emotions was 59% and valence 87%, for the sentences 70% and valence 90%, and for the vowels 69 and 90% respectively. The result for the first four samples was from the 233 participants since the first 17 Finnish listeners missed these samples at the beginning of the

**Table 1 | Significant results for Pearson correlation between voice quality parameters and the share of identified emotions and valence (***p <* **0***.***05).**




*The results are presented for the first four samples, sentences and vowels (There were no female listeners 40*+ *years in the USA).*

*\*Significant negative correlation with age: Finland r* = −0.333*, Russia r* = −*0.350, USA r* = −*0.302.*

present research project. As the accuracy percentage of identification was higher for the sentences than for the first four samples it may be assumed that familiarizing the listeners with the variations of the speakers' voices may have improved their recognition of the target emotions. The familiarizing did not seem to affect the recognition as much of emotional valence which was fairly high already before the familiarizing. Negative emotions were identified slightly more accurately than positive ones.

The younger listeners identified sadness significantly better than the older listeners (*p* = 0.036), who identified joy (*p* = 0.021), surprise (*p* = 0.002), and neutrality (*p* = 0.024) significantly better than the younger ones.

The binomial test conducted on the samples showed that 10 samples were identified with under 50% accuracy: two from the first four samples, disgust (24%), and fear (45% accuracy), from the sentences two samples of anger (13%, 31%), disgust (38%), joy (26%) interest (44%) and surprise (43%), and from the vowels joy (36%) and surprise (42% accuracy). Seven of these samples were produced by male speakers.

A number of confusions of the emotions perceived occurred in the listening test. Hypofunctionally produced emotions of sadness and fear were frequently confused with one another, likewise the hyperfunctionally produced negative emotions of anger and disgust. On the other hand, disgust was also confused with sadness by the listeners in Russia, Sweden, and the USA but not in Estonia or Finland. Positive emotions of joy, surprise, and interest were confused with one another, and thus the percentage for their identification was relatively low.

There was a tendency in the perception test that the more similar the listeners' language or cultural background was to those of the speakers', the more accurate the emotion recognition was, and conversely, the more different the language or cultural background was the less accurate the emotion recognition was. The quartiles studied by country showed that 1/4 of the listeners, e.g., in Estonia identified 55%, 1/2 identified 68%, and 1/4 at least 77% of the emotion samples. Variation was widest for Finland. The percentages fall into the quartiles roughly similarly for Estonia and Sweden, and for Russia and the USA. Finnish listeners were most accurate in the identification (**Table 6**).

The logistic regression model (Response = Emotion identified/not identified) showed that the five countries perceived the emotions expressed significantly differently. The identification was connected to the age of the listener. The interaction effect of the speaker gender, country and emotion expressed was significant. The greatest difference between the emotion identification and the gender of the speaker was found for Estonia and Russia. There, most of the non-identified samples were produced by males. Listener's gender was non-significant (**Table 7**).

When studied by country, gender differences were found for only two countries: Estonian males recognized the valence of the first four samples significantly better than did the Estonian females. Swedish males recognized emotions from the sentences significantly better than did Swedish females. However, the differences between genders did not vary significantly among all five countries (**Table 8**).

The emotions produced by males were perceived with 62% accuracy and valence with 87% accuracy, those produced by females corresponding with 74 and 94% accuracy. The difference was statistically significant (**Table 9**).

#### **DISCUSSION**

#### **VOICE QUALITY**

Identification of valence in both genders appeared to be connected to the number of pulses and number of voice breaks. In a hyper-functional voice quality (e.g., in joy and anger) number of pulses is higher per time-domain than in a hypofunctional voice production type (e.g., Waaramaa et al., 2006). Highest number of voice breaks was found for sadness, and highest degree of voice breaks for fear which were both hypofunctionally produced utterances. Voice breaks and perturbation of voice signal tended to be discriminating features connected to the pressed/breathy voice quality in the emotional utterances.

The results suggest that valence is more important in the perception process of the vocal expressions and is therefore of greater communicative importance than the actual emotions. It was shown in a recent study by Waaramaa and Kankare (2012) that statistically significant differences between valences were already found on micro level emotional expressions which were calculated from the electroglottogram (EGG) signal. EGG was used to measure the contact quotient (CQEGG) of the vocal folds. When



*The percentages are the "Yes" answers to the statements. "Yes" answers and the identification of the emotions and valence were significantly associated (excluding the first statement). Significance of the relationship appears on the far right.*

*Statements: \*\*\*p* < *0.001, \*\*p* < *0.01, \*p* < *0.05, ns) non-significant in independent samples.*

*Student's T test for equality of means.*

#### **Table 4 | Emotions significantly associated with musical interests.**


*Statements: \*\*\*p* < *0.001, \*\*p* < *0.01, \*p* < *0.05, ns.) non-significant in independent samples. Student's T test for equality of means.*

the vocal folds were 25% closed (25% threshold level) significant differences were already found between valences for the CQEGG. Significant gender differences have been found at the 55% threshold level (Higgins and Schulte, 2002). Consequently, differences between emotions may occur only on higher threshold levels, i.e., later in the expression. Glottal behavior has likewise been reported to affect valence perception by Laukkanen et al. (1997) and (Waaramaa et al., 2008, 2010).) Thus, from the communicative perspective, expression of valence seems to precede the expression of gender or the actual emotion in speech samples.

Formant frequencies measured in vowel [a:] did not show significant differences between emotions in the present material. Nor was it expected for F1 and F2, since they are determined by the vowel expressed. Instead, in earlier investigations F3 and F4 have shown higher frequencies in positive emotions than in negative ones (Waaramaa-Mäki-Kulmala, 2009). This was also the case in the present material, but not significant. Waaramaa et al. (2006) studied synthesized vowel [a:] samples with raised, lowered and removed third formant frequency (F3) and valence perception from the samples. The results showed that the raised F3 frequency was perceived more often as positive than the other **Table 5 | The line "Count" in the confusion matrix of the emotions expressed and emotions perceived shows the numbers of answers given.**


*The integers presented in bold face are the emotions identified in numbers. The other integers on the "Count" line show the numbers of confusions with the other emotions. The line "% within emotion perceived" shows the percentage of the answers given for each emotion. The line "% within emotion expressed" shows the percentage for the identification of the emotion in question.*

samples. It was concluded that samples with sufficient energy in the high frequency area of F3 may affect perception of positive valence from a signal.

However, it has been suggested by Laukkanen et al. (2008) that at least valence – if not actual emotions – can be perceived from emotional expressions even with several vocal cues eliminated (see also Waaramaa et al., 2006). This concurs with the idea of motor control and mirror neuron theory that speech can be understood rather in terms of its production than from the characteristics of the acoustic cues (see Introduction in this paper). Thus, general acoustic patterns for emotions can be only roughly presented.

#### **QUESTIONNAIRE**

Language differences emerged when the original Finnish questionnaire was translated into Russian and Swedish. It occurred that the statement "It is easy for me to respond to music." was translated into Russian in such a way that the grammatical subject (me) was changed into the object (on me): "Music has a strong effect on me." It can be speculated whether this has had an effect on the answering to this statement since the percentage of the "Yes" answers was about 50% less in Russia than in the other countries. Another problem with the translation occurred when the Finnish word for "anger," *viha* was translated into Swedish as *hat,* "hate" instead of its correct equivalent of *ilska,* "ill temper," "anger." This problem was explained to the last 1/4 of the participants in the listening test in Sweden.

The statements "It is easy for me to learn a new melody" and "I am interested in singing" were connected to most of the identified emotions. This may partly refer to the underlying intonation of the speech (melody recognition) and partly to the similarities of vocally produced utterances recognized by those who were interested in singing which is also a form of vocal expressions.

#### **Table 6 | The quartiles for the shares of identified emotions studied by country.**


Most of those listeners who were interested in singing and who played a musical instrument were Swedish listeners. This result may be affected by the fact that the listening tests were conducted with help from the Music Acoustics Group at KTH, Royal Institute of Technology in Stockholm. Thus, many of the participants were involved with music through their professions, studies or hobbies. In this respect, the participants in the other countries studied may have been more heterogeneous than those in Sweden.

#### **LISTENING TEST**

The results of the present study showed that the percentage of the emotion identification and valence was clearly above the chance level in each of the five countries with different language and/or cultural backgrounds. Gender had no role in the perception of emotions or valence between the five countries studied. This result concurs with the findings by Koeda et al. (2013). Yet individual differences may be significant.

The speakers of the voice samples spoke Finnish as their native language, hence they read the non-sense text aloud using the Finnish prosody. This may be the reason why the Finnish listeners scored highest on the identified samples. A similar result was reported by Scherer et al. (2001) and by Abelin and Allwood (2000). Matsumoto et al. (2002) and Abelin (2004) have suggested that interpretation of prosody is easier for native speakers **Table 7 | Test of model effect in the logistic regression model of the combined effects on the identification of the emotions.**


of the language in question. Abelin (2004) also has stated that the prosody of emotional expression is always related to the particular language spoken, and never occurs in isolation (see also Iversen et al., 2008). Thus, the Finnish listeners were at an advantage in the perception test as they obviously recognized the prosody more easily than the other listeners in the other countries, and could connect the prosody to the linguistic expressions even without meaningful words used. Finnish listeners perceived most rarely neutrality and most frequently joy and interest – but also disgust when compared to the other countries.

In their earlier study Schirmer and Kotz (2002) used eventrelated potentials (ERP) to study how their participants judged the valence of the prosody of a German verb and the emotional meaning of the word.

Interaction between emotional prosody and word meaning was found in females but not in males. Males appeared to process the meaning and the emotional prosody independently of each other. The researchers also argued that females are faster and more


**Table 8 | Accuracy of the identification of the emotions and valence in percentages when studied by country and gender.**

**Table 9 | Results of the emotionally loaded samples identified in percentages by gender of speakers and listeners.**


accurate in judging emotional information than males (Schirmer and Kotz, 2002; Schirmer et al., 2002, see also Besson et al., 2002; Imaizumi et al., 2004; Fecteau et al., 2005; Schirmer and Simpson, 2008). In the present investigation non-sense utterances were used. Thus there was no meaning in the words. However, gender differences were not studied here by ERP, consequently, it can only be stated that no gender differences in the accuracy of the emotion or valence perception were found. This concurs with the findings combining brain evolution, gender differences, and music (Falk, 2000).

The perceptual confusion of the three positive emotions interest, joy, and surprise may indicate that from the evolutionarysurvival perspective it may not have been crucial to distinguish between these emotions. The emotional state of joy was poorly recognized Scherer et al. (2001) have reported similar results for joy Sauter et al. (2009) have stated that communication of positive emotions may be restricted to the members of the same social or cultural group and function as consolidation of that group.

Identification of anger was not particularly accurate in the present study. This may be in part due to the chosen expression types by the speakers. They tended to express more cold anger than hot anger or rage. Hot anger is undoubtedly easier to identify than cold anger. One reason for not using hot anger was that the expressions had to meet the quality criteria set by the software programs in order to conduct the acoustic analyses. Further, perception of anger (Ekman, 2004; Abelin, 2008a,b)

and disgust (Banse and Scherer, 1996) may be more dominated by the visual than auditive information. However, the negative emotions of anger and disgust have been reported to be confused in visual perception tests as well (Matsumoto et al., 2002). Matsumoto et al. have suggested that the semantics of these emotions is similar and they share the elicitors of the emotion. Also, it may be easier to distinguish between positive and negative emotions (i.e., to identify valence) than between emotions which share the same valence, e.g., two negative emotions (Thompson et al., 2004). Moreover, Koeda et al. (2013) have reported significant cross-cultural differences in the perception of anger, disgust, and fear.

In the present study, the emotional state of fear tended to be well recognized from the auditive characteristics (see also, Abelin, 2008a,b). However, fear was frequently confused with sadness, obviously due to the similarities in their acoustic cues and the large number of voice breaks they shared. These negative emotions tended to be more irregularly expressed than the positive emotions (see also Juslin and Laukka, 2003a). Accordingly, Kotlyar and Morozov, 1976, see also Scherer, 1995) have reported longer pauses between syllables and shorter syllable duration for fear than for the other emotions in the European opera singing tradition they studied. The confusion of sadness and fear concurs with the results of an earlier study by Scherer et al. (2001). Nevertheless, sadness and fear were well recognized: the two emotions together yielded 82% accuracy and valence 94% accuracy.

Laukka and Juslin (2007) and Lima and Castro (2011) have stated that recognition, especially of negative emotions, tends begin to change during middle age. In the present study a negative correlation was found between age and the emotions identified for Finland, Russia, and the USA. Young listeners have been reported to be more accurate than older listeners at recognizing disgust, fear and anger from speech samples (ibid.). This was also seen in the present results. Negative emotion of sadness was significantly better recognized by young listeners, and positive emotions of joy and surprise, and additionally neutrality were significantly better recognized by old listeners. Moreover, the US participants were the youngest listeners and they chose disgust most frequently as an answer to the sentences.

From the evolutionary-survival and reproduction viewpoints it may be important for young people to be able to recognize negative emotions. Additionally, sadness may be an emotion which strengthens the bond between the members of the community. An accurateidentification of positive emotionsmayimply older people's higher tolerance or understanding for the less serious features.

As some of the US listeners were offered course credit for participating in the present test, it may be speculated whether they were completely volunteers or not, and if on the one hand the willingness, or on the other hand the advantage gained, was the "real" motive for participating. Either way, it may have had an effect on the US results.

Even though the speakers were professionals, significant differences occurred in the perception of emotions expressed. It must be stressed that the samples produced by one actress were easiest to recognize throughout the countries, and this may explain the bias in the results of the perception. Coincidentally, somewhat problematic differences in the vocal samples used have also been reported previously (Scherer et al., 1991). Speaker gender has previously been reported to have a significant effect on the identification of emotions (Koeda et al., 2013). Several studies of the vocal characteristics of emotional expressions have also shown that individual differences are significant (e.g., Ladd et al., 1985).

Whether actor portrayals should or should not be used in emotion research has frequently been discussed. Utterances produced by actors are claimed to be stereotypical and controlled, not genuine expressions. However, in such claims genuine is never defined. This raises another question about how genuine (or pure) our emotions are in "real life" as they are mixed in our minds with other ongoing emotions quite randomly and individually (see Izard, 2007). Do we know how a pure single emotion always needs to be manifested by all humans? However, the emotional samples of the present study were fairly well recognized by the listeners. Thus, there must have been some cues, either universal or cultural, which the listeners thought they recognized as expressing the specific emotional states. A number of authorities, cultural, and social systems control and regulate our social and emotional behavior, competence, and skills (Banse and Scherer, 1996; Sauter et al., 2009). To have social competence or skills requires subjective control. Thus, it does not seem reasonable to claim that in "real life" emotions are uncontrolled and hence, "genuine." It seems rather that in "real social life" emotional expressions are restricted and socialized to fit the commonly accepted norms, rules, and limits of the particular society. Consequently, it may sometimes be difficult to interpret the emotional message if the verbal and non-verbal signals are ambiguous. The expressions produced by an actor may thus be more simple and clear as he only uses those vocal cues which are necessary to convey the target emotion. This in turn, may lack realistic situational constraints (Scherer and Ellgring, 2007).

#### **VOCAL EMOTIONS AND MUSIC**

Humans tend to remember better the general structure of the melody line, i.e., the contour than the exact sizes of individual intervals between tones (Levitin and Tirovolas, 2010). The prosodic contour of an utterance may underlie the significance of a musical phrase or proto-musical behavior (Cross, 2001). According to Panksepp (2009/2010) it is possible that without prosodic pre-adaptations from evolving humans music might never have emerged. Juslin and Laukka (2003a) have suggested that the emotional expressiveness of music is based on the similarities of the emotional acoustic cues in vocal expressions. Hence, emotional music and speech may engage the same neural processes (Juslin and Västfäll, 2008).

In the present investigation, the positive emotions were expressed with fewer voice breaks and in a more rhythmical manner than the negative emotions. Speaking in a friendly manner has been shown to carry more melodic characteristics than speaking in an unfriendly way (Fónagy, 1981). Motherese, the speech directed to babies is also melodic and rhythmic (Trehub, 2003). Melodicity has suggested to be a third dimension apart from pitch and time. Melodicity is defined as "the perceptual response to the higher or lower degree of regularity/continuity/predictability of the fundamental frequency curve within each syllable" (Fónagy, 1981). Melodicity can also be used as a means in identifying the emotion. One male listener in the present study explained how he perceived the emotional samples as melodies and based on the melody he decided which emotion he heard. His identification was exceptionally accurate.

#### **CONCLUSION**

Identification of emotions from speech samples tended to be affected by voice quality and by a similar language and/or cultural background. Hence, vocal non-verbal communication affects interpretation of emotions even in the absence of language. It tends to be interpreted differently by speakers of different languages. Musical interests facilitate distinguishing between emotions.

Finally, it has to be stated that all the five countries studied are culturally relatively close to each other. In the future study a clearly different culture representing a totally different language background should be included in the comparison of the countries. This culture and language will be Arabic in Egypt.

#### **ACKNOWLEDGMENTS**

First of all the authors express their special gratitude to the participants in the listening tests in Finland, Estonia, Sweden, USA, and Russia, and the contact persons who made the listening tests possible: Director, Dr. Pille Pruulmann-Vengerfeldt and the staff, Institute of Journalism and Communication, University of Tartu, Tartu, Estonia; Professor Sten Ternström and his students Ragnar Schön and Evert Lagerberg, the Music Acoustics group, KTH, Royal Institute of Technology, Stockholm, Sweden; Assistant Professor Graham D. Bodie and Dr. Christopher C. Gearhart, Department of Communication Studies, Louisiana State University, LA, USA; and Director, Dr. Pavel Skrelin and Tatiana Chukaeva, Department of Phonetics, Saint Petersburg State University, Saint Petersburg, Russia. The authors would also like to thank Hanna-Mari Puuska M.Sc. for statistical analyses, Virginia Mattila M. A. for language correction of the manuscript, and the translators for translating the questionnaire. This study was supported by the Academy of Finland (grant no. 1139321).

#### **REFERENCES**


9, 235–248. doi:10.1016/S0892- 1997(05)80231-0


music training, predicts recognition of emotional speech prosody. *Emotion* 8, 838–849. doi:10.1037/ a0014080


*in the Vocal Expression of Emotions.* Academic dissertation. Tampere University Press. Tampere.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 26 February 2013; accepted: 27 May 2013; published online: 21 June 2013.*

*Citation: Waaramaa T and Leisiö T (2013) Perception of emotionally loaded vocal expressions and its connection to responses to music. A cross-cultural investigation: Estonia, Finland, Sweden, Russia, and the USA. Front. Psychol. 4:344. doi: 10.3389/fpsyg.2013.00344*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Waaramaa and Leisiö. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

#### **APPENDIX**


#### **PLEASE CHOOSE FROM THESE EMOTIONS WHICH OF THEM YOU HEAR:**


## Cross-cultural differences in the processing of non-verbal affective vocalizations by Japanese and Canadian listeners

#### **Michihiko Koeda1,2\*, Pascal Belin<sup>2</sup> ,Tomoko Hama<sup>3</sup> ,Tadashi Masuda<sup>4</sup> , Masato Matsuura<sup>3</sup> and Yoshiro Okubo<sup>1</sup>**

<sup>1</sup> Department of Neuropsychiatry, Nippon Medical School, Tokyo, Japan

<sup>2</sup> Voice Neurocognition Laboratory, Institute of Neuroscience and Psychology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK <sup>3</sup> Department of Biofunctional Informatics, Tokyo Medical and Dental University, Tokyo, Japan

<sup>4</sup> Division of Human Support System, Faculty of Symbiotic Systems Science, Fukushima University, Fukushima, Japan

#### **Edited by:**

Anjali Bhatara, Université Paris Descartes, France

#### **Reviewed by:**

Jan Van Den Stock, Katholieke Universiteit Leuven, Belgium Keiko Ishii, Kobe University, Japan

#### **\*Correspondence:**

Michihiko Koeda, Department of Neuropsychiatry, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo 113-8603, Japan. e-mail: mkoeda@nms.ac.jp

The Montreal Affective Voices (MAVs) consist of a database of non-verbal affect bursts portrayed by Canadian actors, and high recognitions accuracies were observed in Canadian listeners.Whether listeners from other cultures would be as accurate is unclear.We tested for cross-cultural differences in perception of the MAVs: Japanese listeners were asked to rate the MAVs on several affective dimensions and ratings were compared to those obtained by Canadian listeners. Significant Group × Emotion interactions were observed for ratings of Intensity, Valence, and Arousal. Whereas Intensity and Valence ratings did not differ across cultural groups for sad and happy vocalizations, they were significantly less intense and less negative in Japanese listeners for angry, disgusted, and fearful vocalizations. Similarly, pleased vocalizations were rated as less intense and less positive by Japanese listeners.These results demonstrate important cross-cultural differences in affective perception not just of non-verbal vocalizations expressing positive affect (Sauter et al., 2010), but also of vocalizations expressing basic negative emotions.

**Keywords: montreal affective voices, emotion, voice, cross-cultural differences, social cognition**

#### **INTRODUCTION**

Vocal affective processing has an important role in ensuring smooth communication during human social interaction as well as facial affective processing. Facial expressions are generally recognized as the universal language of emotion (Ekman and Friesen, 1971; Ekman et al., 1987; Ekman, 1994; Izard, 1994; Jack et al., 2012): however, several studies have demonstrated cross-cultural differences in facial expression between Western and Eastern groups (Ekman and Friesen, 1971; Ekman et al., 1987; Matsumoto and Ekman, 1989; Izard, 1994; Yrizarry et al., 1998; Elfenbein and Ambady, 2002;Jack et al., 2009,2012).Whether such cross-cultural differences also exist in the recognition of emotional vocalizations is not clear.

Most previous cross-cultural studies of auditory perception have investigated the processing of emotional Valence using word stimuli (Scherer and Wallbott, 1994; Kitayama and Ishii, 2002; Ishii et al., 2003; Min and Schirmer, 2011). One important study demonstrated cross-cultural differences in the rating of Intensity when subjects recognized meaning of the words with major emotions such as joy, fear, anger, sadness, and disgust (Scherer and Wallbott, 1994). Another previous study examined cross-cultural differences in the perception of emotional words (Kitayama and Ishii,2002). This study indicated that native English speakers spontaneously pay more attention to verbal content than to vocal tone when they recognize emotional words, whereas native Japanese speakers spontaneously attend more to vocal tone than to verbal content. The other study has shown that Japanese are more sensitive to vocal tone compared to Dutch participants in the experiment of the multisensory perception of emotion (Tanaka et al., 2010). Further, one other study demonstrated cross-cultural differences in semantic processing of emotional words (Min and Schirmer, 2011), but found no difference in the processing of emotional prosody between native and non-native listeners. These studies suggest cross-cultural differences in auditory recognition of emotional words.

Studies of affective perception in speech prosody are made complex, in particular, by the potential interactions between the affective and the linguistic contents of speech (Scherer et al., 1984; Murray and Arnott, 1993; Banse and Scherer, 1996; Juslin and Laukka, 2003). To avoid this interaction, some studies have controlled the processing of semantic content using pseudo-words (Murray and Arnott, 1993; Schirmer et al., 2005) or pseudosentences (Ekman and Friesen, 1971; Pannekamp et al., 2005; Schirmer et al., 2005). The other previous study has employed a set of low-pass filtered vocal stimuli to select the final set of emotional utterances (Ishii et al., 2003), i.e., non-verbal vocalizations often accompanying strong emotional states such as laughs or screams of fear. Non-verbal affective vocalizations are ideally suited to investigations of cross-cultural differences in the perception of affective information in the voice since they eliminate the need to account for language differences between groups.

A recent study compared the perception of such non-verbal affective vocalizations by listeners from two highly different cultures: Westerners vs. inhabitants of remote Namibian villages. Non-verbal vocalizations expressing negative emotions could be recognized by the other culture much better than those expressing positive emotions, which lead the authors to propose that a number of primarily negative emotions have vocalizations that can be recognized across cultures while most positive emotions are communicated with culture-specific signals (Sauter et al., 2010). However this difference could be specific to English vs. Namibian groups, reflecting for instance different amounts of exposure to vocalizations through media or social interactions, and might not generalize to other cultures.

In the present experiment we tested for cross-cultural differences in perception of affective vocalizations between two cultures much more comparable in socio-economic status and exposure to vocalizations: Canadian vs. Japanese participants. Stimuli consisted of the Montreal Affective Voices (MAVs; Belin et al., 2008), a set of 90 non-verbal affect bursts produced by 10 actors and corresponding to emotions of Anger, Disgust, Fear, Pain, Sadness, Surprise, Happiness, and Pleasure. The MAVs have been validated in a sample of Canadian listeners and showed high inter-reliability in judgments of emotional Intensity, Valence, and Arousal as well as hit rates in emotional recognition (Belin et al., 2008). Here, we collected affective ratings using similar procedures in Japanese listeners and compared those ratings to those obtained in the Canadian listeners. Before the experiment, we predicted that ratings of negative emotion are culturally universal although cross-cultural differences would exist in ratings of positive emotion.

#### **MATERIALS AND METHODS SUBJECTS**

Thirty Japanese subjects (male 15, female 15) participated in this study. The average age was 22.3 ± 1.4 years. The educational years of Japanese subjects were 14.1 ± 0.3. The data of Japanese subjects were compared with 29 Canadian subjects (male 14, female 15); average age: 23.3 ± 1.5 years (Belin et al., 2008). Both Japanese and Canadian participants consisted exclusively of undergraduate students.

After a thorough explanation of the study, written informed consent was obtained from all subjects, and the study was approved by the Ethics Committee of Nippon Medical School.

#### **VOICE MATERIALS**

The MAVs: 10 French-Canadian actors expressed specific emotional vocalizations and non-emotional vocalizations (neutral sounds) using "ah" sounds. The eight emotional vocalizations were angry, disgusted, fearful, painful, sad, surprised, happy, and pleased. The simple "ah" sounds were used to control the influence of lexical-semantic processing. Since each of the eight emotional vocalizations and the neutral vocalization were spoken by 10 actors, the total number of MAVs sounds was 90. The MAVs are available at: http://vnl.psy.gla.ac.uk/

#### **EVALUATION SCALE**

Each emotional vocalization was evaluated using three criteria: perceived emotional Intensity in each of the eight Emotions, perceived Valence, and perceived Arousal. Each scale had a range from 0 to 100.

The Valence scale represented the extent of positive or negative emotion expressed by the vocalization: 0 was extremely negative, and 100 was extremely positive. The Arousal scale represented

the extent of excitement expressed by the vocalization: 0 was extremely calm, and 100 was extremely excited. The Intensity scale represented the Intensity of a given emotion expressed by the vocalization: 0 was not at all intense, and 100 was extremely intense. The Intensity scale was used for eight emotions: Anger, Disgust, Fear, Pain, Sadness, Surprise, Happiness, and Pleasure.

#### **METHODS OF EVALUATION BY PARTICIPANTS**

The MAVs vocalizations were played on a computer in a pseudorandom order. The subjects listened with headphones at a comfortable hearing level, and they evaluated each emotional vocalization for perceived Intensity, Valence, and Arousal using a visual analog scale in English on a computer (10 ratings per vocalization: 8 Intensity ratings, 1 Valence rating, 1 Arousal rating). Simultaneously, participants were given a printed Japanese translation of the scale labels, and by referring to this Japanese sheet, the test was performed using exactly the same procedure as in the Canadian study (Belin et al., 2008). All Japanese participants performed the experiment using a translation sheet with emotional words translated from English to Japanese. Based on previous studies (Scherer and Wallbott, 1994), the Japanese translation of English emotional labels was independently assessed by three clinical psychologists. Through their discussion, the appropriate emotional labels were determined.

#### **STATISTICAL ANALYSIS**

Statistical calculations were made using SPSS (Statistical Package for Social Science) Version 19.0. The Japanese data and the Canadian published data, with permission to verify, were statistically analyzed. A previous study demonstrated gender effects in Canadian participants using the MAV (Belin et al., 2008). Using the same methods to reveal the gender effects, an ANOVA with Emotion, Actor gender, and Participant gender as factors was calculated for ratings by the Japanese listeners. Further, to clarify the cross-cultural effect between Japanese and Canadian participants, three mixed two-way ANOVAs were calculated on ratings of Intensity,Valence, and Arousal. For each mixed ANOVA, to verify the equality of the variance of the differences by Emotions, Mauchly's sphericity was calculated. If the sphericity could not be assumed using Mauchly's test, Greenhouse–Geisser's correction was calculated.

#### **RELIABILITY AND ACCURACY**

First, we analyzed the inter-subject reliability of the ratings using Cronbach's alpha. Next,we examined the Intensity ratings for their sensitivity (hit rate, by Emotion) and specificity (correct rejection rate, by rating scale). Based on the previous report (Belin et al., 2008), the accuracy of emotional recognition was investigated using measures of sensitivity (hit rate, by Emotion) and specificity (correct rejection rate, by rating scale). For each vocalization, participants rated the perceived emotional Intensity along each of eight different scales (Anger, Disgust, Fear, Pain, Sadness, Surprise, Happiness, and Pleasure). To calculate sensitivity, for a given portrayed emotion, a maximum Intensity rating in the corresponding scale (i.e., if Intensity rating of Anger was highest when the subject listened to angry vocalization) was taken as a hit; otherwise, as a miss. In other words, emotions with high hit rates are those that

are well recognized, i.e., that scored highest on the scale of the intended emotion. Conversely, specificity relates to the extent to which the rating scale measures what it is intended to measure. To calculate specificity for a given rating scale, if the maximum score was obtained for the corresponding portrayed emotion across the eight vocalizations from one actor (i.e., when the subject listened to disgusted vocalization by actor 1, if rating of Disgust was highest in the eight emotional items), it was taken as a correct rejection; otherwise, as a false alarm. A highly specific rating scale is one rating scale for which the corresponding vocalization obtains the highest score. In other words, it is a measure of how a rating scale is specific to an emotion.

#### **RESULTS**

#### **AFFECTIVE RATING**

Inter-participant (30 participants) reliability across the 90 items [10 ratings scales: (Valence, Arousal, eight emotional Intensities) × (9 Emotional sounds) was analyzed: Cronbach's alpha = Japanese: 0.941, *F*(89, 299) = 230.6, *p* < 0.001]. Since this reliability for 30 subjects is very high, the ratings of 10 actors' vocalizations were averaged with the ratings of all 30 Japanese participants. [Canadian participants had an inter-participant reliability rating of 0.978 (Belin et al., 2008)]. **Table 1** shows the averaged ratings of Intensity, Valence, and Arousal for the present sample of Japanese participants and the Canadian participants in the study of Belin et al. (2008). **Figure 1** shows the distribution (average ± 2 SD) of ratings of 1-1. Intensity, 1-2. Valence, and 1-3. Arousal in Japanese and Canadian participants.

#### **INTENSITY**

A mixed two-way ANOVA with listeners' Group (Japanese, Canadian) and Emotion (*n* = 8) as factors was calculated on Intensity scores. A significant main effect was revealed between listener's Groups [*F*(1, 57) = 20.828, *p* < 0.001] as well as among the Emotions [*F*(5.5, 313.5) = 40.520, *p* < 0.001; Greenhouse–Geisser's test]. Crucially, a significant interaction between Group and Emotion was observed, *F*(5.5, 313.5) = 9.137, *p* < 0.001, (**Figure 1A**) indicating that rating differences between the two groups varied with the specific Emotion considered. *Post hoc* tests showed that Intensity ratings from Japanese listeners were significantly lower than ratings from Caucasian listeners for Anger, Disgust, Fear, Surprise, and Pleasure (*t*-test, *p* < 0.05/8: Anger, *t* = −4.358; Disgust, *t* = −4.756; Fear, *t* = −3.073; Surprise, *t* = −2.851; Pleasure, *t* = −6.737: **Table 1**; **Figure 1A**).

#### **VALENCE**

A mixed two-way ANOVA with listeners' Group (Japanese, Canadian) and Emotion (*n* = 9) as factors was calculated on Valence scores. There was a significant main effect of listeners' Group: *F*(1, 57) = 5.920, *p* < 0.018, as well as a significant main effect of Emotion *F*(4.3, 244.3) = 224.926, *p* < 0.001 (Greenhouse–Geisser's test). Crucially, a significant interaction between Group and Emotion was observed: *F*(4.3, 244.3) = 25.101, *p* < 0.001 (**Figure 1B**) indicating that rating differences between the two groups varied with the specific Emotion considered. *Post hoc* tests showed that Valence ratings from Japanese listeners were significantly higher than ratings from Caucasian listeners for Anger, Disgust, Fear


**FIGURE 1 | Shows the distribution of ratings (error bar: mean** ± **SD) for each emotional sound judged by 30 Japanese and 30 Canadian participants for (A) Intensity, (B) Valence, and (C) Arousal.** Each horizontal axis represents each rating score (0–100). Each vertical axis shows categories

(*t*-test, *p* < 0.05/9: *t*-test, *p* < 0.05/9: Anger, *t* = 6.696, Disgust, *t* = 3.608; Fear,*t* = 3.232: **Table 1**; **Figure 1B**), whereas the Valence rating from Japanese listeners was significantly lower than ratings from Caucasian listeners for Pleasure (*t*-test, *p* < 0.05/9; Pleasure, *t* = −8.121; **Table 1**, **Figure 1B**).

#### **AROUSAL**

A mixed two-way ANOVA with listeners' Group (Japanese, Canadian) and Emotion (*n* = 9) as factors was calculated on Arousal scores. There was no significant main effect of Group: *F*(1, 57) = 2.099, *p* > 0.05, whereas there was a significant main effect of Emotion *F*(4.4, 250.5) = 158.524, *p* < 0.001 (Greenhouse– Geisser's test). Crucially, a significant interaction between Group and Emotion was observed: *F*(4.4, 250.5) = 8.955, *p* < 0.001 (**Figure 1C**), indicating that rating differences between the two groups varied with the specific Emotion considered. *Post hoc* tests showed that the Arousal ratings from Japanese listeners were significantly higher than ratings from Caucasian listeners for sad vocalizations (*t*-test, *p* < 0.05/9: sad, *t* = 4.334: **Table 1**; **Figure 1C**), whereas the other Emotions were not significantly different between Japanese and Canadian participants (*t*-test, *p* > 0.05/9: **Table 1**; **Figure 1C**).

#### **SENSITIVITY AND SPECIFICITY**

We evaluated the Intensity ratings for their sensitivity (hit rate, by Emotion) and specificity (correct rejection rate, by rating scale). A maximum Intensity rating in the scale corresponding to the portrayed emotion was considered as a hit; otherwise, as a miss.**Table 2** shows the Intensity ratings of portrayed emotions for Japanese and Canadian participants: means of hit rates by participants and means of correct rejection rates by participants.

A Mixed two-way ANOVA with listener's Group and Emotion (*n* = 8) as factors were calculated on the score of sensitivity and specificity, respectively. In both sensitivity and specificity, a significant main effect of Group was observed [sensitivity: *F*(1, 57) = 51.6, *p* < 0.001; specificity: *F*(1, 57) = 44.8, *p* < 0.001] as well as main effects of Emotion [sensitivity: *F*(5.4, 310) = 38.0, *p* < 0.001; specificity: *F*(5.6, 320) = 41.5, *p* < 0.001, Greenhouse–Geisser's test]. Interaction

effects (Group × Emotion) for sensitivity and specificity were also observed sensitivity: *F*(5.4, 310) = 9.0, *p* < 0.001; specificity: *F*(5.6, 320) = 11.0, *p* < 0.001, indicating that rating differences between the two Groups varied with the specific Emotion considered.

There were significant differences in hit rates between Japanese and Canadian participants for angry, disgusted, fearful, painful, and pleased actors' vocalizations (*p* < 0.05/8, *t*-test): hit rates for these emotions were all lower in Japanese participants. In correct rejection rate, there were significant differences between Japanese and Canadian participants for Disgust and Fear ratings scales, with lower correct rejection rates in Japanese listeners (*p* < 0.05/8).

In Japanese participants, hit rates for each Emotion varied greatly, from 25% for fearful to 79% for sad. Hit rates and correct rejection rate to happy, sad, and surprised vocalizations were relatively high (more than 50%), whereas hit rates and correct rejection rate to angry, disgusted, fearful, painful, and pleased vocalizations were lower (less than 50%).

In **Table 2**, the maximum Intensity rating for each portrayed emotion is shown in bold. For fearful vocalizations only, the Emotion with a maximum score by Japanese participants was different from the portrayed emotion. Japanese listeners on average gave higher Intensity rating in the Surprise scale (66%) than the Fear scale (54%) in response to fearful vocalizations. For all other Emotions, Japanese participants gave the maximum ratings in the scale corresponding to the portrayed emotion, as did the Canadian listeners.

#### **GENDER DIFFERENCES OF ACTOR AND PARTICIPANT**

We examined the effects of participant's and actor's gender on hit rates in Japanese participants (**Figure 2**). A three-way mixed ANOVA was calculated with the factors of actor's gender and participant's gender as well as Emotion in Japanese participants. In addition to a significant effect of the emotion [*F*(1, 56) = 70.285, *p* < 0.001], a significant effect of actor's gender [*F*(1, 56) = 4.003, *p* ≤ 0.05] was observed, whereas no significant effect was revealed in participant's gender [*F*(1, 56) = 3.727, *p* > 0.05] or interaction effect: emotion × actor's

corresponding

 to the rating scale (rows; Fisher's


**Table 2 | Intensity ratings (0–100) averaged across all actors for each portrayed emotion and Intensity ratings scale in Japanese and Canadian participants.**

ap < 0.001. b<sup>p</sup> < 0.05, strongest rating on the scale

protected least significance

\*p < 0.05/8, t-test.

 test).

corresponding

 to the portrayed emotion (columns). cp < 0.001. d<sup>p</sup> < 0.05, strongest rating for the portrayed emotion

gender [*F*(1, 56) < 1, *p* > 0.05], emotion × participant's gender [*F*(1, 56) = 2.496, *p* > 0.05], and emotion × actor's gender × participant's gender [*F*(1, 56) < 1, *p* > 0.05]. Hit rates were higher for vocalizations portrayed by the female actors irrespective of participant's gender (**Figure 2**).

Further, we investigated cultural effect on hit rates including Japanese and Canadian participants. A three-way ANOVA was calculated with the factors of listener's group, actor's gender, and participant's gender. A significant main effect was observed in listener's Group: *F*(1, 110) = 83.211, *p* < 0.001, and actor's gender *F*(1, 110) = 11.675, *p* < 0.001, and participant's gender *F*(1, 110) = 8.396, *p* = 0.005 < 0.05. Interaction effect showed no significant effect of listener's group × participant's gender, *F*(1, 110) = 0.054, *p* > 0.05, listener's group × actor's gender, *F*(1, 110) = 0.428, *p* > 0.05, actor's gender × participant's gender *F*(1, 110) = 0.804, *p* > 0.05, and listener's group × actor's gender × participant's gender, *F*(1, 110) = 0.071, *p* > 0.05. These results indicate that in hit rates, the effect of actor's gender exists regardless of cultures.

Gender differences were analyzed on ratings of Intensity, Valence, Arousal, and correct rejection rates as well as hit rates. A significant effect of actor's gender was observed in Intensity: *F*(1, 55) = 136.712, *p* < 0.001; Valence: *F*(1, 55) = 14.551, *p* < 0.001; Arousal: *F*(1, 55) = 182.899, *p* < 0.001; correct rejection rates: *F*(1, 55) = 23.131, *p* < 0.001. There was no significant effect of participant's gender in Intensity: *F*(1, 55) = 0.002, *p* > 0.05; Valence: *F*(1, 55) = 1.289, *p* > 0.05; Arousal: *F*(1, 55) = 0.655, *p* > 0.05. In correct rejection rate, a significant effect of participant's gender was observed: *F*(1, 55) = 6.343, *p* = 0.015, <0.05. No interaction between actor's gender and participant's gender was observed [Intensity: *F*(1, 55) = 1.459, *p* > 0.05, Valence: *F*(1, 55) = 0.316, *p* > 0.05, Arousal: *F*(1, 55) = 2.191, *p* > 0.05, Correct rejection rate: *F*(1, 55) = 0.797, *p* > 0.05].

#### **DISCUSSION**

We investigated cross-cultural differences between Japanese and Canadian participants in their perception of non-verbal affective vocalization using MAVs. The most intriguing finding is that significant Group × Emotion interactions were observed for all emotional ratings (Intensity, Valence, and Arousal). Ratings of Intensity and Valence for happy and sad vocalizations were not significantly different between Japanese and Canadian participants, whereas ratings for angry and pleased vocalizations were significantly different. Especially, for the Valence ratings in angry vocalizations, Japanese subjects rated less negative than Canadian subjects. Further, in the Valence ratings for pleasure vocalizations, Japanese subjects rated less positive than Canadian subjects.

#### **CROSS-CULTURAL EFFECT FOR POSITIVE EMOTION**

Correct rejection rates (validity) of Happiness and Pleasure were not significantly different between Caucasian and Japanese subjects (**Table 2**: Happiness: Canadian 76% vs. Japanese 56%, Pleasure: Canadian 39% vs. Japanese 29%). These findings suggest that these two items are valid beyond the culture. In our study, there was a significant difference in the ratings (Intensity and Valence) for pleased vocalizations between Japanese and Canadian participants,whereas no significant difference was observed in the ratings for happy vocalizations. Although Happiness (laughter) was well recognized across cultures, there were apparent cultural differences in the perception of Pleasure.

A recent study between Western participants and Namibian participants demonstrated that the positive vocalizations of achievement, amusement, sensual pleasure, and relief were recognized as culture-specific signals although happy vocalizations

were recognized cross-culturally (Sauter et al., 2010). Our present result is similar to the findings of this previous study. Further, in accordance with our results, recent studies of facial expression have shown that happy facial expression is not cross-culturally different between Caucasian and Asian participants (Shioiri et al., 1999; Jack et al., 2009, 2012). Our results suggest that the happy emotion is universal in vocal recognition as well as facial recognition. On the other hand, in the vocal recognition, other positive emotions such as Pleasure can show culture-specific biases.

#### **CROSS-CULTURAL EFFECT FOR NEGATIVE EMOTION**

Correct rejection rates (validity) of Anger, Pain, Sadness and Surprise were not significantly different between Caucasian and Japanese subjects (**Table 2**). These findings suggest that these two items are valid beyond the culture. On the other hand, correct rejection rates of Disgust and Fear were significantly different between Caucasian and Japanese subjects (**Table 2**). These findings indicate that it is very difficult for Japanese to identify these two emotions when they listened to MAV.

A recent cross-cultural study between Western participants and Namibian participants suggested that primary basic negative emotions such as Anger, Disgust, Fear, Sadness, and Surprise can be recognized in both cultures (Sauter et al., 2010). We predicted that ratings of negative emotion are culturally universal. However, our results did not accord with that previous study, and we also observed cross-cultural differences in the recognition of Anger, Disgust, and Fear. **Figure 1** and **Table 1** show that Intensity ratings for angry, disgusted, fearful, and surprised vocalizations were significantly higher in the Canadian Group than in the Japanese Group. Valence ratings were higher in Japanese than in Canadians regarding some negative emotions (i.e., anger, disgust, and fear). These differences are consistent as higher perceived Intensity of a negative emotion is typically associated with lower (more negative) perceived Valence. These findings could reflect cross-cultural features of Intensity and Valence in negative emotion. Previous studies of facial expression have demonstrated that cross-cultural differences exist in the recognition of angry, disgusted, and fearful face (Shioiri et al., 1999; Jack et al., 2012). In agreement with these results, the recognition of Anger, Disgust, and Fear may reflect cross-cultural differences between Caucasian and Asian participants. On the other hand, the recognition of sad vocalizations (cries) was not significantly different, in agreement with Sauter et al. (2010). Previous studies of facial expression have shown cross-cultural differences in the recognition of sad expressions (Shioiri et al., 1999; Jack et al., 2012). This finding could reflect the fact that the recognition of sad vocalization could be more similar across cultures in comparison with the facial recognition. A previous study indicated that Japanese are severely affected by the meaning of words in recognition of Japanese emotions (Kitayama and Ishii, 2002). The other reason why Japanese find it difficult to differentiate negative emotional vocalizations may be that Japanese need more contextual information to recognize emotions than Canadians.

Concerning of ratings of negative vocalizations, **Table 2** shows that hit rates (accuracy) and specificity were lower in Japanese participants than in Canadian participantsfor ratings of angry, disgusted, fearful, and painful vocalizations. Especially, the strongest pattern of confusion was observed between fearful and surprised vocalizations in Japanese participants. This pattern is a typical pattern of confusion in Caucasian listeners as well (Belin et al., 2008). For both Japanese and Canadian participants, when listening to fearful vocalizations, the Intensity ratings for Surprise were high (Canadian: fearful 68 ± 2.5 vs. surprised 57 ± 3.0; Japanese: fearful 54 ± 5.9 vs. surprised 66 ± 5.2). These results suggest that it was difficult for Japanese participants to discriminate between fearful and surprised vocalizations. The hit rate of fearful vocalizations in Japanese participants was significantly lower than that in Canadian participants. In contrast, the hit rate of surprised vocalizations was not significantly different between Japanese and Canadian. This finding suggests that Japanese tend to be difficult to identify emotional intensity of fearful vocalizations from MAV.

A recent cross-cultural study between Japanese and Dutch participants demonstrated congruency effects displayed by happy face/voice and angry face/voice (Tanaka et al., 2010). This study indicated that, while listening to Anger voices by Dutch speakers, accuracy ratings of Japanese participants are significantly lower than Dutch participants. In agreement with this result, our study showed that ratings for angry vocalizations showed significantly less Intensity and less negative Valence in Japanese than in Canadian listeners.

#### **THE EFFECTS OF PARTICIPANT'S AND ACTOR'S GENDER IN JAPANESE**

Our present study has demonstrated a significant gender effect by actor in accordance with a previous Canadian study (Belin et al., 2008), and hit rates for female vocalizations are higher than for male vocalizations (**Figure 2**). In general, women are believed to be more emotionally expressive than are men (Fischer, 1993). A previous study of facial recognition also revealed that females had a higher rate of correct classification in comparison with males (Thayer and Johnsen, 2000). Our results suggest that Japanese as well as Canadians are also more accurate at recognizing female vocalizations.

A previous study demonstrated an effect of listener's gender in Canadian participants (Belin et al., 2008). In line with the previous study, in the analysis including Japanese and Canadian participants, the effect of participant's gender was replicated.

Our present study has at least two important limitations. First, stimuli consisted of acted vocalizations, not genuine expressions of emotion. Ideally, research on emotional perception would only use naturalistic stimuli. However, collecting genuine emotional expressions across different actors in comparable settings and for different emotions is very difficult and presents ethical problems. Second, in the present study, cross-cultural differences between Canadian and Japanese listeners were confirmed in the recognition of some emotional vocalizations. In the future, it will be necessary to develop a set of stimuli to increase cross-cultural validity.

In summary, we tested for cross-cultural differences between Japanese and Canadian listeners in perception of non-verbal affective vocalization using MAVs. Significant Group × Emotion interactions were observed for all ratings of Intensity, Valence, and Arousal in comparison with Japanese and Canadian participants of our present study. Although ratings did not differ across cultural groups for Pain, Surprise, and Happiness, they markedly differed for the angry, disgusted, and fearful vocalizations which were rated by Japanese listeners as significantly less intense and less negative than by Canadian listeners; similarly, pleased vocalizations were rated as less intense and less positive by Japanese listeners. These results suggest, in line with Sauter et al. (2010), that there were cross-cultural differences in the perception of emotions through non-verbal vocalizations, and our findings further suggest that these differences are not necessarily only observed for positive emotions.

#### **REFERENCES**


#### **ACKNOWLEDGMENTS**

We gratefully acknowledge the staff of Nippon Medical School Hospital; Section of Biofunctional Informatics, Tokyo Medical and Dental University; and Voice Neurocognition Laboratory, University of Glasgow. This work was supported by a Health and Labor Sciences Research Grant for Research on Psychiatric and Neurological Diseases and Mental Health (H22-seishin-ippan-002) from the Japanese Ministry of Health, Labor and Welfare.

processing: an event-related brain potential study. *J. Cogn. Neurosci.* 17, 407–421.


facial affect: a multivariate analysis of recognition errors. *Scand. J. Psychol.* 41, 243–246.

Yrizarry, N., Matsumoto, D., and Wilson-Cohn, C. (1998). American-Japanese differences in multiscalar intensity ratings of universal facial expressions of emotion. *Motiv. Emot.* 22, 315–327.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 29 November 2012; accepted: 14 February 2013; published online: 19 March 2013.*

*Citation: Koeda M, Belin P, Hama T, Masuda T, Matsuura M and Okubo Y (2013) Cross-cultural differences in the processing of non-verbal affective vocalizations by Japanese and Canadian listeners. Front. Psychol. 4:105. doi: 10.3389/fpsyg.2013.00105*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Koeda, Belin, Hama, Masuda, Matsuura and Okubo. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## Cross-cultural decoding of positive and negative non-linguistic emotion vocalizations

#### *Petri Laukka1 \*, Hillary Anger Elfenbein2, Nela Söder 1, Henrik Nordström1, Jean Althoff 3, Wanda Chui 4, Frederick K. Iraki 5, Thomas Rockstuhl <sup>6</sup> and Nutankumar S. Thingujam7*

*<sup>1</sup> Department of Psychology, Stockholm University, Stockholm, Sweden*

*<sup>2</sup> Olin Business School, Washington University, St. Louis, MO, USA*

*<sup>3</sup> UQ Business School, University of Queensland, Brisbane, QLD, Australia*

*<sup>4</sup> Haas School of Business, University of California, Berkeley, CA, USA*

*<sup>5</sup> United States International University, Nairobi, Kenya*

*<sup>6</sup> Nanyang Business School, Nanyang Technological University, Singapore*

*<sup>7</sup> Department of Psychology, Sikkim University, Gangtok, India*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Pascal Belin, University of Glasgow, UK Bill Thompson, Macquarie University, Australia*

#### *\*Correspondence:*

*Petri Laukka, Department of Psychology, Stockholm University, 106 91 Stockholm, Sweden e-mail: petri.laukka@ psychology.su.se*

Which emotions are associated with universally recognized non-verbal signals? We address this issue by examining how reliably non-linguistic vocalizations (affect bursts) can convey emotions across cultures. Actors from India, Kenya, Singapore, and USA were instructed to produce vocalizations that would convey nine positive and nine negative emotions to listeners. The vocalizations were judged by Swedish listeners using a within-valence forced-choice procedure, where positive and negative emotions were judged in separate experiments. Results showed that listeners could recognize a wide range of positive and negative emotions with accuracy above chance. For positive emotions, we observed the highest recognition rates for relief, followed by lust, interest, serenity and positive surprise, with affection and pride receiving the lowest recognition rates. Anger, disgust, fear, sadness, and negative surprise received the highest recognition rates for negative emotions, with the lowest rates observed for guilt and shame. By way of summary, results showed that the voice can reveal both basic emotions and several positive emotions other than happiness across cultures, but self-conscious emotions such as guilt, pride, and shame seem not to be well recognized from non-linguistic vocalizations.

**Keywords: affect bursts, cross-cultural, emotion recognition, non-verbal behavior, positive emotions, vocalizations**

#### **INTRODUCTION**

Studies of non-verbal emotion expression have provided crucial input to many of the central debates in emotion science. Controversies ranging from the universality of emotions (e.g., Ekman, 1993; Russell, 1994; Elfenbein, 2013), to the wider issue of how emotions should be conceptualized (e.g., Scherer, 1986; Ekman, 1992; Barrett, 2006), have all been fueled by data from studies of emotion expression. Here, we address a fundamental question raised in these debates—namely which emotions can be communicated across cultures—by examining how reliably the voice can convey cross-culturally a wide range of both positive and negative emotions.

The human voice is a rich source of emotional information and non-verbal vocal expressions come in two main forms, namely modifications of prosody (tone of voice) during speech (i.e., *prosodic expressions*), and through non-speech vocal sounds such as breathing sounds, crying, hums, grunts, laughter, shrieks, and sighs (i.e., *non-linguistic vocalizations*). Extensive reviews have established that prosodic expressions of basic emotions such as anger, fear, happiness, and sadness are conveyed by acoustic patterns of cues related to pitch, intensity, voice quality, and durations (Juslin and Laukka, 2003; Scherer, 2003). Several studies have further shown that decoders are able to infer the emotional content of prosodic expressions across languages and cultural boundaries with accuracy above chance (e.g., Kramer, 1964; Beier and Zautra, 1972; Albas et al., 1976; van Bezooijen et al., 1983; Graham et al., 2001; Scherer et al., 2001; Thompson and Balkwill, 2006; Bryant and Barrett, 2008; Pell et al., 2009). These studies suggest that perception of prosodic expressions has a universal component, although meta-analyses have also shown that communication is more accurate when judges rate expressions from their own culture compared with unfamiliar cultures (Elfenbein and Ambady, 2002; Juslin and Laukka, 2003).

Non-linguistic vocalizations [sometimes also referred to as affect bursts; see Scherer (1994)] differ from prosodic expressions in important ways. For example, speech requires highly precise and coordinated movement of the articulators (e.g., lips, tongue, and larynx) in order to transmit linguistic information, whereas non-linguistic vocalizations are not constrained by linguistic codes and thus do not require such precise articulations (Scott et al., 2009). This entails that non-linguistic vocalizations can exhibit larger ranges for many acoustic features than prosodic expressions—as evident by comparing, for example, pitch ranges in laughter vs. speech (Bachorowski et al., 2001). Compared to prosodic expressions, non-linguistic vocalizations may also be more strongly affected by physiological alterations (e.g., autonomic activation) to the appraisal of emotional situations and their effects on the vocal apparatus. Because vocal expressions are hypothesized to largely result from such emotion-related somatic alterations (see Scherer, 1986), non-linguistic vocalizations may be particularly suited for emotive communication.

However, compared to the large number of studies on prosodic expressions, relatively few studies have investigated emotion recognition from non-linguistic vocalizations (Schröder, 2003; Sauter and Scott, 2007; Belin et al., 2008; Hawk et al., 2009; Simon-Thomas et al., 2009; Sauter et al., 2010a; Lima et al., 2013). These studies show that decoders are generally accurate when judging basic emotions from non-linguistic vocalizations, often reaching higher recognition rates than for prosodic stimuli (e.g., Hawk et al., 2009). Some studies on vocalizations have also extended their coverage of emotions to include several emotions not generally viewed as basic. In particular, findings suggest that non-linguistic vocalizations may convey a wider palette of positive emotional states compared to facial expressions (Sauter and Scott, 2007; Simon-Thomas et al., 2009), as hypothesized by Ekman (1992). This suggests that different modalities of expression, such as facial and vocal expression, and perhaps also different varieties of expression within each modality, such as prosodic expressions and nonlinguistic vocalizations, may be preferentially suited for expressing different emotions (see also Hawk et al., 2009; App et al., 2011).

It would seem that non-linguistic vocalizations, being unconstrained by conventions of language, would provide ideal stimuli for cross-cultural studies, but we are aware of very few previous studies on this topic. Sauter et al. (2010b) examined recognition of nine emotions, including basic emotions and additional positive emotions, across European English speaking individuals and individuals from remote, culturally isolated Namibian villages. They reported successful communication of basic emotions across cultural barriers, whereas recognition of positive emotions reached accuracy above chance mainly in within-group conditions. Koeda et al. (2013), in turn, let individuals from Canada and Japan rate Canadian vocalizations of basic emotions with regard to perceived levels of activation, valence and intensity, and reported some group differences in ratings of valence and intensity for both positive and negative emotions. Previous research has thus provided initial findings of both cultural similarities and differences, but further research is needed to establish the degree of cross-cultural variance and invariance of non-linguistic vocalizations.

In the present study we double the number of included emotions compared to previous studies and examine recognition of 18 emotions in cross-cultural conditions. By including the widest selection of emotions to date in a cross-cultural study, we aim to examine the limits of what non-linguistic vocalizations can reveal about emotion in a cross-cultural context. Notably, our selection of emotions includes equally many positive (affection, amusement, happiness, interest, sexual lust, peacefulness/serenity, pride, relief, and positive surprise) and negative (anger, contempt, disgust, distress, fear, guilt, sadness, shame, and negative surprise) emotions. Very few previous cross-cultural studies—regardless of expression modality—have examined recognition of positive emotional states beyond happiness, and our study will therefore provide novel clues about the universality of positive emotion expressions.

### **STUDY 1—DECODING OF POSITIVE NON-LINGUISTIC VOCALIZATIONS**

#### **MATERIALS AND METHODS** *Vocal stimuli*

We utilized non-linguistic vocalizations from the VENEC corpus, which is a large cross-cultural database of vocal emotion expressions portrayed by 100 professional actors (Laukka et al., 2010). The majority of stimuli in the VENEC corpus consist of prosodic expressions, but a subset of the actors also provided non-linguistic vocalizations, or affect bursts, and these stimuli are used in the present study. Actors from India, Kenya, Singapore, and USA were instructed to convey nine positive emotions (affection, amusement, happiness, interest, sexual lust, peacefulness/serenity, pride, relief, and positive surprise) by means of non-linguistic vocalizations. All vocalizations were intended to convey expressions with medium (moderately high) emotion intensity. Emotionally neutral vocalizations were also recorded, but these are not included in the current study.

The actors were instructed to express the emotions as convincingly as possible and in a similar way as in real emotional situations. To achieve this, the actors were first provided with scenarios describing typical situations in which each emotion may be elicited, based on current research on emotion appraisals (e.g., Ortony et al., 1988; Lazarus, 1991; Ellsworth and Scherer, 2003), and were then instructed to try to enact finding themselves in similar situations. As a further aid for producing convincing portrayals, they were also told to try to remember similar situations that they had experienced personally and that had evoked the specified emotions, and if possible to try to put themselves into the same emotional state of mind. Scherer and Bänziger (2010) have argued that a combination of scenarios and induction methods is likely to increase the authenticity and believability of the resulting portrayals because it discourages the use of stereotypical expressions.

The actors were free to choose whatever kind of human sounds that they thought fit for the purpose (e.g., breathing sounds, crying, hums, grunts, laughter, shrieks, and sighs). They were, however, told to avoid actual words (e.g., "heaven," "no," "yes") and vocalizations with conventionalized semantic meaning (e.g., "yuck," "ouch"), although non-linguistic interjections (e.g., "ah," "er," "hm," "oh") were allowed. Some actors nevertheless used words and these stimuli were excluded in an initial screening of the stimuli. Non-linguistic vocalizations were not recorded for each actor, and the number of emotions that each actor provided vocalizations for also varied. In total, our selection included 213 positive non-linguistic vocalizations from 41 actors (India, *N* = 9; Kenya, *N* = 11; Singapore, *N* = 7; and USA, *N* = 14), and contained approximately equally many portrayals of each emotion from each culture (see **Table 1**). The selection further included approximately the same number of stimuli by female and male actors in each condition.

Recordings were conducted on location in each country (Pune, India; Nairobi, Kenya; Singapore, Singapore; and Berkeley, CA, USA), and the vocalizations were recorded directly onto a

**Table 1 | Number of non-linguistic vocalizations for each emotion and culture.**


computer with 44 kHz sampling frequency using a high-quality microphone (sE Electronics USB2200A, Shanghai, China). The loudness of the stimuli varied widely—literally ranging from whispers to screams—and the amplitude of each stimulus was therefore peak normalized using *Adobe Audition* software (Adobe Systems Inc., San Jose, CA, USA). The normalization procedure controlled for differences in recording level between actors and softened the contrast between stimuli which would otherwise have been disturbingly loud or inaudibly quiet.

#### *Participants and procedure*

Twenty-nine Swedish individuals, mainly university students, took part in the study (20 women; mean age = 31 years). Participants judged the vocalizations of positive emotions by choosing one label which best represented the expression conveyed by each speech stimulus, and the alternatives they could choose from were the same as the nine intended expressions (affection, amusement, happiness, interest, lust, peacefulness, pride, relief, and positive surprise). All participants were provided dictionary definitions of each emotion, and also received the same emotion scenarios as did the actors, to make sure that they understood all of the included emotion labels.

Responses were scored as correct if the response matched the intended expressions of the emotion portrayals. Experiments were computerized and conducted individually using *MediaLab* software (Jarvis, 2008). Stimuli were presented in random order, and the participants were only allowed to listen to each stimulus once. The participants listened to stimuli through high-quality headphones, with the sound level kept constant across participants. Sessions lasted for ∼40 min, and participants received course credits or a movie ticket voucher as compensation for their participation.

#### **RESULTS**

**Table 2** shows the recognition rates and confusion patterns for positive emotions. The overall recognition rate was 39%, which is 3.5 times higher than the proportion expected by chance guessing (the chance level in a 9-alternative forced choice task is 11%; 1 out of 9). All emotions were recognized with accuracy above chance in at least some cultural conditions, as indicated by binomial tests. This suggests that a wide range of positive vocalizations were conveyed across cultures. Vocalizations of relief (mean recognition rate = 70%) were most accurately perceived, followed by lust (45%), interest (44%), serenity (43%), and positive surprise (42%). These emotions were not frequently confused with other states, although interest was sometimes confused with positive surprise, and serenity with relief.

Happiness (36%) and amusement (32%) were symmetrically confused with each other at a level equal to accurate decoding proportion, which suggests that vocalizations of these states are not easy to separate. Given the conceptual similarity between these states this was hardly a surprising finding, and a combined happiness/amusement category received 60% accuracy. At the bottom end of recognizability, we found pride (22%) and affection (20%). Although recognized with above-chance accuracy in some conditions, these emotions were frequently misclassified, and vocalizations of both pride and affection were most commonly confused with interest.

Inspection of the recognition rates as a function of speaker culture further revealed that both recognition and confusion patterns were similar across all four cultures (see **Table 2**). This suggests cross-cultural consistency with regard to which emotions were easy or hard to recognize, and which emotions were confused with each other and which were not. Nevertheless, some emotions were only recognized in some, but not in other, cultural conditions. For example, Swedish listeners did not accurately perceive amusement vocalizations from Indian stimuli, but instead judged them as surprised sounding. However, it is difficult to interpret such group differences, because they may result from group effects not having to do with culture *per se* (e.g., the Indian actors may simply not have been as successful in portraying amusement compared to actors from other cultures).

#### **STUDY 2—DECODING OF NEGATIVE NON-LINGUISTIC VOCALIZATIONS**

#### **MATERIALS AND METHODS** *Vocal stimuli*

Non-linguistic vocalizations of nine negative emotions (anger, contempt, disgust, distress/pain, fear, guilt, sadness, shame, and negative surprise) portrayed by professional actors from India, Kenya, Singapore, and USA served as stimuli in Study 2. The vocalizations were selected from the VENEC corpus (Laukka et al., 2010) and were collected using the same methods as described for Study 1. In total, the selection contained 214 negative emotional vocalizations from 40 actors (India, *N* = 8; Kenya, *N* = 10; Singapore, *N* = 7; and USA, *N* = 15), see **Table 1** for details.

#### *Participants and procedure*

We used the same judgment procedures in Study 2, as previously described for Study 1, except that we presented negative vocalizations and response alternatives. Twenty-eight Swedish individuals (18 women; mean age = 31 years)

**Table 2 | Recognition rates and confusion patterns for non-linguistic vocalizations of nine positive emotions from four cultures.**


*Note: The recognition rates (percentage accuracy) for which the expression portrayed is the same as the expression judged are shown in the diagonal cells (marked in bold typeface). Asterisks denote recognition rates higher than what would be expected by chance guessing (11%), as indicated by binomial tests (ps* < *0.05, Bonferroni corrected; ps* < *0.001, uncorrected). Blank cells indicate misclassification rates of less than 10%.*

judged the expressed emotion of each presented stimulus, by choosing one from nine alternatives (anger, contempt, disgust, distress, fear, guilt, sadness, shame, and negative surprise). Four of the participants had previously taken part in Study 1.

#### **RESULTS**

Recognition rates and confusion patterns for negative emotions are presented in **Table 3**. For negative emotions, the overall recognition rate was approximately four times higher than chance at 45%. Similar to Study 1, we conducted binomial tests to test

**Table 3 | Recognition rates and confusion patterns for non-linguistic vocalizations of nine negative emotions from four cultures.**


*Note. The recognition rates (percentage accuracy) for which the expression portrayed is the same as the expression judged are shown in the diagonal cells (marked in bold typeface). Asterisks denote recognition rates higher than what would be expected by chance guessing (11%), as indicated by binomial tests (ps* < *0.05, Bonferroni corrected; ps* < *0.001, uncorrected). Blank cells indicate misclassification rates of less than 10%.*

whether the proportion of participants who chose the correct response alternative for each emotion was higher than the proportion that would be expected by chance guessing. All emotions were recognized with accuracy above chance in at least some conditions—which suggests that a wide range of negative emotions can be expressed cross-culturally through the voice. Disgust (mean recognition rate = 63%) was the best recognized emotion, followed by anger (57%), fear (57%), sadness (56%), negative surprise (53%), and contempt (44%). These emotions were seldom confused with other states, although contempt was sometimes confused with negative surprise.

Distress (33%) was frequently confused with both fear and sadness, which suggests that distress vocalizations may show some overlap with these emotions. The most frequently observed confusions occurred symmetrically between shame (mean recognition rate = 21%) and guilt (mean recognition rate = 20%). A joint shame/guilt category indeed received 40% accuracy, which could be interpreted as evidence for the notion that the voice can reveal a negative self-conscious emotion category. However, both shame and guilt were frequently confused also with other emotions, such as distress and negative surprise, which instead indicates that they may not be associated with distinct vocal signals.

**Table 3** further displays recognition rates as a function of speaker culture, and inspection revealed substantial cross-cultural consistency with regard to both recognition and confusion patterns. However, some cultural variability could also be observed. For example, Swedish listeners frequently confused distress vocalizations from India and Kenya with sadness, whereas distress vocalizations from Singapore and USA were instead confused with fear. However, as previously explained, we cannot know if group differences are caused by cultural factors or factors unrelated to culture.

#### **DISCUSSION**

The present results establish non-linguistic vocalizations as a rich and nuanced source of emotional signals. Across two studies, our results suggest that the voice can convey a wide range of positive (Study 1) as well as negative (Study 2) emotions across cultures. More specifically, we observed above-chance cross-cultural recognition of basic emotions such as anger, contempt, disgust, fear, happiness, sadness, and surprise. Notably, we also observed for the first time above-chance recognition of several positive emotions other than happiness—such as interest, lust, relief, and serenity—in a cross-cultural context. However, not all emotions were equally recognizable across cultures and we observed only modest recognition rates for affection, guilt, pride, and shame. The implications of these findings are discussed below in relation to the larger issue about which emotions are associated with universally recognized expressions.

Findings of universality in emotion expression are traditionally interpreted as support for the proposition that emotion expressions are based on biologically driven evolved mechanisms (e.g., Ekman, 1992), although this view also has its critics (e.g., Barrett, 2006). Non-linguistic vocalizations are often considered an especially "primitive" form of human emotion signaling that is functional already at the time of birth and that in many ways resembles animal expressions more than human speech (Owren et al., 2010; Briefer, 2012). Thus, it may be hypothesized that cross-culturally communicable vocalizations may, at least to a certain extent, be based on evolved biologically driven mechanisms (e.g., Ekman, 1992), such as physiological effects of emotion appraisals on the voice production apparatus (Scherer, 1986). Our observation of above-chance recognition of basic emotions corroborates findings from the sole previous cross-cultural study on non-linguistic vocalizations by Sauter et al. (2010b), as well as previous studies on prosodic and facial expressions (e.g., Elfenbein and Ambady, 2002; Juslin and Laukka, 2003), and suggests that basic emotion vocalizations have a universal component.

We included a wide selection of positive emotions, and our observation of above-chance recognition of positive states other than happiness expands upon previous studies conducted in a within-cultural context (e.g., Simon-Thomas et al., 2009). The finding of a universal component to positive emotion vocalizations may appear contrary to the previous findings of Sauter et al. (2010b), who reported largely non-significant cross-cultural recognition for positive emotions. However, the distinctions between different positive emotions are not well understood, and as a consequence different studies have included different positive states. Between our study and Sauter et al. (2010b), the only common positive emotions were amusement and relief. Whereas Sauter et al. (2010b) observed cross-cultural recognition for amusement (which they viewed as a basic emotion) but not relief, we instead observed above-chance recognition for both emotions (although amusement was frequently confused with happiness). The main difference between studies thus concerns recognition of relief only, and may have been caused by idiosyncratic differences in the sets of expressive stimuli used in respective study. Despite the fact that expressors and perceivers in our study came from different continents, it also remains a possibility that the cultural distances may have been larger in Sauter et al. (2010b) compared to our study.

Similar to our observations, previous within-cultural studies have also reported modest recognition rates for affection, guilt, pride, and shame (Hawk et al., 2009; Simon-Thomas et al., 2009). Taken together, current evidence thus suggests that these emotions may not be associated with highly distinct vocalizations. Guilt, pride, and shame involve reflection upon and evaluation of the self (Tangney and Tracy, 2012), which makes these emotions more dependent on complex cognitive skills compared to basic emotions. Cultures vary regarding how the self is conceptualized (Markus and Kitayama, 1991), and this may lead to culture-specific interpretations of situations particularly relevant for self-conscious emotions such pride and shame (Imada and Ellsworth, 2011). There is thus a possibility that cultural variance may be especially salient for expressions of self-conscious emotions. Although we cannot draw this conclusion based on our current data—because we did not assess emotion recognition in both within- and cross-cultural conditions—this remains an interesting question for future studies. However, evidence also suggests that pride and shame are expressed in a similar fashion cross-culturally through facial and bodily cues (Tracy and Matsumoto, 2008), which leaves open the possibility that they may have distinct expressions through other modalities than the voice.

studies, and would welcome future studies that consider effects of the format of the judgment task and type of expressive stimuli on cross-cultural emotion decoding (e.g., Jürgens et al.,

Non-linguistic vocalizations are heterogeneous and contain many different types of human sounds, and our sample can only represent a limited subset of all possible vocalizations. We instructed the actors to avoid the use of vocalizations with conventionalized semantic meaning, because the production and recognition of emblematic affect expressions is hypothesized to be strongly culture-dependent (see Scherer, 1994). However, it remains a possibility that some of our vocal stimuli nevertheless contained such culture-dependent information, and this may have reduced recognition accuracy for some emotion × culture combinations. Our study was limited to decoding, but future studies could also investigate how different emotions are encoded in the acoustic properties (such as pitch, intensity, voice quality, and durations; Sauter et al., 2010a; Lima et al., 2013) and in the segmental-phonemic structure (Schröder, 2003) of non-linguistic vocalizations. Currently, cross-cultural studies linking encoding and decoding are missing, but such studies have the potential to reveal which aspects of non-linguistic emotion vocalizations are culturally invariant and which rely on culture-dependent

2013).

templates.

Comparing our results to previous studies on prosodic expressions, we note that disgust vocalizations received high accuracy rates in our study (as well as in most previous vocalization studies; e.g., Schröder, 2003), whereas disgust is often poorly recognized from prosodic stimuli (e.g., Banse and Scherer, 1996). This suggests that some emotions may be better decoded from vocalizations versus emotional prosody, and future studies could perform direct comparisons to establish which emotions are preferentially recognized from which type of expression. Hawk et al. (2009) reported higher accuracy for vocalizations compared to prosodic expressions for a range of mainly negative emotions, but comparisons for positive emotions are currently missing. Such studies could also include other expression channels—such as facial, bodily, olfactory, and tactile cues—in order to provide a foundation for understanding of which emotions are preferentially expressed through which modalities (e.g., App et al., 2011).

Our investigation also has several limitations which merit consideration. Recent cross-cultural studies on decoding of facial (Elfenbein et al., 2007), musical (Laukka et al., 2013), and vocal (Sauter et al., 2010b) expressions have reported evidence for an in-group advantage to the effect that decoders perform better for expressions from a familiar versus an unfamiliar culture. However, we only assessed decoding in cross-cultural conditions, which precluded investigation of an in-group advantage in the current study. The lack of a within-cultural baseline rate, together with the small number of stimuli in each emotion × culture cell, also prevents a meaningful comparison of recognition rates between cultures—because differences may have been caused by group effects other than culture. We further assessed positive and negative emotions in separate forced-choice experiments in order to avoid fatigue in the participants and to keep the number of response options at a manageable level. However, this design prevented us from investigating possible confusions between positive and negative expressions. The use of a forced-choice format has also been criticized on the grounds that it may lead to inflated recognition rates by enabling judges to use informed guessing strategies to a certain extent (e.g., Russell, 1994). Finally, we used portrayed rather than spontaneous vocalizations, whereas some previous studies have reported that acted expressions may be more prototypical and intense than spontaneous expressions (e.g., Laukka et al., 2012). We are addressing the question of a possible in-group advantage in ongoing cross-cultural judgment

distinctiveness of positive emotions. **ACKNOWLEDGMENTS**

We acknowledge the following grants: Swedish Research Council 2006-1360 to Petri Laukka and US National Science Foundation BCS-0617624 to Hillary Anger Elfenbein.

To conclude, our results show that non-linguistic vocalizations can convey detailed emotional information—not limited to the usual basic emotions, or activation and valence dimensions—to listeners across cultures. We therefore propose that vocalizations may provide ideal stimuli for theory development and applied research in emotion science. Compared to negative emotions, positive emotions have received much less attention, and as a result knowledge about the cognitive appraisals underlying different positive states, and their effects on physiology, is limited. Because vocalizations seem to convey a particularly wide range of positive states, we suggest that studies on non-linguistic vocalizations provide a promising avenue for investigating the

#### **REFERENCES**


Bachorowski, J.-A., Smoski, M. J., and Owren, M. J. (2001). The acoustic features of human laughter. *J. Acoust. Soc. Am.* 110, 1581–1597. doi: 10.1121/1. 1391244


28–58. doi: 10.1111/j.1745-6916.20 06.00003.x


40, 531–539. doi: 10.3758/BRM.40. 2.531


*Cogn. Emot.* 6, 169–200. doi: 10.1080/02699939208411068


differences in the processing of nonverbal affective vocalizations by Japanese and Canadian listeners. *Front. Psychol.* 4:105. doi: 10.3389/fpsyg.2013.00105


a comparison of four languages. *J. Phon.* 37, 417–435. doi: 10.1016/j.wocn.2009.07.005


M. Brudzynski (Oxford: Academic Press), 187–198.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 31 March 2013; paper pending published: 23 April 2013; accepted: 30 May 2013; published online: 30 July 2013.*

*Citation: Laukka P, Elfenbein HA, Söder N, Nordström H, Althoff J, Chui W, Iraki FK, Rockstuhl T and Thingujam NS (2013) Cross-cultural decoding of positive and negative non-linguistic emotion vocalizations. Front. Psychol. 4:353. doi: 10.3389/fpsyg.2013.00353*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Laukka, Elfenbein, Söder, Nordström, Althoff, Chui, Iraki, Rockstuhl and Thingujam. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## The role of motivation and cultural dialects in the in-group advantage for emotional vocalizations

#### *Disa A. Sauter\**

*Department of Social Psychology, University of Amsterdam, Amsterdam, Netherlands*

#### *Edited by:*

*Petri Laukka, Stockholm University, Sweden*

#### *Reviewed by:*

*Marc D. Pell, McGill University, Canada Hillary A. Elfenbein, Washington University in St. Louis, USA*

*\*Correspondence: Disa A. Sauter, Department of Social Psychology, University of Amsterdam, Weesperplein 4, 1018 XA Amsterdam, Netherlands e-mail: d.a.sauter@uva.nl*

It is well-established that non-verbal emotional communication via both facial and vocal information is more accurate when expresser and perceiver are from the same cultural group. Two accounts have been put forward to explain this finding: According to the dialect theory, culture-specific learning modulates the largely cross-culturally consistent expressions of emotions. Consequently, within-group signaling benefits from a better match of the "emotion dialect" of the expresser and perceiver. However, it has been proposed that the in-group advantage in emotion recognition could instead arise from motivational differences in the perceiver, with perceivers being more motivated when decoding signals from members of their own group. Two experiments addressed predictions from these accounts. Experiment 1 tested whether perceivers' ability to accurately judge the origin of emotional signals predicts the in-group advantage. For perceived group membership to affect the perceivers' motivation, they must be able to detect whether the signal is coming from an in-group or out-group member. Although an in-group advantage was found for in-group compared to out-group vocalizations, listeners were unable to reliably infer the group membership of the vocalizer. This result indicates that improved recognition of in-group signals can occur also when the perceiver is unable to judge whether signals were produced by in- or out-group members. Experiment 2 examined the effects of expected and actual group membership of signals on emotion recognition by manipulating both orthogonally. The actual origin of the stimulus was found to significantly affect emotion recognition, but the believed origin of the stimulus did not. Together these results support the notion that the in-group advantage is caused by culture-specific modulations of non-verbal expressions of emotions, rather than motivational factors.

**Keywords: emotion, in-group advantage, vocalizations, non-verbal communication, motivation**

#### **INTRODUCTION**

#### **THE IN-GROUP ADVANTAGE**

Emotional signals are largely shared across cultural groups. However, a consistent finding in cross-cultural research on nonverbal emotional communication is that recognition accuracy is higher when expresser and perceiver are from the same cultural group (see Elfenbein and Ambady, 2002a for a meta-analysis). This pattern of in-group advantage has been found for visual cues in the form of both facial expressions (e.g., Ekman et al., 1969; Haidt and Keltner, 1999) and postural cues (Tracy and Robins, 2008). The in-group advantage has also been found for auditory signals, specifically speech prosody (Scherer et al., 2001) and nonverbal vocalizations (Sauter et al., 2010b). Two mechanisms have been proposed to explain the in-group advantage: Perceivers may be more motivated when judging in-group signals, or the physical expressions may be modulated by cultural learning, which can lead to a disadvantage when encoder and perceiver are from different cultures.

#### **THE MOTIVATION ACCOUNT**

The in-group advantage in emotion recognition could arise from motivational differences in the perceiver when judging in- vs. out-group expressions (e.g., Thibault et al., 2006). According to this view, the extent to which perceivers are motivated to attempt to take the expresser's perspective, and thus decode their emotional state, depends on the extent to which they identify with the expresser. This builds on the findings that others who are perceived to be in-group members are attended more to, and are also typically evaluated more positively than out- group members (see Tajfel and Billic, 1974). In the context of emotion communication, Thibault et al. (2006) suggest that observers may engage in more challenging strategies when decoding in-group expressions. According to this view, ethnic cultural groups constitute a subset of social groups, which depend on group identification. An in-group advantage would thus be expected for emotional communication between social groups of any kind, including groups differentiated by culture and/or ethnicity.

Thibault et al. provided empirical support for the motivational account in a study where basketball players and non-basketball players judged the emotional facial expressions of individuals who they were told were either basketball players or non-basketball players (Thibault et al., 2006). Participants who themselves played basketball were expected to consider other basketball players as their in-group, and hence be more accurate in their judgments of their emotional expressions. The authors found that recognition accuracy was affected by the group membership of the judge as well as the perceived group membership of the target, and concluded that "group identification influences decoding accuracy" (p. 682).

A study by Young and Hugenberg found further support for a motivational account of the in-group advantage for facial expressions (2010). They elicited an in-group advantage while holding the culture of the expresser and perceiver constant, by creating a minimal-group paradigm using fake feedback from a personality test. They found that in-group faces were processed more configurally than out-group faces. Given that configural processing is beneficial for the decoding of emotional facial expressions (Calder et al., 2000), the authors suggest that this processing bias may underlie the advantage for in-group judgments of facial expressions. They further argue that the processing bias is motivationally driven, based on the fact that the in-group advantage as well as configural processing difference disappears under reduced exposure time. To what extent this mechanism would apply to emotional signals other than facial expressions is unclear, given that this kind of configural processing has primarily been studied with faces.

Two studies of facial mimicry have also found support for a motivational account. In a related study to that by Thibault et al. (2006), using the same participants and stimuli, Bourgeois and Hess (2008; Experiment 2) found that perceivers displayed more mimicry for in-group as compared to out-group displays of sadness, although no effect was found for displays of anger or happiness. The authors concluded that the level of facial mimicry varies as a function of group membership, at least for some emotional states. Similarly, van der Schalk et al. (2011) examined facial mimicry to in-group and out-group expressions of emotions, presenting expressers either as a student of psychology (in-group) or as a student of economics (out-group). In that study, mimicry of anger and fear facial expressions, but not happiness, was found to be affected by group membership, with more mimicry occurring to in-group displays.

Together, these studies suggest that, at least for the perception of emotional facial expressions, signals may be affected by the extent to which perceivers identify with the expresser and judge them to belong to their own group.

#### **THE DIALECT ACCOUNT**

An alternative explanation of the in-group advantage is that culture-specific learning modulates non-verbal expressions of emotions. This is the account advanced by proponents of the dialect theory (Elfenbein and Ambady, 2002a). According to this view, within-group signaling benefits from a better match of the "emotion dialect" of the expresser and perceiver and hence results in improved accuracy.

One study that tested this account directly was conducted by Elfenbein et al. (2007). They asked individuals from Canada and Gabon to try to communicate a range of emotions to a friend using facial expressions. Analyzing the facial expressions using the Facial Action Coding System (FACS), they found that the muscle movements in the two groups largely converged on the expressions posited to be universal prototypes (Ekman and Friesen, 1978). In addition, and in support of the dialect account, reliable cultural differences emerged, which went beyond idiosyncratic differences of individual posers. The same expressions were then used in an emotion recognition task with participants from Canada and Gabon. Greater accuracy was found for judgments of in-group expressions, and although this pattern was consistent across emotions, a larger in-group advantage was found for those emotional states that exhibited greater differences in muscle movements. In fact, no in-group advantage was found when stimulus materials from the different groups were constrained to have an identical appearance (see Elfenbein and Ambady, 2002b; Matsumoto, 2002 for a discussion of this issue). The authors concluded that cross-cultural differences in expressive style underlie the in-group advantage for emotional expressions, rather than motivational or other factors.

#### **THE CURRENT STUDY**

Two experiments are presented which were designed to test predictions derived from the motivational (Experiments 1 and 2) and dialect (Experiment 2) accounts. Experiment 1 sought to examine the relationship between the in-group advantage and perceivers' ability to judge whether an emotional expression was produced by an in- or out-group member. For group-based motivational mechanisms to work, perceivers must be able to reliably judge whether a signal was produced by an in- or out-group member. Experiment 1 also examined whether individual differences in the ability to identify the group membership of emotional expressions would predict the extent to which perceivers show an in-group advantage. Experiment 2 aimed to test the relative contributions of motivation and cultural dialect to the in-group advantage for emotional vocalizations in a design that orthogonally manipulated the believed and actual cultural origin of the stimuli.

Research investigating the underlying mechanisms of this phenomenon has focused on facial expressions, although the ingroup advantage has been found for facial, postural, and vocal signals (Elfenbein and Ambady, 2002a). There has been a call for the inclusion of non-verbal channels of communication other than the face, such as the voice (Elfenbein et al., 2007). A number of studies have found evidence for an in-group advantage also in emotional speech prosody (Scherer et al., 2001; Thompson and Balkwill, 2006; Pell et al., 2009).

The stimuli employed in these experiments were non-verbal vocalizations of emotions, such as cheers, laughs, and sighs. Group differences based on ethnicity and language are often obvious from facial or speech features, but non-verbal vocalizations offer a type of signal from which group membership may not be easily inferred. Both experiments used ethnic cultural groups to define in- and out-groups, since this is the level at which cultural dialects would be likely to work most extensively.

#### **EXPERIMENT 1**

Experiment 1 firstly sought to replicate previous findings of an in-group advantage for non-verbal vocalizations of emotions (Sauter and Scott, 2007; Sauter et al., 2010b). It was expected that degree of in- or out-group membership would vary with cultural distance. The perceivers consisted of a group of Dutch listeners, and so Dutch vocalizations were in-group stimuli. A British set of sounds were close out-group stimuli, and a set of Namibian vocalizations were distant out-group stimuli. The experiment then tested two predictions derived from the motivational account. The first hypothesis was that perceivers should be able to reliably judge whether non-verbal vocalizations of emotions were produced by in- or out-group members, assuming that an in-group advantage was found. The second hypothesis was that individuals who were better at judging the group membership of the vocalizations would show a larger in-group advantage.

#### **METHODS**

#### *Stimuli*

The stimuli were taken from sets of non-verbal vocalizations of positive emotions (Dutch: Sauter et al., 2010a British: Sauter and Scott, 2007; Namibian: Sauter et al., 2010b). Each stimulus set consisted of six vocalizations per emotion, for the four emotions triumph, relief, amusement and sensual pleasure, resulting in 24 sounds per group and a total of 72 stimuli. Stimulus sex was balanced within each condition, with equal numbers of male and female tokens of each emotion for Dutch, British, and Namibian sounds, respectively. No exact age range was specified for the individuals producing vocalizations (given that the Namibian sample do not count age), but children, adolescents, and elderly adults were not included. The entire stimulus set was normalized for peak amplitude and was digitized at 41 kHz.

#### *Participants*

Thirty students (14 males; mean age 20.42 years, range 19–25 years) from the University of Amsterdam participated in the experiment in exchange for research credits. All participants were Dutch and reported having normal hearing.

#### *Design and procedure*

Participants were tested individually and completed two task blocks in one session. The first task tested emotion recognition, and the second block examined judgments of group origin of the same stimuli. Participants were informed that the experiment consisted of two parts and that they would receive the instructions for the second part after completing the first part. This was to ensure that participants would complete the emotion recognition task without knowing that the sounds were produced by individuals from different cultural groups. Thus, any in-group advantage would be independent of listeners' conscious awareness that the stimuli they heard originated from several different cultural groups. Sounds were delivered via headphones using the Psychophysics toolbox (Brainard, 1997) for MATLAB (Mathworks Inc., Natick, MA) running on a MacBook Pro laptop. Before each block, participants were given written instructions and had the opportunity to ask questions. After completing both tasks, participants were debriefed about the purpose of the study. The project was approved by the University of Amsterdam Department of Psychology ethics committee, and informed consent was obtained from all participants.

#### *Emotion recognition task*

The emotion recognition block consisted of a forced-choice categorization task with four response alternatives: triumph (in Dutch: *success)*, relief (in Dutch: *opluchting)*, amusement (in Dutch: *vermaak)*, and sensual pleasure (in Dutch: *genot)*. The written instructions for the emotion recognition task included a scenario for each of the four emotions (see Sauter et al., 2010a). The stimuli consisted of six tokens of each of these emotions from each of the three cultural groups (Dutch, British, and Namibian), resulting in a total of 72 sounds. The stimuli were played in a random order, and participants responded using key presses to select between response alternatives displayed on the screen in Dutch alphabetical order.

#### *Group identification task*

The group identification task consisted of a forced-choice task in which participants were asked to judge where the person producing each stimulus was from. The three response options were "the Netherlands," "a different country in Europe," and "a different country outside Europe." The same stimuli as in the emotion recognition task were used. Again, the stimuli were played in a random order, and participants responded using key presses. Response alternatives were displayed in order of proximity from the Netherlands.

#### **RESULTS**

In forced-choice tasks with multiple response alternatives, performance for a particular category can be artificially inflated by the disproportionate use of that response. Unbiased hit rates (Hu scores, see Wagner, 1993) were calculated to control for this bias in both tasks. Hu scores are calculated separately for each participant for each condition (see **Table A1** for results for individual emotions), and a score of one denotes perfect performance and a score of zero denotes no correct responses. As Hu scores are proportional, they were arcsine transformed before use in statistical tests.

#### *Is there an in-group advantage in emotion recognition from vocalizations?*

In order to confirm whether listeners performed better with ingroup as compared to out-group vocalizations on the emotion recognition task, the Hu scores from block 1 were compared using paired-sample *t*-tests, with separate tests to contrast performance with Dutch vocalizations to that with British and Namibian sounds, respectively. Dutch sounds were significantly better recognized than Namibian sounds [*t*(29) = 9.61, *p* < 0.001], see **Figure 1A**. However, no difference was found in recognition levels between Dutch and British sounds [*t*(29) = 0.82, *p* = 0.42], see **Figure 1A**. These results indicate that an in-group advantage is present in the recognition of emotional vocalizations from distantly-related, but not very closely-related, cultural groups.

#### *Can listeners identify group membership from sounds?*

To test whether participants were able to identify group membership at better-than-chance levels, Hu-scores from block 2 were compared to chance scores in paired-sample *t*-tests, separately for the Dutch, British, and Namibian vocalizations. Performance was not better than chance for any set of vocalizations (see **Figure 1B**).

In fact, performance was significantly worse than chance for both Dutch [*t*(29) = −10.78, *p* < 0.001] and British [*t*(29) = −10.44, *p* < 0.001] vocalizations, likely due to the particularly high rate of confusions between these two groups. Performance was no different than chance for Namibian sounds [*t*(29) = −0.52, *p* = 0.61]. These results show that listeners are not able, on a group level, to reliably judge group membership from emotional vocalizations.

#### *Does group identification predict the in-group advantage?*

To examine whether individual listeners' ability to identify the expresser's group membership would predict the in-group advantage they displayed, a linear regression was performed. The independent measure was performance on the group identification task, using Hu-scores. The dependent measure was the in-group advantage, calculated as performance on the emotion recognition task for Dutch as compared to Namibian stimuli, given that no significant difference was found for performance with Dutch as compared to British stimuli. The regression analysis was not statistically significant (*r*<sup>2</sup> <sup>=</sup> <sup>0</sup>.01, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.68; see **Figure 2**), indicating that the ability to judge whether sounds were produced by in- or out-group individuals did not predict the advantage displayed for recognizing in-group vocalizations.

#### **DISCUSSION**

The results of this experiment showed that perceivers were not able to reliably judge whether non-verbal vocalizations of emotions were produced by in- or out-group members. Other nonverbal signals of emotions, such as facial expressions, typically allow perceivers to infer group membership even of visually similar cultural groups, such as Australians and Americans (Marsh et al., 2007), and Japanese and Japanese-Americans (Marsh et al., 2003, but see also Matsumoto, 2007). Perceivers are also able to

infer group identity from speech segments, as shown in a study by Walton and Orlikoff (1994). Using sustained vowel sounds, they found that listeners could identify the speakers' race correctly 60% of the time in a two-way forced choice task. The contrast between the current study and studies of facial expressions and speech may suggest that non-verbal vocalizations are an unusual class of communicative signal in that they do not carry information about group identity.

The fact that an in-group advantage was found in the current study even though listeners were unable to tell whether sounds were produced by in- or out-group members demonstrates that motivational mechanisms are not necessary for an in-group advantage to occur. This does not rule out the possibility that motivational mechanisms contribute to the in-group advantage in cases where perceivers can accurately judge, or are explicitly told about, the group membership of the encoder. However, the current results also found no relationship between individuals' ability to judge group membership and the size of the in-group advantage that they displayed, further supporting the notion that the in-group advantage for non-verbal vocalizations of emotions does not rely on motivational factors in perceivers.

Consistent with previous research and similar to other types of non-verbal signals (Elfenbein and Ambady, 2002a; Sauter et al., 2010b), the current results show that there is an in-group advantage for non-verbal vocalizations of emotions. However, the current study found no in-group advantage for Dutch as compared to British stimuli. This is in somewhat inconsistent with a previous study in which British listeners performed better than Swedish listeners with British stimuli (Sauter and Scott, 2007). The difference between the two sets of results may be partly due to the fact that the current experiment employed a within-subject design. The current study also had fewer response alternatives as it left out the least well recognized stimulus type used by Sauter and Scott, resulting in near-ceiling accuracy for both Dutch and British sounds. Another possibility is that Swedish vocalizations are more similar to British ones than are Dutch ones. Further studies are needed to examine similarities between the vocalizations of difference cultural groups in terms of physical cues, and the relationship of these to listeners' perception (see also Pell et al., 2009 for a discussion of cultural out-group distance in the context of speech prosody).

#### **EXPERIMENT 2**

Experiment 1 showed that listeners cannot reliably judge whether non-verbal vocalizations of emotions were produced by members of their own cultural group or not. In Experiment 2, participants' belief about the origins of the sounds was manipulated in order to allow for a test of the relative contributions of motivation and cultural dialects to the in-group advantage. In a 2 × 2 design, Dutch participants heard emotional vocalizations in two blocks, each of which they were told contained either Dutch or foreign stimuli. In each block, the actual cultural origin of the stimuli was mixed so that an equal number of Dutch and foreign stimuli were presented. According to the dialect account, an in-group advantage should be found based on the actual origin of the stimuli, while according to the motivation account an advantage should be found for the recognition of stimuli that the participants believed were produced by in-group members.

#### **METHODS**

#### *Stimuli*

The stimuli were taken from sets of non-verbal vocalizations of positive emotions, but using only Dutch and Namibian sounds (Dutch: Sauter et al., 2010a; Namibian: Sauter et al., 2010b). Each stimulus set consisted of six vocalizations per emotion, for the eight emotions anger, fear, triumph, relief, amusement, surprise, sadness, and sensual pleasure, resulting in 48 sounds per group and a total of 96 stimuli. The British stimuli were excluded in Experiment 2, because no in-group advantage was found for Dutch as compared to British stimuli in Experiment 1. In addition, the set of the emotions was expanded from that used in Experiment 1. This was to get a more nuanced measure of emotion recognition which could also detect small effects. Stimulus sex was balanced within each condition, with equal numbers of male and female tokens of each emotion for Dutch, British, and Namibian sounds, respectively. The entire stimulus set was normalized for peak amplitude and was digitized at 41 kHz.

#### *Participants*

Thirty students participated in the experiment. The participants received no reward for taking part. All participants were Dutch and reported having normal hearing. One participant was excluded because she did not have Dutch parents, resulting in a final sample of 29 participants (15 males; mean age 20.39 years, range 18–24 years).

#### *Design and procedure*

Participants were tested individually and completed two task blocks in one session. In one block participants were informed that they would hear vocalizations produced by Dutch individuals, and in the other block they were told that they would hear sounds by foreign individuals. Block order was counter-balanced across participants. In actual fact, within each block, half of the stimuli were Dutch and the other half of the stimuli were Namibian, with each set balanced for stimulus emotion and sex. Before each block, participants were given written instructions and had the opportunity to ask questions. The written instructions for the emotion recognition task included a scenario for each of the emotions (see Sauter et al., 2010a). In both blocks participants performed a forced-choice categorization task with eight response alternatives: triumph (in Dutch: *success)*, anger (in Dutch: *woede*), relief (in Dutch: *opluchting)*, fear (in Dutch: *angst*), amusement (in Dutch: *vermaak)*, sadness (in Dutch: verdriet), sensual pleasure (in Dutch: *genot)*, and disgust (in Dutch: *walging*). The stimuli were played in a random order, and participants responded using key presses, with response alternatives displayed on the screen in Dutch alphabetical order. Sounds were delivered via headphones using the Psychophysics toolbox (Brainard, 1997) for MATLAB (Mathworks Inc., Natick, MA) running on a MacBook laptop. After completing both tasks, participants were asked what they thought the purpose of the study was, in order to examine whether they were aware that the instructions they had received were untrue. None of the participants had seen through the deception. The project was approved by the University of Amsterdam Department of Psychology ethics committee, and informed consent was obtained from all participants.

#### **RESULTS**

Hu scores were calculated to yield an accuracy measure controlling for any response biases, and scores were arcsine transformed before use in statistical tests (see Wagner, 1993).

#### *What are the effects of expected and actual group belonging on emotion recognition?*

How listeners' performance was affected by the expected and actual group belonging of the stimuli was tested in an ANOVA with the two within-subjects factors expected stimulus group (Dutch vs. foreign) and actual stimulus group (Dutch vs. Namibian). A significant main effect was found for actual stimulus group [*F*(1, <sup>28</sup>) = 281.20, *p* < 0.0001], with Dutch stimuli being recognized more accurately than Namibian sounds (see **Figure 3**). No main effect was found for expected stimulus group [*F*(1, <sup>28</sup>) = 0.39, *p* = 0.54] and no interaction was found between the two factors [*F*(28, <sup>115</sup>) = 0.01, *p* = 0.92].

#### **DISCUSSION**

Experiment 2 shows that, for non-verbal vocalizations of emotion, in-group sounds are more accurately understood, regardless of what listeners believe the cultural origin of the sounds to be. This pattern of results lends support to the dialect account (Elfenbein and Ambady, 2002a), in that it demonstrates that differences in the actual cultural origin of emotional signals produce differences in recognition rates. This in line with the idea of small physical differences in the affective signals of different groups, which are transmitted via cultural learning. For facial expressions, such differences could involve variations in the muscles used to signal particular states, as well as the intensity and dynamics in the activation of those muscles. These are likely acquired via implicit norms surrounding the regulation and display of emotions, including display rules (see Mesquita and Frijda, 1992, for a discussion). For vocalizations, differences may be expressed in amplitude, spectral, or pitch cues, and for vocal signals the native language of the producer may additionally influence the phonetic properties of the expressions (see Pell et al., 2009).

The results of this study fail to support a motivational explanation for the in-group advantage of emotional vocalizations. However, motivation has previously been found to have an effect in designs where stimuli were equivalent (Thibault et al., 2006; Young and Hugenberg, 2010). One difference between the current and previous studies may be the use of ethnic cultural groups in the current experiments, as opposed to cultural groups based on other aspects of identity (playing basketball or results on a personality test). Another difference is the use of vocal stimuli in the current study, in contrast to facial displays in previous research. Group membership, at least with regards to ethnicity, may be a less salient cue in vocalizations than in other non-verbal displays, such as facial expressions. From faces, perceivers are able to make reliable group judgments even for individuals who are visually similar (Marsh et al., 2003, 2007). Furthermore, in the current experiment, group membership was not emphasized to participants, and it cannot be ruled out that believed group membership may have affected recognition accuracy if it had been made more salient and/or relevant to participants. Nevertheless, the results of this study show that actual, but not believed, group belonging of non-verbal emotional vocalizations, underlie the in-group advantage in a cross-cultural paradigm.

#### **GENERAL DISCUSSION**

The current set of experiments set out to test predictions derived from two accounts of the in-group advantage for emotional signals. Experiment 1 tested predictions made on the basis of the motivational account, and Experiment 2 tested predictions of both the motivational and dialect theories. Experiment 1 failed to find any relationship between the in-group advantage and perceivers' ability to judge whether an emotional expression was produced by an in- or out-group member, as measured by individual differences. Furthermore, an in-group advantage was found despite listeners being unable to accurately judge whether stimuli were produced by in- or out-group members. This indicates that motivational factors is not necessary for an in-group advantage to emerge for emotional stimuli, given that perceivers must be able to infer whether a signal was produced by an in-group or out-group member in order for motivational mechanisms to work.

In Experiment 2, the believed and actual cultural origin of the stimuli were orthogonally manipulated in order to examine the relative contributions of motivation and cultural dialect to the ingroup advantage. As would be predicted by the dialect account, a strong effect was found for the actual group membership of the stimuli. The dialect account is consistent with co-evolutionary accounts of signal production and perception. According to this view, mechanisms of production and perception have evolved under reciprocal pressure toward a shared set of communication features (Gentner and Margoliash, 2003; see also Smith et al., 2005 for supporting evidence for emotional signals).

The believed group of the stimuli did not affect recognition rates, in contrast to the prediction derived from the motivational account. Thus, across two experiments, the current study failed to find support for a motivational mechanism underlying the in-group advantage for emotional communication. It is worth noting that the current study did not employ a balanced design, which is the most robust way for studying the in-group advantage (Matsumoto, 2002). The development of new, validated corpora from several cultural groups at varying distance from each other would facilitate the use of balanced designs. This is particularly important given that an in-group advantage in uni-directional designs may occur due to a difference in decoding-ease of the stimuli (Matsumoto, 2002). This is, however, unlikely to be the case for the stimuli used in the current study, as a bi-directional in-group advantage has already been demonstrated for a Namibian—British comparison of those sounds (Sauter et al., 2010b). The fact that Namibian listeners recognized Namibian stimuli better than British sounds suggests that, rather than the Namibian stimuli being of inferior quality to the Dutch and British ones, how easily the sounds are decoded depends on the cultural background of the listener.

The findings from the current study raise some questions about the constraints under which a motivational mechanism may apply. One possibility is that motivation may primarily affect perception in the context of groups that do not differ in their physical signals. Previous studies that have found support for a motivation explanation have contrasted students from different disciplines (van der Schalk et al., 2011), basketball players with non-basketball players (Thibault et al., 2006), or people with different personality test scores (Young and Hugenberg, 2010). In cross-cultural contexts, the role of motivation may thus play a less pronounced role. However, whether motivational mechanisms operate as an out-group bias, rather than eliciting an in-group advantage has recently been questioned (Elfenbein, 2013). An out-group bias could be expected to occur in cross-cultural settings with foreign groups that are perceived, for example, as of low status.

The current results also indicate that emotion recognition may work differently to emotional mimicry. Two studies to date have found more mimicry for believed in-group as compared to believed out-group displays of emotions, while controlling for the physical features of the stimuli (Bourgeois and Hess, 2008; van der Schalk et al., 2011). It is worth noting that the current study used auditory stimuli, while most previous studies have used facial expressions of emotions. However, van der Schalk et al. also note that the theoretical implications of their results are not clear, given the small effect sizes and the inconsistency of the effect across emotions.

One way to further explore predictions from these accounts may be the use of neurocomputational models, such as that developed by Dailey and colleagues (Dailey et al., 2010). Their model was trained in a Japanese or an American cultural context, and then tested with facial stimuli from both groups. The results of the model replicated the human in-group advantage for emotional facial expressions, lending support to an explanation emphasizing physical differences in the expressions of cultural groups. Recent evidence suggests that computer models may also be able to identify ethnic groups from speech segments (Hanani et al., 2013). Further computational models may add yet more to our understanding of how differences in emotional communication might arise in different cultural learning environments by incorporating possible motivational mechanisms.

#### **ACKNOWLEDGMENTS**

This work was funded by an NWO Veni grant (275-70-033) to Disa A. Sauter. The author would like to thank Frank Eisner for his helpful comments. The author would also like to thank Yazmin Daruvalla, Dennis Huijzer, Nadine Komduur, Djinder Verduyn Lunel, and Josefien Mooij for data collection for Experiment 1, and Nora Holster, Kaylee van de Meent, Bren Meijer, Carina Thönnes, and Marloes van der Wel for data collection for Experiment 2.

#### **REFERENCES**


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 01 March 2013; accepted: 14 October 2013; published online: 30 October 2013.*

*Citation: Sauter DA (2013) The role of motivation and cultural dialects in the ingroup advantage for emotional vocalizations. Front. Psychol. 4:814. doi: 10.3389/fpsyg. 2013.00814*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2013 Sauter. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### **APPENDIX**

#### **Table A1 | Performance on the emotion recognition task in Experiment 1, with accuracy for Dutch, British, and Namibian stimuli shown for each emotion separately.**


*Scores denote raw Hu scores, with standard deviations in brackets.*

## What does music express? Basic emotions and beyond

#### *Patrik N. Juslin\**

*Department of Psychology, Uppsala University, Uppsala, Sweden*

#### *Edited by:*

*Daniel J. Levitin, McGill University, Canada*

#### *Reviewed by:*

*Sébastien Paquette, University of Montréal, Canada Peter Pfordresher, University at Buffalo State, University of New York, USA*

*\*Correspondence:*

*Patrik N. Juslin, Department of Psychology, Uppsala University, PO Box 1225, SE–751 42 Uppsala, Sweden e-mail: patrik.juslin@psyk.uu.se*

Numerous studies have investigated whether music can reliably convey emotions to listeners, and—if so—what musical parameters might carry this information. Far less attention has been devoted to the actual *contents* of the communicative process. The goal of this article is thus to consider what types of emotional content are possible to convey in music. I will argue that the content is mainly constrained by the type of coding involved, and that distinct types of content are related to different types of coding. Based on these premises, I suggest a conceptualization in terms of "multiple layers" of musical expression of emotions. The "core" layer is constituted by iconically-coded *basic emotions*. I attempt to clarify the meaning of this concept, dispel the myths that surround it, and provide examples of how it can be heuristic in explaining findings in this domain. However, I also propose that this "core" layer may be extended, qualified, and even modified by additional layers of expression that involve intrinsic and associative coding. These layers enable listeners to perceive more complex emotions—though the expressions are less cross-culturally invariant and more dependent on the social context and/or the individual listener. This multiple-layer conceptualization of expression in music can help to explain both similarities and differences between vocal and musical expression of emotions.

**Keywords: music, emotion, expression, communication, categories, dimensions**

### **INTRODUCTION**

Few scholars would dispute that music is often heard as expressive of emotions by listeners. Indeed, emotional expression has been regarded as one of the most important criteria for the aesthetic value of music (Juslin, 2013). Music has even been described as a "language of the emotions" by some authors (Cooke, 1959). It is not surprising, then, that a number of studies have investigated whether music can reliably convey emotions to listeners, and if so—what musical features may carry this information. Far less attention has been devoted to the actual *contents* of the communicative process. The goal of this article is thus to take a closer look at the emotional contents of music. To be clear, the focus is on the *expression* and *perception* of emotions, rather than on the *arousal* of emotions (Gabrielsson, 2002).

In one sense, the term "emotional expression" is slightly misleading: it is only sometimes that musicians are truly expressing their own emotions in a composition or performance. What is usually meant by the term emotional expression is that listeners perceive *emotional meaning* in music. Yet the term "emotional expression" is widely established and will thus be retained in the present essay. The fact that people like to use the term "expression" suggests that music somehow reminds them of the ways humans express their states of mind in real life—a notion that is not too far off the mark (see section Iconic Coding: Basic Emotions).

Whereas Budd (1985) defined music as "the art of uninterpreted sounds" (p. ix), the present author instead assumes that music is *constantly interpreted*. Sometimes these interpretations may lead to the *arousal* of an emotion (e.g., Juslin, 2013). But more commonly, perhaps, we merely detect meaningful information. The notion of meaning suggests that music somehow *refers* to something else, beyond itself (Cross and Tolbert, 2009), but what kind of meaning it conveys has been a matter of much debate. Throughout history, music has been regarded as expressive of motion, tension, human characters, identity, beauty, religious faith, and social conditions. However, the most common hypothesis is arguably that listeners perceive music as expressive of emotions (for a review, see Gabrielsson and Juslin, 2003).

Empirical research largely confirms this view; for example, in a survey study by Juslin and Laukka (2004), 141 participants were asked what, if anything, music expresses. They were required to tick items from a list of options, based on a thorough survey of the literature on expression in music. Results indicated that "emotions," unlike any of the other options, was selected by 100% of the participants. The real puzzle, however, and the topic of the present discussion is this: which emotions are expressed in music and why? The previous literature presents a somewhat confusing picture: some authors write about "expression" as something vague and flexible, almost idiosyncratic; others seem to view expression as something more specific, something for which terms like *agreement* and *accuracy* seem applicable. Are they really writing about the same phenomenon? It is hoped that the present essay can bring some clarity to this issue and illustrate how different conceptions of expression might be related.

The rest of this article is organized as follows. First, I briefly review some evidence regarding what emotions music typically express. I also discuss which approach to emotion—categories or dimensions—can best account for these results. Then, I argue that the emotional content of music is constrained by three types of coding that can be conceptualized as distinct "layers" of musical expression. Finally, I consider the implications of this conceptualization for the field.

#### **WHICH EMOTIONS DOES MUSIC EXPRESS?**

Note that there are different senses in which music can be said to express emotions. Firstly, a listener could perceive *any* emotion in a piece of music; and in a nontrivial sense, it would be inappropriate to claim that the listener is "wrong." The subjective impression of an individual listener cannot be disputed on objective grounds. A first way to index emotional expression is thus to accept the *unique impressions* of individual listeners: Whatever a listener perceives in the music *is* what the music is expressing—for him or her at least! This is the view adopted by MacDonald et al. (2012), when they note that "we are ... free to interpret what we hear in an infinite number of ways" (p. 5).

Several researchers prefer a more "restrictive" view on expression, however, which holds that music is expressive of a specific emotion only to the extent that there is some minimum level of *agreement* among different listeners regarding the expression, presumably because there is something in the music that produces a similar impression in many listeners. Expression thus conceived brings a stronger focus on psychophysical relationships between musical features and perceptual impressions. Thus, a second way to index emotional expression in music is to focus on listener agreement (Campbell, 1942).

The notion of expression does not require that there is any correspondence between what the listener perceives in a piece of music and what the composer or performer intends to express. In contrast, the concept of "communication" requires that there is both an intention to express a specific emotion and recognition of this emotion by listeners. Presumably, many musicians care about whether listeners perceive their music the way they intended it. Hence, if we study expressed emotions in terms of communication, we might also index emotional expression in terms of *accuracy* (Juslin and Timmers, 2010; see, e.g., Thompson and Robitaille, 1992).

Most likely, there are fewer emotions for which there is *agreement* among several listeners than there are emotions that a *single* listener may perceive in a piece. Even fewer emotions may be relevant if we consider those emotions that might be reliably *communicated* from a musician to a listener; that is, where there is an intention to convey an emotional character, which is correctly recognized by a perceiver. Later in this essay, I will offer a conceptualization that covers all of the above ways in which music could be said to express emotions—from the most personal to the most communal aspects of perceived expression.

Just as there are many different ways to conceptualize expression in music, there are different approaches that may be adopted to investigate empirically which emotions music can express. One rather simple way to approach the question is to ask music listeners directly. Thus, **Table 1** shows data from three different studies in which listeners were asked which emotions music can express. In each study, the subject could choose from a long list of emotion labels. Shown are the rank orders with which each of the top ten emotion terms was selected. As can be seen, *happiness*,

#### **Table 1 | Ratings of the extent to which specific emotions can be expressed in music.**


*Only the ten most highly rated emotions in each study have been included in the Table. Those emotion categories that correspond to the basic emotions are set in bold text. (Anxiety belongs to the "fear family," and tenderness to the "love" family, see, e.g., Shaver et al., 1987.) The original lists of emotion terms contained both "basic" and "complex" emotions, as well as some terms commonly emphasized in musical contexts (e.g., solemnity).*

*sadness, anger, fear* and *love, tenderness* were all among the topten emotions, and this tendency was similar across the three data sets, despite differences in samples (musicians vs. students, various countries) and selections of emotion terms (ranging from 32 to 38 terms). Hence, there seems to be agreement about which emotions are easiest to express in music<sup>1</sup> .

It could be argued that such findings are more reflective of the beliefs and folk theories that musicians and listeners have about music than they are of any real circumstances. However, evidence that there is some substance to their intuitions comes from studies, where listeners are asked to rate the emotional expression of actual pieces of music. The results from over a hundred studies demonstrate that music listeners are generally consistent in their judgments of expression. Thus, a second approach to answer what emotions music expresses is to look at what emotions tend to yield the highest levels of agreement between listeners in previous studies. For instance, in their respective overviews, Gabrielsson and Juslin (2003) and Juslin and Laukka (2003) noted that the highest agreement between listeners occurred for emotions such as *happiness, sadness, anger*, and *tenderness*, and emotion dimensions, such as *arousal*. Moreover, there was often good agreement regarding the *broad* emotional character, but less agreement about *nuances* or *variants* of this emotion. Low agreement was found for

<sup>1</sup>Notably, the nine emotion clusters in Schubert's (2003) "updated" version of Hevner's (1936) adjective clock are quite similar to the most highly ranked emotion terms shown in **Table 1**.

emotion labels such as *jealousy*, *pity*,*cruelty*,*eroticism*, *whimsiness*, and *devotion*. In addition, hardly any agreement at all was found for various events depicted in so-called "program music."

In sum, previous research suggests that certain emotions are easier to express in music than others. What approach can best help to explain these findings? To answer this question, we first have to review major approaches to conceptualizing emotions.

#### **DOES MUSICAL EXPRESSION INVOLVE CATEGORIES OR DIMENSIONS?**

The dominant approaches to conceptualizing emotions in psychology are *categorical* and *dimensional* approaches, respectively. (I prefer to refer to them as approaches, rather than theories, because they represent broad perspectives on similarities and differences among emotions, which may include quite different emotion theories of a more specific kind.)

#### **CATEGORIES**

According to categorical theories, people experience emotional episodes as categories that are distinct from each other, such as *happiness*, *sadness, anger, surprise, fear*, and *interest* (Izard, 1977). Note that categorical theories of emotion come in many different forms. Thus, one type of theory is associated with the concept of basic emotions (see section The Concept of Basic Emotions). However, many other emotion theories, such as componentprocess theories and "music-specific" models, also involve categories and therefore represent subdivisions of the categorical approach rather than additional approaches. Component-process theories (e.g., Scherer, 1984) assume that there are as many categories as there are possible outcomes of the appraisal process. A "music-specific" model assumes that the categories are different from "everyday emotions," and, moreover, that they are "unique" to music. Zentner et al. (2008) proposed nine categories.

#### **DIMENSIONS**

In contrast, dimensional theories seek to conceptualize emotions based on their approximate placement along broad and continuous dimensions, such as *valence, activation*, and *potency*. Just like categorical approaches, dimensional models come in several different forms—from one-dimensional *arousal* models (Duffy, 1941), to two-dimensional (e.g., *Arousal-Pleasure*, Russell, 1980; *Positive Affect-Negative Affect*; Watson and Tellegen, 1985; *Energetic Arousal-Tense Arousal*; Thayer, 1989), or three-dimensional models (*Energy Arousal-Tense Arousal-Valence;* Schimmack and Grob, 2000; *Gaiety-Gloom*, *Tension-Relaxation*, *Solemnity-Triviality*; Wedin, 1972). The most popular version is clearly the *circumplex model* outlined by Russell (1980), maybe because it is easy to understand. It consists of a two-dimensional and circular structure featuring the dimensions *pleasure* and *arousal*. The model illustrates that emotions vary in their degree of similarity, and that some emotions are usually thought of as opposites.

#### **DISCRIMINATING EMPIRICAL EVIDENCE**

Which of these approaches best accounts for emotions? Many researchers view categorical and dimensional approaches as "complementary" (Nyklícek et al., 1997 ˇ ): they both receive some support from neurophysiological findings (Damasio, 1994), and both can be useful to characterize emotions in music (e.g., Vieillard et al., 2008). Still, theoretically, they cannot really be equally correct considering that they make opposite claims at a fundamental level. Although several studies have aimed to compare categorical and dimensional approaches to emotions (e.g., Eerola and Vuoskoski, 2011), the data reported have rarely any bearing on the fundamental assumptions of each approach2 . The most important difference between a dimensional and a categorical approach is that the former assumes that emotions vary in a *continuous* manner in "emotion space," whereas the latter assumes that there is *discontinuity* (discreteness) in "emotion space." Though only few studies have directly addressed this essential aspect, those studies that have indicate that the continuity assumption is incorrect. That is, emotions show discreteness in terms of category boundaries, rather than continuity (Haslam, 1995).

In this article, we are concerned with emotional expression. Of particular importance in this context are studies which show that continuous variation in vocal emotion expressions is processed categorically (de Gelder and Vroomen, 1996; Laukka, 2005), since there are strong parallels among vocal and musical expressions of emotion (Juslin and Laukka, 2003; Table 7). If emotions conveyed in sound are not perceived in a continuous fashion, then other types of comparisons among the two approaches suddenly don't appear that important anymore—the dimensional approach has already been found wanting.

It's easy to see why categories are needed. Emotions function to guide decisions about future behavior. A continuous dimension of, say, *valence* is all very nice—but how are you going to use it? Exactly *how much is enough* to motivate a change in behavior? We need a "cut-off" or "stop rule" to make a decision; and once we have that—*voilà*—we have a *category boundary*. (In fact, even the traditional *pleasure* or *valence* dimension of the circumplex seems to imply a discrete boundary at some point—between positive and negative; approach and avoidance).

Categories are of crucial importance to human behavior: they aid inferences, communication, and decision making (cf. Markman and Rein, 2013). Hence, even Barrett (2006), a dimensional theorist of rang, appears to have accepted that "core affect" in terms of only two dimensions is insufficient to account for human emotions. She postulates a conceptual layer of categories of emotion on top of the two dimensions, and assumes that this layer is a social construction that mainly reflects language. But can emotion categories be dismissed so easily?

The argument that categorical perception of emotional expressions reflects language is partly based on findings that categorical perception involves a left-hemisphere bias in the brain. But Holmes and Wolff (2012) reported that categorical perception is

<sup>2</sup>Eerola and Vuoskoski (2011) noted that although both discrete-emotions and dimensional ratings of perceived emotions in music showed a high internal consistency, the discrete-emotions model showed a somewhat lower consistency for ambiguous examples of emotions (poor emotion examplars). However, the dimensional model performs better mainly because it 'by-passes' the task of allocating perceived emotions to an emotion category.

not driven by language<sup>3</sup> ; and emotion categories in vocal expression appear in other mammals, which obviously don't have a verbal language. For instance, the squirrel monkey has a limited number of vocal expression categories, which are associated with important events in the monkeys' life, such as warning calls (alarm peeps), threat calls (groaning), desire for social contact calls (isolation peeps) and companionship calls (cackling) (Ploog, 1992). Further, even in the first months of life, human infants are able to differentiate vocal expressions of emotions in infantdirected speech, and to respond adequately to their categorical messages (see Papoušek et al., 1990). While I certainly do not deny that language shapes several aspects of how we report and perhaps even experience—emotions, it cannot fully account for the existence of discrete emotion categories. Panksepp (1998) have outlined distinct emotion systems, with neuroanatomical and neurochemical components in the mammal brain, associated with seven emotion categories. His emotion labels (with more commonly used labels within parentheses) are *seeking* (*interest*), *rage* (*anger*), *fear*, *lust* (*desire*), *care* (*tenderness*), *panic* (*sadness*), and *play* (*joy*). The point is that emotion categories go deeper than mere verbal labels in language.

While there seems to be a consensus today that dimensional approaches focus on subjective experience (*feelings*)—perhaps because they are so poorly able to account for other emotion components such as emotional expression—it is important to acknowledge that dimensional models did *not* derive from "raw data" of self-reported emotions. Instead, they were abstract dimensions that resulted from multivariate statistical techniques applied to similarity ratings of facial expressions and emotion labels (e.g., Plutchik, 1994). People do not spontaneously report emotions as coordinates within an abstract, multi-dimensional emotion space. Hence, dimensional models appear too reductionist. In the circumplex model, two emotions that are placed in the same position in the circular matrix may be very different. For example, *anger* and *fear* are two emotions that are highly correlated within this model because they are both high in arousal and unpleasantness. Yet they are very different in terms of their implications for the organism (Lazarus, 1991). Furthermore, musical expressions of the two emotions are quite different (see Juslin and Laukka, 2003; Table 7). This implies that the circumplex model cannot accommodate that we are able to distinguish *anger* and *fear* expressions.

Based on the above line of reasoning, I conclude that musical expression of emotion is likely to involve emotion categories, rather than mere dimensions. (As we shall see later, this does *not* preclude that there is an implicit dimensionality in emotion categories; cf. section Resistance Against Basic Emotions) If emotions tend to involve categories, then the next question is, which are those categories? Below, I suggest that an ecological perspective on emotions can be helpful to understand the kinds of categories that have been premiered throughout evolution. But the types of emotion that are expressed and recognized in music also reflect the precise process through which the emotional contents are transmitted.

#### **HOW DOES MUSIC EXPRESS EMOTIONS? THREE TYPES OF CODING**

To explain why music appears to be expressing some emotions, rather than others, we need to take a closer look at the underlying process, particularly how the emotional meaning is *coded* in music (the specific manner in which the music carries the emotional meaning). I argue here that the emotional content of musical expression is constrained by the type of coding available and that distinct types of content are conveyed through different types of coding. Dowling and Harwood (1986) offered a useful categorization based on the ideas of Charles Pierce:


These three principles have been referred to as "iconic," "intrinsic," and "associative" sources of musical expression, in an attempt to make the concepts easier to grasp (Sloboda and Juslin, 2001). In the following, I will consider these types of coding in music and their implications for the types of emotions expressed.

#### **ICONIC CODING: BASIC EMOTIONS**

A first and very powerful source of perceived emotion in music reflects *iconic* coding. Juslin (1995, 1997, 1998, 2001) has repeatedly theorized that the code used in emotional expression in music performance is based on innate and universal "affect programs" for vocal expression of emotions. According to this "functionalist" framework—partly inspired by Spencer (1857)—the origin of iconically-coded expressions is to be sought in involuntary and emotion-specific physiological changes associated with emotional reactions, which strongly influence different aspects of voice production (for a review of the relationships among emotion, physiology and voice, see Juslin and Scherer, 2005). This notion was later named "Spencer's law" by Juslin and Laukka (2003). Because of its evolutionary origin, this is the type of coding that will have the most *uniform* impact on musical expression. I will show that iconically-coded expressions are intimately related to basic emotions.

#### *The concept of basic emotions*

The term *basic* or *discrete* emotions occurs frequently in the music psychology field today, typically to refer to certain emotions (*happiness, sadness, anger*, and *fear*), but without any deeper consideration of the theoretical basis of the concept. This is unfortunate, as it serves to obscure many of the issues under consideration.

First of all, it is quite possible to talk about emotions like *sadness, surprise, anger, happiness, interest*, and *fear* without adopting a basic-emotions perspective. Thus, simply adopting these

<sup>3</sup>Bigand et al. (2005) also reported categorical effects that are unlikely to be a by-product of linguistic labelling, in the context of emotion judgments of music excerpts (p. 1130).

emotions does not itself make one a "basic-emotion theorist." (Otherwise, even Scherer would be a "basic-emotion theorist" because most of his studies have focused on these emotions; e.g., Scherer and Oshinsky, 1977; Banse and Scherer, 1996; Scherer et al., 2001). Hence, regardless of one's theoretical position, *sadness*, *happiness, anger, surprise*, and *fear* are obvious examples of emotions from "everyday life." Therefore, my recommendation is to employ the term "basic emotion" only when one is embracing the theoretical basis of this concept, and to use the term "everyday emotions" when one is simply referring to emotions like *happiness, anger, surprise, fear*, and *sadness*, without wanting to commit to the underlying theory of basic emotions.

The concept of *basic emotions* refers to the idea that there is a limited number of innate and universal emotion categories, which are more biologically fundamental than others (Tomkins, 1962; Izard, 1977; Ekman, 1992; Oatley, 1992; Plutchik, 1994; Power and Dalgleish, 1997). Each basic emotion may be defined functionally in terms of a key appraisal of goal-relevant situations that have occurred frequently during evolution (e.g., Oatley, 1992). The situations include cooperation, conflict, separation, danger, reproduction, and caring. Support for basic emotions comes from a wide range of sources that include:


Not all of these sources of evidence are equally strong: thus, for example, the extent to which psychophysiological measures can distinguish among basic emotions is controversial, though recent multivariate approaches to emotion classification are promising (e.g., Kragel and LaBar, 2013). Yet, the most impressive evidence of basic emotions comes from studies of emotional communication (Juslin and Laukka, 2003).

#### *Basic emotions in vocal and musical communication*

To answer the question of which emotion categories we have, we first need to ask ourselves why we have categories *at all*; and, in particular, why we have emotion categories. Here, an ecological perspective on emotion could be helpful. Categories enable us to make important inferences (Corter and Gluck, 1992). For example, the ability to predict the probable behavior of another individual is quite useful: it allows the judge to adjust his or her behavior in order to affect the outcome of the interaction. Consequently I have argued elsewhere (Juslin, 1998) that when it comes to communication of emotion, the basic emotion categories represent the optimal compromise between two opposing goals of a perceiver: the desire to have the most informative categorization possible and the desire to have the categories be as discriminable as possible (Ross and Spalding, 1994). To be useful as guides to action, emotional expressions are typically decoded in terms of a few emotion categories related to important life problems such as danger (*fear*), competition (*anger*), loss (*sadness*), social cooperation (*happiness*), or caregiving (*love*) (Juslin, 2001).

In support, there is cross-cultural accuracy in decoding of basic emotions in vocal expression even in so-called traditional societies without any exposure to media (Bryan and Barrett, 2008). Critics of the basic-emotion approach in studies of vocal expression (Bachorowski, 1999) like to point out that it has been difficult to find distinct voice-profiles for basic emotions. Indeed, although basic emotions do present different acoustic features (Juslin and Laukka, 2003; Table 7), it's clear that the acoustic patterns obtained do not always neatly correspond to categories. But to look for discrete categories in the acoustic data is to look at the wrong place altogether. Categorical perception is a creation of the *mind*, it's not in the physical stimulus. The relevant support comes from work that shows that vocal emotion expression is *perceived* categorically (Laukka, 2005). The argument is that this evolved tendency to interpret emotional meaning in sounds in terms of certain categories places some constraints on musical expression also.

I have speculated (Juslin, 2001) that the origin of music lies in ceremonies of the distant past that related vocal emotion expression to singing: vocal expressions of basic emotions such as *happiness*, *sadness*, *anger* and *love* probably became gradually meshed with vocal music that accompanied associated cultural activities, such as festivities, funerals, wars, and caregiving. The implication is that basic emotions are "privileged," in the sense that they are biologically prepared for effective communication.

That basic emotions are easier to convey reliably in musical expression is also partly an effect of the fact the communicative process involves partly redundant cues which limits the amount of information that may be conveyed through the "channel," as captured by the *Lens Model* for music and emotion first proposed and implemented by Juslin (1995, 2000). This characteristic might also be explained in terms of evolutionary pressures: Ultimately, it is more important to avoid making serious mistakes (e.g., mistaking *anger* for *joy*), than to have the ability to make subtle discriminations among emotions (e.g., reliably recognizing different types of *joy*). Thus, a listener's interpretation of emotions in music will tend to gravitate toward basic categories.

#### *Resistance against basic emotions*

As shown above there are plenty of reasons to adopt a categorical approach in terms of basic emotions. Why, then, has the notion of basic emotions been treated with so much skepticism in the music field recently? The reasons may be different, depending on who the skeptics are. Among *musicians*, there may be a sense that the concept of basic emotions somehow implies a low level of musical sophistication. (Who would like to have his or her music compositions or performances described as "basic"?) As pointed out by Juslin and Lindström, (2010), however, the term *basic* Juslin What does music express?

*emotion* does not imply that the music itself is "basic": indeed, "basic emotions may be expressed in the most sublime manner" (p. 356). The term simply highlights the fact that basic emotions are at the core of human emotions. (Moreover, for most theorists, the idea of basic emotions also means that there are more complex emotions; see section Beyond Basic Emotions: Intrinsic and Associative Coding) Yet, one source of resistance to basic emotions is probably the terminology as such.

One way to reduce resistance to the notion of basic emotions amongst musicians could be to demonstrate their natural relationships to the everyday praxis of musicians, even in classical music. Could it be the case that these terms used merely as shorthand for broad categories of emotion in musical expression in previous studies (Juslin, 2001) can be "translated" to some "language" more familiar to the working musician? Musical scores often include "expression marks" that serve to indicate not only the tempo of the music but also the intended expressive character of the music. In a recent study (Juslin and Wiik, submitted), professional performers and psychology students were required to rate a highly varied set of pieces of classical music with regard to 20 expression marks rated as common by music experts and 20 emotion terms rated as feasible in the context of musical expression (e.g., Lindström et al., 2003). When the ratings were combined, the analysis yielded highly significant correlations among expression marks and emotion terms—in particular for basic emotions (**Table 2**). The results may not be particularly surprising, given that expression marks typically involve reference to motion and emotion characters. But the point is that when music psychologists talk about basic emotions, they may well be referring to precisely the same expressive qualities that performers consider in expression marks throughout their daily work. Again we should not get too hung up on the superficial labels used to refer to the underlying emotion categories<sup>4</sup> .

**Table 2 | Examples of correlations between commonly used expression marks in music scores and basic-emotion labels used by psychologists.**


*\*p* < 0.01*.*

*(based on Juslin and Wiik, submitted).*

Among *music researchers*, resistance to basic emotions seems to be due to certain myths that have been allowed to flourish unchallenged, and that have contributed to a misunderstanding of the concept of basic emotions. Six of these myths warrant closer consideration here.

#### *Myth 1: "There is no agreement about which emotions are basic."*

Basic emotions have been criticized, based on the fact that different emotion theorists have come up with different lists of emotions (Ortony and Turner, 1990). But this argument is, on reflection, a little suspect. There is a key question we should ask about the concept of basic emotions: does the concept help to narrow down and organize the field of emotion in a way that makes for greater agreement and consistency amongst those researchers who adopt the concept than amongst those who don't? If so, the concept is heuristic. Note that ideas about emotions depend crucially on how one *defines* an emotion. This helps to explain differences with respect to the lists of basic emotions proposed so far. How can we expect the authors to come up with the same set of basic emotions if they don't define emotions in the same way? The relevant question to ask is therefore: *is there agreement about which emotions are basic amongst those who define emotions in a similar way?* In fact, if we consider the authors who adopt similar definitions of emotions (e.g., in terms of their evolutionary adaptiveness), there is a lot of agreement about which emotions are basic (e.g., Plutchik, 1980). There is arguably more disagreement about the term "emotion" itself than about basic emotions (cf. Kleinginna and Kleinginna, 1981). Yet, few would argue that we should abandon the term "emotion."

*Myth 2: "Basic-emotions are incompatible with appraisal theory."* Sometimes the basic-emotion approach is contrasted with "appraisal theories" (Scherer, 1984), which aim to describe the processes through which an emotion is aroused. This is misleading, as it implies that the basic-emotion approach is somehow incompatible with appraisal. In fact, it turns out that many appraisal theorists embrace the notion of basic or primary emotions (see Lazarus, 1991; Roseman, 1991; Stein and Trabasso, 1992). Appraisal is a fundamental aspect of emotion induction that must be part of *any* emotion theory regardless of how it conceptualizes the resulting emotions. A component-process theory (e.g., Scherer, 1984) does not differ from a basic-emotion theory because it involves appraisal: The primary difference between the two types of theories is that the former assumes that there are as many emotion categories as there are possible outcome combinations of the appraisal-criteria included. (To my knowledge, this essential assumption has never actually been tested and verified by any researcher). The latter type, in contrast, assumes that cognitive appraisals typically result in a fewer number of broad categories, with more differentiated appraisals producing nuances *within* the categories, rather than additional categories. Regardless, basic-emotion theories are compatible with attempts to model the appraisal process that produces an emotion<sup>5</sup> .

<sup>4</sup>Some musicians resist description of music in words more generally, on the basis that language is incapable of capturing all the musical nuances, but this limitation obviously applies to all emotion approaches discussed here.

<sup>5</sup>However, recall that emotions can be aroused in many different ways (Izard, 1993; Juslin, 2013). Hence, there lies a danger in defining emotion categories solely on the basis of specific appraisal outcomes.

*Myth 3: "Basic emotions are crude and lacking in nuance."* This refers to the common view that emotion categories do not allow for the occurrence of subtle nuances within a category. This reflects a misunderstanding of the very concept of a category. Just as there are different shades of *blue*, there can be different shades of *sadness*. The notion of basic emotions implies that, emotions from distinct basic-level categories are more different from one another than are different emotions from within the same category (e.g., *sadness* and *joy* differ more than, say, *sadness* and *melancholy*); this doesn't preclude that there are nuances *within* categories as well. The notion of an emotion category is nicely captured by the word "emotion family" (e.g., Ekman, 1992). Each family includes a "theme" and its "variations." The "theme" represents the common characteristics of the basic emotion and the "variations" all the subtle nuances and shadings that might occur *within* the category. Laukka and Juslin (2007) reported that listeners could accurately recognize various intensity levels (high or low) of basic emotions in both vocal and musical expressions. Hence, there's an implicit dimensionality within basic-level emotion categories. Schubert (2010) points out that although we often think of using continuous-response methodology only with respect to dimensional models, it's perfectly possible to collect continuous ratings of discrete emotions also (e.g., to rate the amount of *sadness* while the music unfolds). In addition, many emotion researchers postulate "secondary" or "mixed" emotions which are founded on basic emotions, but that involve "blends" of emotions (Plutchik, 1994), or specific cognitive appraisals which occur together with a basic emotion (Oatley, 1992). Hence, Johnson-Laird and Oatley (1989) were able to sort several hundreds of emotion terms into just five basic emotion categories or some subset of them. Basic-emotion theories are able to accommodate diversity and nuances, including ebb and flow in emotion over time.

*Myth 4: "Basic emotions are always full-blown responses."* Basic emotions are commonly depicted by critics in a stereotyped manner, which borders on caricature: it's usually about hair-raising fear when confronted by a bear! But basic emotions may vary in intensity (e.g., from *frustration* or *irritation* to *anger* and *rage*). There is nothing in the concept of basic emotions as such that requires that the emotion will always be intense. Basic emotions are typically portrayed in such a way by critics in order to make the emotions appear irrelevant in everyday life (or in music). Are basic emotions relevant in everyday life? In the context of vocal expression, Cowie et al. (1999) asked participants to select a subset of emotions that they thought were important in everyday life. This produced a list of 16 emotions and labels chosen included basic emotions in different variants such as *anger, fear, happiness, sadness, love, worry, interest* and *affection* (cf. Panksepp's seven emotional systems): we feel *irritated* when we can't find a parking space; *tender* when our children greet us; *anxious* when we receive letters from the tax office; or *enthusiastic* when we get a paper accepted. The mere fact that most emotions experienced in everyday life aren't particularly intense does not imply that they do not involve basic-emotion categories<sup>6</sup> . Consider Plutchik's (1994) *cone model* of basic emotions (**Figure 1**). The circular arrangement shows the degree of similarity among the emotions, whereas the vertical dimension shows the intensity dimension. One consequence of this arrangement is that emotions of a lower intensity are closer to each other, and hence more similar, than are emotions of a high intensity. It may be that music often operates in the lower section of the cone, rather than in the extreme section representing "full-blown emotions," but the same emotion categories are still involved. Therefore, we may not always "detect" discrete emotions in everyday life situations or in musical expressions, simply because milder versions of basic emotions involve more subtle differences.

*Myth 5: "Basic emotions are not relevant in music."* The above myths can explain a further myth: that basic emotions are irrelevant in the context of musical expression. One moment's reflection suggests the opposite—if there is *any* type of emotions that could be expected to have a strong and natural link to musical expression, then it's the basic-emotion type: basic emotions can be conveyed nonverbally through gesture and tone of voice using similar patterns (e.g., Clynes, 1977; Juslin, 1997), whereas more complex emotions don't have similarly distinct nonverbal patterns. We also saw that emotions that are regarded as basic emotions (e.g., *happiness, sadness, anger, tenderness, fear*)

<sup>6</sup>Admittedly, many researchers prefer to look at intense emotions, but that's because it may be easier to detect *effects* of emotional responses on various measures if one uses emotion episodes with a high intensity.

seem easiest to express and perceive in music, as indexed by listener agreement (Gabrielsson and Juslin, 2003) and ratings by both musicians (Lindström et al., 2003) and listeners (Juslin and Laukka, 2004). Zentner and Eerola (2010) submit that discreteemotion models were not developed to study music. This is of course true, but in the context of perceived emotion, this misses the greater point: that music probably evolved on the foundation of vocal expressions of basic emotions. Hence, examples of such basic emotions may easily be found also in commercially available recorded music. For example, Leech-Wilkinson (2006) offers a large number of examples of "expressive gestures" used by singers to express basic emotions, such as *fear, sadness, anger, love*, and *disgust* in Schubert Lieder (see also analysis by Spitzer, 2010). Further, if we leave classical music aside for the moment—since it is a minority interest in the world, and even in the Western world (Hargreaves, 1986)—and look at the types of music most frequently heard in everyday life, we find that popular music involves songs about things that matter to people, the stuff that makes them happy, sad, angry, afraid, or tender.

*Myth 6: "Basic emotions have dominated in studies of music and emotion."* This concerns the increasingly common claim that basic or discrete emotions have somehow dominated in music and emotion research. The actual data reveal something else. Eerola and Vuoskoski (2013) recently reviewed studies of music and emotion published over a ten-year period (from 1988 to 2009). They found that about one third of these studies adopted a basic or discrete-emotions perspective. This shows, then, that the majority of studies of music and emotion have *not* focused on basic emotions. This is even more true, if one extends the time-frame of the overview. For instance, Gabrielsson and Juslin (2003), who reviewed studies of emotional expression in music from the 1890's, observed that the concept of basic emotions, and other influences from emotion psychology in general, have come into studies of musical expression quite recently, and then primarily in studies of music performance. In most of the investigations to date, the emotions measured have instead been chosen based on statements from philosophers and music theorists; suggestions from previous studies; and intuition, folk psychology, and personal experience. All together, the emotion labels used in previous work are counted in hundreds. Therefore, the view that basic emotions have dominated in previous studies of music and emotion is largely a "straw man"<sup>7</sup> .

#### *A positive explanatory role of basic emotions*

If we can get past the above myths about basic emotions, and consider the concept on its own merits, we may find that it can be highly heuristic to our understanding of musical expression. Few researchers in the music field have explicitly adopted a basic-emotions approach (but see Clynes, 1977). I proposed such an approach specifically in the context of studies of emotional expression in the *performance* of music (and not as an all-encompassing solution for the field of musical emotion), because I thought the concept could uniquely help to account for several of the findings in that field (see Juslin, 1997). The findings that have amassed since then have only reinforced this belief. Hence, consistent with the idea that emotional expression in music performance is mainly based on a code for vocal expression of basic emotions that has served important functions throughout evolution is evidence that:


<sup>7</sup>One exception is a set of studies of *performance,* which used the so-called "standard paradigm" to investigate whether musicians can communicate various basic emotions to listeners (reviewed by Juslin and Laukka, 2003).

It is my strong belief that no other emotion approach can nearly as convincingly account for the above findings regarding expression of emotion in music performance. The dimensional approach would have to explain why there is categorical perception of emotional expression if emotions are processed as continuous dimensions. It would also have to explain why some emotions are more easily expressed and recognized than others, if all emotions can be placed along the same continuous dimensions. Component-process theories would have to show that there are as many recognizable emotion categories in musical expression as there are possible appraisal-combination outcomes. This is a tall order, and I do not expect it to happen anytime soon. In contrast, a basic-emotions approach (Juslin, 1998) *predicts* categorical perception of emotions and higher listener agreement or decoding accuracy for emotions such as *happiness, sadness, anger, fear*, and *tenderness*.

#### **BEYOND BASIC EMOTIONS: INTRINSIC AND ASSOCIATIVE CODING**

The idea that basic emotions are "privileged" in musical expression does *not* imply, however, that other emotions cannot be conveyed in music also. It seems possible for music to convey more complex emotions under certain circumstances, even though there will tend to be lower agreement between listeners for such emotions (Senju and Ohgushi, 1987; Laukka et al., 2013). Part of the reason for this tendency is that more complex emotions are coded differently: they involve intrinsic and associative coding.

#### *Intrinsic coding*

Intrinsic coding involves internal syntactic relationships within the music itself. Music theory involves frequent references to tonal or harmonic motion (Lerdahl and Krumhansl, 2007), even gravitational forces between tones and chords (Larson and Van Handel, 2005), which can create "tension," "release," "climax," "repose," and "relaxation." Although Meyer's (1956) well-known theory focused primarily on how the thwarting of musical expectations might *arouse* emotion in listeners, it seems likely that this internal play within the musical structure could also affect *perceived* emotions (e.g., the emotional intensity; see Sloboda and Lehmann, 2001; Timmers and Ashley, 2007, for examples). Intrinsic sources of musical expression in music have rarely been investigated thus far, but they are unlikely to express specific emotions by themselves. Rather, their signification appears quite broad and mainly helps to qualify specific emotions conveyed by iconic or associative coding. By contributing dynamically shifting levels of tension, arousal and stability, they may help to express more complex, time-dependent emotions, such as *relief* and *hope*. This type of coding may require longer music excerpts in order to be truly effective, while most studies to date have used relatively short excerpts (Eerola and Vuoskoski, 2013).

#### *Associative coding*

Finally, music might also be perceived as expressive of emotions through *associative* coding. In other words, a performance of music may be perceived as expressive of a specific emotion simply because something in the music (a melody, timbre) has been repeatedly and arbitrarily paired with other meaningful stimuli or events in the past. Organ music could be perceived as expressive of "solemnity" or "spirituality," simply because it has been heard often in churches. Dowling and Harwood (1986) offered a classic example, in terms of Puccini's use of the first phrase of the "Star Spangled Banner" in *Madame Butterfly* to signify a feeling of "patriotism." Associative coding plays a crucial part in Wagner's *Leitmotif* strategy, where specific melodic themes are associated with particular characters in the drama. Included in this coding subtype are also expressive meanings which are purely conventional. Throughout music history, there are several examples of systems for emotional communication primarily based on convention (e.g., "the doctrine of affections"; see Buelow, 1983). Through this type of coding, music may achieve a more precise and complex expression, but its recognition will depend on having the necessary knowledge or experience. Hence, emotional expression through this type of coding will necessarily be less cross-culturally invariant and more context and/or listener dependent. Beyond a certain level, the associations will be deeply personal. DeNora (2001) describes the case of "Lucy," whose hearing of the Schubert "Impromptus" brings connotations of "comfort" because her father used to play these pieces when she was falling asleep after dinner.

Cross (2012) notes that music appears to be "a strangely malleable and flexible phenomenon" (p. 265), in that "one and the same piece can bear quite different meanings for performer and listener, or for two different listeners" (p. 266). Some of this socalled *floating intentionality* or "aboutness" (Cross, 2012, p. 266) is perhaps beyond systematic modeling, but may still be explored in terms of in-depth interviews and music analysis. For example, Delis et al. (1978) suggest that listeners construct a story in relation to the music, in order to better remember it. Moreover, some listeners perceive the music to reflect their own personality (how they think and feel), thereby confirming their self-identity (Gabrielsson and Lindström Wik, 2003). Music analyses in the more "hermeneutic" tradition may also be concerned with associative codings: for instance, when Hatten (1994) analyses the *Cavatina* of Beethoven's string quartet op. 130 and notes that "the 'willed' (basically stepwise) ascent takes on a *hopeful* character supported by the stepwise bass . . . " (p. 213, italics added), there is no doubt that this is how Hatten hears the music; there is also little doubt that few other listeners would hear the piece in exactly the same way (unless, perhaps, they have read Hatten's persuasive interpretation).

The real powers of intrinsic and associative sources of perceived emotions in music might lie in their ability to modulate or extend the expression provided by iconically-coded sources, as discussed in the following section.

#### *Codings combined: multiple layers of musical expression*

**Figure 2** illustrates the conceptualization of musical expression proposed in this article: there are three primary types of coding which correspond to three "layers" of musical expression of emotions. The bottom ("core") level is constituted by iconicallycoded basic emotions (based on vocal expression). This layer may explain universal recognition of basic emotions, in both vocal (Bryan and Barrett, 2008) and musical (Fritz et al., 2009)

expression. However, this layer can be extended, qualified, or even modified by two additional layers in terms of intrinsic and associative coding, enabling listeners to perceive more complex emotions, which however are less cross-culturally invariant. Intrinsically-coded expression may add dynamically changing contours (e.g., variation in "tension," "arousal," or "intensity") which help to shape more time-dependent emotional expressions (e.g., conveying "relief" may depend on changes over time). Associative coding adds an even richer level of complex emotions, although typically with a low level of cross-cultural or even inter-individual agreement. This layer can furthermore be divided into a more "communal" associative subsection/dimension and a more "idiosyncratic" (or deeply personal) subsection/dimension. The "communal" subsection involves the common associations of a particular social group, as constituted by shared experiences (group identity) or musical conventions. At the final layer of expression, the idiosyncratic layer, a listener can perceive just about *any* emotion in the music, through deeply personal associations.

Conceived in this manner, it is easy to see how perception of emotional expression in music might lead to *both* agreement (Juslin, 1997) and disagreement (Huber, 1923); cross-cultural similarities (Fritz et al., 2009) and differences (Gregory and Varney, 1996); a shared meaning (Sloboda and Juslin, 2001, p. 95) and deeply personal meaning (Gabrielsson and Lindström Wik, 2003), sometimes, perhaps, even within the same study or piece of music. Further, one might conceive of "mixed emotions," resulting from different emotional meaning at different layers, somewhat akin to what Cohen (2001) refers to as "emotional polyphony" (p. 252).

This multiple layer notion of musical expression might account for some previous findings. For example, Brown (1981) studied music listeners' ability to recognize emotions in pieces from different styles and genres in classical music. He chose 12 musical excerpts and asked listeners, both musicians and nonmusicians, to sort them into six broad emotion categories. In a second task, listeners were instead required to identify six pairs out of 12 other musical excerpts representing "Variations on Sadness" (i.e., variants *within* the same broad emotion category). While listeners were quite successful in the first task, they were not in the second task, until Brown supplied his own descriptions of the six sadness categories. However, non-musicians were still unsuccessful. Brown thus concluded that if the different expressions are not too similar (as in the first task), the emotion categories can be identified even by persons not highly knowledgeable about classical music; however, with pieces as close in expression as in the variations on sadness "the agreement on synonymous pairs can only be achieved by listeners highly conversant with the traditions involved" (p. 264). One may re-interpret these results as follows: the recognition of broad basic emotion categories was based on iconically coded expression which does not require musical expertise; the recognition of more complex or subtle nuances within the categories was based on associative coding which requires some knowledge of musical conventions.

Similarly, in a recent cross-cultural study of musical expression of emotions by Laukka et al. (2013), it was found that decoding of basic emotions was rather robust regardless of whether the music was familiar or not—presumably because it's based on the core layer of iconically coded expression. In contrast, decoding of non-basic emotions was more limited as it merely occurred for some listener groups and/or for familiar musical cultures. These emotions were probably based to a greater extent on associative coding (e.g., social conventions) at the third layer of expression (see **Figure 2**).

The relative importance of the three layers of musical expression could vary as a function of musical genre, historical context, as well as various listener characteristics. Still, I think that iconic sources tend to be the most powerful—because associative sources are too individual and intrinsic sources are too indeterminate. Hence, iconic sources, linked to basic emotions, account for the lion's share of musical expression. These sources have a clear cross-cultural component due to their direct link to autonomic arousal and the human voice. Consequently, it does not appear far-fetched to assume that the so-called "psychophysical cues" in Balkwill and Thompson's (1999) *cue redundancy model* mainly correspond to iconically-coded basic emotions that can be cross-culturally recognized, whereas their "culture-specific cues" partly correspond to emotions coded more in terms of associative or intrinsic sources. Similarly, a decomposition into different types of coding might help to account for both similarities and differences between vocal expression (e.g., Juslin and Scherer, 2005) and musical expression (Gabrielsson and Juslin, 2003): Iconic coding of basic emotions will tend to be similar across the two channels (Juslin and Laukka, 2003), but associative and intrinsic sources of emotions will diverge, since their different functions in human life will shape conventions underlying their use differently.

#### **THE SPECIAL CASE OF AROUSAL OF FELT EMOTIONS**

So far, this paper has been concerned exclusively with expression and perception of emotions. However, most researchers believe that music can also *arouse* felt emotions in listeners under certain circumstances. This issue is not uncontroversial (Juslin and Västfjäll, 2008), although as eloquently put by Ball (2010), "no one can doubt that some music is capable of exciting some emotion in some people some of the time" (p. 257). It is very important to distinguish arousal of emotions from expression and perception of emotions, because the emotions involved may be different depending on the process (Juslin and Laukka, 2004).

To be clear, the present author has *never* suggested that arousal of felt emotion during music listening is limited to basic emotions; quite the contrary, for over a decade I have repeatedly observed that music arouses a wide range of emotions (Juslin and Laukka, 2004; Juslin, 2005, 2011, 2013; Juslin and Västfjäll, 2008; Juslin et al., 2011). The notion that basic emotions are "privileged" applies *only* to expression and perception of emotion, not to arousal of emotion (And even in the case of expression and perception, I have allowed for, and examined, more "complex" emotions; e.g., Juslin et al., 2004). Despite this, it's not uncommon for scholars to give the impression that my position is, or has been, that music arouses only basic emotions. Again, it is helpful to consider the actual findings of relevance to the issue at hand.

Survey studies of the prevalence of musical emotions suggest that music arouses quite a wide range of states. Among the most frequently reported emotions to date are the following broad categories: *Calm-relaxation, happiness-joy, nostalgia-longing, interest-expectancy, pleasure-enjoyment, sadness-melancholy, arousal-energy, love-tenderness, pride-confidence* as well as various synonymous terms (Wells and Hakanen, 1991; Sloboda, 1992; Juslin and Laukka, 2004; Juslin et al., 2008, 2011; Zentner et al., 2008). "Mixed" emotions (e.g., both *joy* and *sadness*) also occur but in a minority of the events (13% in Gabrielsson, 2010; 11% in Juslin et al., 2011).

Hence, previous findings indicate that the emotions aroused by music include basic emotions, but also include many other emotions, depending on which underlying mechanism caused the emotion (see Juslin, 2013, for further discussion). Even the supposedly "music-specific" scale for measuring emotional reactions to music, GEMS (e.g., Zentner et al., 2008), includes basic emotions (e.g., *sad* → SADNESS; *irritated* → TENSION; *in love* → TENDERNESS; *joyful* → JOYFUL ACTIVATION). Thus, Lamont and Eerola (2011) suggest that GEMS "contains significant redundancy in comparison to traditional models" (p. 142). We need to identify the real points of agreement and disagreement: researchers agree that music arouses a wide range of emotions that go beyond basic emotions. They agree that music arouses more positive than negative emotions. They do *not* agree, however, that there exist unique emotions aroused when and only when people listen to music (Juslin, 2013).

#### **CONCLUSION: ENDING THE BASIC-EMOTION BASHING**

Let us return to the main question posed at the outset of this article: What does music express? Or, formulated more precisely: What are the emotional contents that listeners may perceive in music? As noted at the beginning, the question may have different answers depending on how we operationalize the notion of expression: Is it sufficient that any single listener perceives an emotion? Or should there be a minimum level of listener agreement? Or should the perceived emotion correspond to what the composer intended?

These various senses in which music can be said to express emotions are largely integrated in the present approach, which may be summarized as follows: There are three distinct layers of perceived musical expression of emotions. Each layer corresponds to a specific type of coding of emotional meaning. The "core" layer is constituted by iconically-coded basic emotions that can explain recent findings of universal recognition of basic emotions in vocal expression and music. The "core" layer can be extended, qualified and sometimes even modified by additional layers in terms of intrinsic and associative coding, which enable listeners to perceive complex emotions. These additional layers of expression are less cross-culturally invariant, though, and more dependent on the social context and/or the individual listener. At the "core" level of basic emotions, vocal and musical expression are fairly similar. At the additional layers that involve more complex emotions, vocal and musical expression begin to diverge from one another, due to the unique functions and uses associated with each modality. Depending on how expression is coded in particular pieces of music, we may expect to find different results across empirical investigations. Hence, I have argued that one might easily obtain evidence of either cross-cultural invariance or diversity, simply depending on how one is selecting the music in studies (Juslin, 2012).

Research to date has primarily focused on iconically-coded expression of emotions in music. It would thus be interesting to explore in future studies how associative and intrinsic sources contribute to expression, *beyond* basic emotions produced by iconically-coded sources. Still, while there is more to expression in music that basic emotions, as I have tried to show, basic emotions remain at the core of the process, and cannot be ignored. An approach that focuses only on basic emotions presents an incomplete picture (see **Figure 2**), while an approach that ignores basic emotions is plainly inadequate. Recent critiques of the basic-emotion approach in the music field have been marked by myths and misunderstandings (or by hidden agendas). Empirical data, in contrast, illustrate the value of the concept of basic emotions in accounting for musical expression of emotions (section A Positive Explanatory Role of Basic Emotions).

Hence, in closing this essay, I would like to call for an end to the "basic-emotion bashing," in an attempt to offer a more nuanced view. As I have tried to show, a distinction between basic and complex emotions, and its link to various types of coding, can help to account for several findings concerning musical expression of emotions. The basic emotions represent the crucial link between our ancient past and modern music making, and are part of the reason that music is sometimes, perhaps justifiably so, called a universal language of the emotions.

#### **ACKNOWLEDGMENTS**

This research was supported by the Swedish Research Council through a grant to Patrik Juslin (421-2010-2129). I thank the referees for useful comments on an earlier version of this article.

#### **REFERENCES**


*78.* Uppsala: Uppsala University Library.


*Psychol. Rev.* 97, 315–331. doi: 10.1037/0033-295X.97.3.315


from vocal expression correlate across languages and cultures. *J. CrossCult. Psychol.* 32, 76–92. doi: 10.1177/0022022101032001009


on music and emotion," in *Music and Emotion: Theory and Research,* eds P. N. Juslin and J. A. Sloboda (New York, NY: Oxford University Press), 71–104.


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 25 June 2013; accepted: 16 August 2013; published online: 06 September 2013.*

*Citation: Juslin PN (2013) What does music express? Basic emotions and beyond. Front. Psychol. 4:596. doi: 10.3389/fpsyg.2013.00596*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2013 Juslin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

#### **Elizabeth Hellmuth Margulis\***

Music Cognition Lab, Department of Music, University of Arkansas, Fayetteville, AR, USA

#### **Edited by:**

Petri Laukka, Stockholm University, Sweden

**Reviewed by:**

Steven Brown, McMaster University, Canada Adam Ockelford, University of Roehampton, UK

#### **\*Correspondence:**

Elizabeth Hellmuth Margulis, Music Cognition Lab, Department of Music, University of Arkansas, MUSC 324, Fayetteville, AR 72701, USA. e-mail: ehm@uark.edu

Music and speech are often placed alongside one another as comparative cases. Their relative overlaps and disassociations have been well explored (e.g., Patel, 2008). But one key attribute distinguishing these two domains has often been overlooked: the greater preponderance of repetition in music in comparison to speech. Recent fMRI studies have shown that familiarity – achieved through repetition – is a critical component of emotional engagement with music (Pereira et al., 2011). If repetition is fundamental to emotional responses to music, and repetition is a key distinguisher between the domains of music and speech, then close examination of the phenomenon of repetition might help clarify the ways that music elicits emotion differently than speech.

**Keywords: repetition, basal ganglia, sequencing, speech-to-song illusion, ritual**

#### **MUSIC'S REPETITIVENESS IS SPECIAL**

Ethnomusicologist Nettl (1983) identifies musical repetition as a rare cultural universal – a characteristic exhibited by the music of every known human culture. Although some traditions, for example certain strands of contemporary art music in the West, explicitly eschew repetition, they do so in conscious response to a tendency toward musical repetition that exists elsewhere in the culture. Evolutionary biologist Fitch (2006) goes so far as to call repetition a"design feature"of music, essentially constitutive of the communicative form. This repetition can happen within a piece, or across multiple hearings.

Speech, by contrast, features a much lower incidence of repetition, and although the specifics are challenging to quantify, aspects of this distinction are plainly evident. For example, music features a litany of symbols instructing the player to repeat, from repeat signs to da capo indications (Kivy, 1993), whereas written language possesses no such lexicon for repetition. In a plea to abolish the practice of "part-repetition," a tradition in eighteenth century music whereby performers would repeat large sections of the piece during performance, Ferdinand Praeger appeals to the unpalatability such a practice would have in speech:

Would ever a poet think of repeating half of his poem; a dramatist a whole act; a novelist a whole chapter? Such a proposition would be at once rejected as childish. Why should it be otherwise with music? . . . Since any whole partrepetition in poetry would be rejected as childish, or as the emanation of a disordered brain, why should it be otherwise with music? (Praeger, 1882–1883).

Yet the fact remains not only that sections within musical pieces are often repeated, but also that entire pieces are listened and relistened to hundreds of times, often voluntarily and even enthusiastically.

Garcia (2005) explores the ways that repetition's perceived affiliations with childishness, regression, and insanity (well exemplified by Praeger's remarks) have prevented scholars from acknowledging, let alone investigating its function in music (with notable exceptions, such as Ockelford, 2005). They've preferred instead to emphasize music's connections with language, long recognized as a legitimate domain of inquiry. But insight into the parallels between music and language has sometimes come at the cost of insight into music's more unique qualities, like repetitiveness. So closely affiliated with music is this quality,in fact, that its use within speech can actually serve to engender a perceptual shift whereby an acoustic stimulus first perceived as speech comes to be perceived as music. This phenomenon, the speech-to-song illusion (Deutsch et al., 2011), documents the way that the temporally regular repetition of a particular clause can trigger a startling effect on replay of the entire utterance whereby the speaker, at the start of the clause in question, is heard to suddenly break into song. That the simple act of repetition can so dramatically musicalize speech illuminates its special role in delineating these two communicative domains.

#### **ARE THERE FUNCTIONAL COMMONALITIES UNDERLYING DIFFERENT KINDS OF REPETITION?**

Johnstone's (1994)two-volume edited collection explores a variety of special cases where language is used repetitively, asking fundamentally whether there are things "repetition always does" (p. 12). By way of an answer, Johnstone observes that:

The function of repetition in general is to point, to direct a hearer back to something and say,"Pay attention to his again. This is still salient; this still has potential meaning; let's make use of it in some way." This accounts, for example, for the cognitive utility of repetition to learners, getting the learner's attention on a token of input for a second round in order to have something to work with. We can also call attention to the fact that we're getting one's attention, and we can take that one step further, when awareness of the ability to manipulate allows us to play with attention. Immediacy may be poetic. . . . Repetition is a mode of focusing attention. . . . Repetition focuses attention on the makeup of both the repeated discourse and the earlier discourse. Repetition puts the utterance in brackets making it impossible to treat the language as if it were transparent, by forcing hearers to focus on the language itself. In that sense repetition is metalinguistic (p. 13).

Repetition in speech, in other words, encourages a listener to orient differently to the repeated element, to shift attention down to its constituent sounds on the one hand, or up to its contextual role on the other. For example, if a mob boss in a gangster movie says "take care of it," and is answered by a quizzical look, he might repeat "take care of it!" underscoring for his henchman the darker meaning of the instruction.

The speech-to-song illusion can be understood similarly as a shift to a different level of understanding, in this case, to the utterance's lower-level prosodic aspects. Semantic satiation (Severance andWashburn, 1907), the well-known phenomenon whereby repeatedly speaking a word causes it to shed its semantic associations and devolve into nonsense, can also be understood as a result of an attentional shift down to the word's lower-level phonemic content.

Recent work in music has suggested that in addition to engendering a downward shift, repetition can also trigger an attentional shift up, toward progressively higher levels of the musical structure (Margulis, 2012). When participants were reexposed to the same piece four times in a row and asked during each iteration to press a button each time they heard something repeat, having been previously informed that the repeating thing could range in scope from a two-note motive to a phrase or section, they generally identified repetitions of smaller-scale elements (such as motives) on the first hearing, and then progressively larger-scale elements (such as phrases) across additional exposures. This change can be interpreted as evidence of a shift in orientation from lowerlevel aspects of the musical structure to higher-level aspects of the musical structure. Although repeated exposures seemed to engender an attentional shift upward for these pieces, I hypothesize that for repertoires with less rich hierarchic structuring repeated exposures might push attention down to attributes like microtiming and microdynamics.

In speech, then, repetition may be useful in specialized circumstances where a speaker wants a listener to attend to some different, non-obvious aspect of the utterance: its previously unseen relevance to some larger situational context, for example, or its prosodic or lexical content. But speech normally functions to relay a particular semantic meaning; once the message has been conveyed, the particular words used to convey it are no longer relevant. This condition has been explored within the context of the fuzzy trace theory (Reyna and Brainerd, 1995), which posits a distinction between gist and verbatim memory. Speech is normally associated with gist memory; if asked to recount a story, for example, people use different words to describe the events – they've invested in the *meaning* rather than in the particular words used to encode it. Music, by way of contrast, is associated with comparatively keener verbatim memory (Calvert, 1991, 2001). Recent work by Krumhansl (2010) shows that listeners can identify songs

remarkably well from clips shorter than half a second, suggesting extremely acute verbatim encoding. Music doesn't serve as a discardable vessel for conceptual meaning in the same way that ordinary uses of speech can; rather, its surface, verbatim content retains communicative significance across repetitions. Moving up and down within its structure across rehearings can yield satisfying varieties of engagement with a piece, revealing a stark contrast in the kind of thing sought after by a listener when hearing a piece of music versus an excerpt of speech.

#### **REPETITION AND EMOTIONAL RESPONSE**

To gain a foothold in the relatively underexplored domain of the emotional impact of musical repetition, it's helpful to examine a better-explored domain that features repetitive behavior, such as ritual. Like music, ritual features unusual degrees of voluntarily undertaken repetition, and also like music, ritual is capable of eliciting strong emotional response. Boyer and Liénard (2006) adopt a framework for event hierarchies from Zacks et al. (2001) to characterize the special behaviors associated with ritual. Within this framework, gestures (on the order of a few seconds) combine to form episodes (such as tying your shoes) which combine in turn to form scripts (such as getting ready for school or eating dinner at a restaurant). It's typically most natural to recall events in terms of episodes, and excessive focus on the lower gestural level can indicate pathologies such as frontal lobe damage or schizophrenia (Janata and Grafton, 2003). But ritual expressly drives attention down to this level, inducing, Boyer and Liénard claim, a special mental state focusing on low-level properties of actions. Associated with the repeated gestures comes a general effect of goal demotion, where the uses the gestures are typically put to recede and the constituent movements themselves rise to prominence. The excessive repetition also serves as a powerful signal of intentionality, revealing both the internal commitment of the ritual's participant and her ties to a social community that has defined these particular gestures as significant. Shifts in attention, then, of the sort chronicled by the studies reviewed in the first part of this article, might underlie the capacity for a special kind of emotional engagement.

Margulis (2013) arbitrarily inserted repetitions into excerpts of contemporary art music by renowned composers Elliott Carter and Luciano Berio, and everyday listeners without special training or experience with the genre rated the repetition-hacked examples as more likely to have been composed by a human artist and the original versions as more likely to have been randomly generated by a computer. Repetition in music, like repetition in ritual, then, can serve to signal intentionality, and this recognition of intentionality might facilitate the capacity to engage with sounds as emotionally communicative.

#### **INTERNAL IMAGERY, EXTERNAL SOUND, AND STRONG EXPERIENCES OF MUSIC**

One consequence of the prevalence of musical repetition is the phenomenon of the earworm. Liikkanen (2008) reports that over 90% of people report experiencing earworms at least once a week, and more than 25% say they suffer them several times a day. Brown (2006), a neuroscientist at McMaster, has reported extensively on his own "perpetual music track:" tunes loop repeatedly in his mind

on a near constant basis. Brown observes that the snippets usually last between 5 and 15 s, and repeat continuously – sometimes for hours at a time – before moving to a new segment.

The experience in each of these cases, the earworm and the perpetual music track, is very much one of being occupied by music, as if a passage had really taken some kind of involuntary hold on the mind, and very much one of relentless repetitiveness. The seat of such automatic routines is typically held to be the basal ganglia (Boecker et al., 1998; Nakahara et al., 2001; Lehéricy et al., 2005). Graybiel (2008) has identified episodes where neural activity within these structures becomes locked to the start and endpoints of well-learned action sequences, resulting in a chunked series that can unfurl automatically, leaving only the boundary markers subject to intervention and control. Vakil et al. (2000) showed that the basal ganglia underlie sequence learning even when the sequences lack a distinct motoric component. And, critically, Grahn and Brett (2007, 2009) used neuroimaging to demonstrate the role of the basal ganglia in listening to beat-based music; (Grahn and Rowe, 2012) shows that this role relates to the active online prediction of the beat, rather than the mere *post hoc* identification of temporal regularities.

The circuitry that underlies habit formation and the assimilation of sequence routines, then, also underlies the process of meter-based engagement with music. And it is repetition that defines these musical routines, fusing one note to the next increasingly tightly across multiple iterations. DeBellis (1995) offers this telling example of the tight sequential fusing effected by familiar music: ask yourself whether "oh" and "you" are sung on the same pitch in the opening to *The Star-Spangled Banner.* Most people cannot answer this question without starting at the beginning and either singing through or imagining singing through to the word "you." We largely lack access to the individual pitches within the opening phrase – we cannot conjure up a good auditory image of the way "you" or "can" or "by" sounds in this song, but we can produce an excellent auditory image of the entire opening phrase, which includes these component pitches. The passage, then, is like an action sequence or a habit; we can duck in at the start and out at the end, but we have trouble entering or exiting the music midphrase. This condition contributes to the pervasiveness of earworms; once they've gripped your mind, they insist on playing through until a point of rest. The remainder of the passage is so tightly welded to its beginning that conscious will cannot intervene and apply the brakes; the music spills forward to a point of rest whether you want it to or not.

#### **REFERENCES**


Brown, S. (2006). The perpetual music track: the phenomenon of constant musical imagery. *J. Conscious. Stud.* 13, 43–62.


Reencountering a passage of music involves repeatedly traversing the same imagined path until the grooves through which it moves are deep, and carry the passage easily. It becomes an overlearned sequence, which we are capable of executing without conscious attention. Yet in the case of passive listening, this movement is entirely virtual; it's evocative of the experience of being internally gripped by an earworm, and this parallel forms a tantalizing link between objective, external and subjective, internal experience. This sense of being moved, of being taken and carried along in the mode of a procedural enactment, when the knowledge was presented (by simply sounding) in a way that seemed to imply a more declarative mode can be exhilarating, immersive, and boundary-dissolving: all characteristics of strong experiences of music as chronicled by Gabrielsson and Lindström's (2003)survey of almost 1000 listeners. Most relevant to the present account are findings that peak musical experiences tended to resist verbal description, to instigate an impulse to move, to elicit quasi-physical sensations such as being "filled" by the music, to alter sensations of space and time, including out-of-body experiences and percepts of dissolved boundaries, to bypass conscious control and speak straight to feelings, emotions, and senses, to effect an altered relationship between music and listeners, such that the listener feels penetrated by the music, or merged with it, or feels that he or she is being played by the music, to cause the listener to imagine him or herself as the performer or composer, or experience the music as executing his or her will, and to precipitate sensations of an existential or transcendent nature, described variously as heavenly, ecstatic or trance-like.

These sensations can be parsimoniously explained as consequences of a sense of virtual inhabitation of the music engendered by repeated musical passages that get procedurally encoded as chunked sequences, activating motor regions and getting experienced as lived/enacted phenomena, rather than heard/cognized ones. It is repetition, specifically, that engages and intensifies these processes, since it takes multiple repetitions for something to be procedurally encoded as an automatic sequence. This mode of pleasure seems closely affiliated with and even characteristic of music, but less so for speech, where emotions are more typically elicited by the listener's relationship to the semantic meaning conveyed by the utterance. This paper argues that the difference in the appetite for repetition between musical and speech-based modes of communication is fundamentally linked with differences in the means by which these modes of communication elicit emotion. Margulis (forthcoming) explores this hypothesis in detail.


*Proc. Natl. Acad. Sci. U.S.A.* 102, 12566–12571.


and communicating structure in events. *J. Exp. Psychol. Gen.* 130, 29–58.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 28 February 2013; accepted: 17 March 2013; published online: 04 April 2013.*

*Citation: Margulis EH (2013) Repetition and emotive communication in music versus speech. Front. Psychol. 4:167. doi: 10.3389/fpsyg.2013.00167*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Margulis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## Emotional communication in speech and music: the role of melodic and rhythmic contrasts

#### **Lena Quinto,William Forde Thompson\* and Felicity Louise Keating**

Department of Psychology, Macquarie University, Sydney, NSW, Australia

#### **Edited by:**

Petri Laukka, Stockholm University, Sweden

#### **Reviewed by:**

L. Robert Slevc, University of Maryland, USA Renee Timmers, University of Sheffield, UK

#### **\*Correspondence:**

William Forde Thompson, Department of Psychology, Macquarie University, Balaclava Road, BLD C3A, Sydney, NSW 2109, Australia. e-mail: bill.thompson@mq.edu.au

Many acoustic features convey emotion similarly in speech and music. Researchers have established that acoustic features such as pitch height, tempo, and intensity carry important emotional information in both domains. In this investigation, we examined the emotional significance of melodic and rhythmic contrasts between successive syllables or tones in speech and music, referred to as Melodic Interval Variability (MIV) and the normalized Pairwise Variability Index (nPVI).The spoken stimuli were 96 tokens expressing the emotions of irritation, fear, happiness, sadness, tenderness, or no emotion. The music stimuli were 96 phrases, played with or without performance expression and composed with the intention of communicating the same emotions. Results showed that nPVI, but not MIV, operates similarly in music and speech. Spoken stimuli, but not musical stimuli, were characterized by changes in MIV as a function of intended emotion. The results suggest that these measures may signal emotional intentions differently in speech and music.

**Keywords: speech prosody, emotional communication, music cognition and perception, melodic variability index, normalized pairwise variability index**

Several commonalities exist in how emotion is expressed by speech prosody (tone of voice) and music (Sundberg, 1998). Both speakers and musicians convey emotion through cues such as timing, rate, intensity, intonation, and pitch. According to Scherer (1995), the reasons for this similarity may stem from the shared vocal constraints associated with speaking and singing. Based on a meta-analysis of 104 speech studies and 41 music studies, Juslin and Laukka (2003) concluded that similar changes in acoustic features occurred when conveying similar emotions in these domains. This observation led Juslin and Laukka to suggest that there is a "common code" for emotional expression in speech and music. Further evidence comes from studies that examined the emotional consequences of manipulating acoustic attributes. Ilie and Thompson (2006) reported that manipulations of pitch height, intensity, and rate (tempo) in speech and music yielded similar emotional ratings by listeners.

Comparisons of emotional features between the two domains have tended to focus on variables such as changes in intensity, duration, timbre, and pitch (Gabrielsson and Lindström, 2010; Juslin and Timmers, 2010). However, the full range of emotional cues, and their degree of overlap between speech prosody and music, has yet to be determined. The comparison of emotional attributes in music and speech has been challenging because direct analogs do not always exist. Speech and music may each have domain-specific cues to emotion, because they have different structural features and different functions. For example, in music pitches tend to be discrete and are typically organized hierarchically (Krumhansl, 1990). Pitches may also be specified by the composer and are not under the control of the

performer. In speech, pitch movement tends to be continuous, not hierarchically organized and under the direct discretion of the speaker. Music is also characterized by regular cycles of stress, called meter. The deviations from expected timing contribute to the expressiveness of a musical performance (Palmer, 1997). In speech, rhythm is subtler, and debates exist as to how it is best quantified (see Patel, 2008). These issues represent a difficulty in comparing pitch and rhythm in affective speech and music. As a result, speech and music may each have shared and domain-specific cues to emotion, but only a relatively small number of the most obvious cues to emotion have been investigated.

Recently, two measures have been developed that allow an examination of the changing pitch and rhythmic properties of speech and music. The first, Melodic Interval Variability (MIV), is a measure of pitch variability. MIV takes into account differences in successive intervals (Patel, 2008). MIV is defined as the coefficient of variation (CV = standard deviation/mean) of absolute interval size for a sequence of tones. MIV yields a smaller value when interval changes are less variable, and a larger MIV value when interval changes are more diverse. This allows for comparisons between the variability of intervals in melodies independent of the average interval size.

The second measure, the normalized Pairwise Variability Index (nPVI), is a measure of rhythmic contrastiveness between successive durations (Low et al., 2000; Grabe and Low, 2002). It was developed to better understand the rhythmic differences found between languages, such as stress-timed versus syllable-timed languages (Low et al., 2000). Like MIV, a small nPVI indicates uniform durations between successive tones or syllables, whereas a greater nPVI indicates that successive durations are less uniform.<sup>1</sup> The nPVI is an overall contrast value based on the length of successive syllables or tones.

These measures were developed independently, and variations of each measure exist. In the calculations documented by Patel et al. (2006), MIV is normalized with respect to the average interval between adjacent syllables or tones, and nPVI is normalized with respect to the average durations of adjacent syllables or tones.

MIV and nPVI measure distinct attributes – pitch and time – but have been examined together in the work of Patel and colleagues. Early anecdotal evidence suggested that a composer's instrumental music was influenced by their nationality and language. However, because of difficulties in assessing the structural attributes of music, this hypothesis was difficult to test. Patel et al. (2006) used MIV and nPVI to compare the spoken language of a composer and the structural patterns found in their music. Patel et al. found that French speech has lower MIV and nPVI values than English speech. Similarly, music written by French composers has lower MIV and nPVI values than music written by English composers. These findings suggest that a composer's language may influence the pitch and rhythmic properties of their music.

To date, there is no widely accepted account for why average measures of MIV and nPVI differ between languages, and there is little understanding of the degree to which these variables are perceivable. Patel (2008) offers a few reasons for the observed pattern of results. One reason for differences in MIV is that English speech may have more pitch levels than French speech – allowing for greater variability. Another reason is that composers may have internalized pitch and timing patterns in the speech of their culture and these patterns are reflected in their music. Research on rhythm and language suggests that nPVI differs between languages for a few reasons. One possibility could be due to varying amounts of vowel reduction by speakers, a second possibility could be differences in the proportion of vowels in a sentence, and a third possibility is that there may be differences in the variability of vowel types within a language (Patel et al., 2006). Additionally, there are currently no data on the degree to which people are sensitive to changes in MIV, but a study by Hannon (2009) indicates that participants can reliably classify sequences that vary in nPVI.

In summary, speech prosody and music are powerful channels of emotional communication. Previous research has found that MIV and nPVI are important attributes in both domains (e.g., Patel et al., 2006), yet the relevance of these attributes to emotional communication has never been examined. Our aims were (a) to determine whether these features carry emotional information in one or both domains and (b) to determine if they are associated with emotions in the same way in affective speech and music,

<sup>1</sup>The following equation shows how the nPVI is calculated:

$$\text{nPVI} = \frac{100}{m - 1} \times \sum\_{k=1}^{m-1} \left| \frac{\frac{d\_k - d\_{k+1}}{d\_k + d\_{k+1}}}{\frac{d\_k + d\_{k+1}}{2}} \right| \dots$$

*M* is the number of durational elements in a sequence and *d<sup>k</sup>* is the duration of the *k*th element. Individual contrast values between each syllable or tone are computed, summed and averaged to yield the nPVI for a entire sentence.

or whether they operate differently in the two domains. First, we generated spoken and melodic stimuli conveying six emotional intentions. Next, stimuli were acoustically analyzed to assess differences in MIV and nPVI for each emotion and domain. We predicted that both measures would vary as a function of the intended emotion, but there are no clear grounds for making specific hypotheses. For example, it might be expected that high levels of MIV and nPVI would be associated with high-arousal emotions such as happiness and fear, because high values reflect greater pitch and durational contrasts. On the other hand, as there are no data to support such a hypothesis, the opposite could also be true. Melodies with consistently short durations (fast tempo, low nPVI) and consistently large pitch changes (low MIV) might also be expected in high-arousal emotions. Based on evidence that music and speech share a common emotional code, we predicted that these measures would carry similar emotional information in the two domains.

#### **MATERIALS AND METHODS SPOKEN STIMULI**

Speakers were asked to emotionally express semantically neutral phrases such as "The boy and girl went to the store to fetch some milk for lunch." Each sentence had 14 syllables and was expressed with the intention to communicate each of the six emotions of irritation, fear, happiness, sadness, tenderness, and neutral or no emotional expression. These emotions were selected because they have been identified as frequently used in previous studies (Juslin and Laukka, 2003) and involve a range of acoustic features.

#### **Speakers**

Six male and seven female speakers provided samples of emotional speech. Their average age was 23.65 years. All speakers were paid \$15 for their participation.

#### **Procedure**

Speakers were asked to read a description of an affective scenario that was associated with one of the target emotions. We adopted this procedure to prepare speakers to verbally communicate the target emotion. Once they had read the scenario, they vocalized each of seven sentences while attempting to convey the intended emotion. This process was repeated for each emotion (irritation, fear, happiness, sadness, tenderness, or neutral). An experienced recording engineer provided feedback and coaching regarding the emotional expression of each sentence. The coaching did not involve suggestions for the use of cues to express emotion but rather encouragement to attempt additional renditions. Speakers were allowed to repeat each sentence until they, and the recording engineer were satisfied that the intended emotion was communicated.

#### **Recording**

Speakers were recorded in a professional recording studio at a sample rate and bit depth of 44.1 kHz/16 bit-mono. They spoke into a K2 condenser microphone (RØDE microphones) and were recorded with Cubase 4 (2008).

#### **Pre-rating**

Initially, 462 recordings were obtained (11 speakers × 6 emotions × 7 sentences). These recordings were then assessed in a pilot investigation involving 13 male and 22 female undergraduate students atMacquarie University (mean age = 21.49,SD = 4.75),with an average of 3.16 (SD = 4.00) years of musical experience. Participants heard a subset of the stimuli and made a forced-choice decision of the emotion they believed was conveyed. Their options were the six emotional intentions conveyed by the actors. Decoding accuracy was determined for every recording: The 16 most accurately decoded recordings were selected for each of the six emotions, resulting in 96 recordings balanced for speaker sex. This procedure was adopted to ensure that the intended emotions were expressed and to reduce the battery to a manageable size for analysis. The resultant battery of 96 spoken phrases (Macquarie Battery of Emotional Prosody, or MBEP) can be downloaded from the second author's website at www.psy.mq.edu.au/me2. **Table 1** summarizes some of the acoustic features associated with each intended emotion.

### **MUSICAL STIMULI**

#### **Musicians**

Four violinists and four vocalists created the stimuli. Violin and voice were selected because both instruments allow musicians to use a wide range of performance features. All musicians were currently performing or had completed higher-level examinations for their instruments. The two vocalists who had not completed formal examinations had been actively singing for 17 and 20 years. On average, the musicians had 15 (SD = 3.89) years of formal training, with an average time of 21.15 (SD = 9.41) years performing.

#### **Procedure**

We asked musicians to compose brief melodies with the intention of expressing the emotions of anger, fear, happiness, sadness, tenderness, and neutral. They were asked to compose melodies for their own instrument and to limit their compositions to a maximum of nine notes (range = 5–9 notes, average = 7.40 notes). Examples are illustrated in **Figures 1A–F**. In the *live condition*, musicians performed their own compositions in a manner that reinforced the emotion that was intended in each composition. In the *deadpan condition*, compositions were notated in MIDI format using Cubase. Deadpan compositions were recorded using timbres selected from a Roland super JV-1080 64 voice

synthesizer with four expansion modules. Compositions produced by violinists were recorded using timbre 41 from the XPA preset bank (violin); compositions produced by vocalists were recorded using timbre 54 from the D (GM) preset bank (voice). The tempo of each melody in the deadpan condition was matched to the tempo as performed by the musician in the live condition. This procedure resulted in 96 stimuli (8 musicians × 6 emotions × 2 manners of performance). The stimuli in the live condition differed from the deadpan condition because performers had the ability to deviate from the notated pitch and rhythmic information. Two judges with at least 10 years of music training independently confirmed that the intended emotion was expressed in all cases. All musicians were paid \$40 for their participation. In our study, pitch varied depending on emotional intention, whereas other properties of the sequence (whether verbal material or instrument timbre) were constant.

#### **Recording of musical stimuli**

Musicians were recorded in a quiet (testing) room at a sample rate and bit depth of 44.1 kHz/16 bit-mono. Performances were recorded using a K2 condenser microphone (RØDE microphones) and saved into Cubase 4 (2008). **Table 2** summarizes some of the acoustic attributes associated with each intended emotion.

#### **ACOUSTIC ANALYSIS**

The first two authors parsed each sentence and musical phrase manually with text grids using Praat (Boersma and Weenink, 2010). Text grids marked the boundary of every syllable and note in each phrase. In both music and speech, large glides were not considered stable pitches and were ignored. MIV was computed by measuring the interval distance between two syllables or tones in semitones. For each syllable or tone, the mean frequency in hertz was calculated using Praat. The interval distance was then calculated in semitones using the formula: 12 × log2(Hz 1/Hz 2). Interval distances in semitones were then used to compute MIV by dividing the standard deviation in interval size by the mean interval size for each sentence or musical phrase. The nPVI was computed by measuring the duration of each syllable or tone from its onset to the onset of the next syllable or tone. Periods of silence were included in the calculation of nPVI but not for MIV.

**Table 1 | Means associated with the acoustic features of the Macquarie battery of emotional prosody (standard errors are shown in parentheses).**


**Table 2 | Means associated with the acoustic features of the musical stimuli (standard errors are shown in parentheses).**


A short comparison of some acoustic features measured in both sets of stimuli (mean fundamental frequency, the standard deviation in fundamental frequency, duration, and intensity) revealed that there were similar changes in both domains depending on emotion. However, some exceptions did occur. For example, in music, anger was associated with a low pitch, whereas in speech, irritation (or mild anger) was associated with a high pitch. There are two potential reasons for the observed differences. The first is that any acoustic feature is associated probabilistically with emotional expression. Thus, the use of any one feature in the expression of an emotion may change depending on the rendition or portrayal. A second reason may be the semantic labels associated with the emotions. Note that five of the emotions in the two stimulus sets were the same. However, whereas the speech prosody stimulus set included the negative emotion of irritation, the music stimulus set included the (more intensely negative) emotion of anger. In

other words, one emotion category differed in the intensity of the emotion. This difference in semantic labels arose inadvertently because the two stimulus sets were developed independently.

There were some features that were not common to both sets of stimuli. For example, in the spoken stimuli there was a downward pitch trend for phrases intended to communicate irritation and an upward pitch trend for phrases intended to communicate happiness. This was referred to as slope and was an average measure of the pitch movement. The spoken stimuli also differed in the number of pitch direction changes that occurred. Phrases expressing happiness had a greater number of pitch changes than those expressing irritation.

In the music stimuli, pieces intended to have a negative valence (i.e., anger, sadness, and fear) were more strongly correlated with a minor mode than pieces intended to have a positive valence (i.e., happiness, tenderness). There was also a trend whereby the average interval size was larger for pieces expressing happiness than for pieces expressing sadness. Additional details can be found in Thompson et al. (2012) and Quinto et al. (in press).

#### **RESULTS**

#### **SPOKEN STIMULI**

Separate linear mixed effects models were conducted for the spoken and musical stimuli, and for the two dependent variables. The 96 recordings were the observations in each analysis. A linear mixed effects analysis was selected because the stimuli did not reflect independent observations and because the spoken stimuli did not have equal numbers or equal speakers of each sentence. It was important to account for the effects of using the same speaker (or musician) and sentence repeatedly. Stimuli that used the same speaker or the same sentence might be expected to be more similar (correlated) than stimuli that differ with respect to these variables. For the spoken stimuli, the variable of sentence and speaker were entered as random effects (intercepts). For the musical stimuli, the variable of performer was entered as a random effect (intercept).

For the spoken stimuli, the linear mixed effects analysis with MIV as the dependent variable and emotion as a fixed effect revealed a significant main effect of emotion, *F*(5, 82.86) = 4.27, *p* = 0.002. Means and standard deviations are shown in **Table 3**. Pairwise tests with Bonferroni correction showed that expressions of happiness had a lower MIV value than expressions of sadness, *t*(82.86) = 3.84, *p* < 0.001; and tenderness, *t*(82.86) = 3.47, *p* < 0.001. No other significant differences emerged. The results suggest that portrayals of happiness in speech are associated with relatively low variability in successive interval size, whereas portrayals of sadness and tenderness are associated with higher variability in interval size. The covariance parameter indicated that the sentence standard deviation was (range of the intercept) 2.45 (Wald *Z* = 0.28,*p* = 0.78). This suggests that the variation between sentences was small. The addition of speaker as a random effect showed that the covariance parameter was redundant suggesting that there was not enough variance or that the variances in speakers were highly correlated.

The average interval size was also assessed to demonstrate the independence of information provided by the variables of MIV and average interval size. A second linear mixed effects model with the average interval size as the dependent variable, and speaker and sentence as random variables also revealed a significant main effect of emotion, *F*(5, 86.92) = 15.72, *p* < 0.001. Pairwise tests revealed that happiness had a greater average interval size than all other emotions, *t*'s(86.92) > 3.12, *p* < 0.001. This finding demonstrates that MIV and average interval size provide different types of information in emotional speech. Specifically, while happiness might be associated with low variability in interval size, the types of intervals that are associated with the expression of happiness tend to be larger as compared to other emotions. Similarly, sadness and tenderness were associated with smaller to intermediate interval sizes yet relatively higher MIV values were associated with these emotions. The covariance parameter indicated that the sentence standard deviation was 0.28, (Wald *Z* = 0.81, *p* = 0.42).

A linear mixed effects model with nPVI as the dependent variable, emotion as the fixed effect and sentence and speakers as random effects revealed a significant effect of emotion, *F*(5, 81.37) = 2.88, *p* = 0.02. This effect arose from neutral or "no emotion" expressions having a lower nPVI value than the emotional expressions. Tests of simple effects with Bonferroni correction showed that "no emotion" expressions have a significantly lower nPVI than tenderness, *t*(81.37) = 3.19, *p* = 0.05. The means for the spoken nPVI in **Table 3** reflect estimated marginal means that take into account the effects of sentence. The covariance parameter indicated that the speaker standard deviation was 3.86 (Wald *Z* = 1.48, *p* = 0.14) and that the sentence standard deviation was (intercept) 10.80 (Wald *Z* = 1.67, *p* = 0.09). The relatively large standard deviation suggests that there was considerable variance between the sentences.

#### **MUSICAL STIMULI**

For the musical stimuli, we conducted a mixed linear effects model with MIV as the dependent variable, mode of presentation (live versus deadpan), and emotion as the independent variables. Performer was treated as a random factor. There was no effect of mode of presentation, *F*(1, 77) = 0.14, *p* = 0.71, nor was there an effect of emotion, *F*(5, 77) = 0.54, *p* = 0.70. The interaction between emotion and mode of presentation also was not significant, *F*(5, 77) = 0.14, *p* = 0.98. This finding suggests that MIV did not distinguish emotional portrayals in music. The covariance parameter indicated that the standard deviation for performer was 13.80 (Wald *Z* = 1.51, *p* = 0.13).



Note that the means are unweighted for all variables except nPVI in the spoken condition. Unit for interval size is semitones.

The linear mixed effects model with nPVI as the dependent variable revealed a significant main effect of mode of presentation, *F*(1, 77) = 9.39,*p* = 0.003. The nPVI was significantly higher in the live condition (*M* = 54.03, SD = 22.48) than in the deadpan condition (*M* = 42.26, SD = 19.03). This finding demonstrates that performers enhanced the durational contrasts between successive tones when performing their compositions as compared to the deadpan renditions. There was also a significant main effect of emotion, *F*(5, 77) = 4.03, *p* = 0.003. As shown in **Table 3**, nPVI was significantly lower for neutral expressions than for melodies conveying anger, *t*(77) = 3.55, *p* < 0.001; sadness, *t*(77) = 3.60, *p* < 0.001; and (marginally for) happiness,*t*(77) = 2.15, *p* = 0.056. The interaction between emotion and mode of presentation was not significant, *F*(5, 77) = 0.14, *p* = 0.98. The covariance parameter indicated that the standard deviation for performer was 6.62 (Wald *Z* = 1.11, *p* = 0.27).

#### **DISCUSSION**

The results of this investigation confirm that both MIV and nPVI can reflect emotional intentions. MIV varied as a function of emotional intentions in speech but not in music, whereas nPVI differentiated emotional from non-emotional portrayals in both domains. While similarities have been documented in the expression of emotion in speech and music (Scherer, 1995; Juslin and Laukka, 2003; Bowling et al., 2010;Curtis and Bharucha, 2010) and emotional experiences (Ilie and Thompson, 2011; Coutinho and Dibben, 2012), the current finding represents differences in the use of pitch contrasts as emotional information in the two domains. Differences between music and speech in the cues used to communicate emotions are hardly surprising: music contains features that have no clear analog in speech, such as harmony and the tonal hierarchy. However, it was observed that there were similarities in the use of nPVI to differentiate emotional from non-emotional sentences and musical phrases.

In speech, MIV values were lower for portrayals of happiness than for portrayals of sadness, tenderness, and neutral. This means that changes in interval size were more uniform for happiness than the other emotions. Yet portrayals of happiness had the greatest average interval size as compared to other emotional portrayals (see also Banse and Scherer, 1996). From a physiological perspective, greater pitch variability may reflect a higher arousal level but the consistency of interval changes may reflect the speaker's ability to control the regularity of the pitch. Since acoustic features only probabilistically contribute to the expression of a given emotion, and do not always behave consistently, this finding contributes another acoustic feature that could be used to differentiate emotions in speech.

An analysis of musical stimuli revealed no significant effect of emotion on MIV. Thus, whereas previous studies have found that pitch-based cues such as pitch height, pitch range, pitch variability, and modality contribute to emotional communication (Ilie and Thompson, 2006; Gabrielsson and Lindström, 2010; Quinto et al., in press), MIV does not appear to be involved in the communication of emotion in music. Nonetheless, recent investigations suggest that features considered to be music-specific, including interval size and mode, may actually play a role in emotional speech (Bowling et al., 2010; Curtis and Bharucha, 2010).

Specifically, excited and happy speech have been found to contain a higher proportion of major intervals (happy sounding) than minor intervals (sadder sounding) and sad speech has been found to contain a greater proportion of minor intervals than major intervals. One potential reason for the null result is that conventional associations between pitch structure and emotion, such as modality, may guide the creation of emotional music. Hence, the extent to which other pitch cues, such as MIV could vary may be restricted.

For speech, nPVI was significantly greater for emotional utterances than for neutral or non-emotional utterances. One interpretation of this finding is that durational contrasts function to attract and maintain attention, enhancing sensitivity to emotional messages. Changing-state sounds, including changes in duration as measured by nPVI, are known to capture attention (Jones and Macken, 1993). By capturing attention, nPVI may fulfill a primary goal of emotional communication, increasing the capacity of speakers to influence the perceptions and behaviors of others (Bachorowski and Owren, 2003).

As in the speech stimuli, nPVI values in music were significantly lower in deadpan melodies than live recordings, indicating that the use of performance expression involved enhancing rhythmic contrasts between tones relative to strict notation. This increase in nPVI for performed melodies occurred for all intended emotion categories: there was no significant interaction between emotion and mode of presentation. There was also a significant main effect of emotion on nPVI. This effect was driven primarily by a comparatively large difference between mean nPVI values for neutral and emotional music, regardless of the mode of presentation. Among the five non-neutral (emotional) portrayals, differences between the nPVI values were relatively small and not statistically reliable in *post hoc* analyses. Taken together, nPVI was primarily effective in distinguishing (a) performed from deadpan music, and (b) emotional from non-emotional music.

These findings demonstrate that the processes of both composition and performance independently contribute to changes in rhythmic contrasts. Performance expression introduces these rhythmic contrasts regardless of emotional intention. This finding is compatible with the "duration contrast" rule in the KTH rule system for musical performance, proposed by Sundberg and colleagues (Sundberg et al., 1982;Thompson et al., 1989; Friberg et al., 2006). This finding also extends research reported by Gabrielsson and Juslin (1996), who observed that emotional performances tend to be characterized by exaggerated timing deviations and durational contrasts. Our results suggest that nPVI provides an effective quantification of this expressive phenomenon.

A novel finding of this investigation is that emotional music, whether performed or deadpan, is characterized by increased durational contrasts as measured by nPVI. Compared to music that was composed to be emotionally neutral, music composed to be emotional contained increased durational contrasts.

The current data confirm that MIV and nPVI are associated with emotional communication in music and speech, but the extent to which these attributes are actually used by listeners for decoding has yet to be determined. Evidence does suggest that listeners are able to perceive differences in nPVI (Hannon, 2009) and unpublished work in our lab suggests that participants can differentiate levels of MIV. However, the extent to which these attributes aid listeners in decoding emotion is uncertain.

A direction for future research might be to assess the role of MIV and nPVI when other cues are restricted. For example, MIV may play a greater role in emotional expression for atonal music, for which influences by modality and the tonal hierarchy are absent. It is also unclear whether the results for our stimuli can be generalized to speech and music produced naturally (Scherer et al., 2003). Finally, given the cross-cultural findings of Patel et al. (2006), it seems possible that the emotional connotations of MIV are not only domain-specific but may operate differently in different languages. For example, it is possible that when emotional English is compared to emotional French, then the use of MIV as an emotional cue may be observed. While emotional decoding across cultures is relatively good, individuals within a culture are still better able to identify emotion than outsiders. These cultural differences in emotional communication are referred to as "pulleffects" (Scherer et al., 2003) and can account for the experience of non-native speakers of a language misunderstanding the emotional intentions of native speakers (Wierzbicka, 1999; Mesquita, 2003).

We note that there are some differences between our spoken and musical stimuli that could have influenced the results. One difference between the two stimulus sets concerns the semantic labels used for one of the emotions (anger versus irritation). A second difference concerns the manner in which the stimuli were selected. It may be useful in future research to examine spoken and musical stimuli that have been developed using similar selection criteria and have matching semantic labels.

It could be argued that differences in the attributes involved in musical and spoken stimuli may account for some of the effects

#### **REFERENCES**


observed. Specifically, because spoken material contained words and the musical material involved instrument timbres, these properties may have exerted an influence on the pitch (MIV) and timing (nPVI) of the stimuli. Such influences could occur if, for example, sentences varied in the number of words that are naturally spoken with different melodic intonation (regardless of any emotional intention), and musicians composed melodies with different degrees of durational contrast depending on the instrument that they played. However, our analyses took into account these possible effects.

To conclude, nPVI reflects a "common code" of emotional expression in music and speech but MIV does not. Based on observations by Patel et al. (2006) and Juslin and Laukka (2003) it was expected that both these measures would change similarly in both speech and music. MIV may be a useful predictor of emotional speech whereas the nPVI may help to differentiate emotional from non-emotional music and speech. The use of universal cues (e.g., loudness, pitch, tempo) may allow individuals to decode unfamiliar emotional speech (Elfenbein and Ambady, 2002) and music (Balkwill and Thompson, 1999; Fritz et al., 2009). However, it is possible that attributes of music and speech have domainspecific constraints, which may mean that only some cues are effective. The task of differentiating universal, domain-specific, and culture-specific cues to emotion is an exciting challenge for future research.

#### **ACKNOWLEDGMENTS**

This research was supported by a grant awarded to the second author from the Australian Research Council DP 0879017. We would like to thank Alex Chilvers, Bojan Neskovic, Catherine Greentree, Rachel Bennetts and Julia Case for technical support.

P. Juslin and J. Sloboda (Oxford: Oxford University Press), 368–400.


*Handbook of the Affective Sciences*, eds R. J. Davidson, K. R. Scherer, and H. Goldsmith (New York, NY: Oxford University Press), 433–456.


in congenital amusia rekindles the musical protolanguage hypothesis. *Proc. Natl. Acad. Sci. U.S.A.*, 109, 19027–19032.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 06 November 2012; accepted: 26 March 2013; published online: 24 April 2013.*

*Citation: Quinto L, Thompson WF and Keating FL (2013) Emotional communication in speech and music: the role of melodic and rhythmic contrasts. Front. Psychol. 4:184. doi: 10.3389/fpsyg.2013.00184*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Quinto, Thompson and Keating . This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## On the acoustics of emotion in audio: what speech, music, and sound have in common

#### **FelixWeninger <sup>1</sup>\*, Florian Eyben<sup>1</sup> , BjörnW. Schuller 1,2, Marcello Mortillaro<sup>2</sup> and Klaus R. Scherer <sup>2</sup>**

<sup>1</sup> Machine Intelligence and Signal Processing Group, Mensch-Maschine-Kommunikation, Technische Universität München, Munich, Germany <sup>2</sup> Centre Interfacultaire en Sciences Affectives, Université de Genève, Geneva, Switzerland

#### **Edited by:**

Anjali Bhatara, Université Paris Descartes, France

#### **Reviewed by:**

Jarek Krajewski, University of Cologne, Germany Gabriela Ilie, St. Michael Hospital, Canada

#### **\*Correspondence:**

Felix Weninger, Technische Universität München,Theresienstraße 90, 80333 Munich, Germany. e-mail: weninger@tum.de

Without doubt, there is emotional information in almost any kind of sound received by humans every day: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring in the environment, in the soundtrack of a movie, or in a radio play. In the field of affective computing, there is currently some loosely connected research concerning either of these phenomena, but a holistic computational model of affect in sound is still lacking. In turn, for tomorrow's pervasive technical systems, including affective companions and robots, it is expected to be highly beneficial to understand the affective dimensions of "the sound that something makes," in order to evaluate the system's auditory environment and its own audio output. This article aims at a first step toward a holistic computational model: starting from standard acoustic feature extraction schemes in the domains of speech, music, and sound analysis, we interpret the worth of individual features across these three domains, considering four audio databases with observer annotations in the arousal and valence dimensions. In the results, we find that by selection of appropriate descriptors, cross-domain arousal, and valence regression is feasible achieving significant correlations with the observer annotations of up to 0.78 for arousal (training on sound and testing on enacted speech) and 0.60 for valence (training on enacted speech and testing on music). The high degree of cross-domain consistency in encoding the two main dimensions of affect may be attributable to the co-evolution of speech and music from multimodal affect bursts, including the integration of nature sounds for expressive effects.

**Keywords: audio signal processing, emotion recognition, feature selection, transfer learning, music perception, sound perception, speech perception**

#### **1. INTRODUCTION**

Without doubt, emotional expressivity in sound is one of the most important methods of human communication. Not only human speech, but also music and ambient sound events carry emotional information. This information is transmitted by modulation of the acoustics and decoded by the receiver – a human conversation partner, the audience of a concert, or a robot or automated dialog system. By that, the concept of emotion that we consider in this article is the one of consciously conveyed emotion (in contrast, for example, to the "true" emotion of a human related to biosignals such as heart rate). In speech, for example, a certain affective state can be transmitted through a change in vocal parameters, e.g., by adjusting fundamental frequency and loudness (Scherer et al., 2003). In music, we consider the emotion intended by the composer of a piece – and by that, the performing artist(s) as actor(s) realizing an emotional concept such as "happiness" or "sadness." This can manifest through acoustic parameters such as tempo, dynamics (forte/piano), and instrumentation (Schuller et al., 2010). In contrast to earlier research on affect recognition from singing (e.g., Daido et al., 2011), we focus on polyphonic music – by that adding the instrumentation as a major contribution to expressivity. As a connection between music and speech emotion, for example, the effect of musical training on human emotion recognition has been highlighted in related work (Nilsonne and Sundberg, 1985; Thompson et al., 2004). Lastly, also the concept of affect in sound adopted in this article is motivated by the usage of (ambient) sounds as a method of communication – to elicit an intended emotional response in the audience of a movie, radio play, or in the users of a technical system with auditory output.

In the field of affective computing, there is currently some loosely connected research concerning either of these phenomena (Schuller et al., 2011a; Drossos et al., 2012; Yang and Chen, 2012). Despite a number of perception studies suggesting overlap in the relevant acoustic parameters (e.g., Ilie and Thompson, 2006), a holistic computational model of affect in general sound is still lacking. In turn, for tomorrow's technical systems, including affective companions and robots, it is expected to be highly beneficial to understand the affective dimensions of "the sound that something makes," in order to evaluate the system's auditory environment and its own audio output.

In order to move toward such a unified framework for affect analysis, we consider feature relevance analysis and automatic regression with respect to continuous observer ratings of the main dimensions of affect, arousal, and valence, across speech, music, and ambient sound events. Thereby, on the feature side, we restrict ourselves to non-symbolic acoustic descriptors, thus eliminating more domain-specific higher-level concepts such as linguistics, chords, or key. In particular, we use a well proven set of "lowlevel" acoustic descriptors for paralinguistic analysis of speech (cf. Section 2.3). Then, we address the importance of acoustic descriptors for the automatic recognition of continuous arousal and valence in a "cross-domain" setting. We show that there exist large commonalities but also strong differences in the worth of individual descriptors for emotion prediction in the various domains. Finally, we carry out experiments with automatic regression on a selected set of "generic acoustic emotion descriptors."

#### **2. MATERIALS AND METHODS**

#### **2.1. EMOTION MODEL**

Let us first clarify the model of emotion employed in this article. There is a debate in the field on which type of model to adopt for emotion differentiation: discrete (categorical) or dimensional (e.g., Mortillaro et al., 2012). We believe that these approaches are highly complementary. It has been copiously shown that discrete emotions in higher dimensional space can be mapped parsimoniously into lower dimensional space. Most frequently, the two dimensions valence and arousal are chosen, although it can be shown that affective space is best structured by four dimensions – adding power and novelty to valence and arousal (Fontaine et al., 2007).Whether to choose a categorical or dimensional approach is thus dependent on the respective research context and the specific goals. Here, we chose a valence × arousal dimensional approach because of the range of affective phenomena underlying our stimuli. In addition for some of our stimulus sets only dimensional annotations were available.

#### **2.2. DATABASES**

Let us now start the technical discussion in this article by a brief introduction of the data sets used in the present study on arousal and valence in speech, music, and sound. The collection of emotional audio data for the purpose of automatic analysis has often been driven by computer engineering. This is particularly true for speech data – considering applications, for example, in humancomputer interaction. This has led to large databases of spontaneous emotion expression, for example, emotion in child-robot interaction (Steidl, 2009) or communication with virtual humans (McKeown et al., 2012), which are however limited to specific domains. In contrast, there are data sets from controlled experiments, featuring, for example, emotions expressed ("enacted") by professional actors, with restricted linguistic content (e.g., phonetically balanced pseudo sentences) with the goal to allow for domain-independent analysis of the variation of vocal parameters (Burkhardt et al., 2005; Bänziger et al., 2012). In the case of polyphonic music, data sets are mostly collected with (commercial) software applications in mind – for example, categorization of music databases on end-user devices ("music mood recognition"; Yang and Chen, 2012). Finally, emotion analysis of general sounds has been attempted only recently (Sundaram and Schleicher, 2010; Drossos et al., 2012; Schuller et al., 2012). In this light, we selected the following databases for our analysis: the Geneva

Multimodal Emotion Portrayals (GEMEP) set as an example for enacted emotional speech; the Vera am Mittag (VAM) database as an example for spontaneous emotional speech "sampled" from a "real-life"context; the"Now That'sWhat I CallMusic"(NTWICM) database for mood recognition in popular music; and the recently introduced emotional sounds database.

#### **2.2.1. Enacted emotion in speech: the Geneva multimodal emotion portrayals (GEMEP)**

The GEMEP corpus is a collection of 1260 multimodal expressions of emotion enacted by 10 French-speaking actors (Bänziger et al., 2012). GEMEP comprises 18 emotions that cover all four quadrants of the arousal-valence space. The list includes the emotions most frequently used in the literature (e. g., fear, sadness, joy) as well as more subtle differentiations within emotion families (e. g., anger and irritation, fear, and anxiety). Actors expressed each emotion by using three verbal contents (two pseudo sentences and one sustained vowel) and different expression regulation strategies while they were recorded by three synchronized cameras and a separate microphone. To increase the realism and the spontaneity of the expressions, a professional director worked with the respective actor during the recording session in order to choose one scenario typical for the emotion – either by recall or mental imagery – that was personally relevant for the actor. Actors did not receive any instruction on how to express the emotion and were free to use any movement and prosody they wanted.

In the present research we consider a sub selection of 154 instances of emotional speech based on the high recognition rates reported by Bänziger et al. (2012). For this set of portrayals perceptual ratings of arousal and valence were obtained in the context of a study on the perception of multimodal emotion expressions (Mortillaro et al., unpublished). Twenty participants (10 male) listened to each of these expressions (presented in random order) and rated the content in terms of arousal and valence by using a continuous slider. Participants were given written instructions before the study. These instructions included a clear definition for each dimension that was judged. Furthermore, right before they started to rate the stimuli, they were asked whether they understood the dimensions and the two anchors and were invited to ask questions in case something was unclear. During the ratings the name of the dimension (e.g., "activation"), a brief definition (e.g., "degree of physical/physiological activation of the actor"), and the anchors ("very weak"and"very strong") were visible on the screen.

#### **2.2.2. Spontaneous emotion in speech: the VAM corpus**

TheVAM corpus (Grimm et al., 2008) was collected by the institute INT of the University Karlsruhe, Germany, and consists of audiovisual recordings taken from the German TV talk show "Vera am Mittag" (English:"Vera at noon" –Vera is the name of the talk show host). In this show, the host mainly moderates discussions between guests, e.g., by occasional questions. The corpus contains 947 spontaneous, emotionally rich utterances from 47 guests of the talk show which were recorded from unscripted and authentic discussions. There were several reasons to build the database on material from a TV talk show: there is a reasonable amount of speech from the same speakers available in each session, the spontaneous discussions between talk show guests are often rather affective, and the interpersonal communication leads to a wide variety of emotional states, depending on the topics discussed. These topics were mainly personal issues, such as friendship crises, fatherhood questions, or romantic affairs. At the time of recording, all subjects did not know that the recordings were going to be analyzed in a study of affective expression. Furthermore, the selection of the speakers was based on additional factors, such as how emotional the utterances were or which spectrum of emotions was covered by the speakers, to assure a large spectrum of different and realistic affective states. Within the VAM corpus, emotion is described in terms of three basic primitives – valence, arousal, and dominance. Valence describes the intrinsic pleasantness or unpleasantness of a situation. Arousal describes whether a stimulus puts a person into a state of increased or reduced activity. Dominance is not used for the experiments reported in this article. For annotation of the speech data, the audio recordings were manually segmented to utterance level. A large number of human annotators were used for annotation (17 for one half of the data, six for the other).

For evaluation an icon-based method that consists of an array of five images for each emotion dimension was used. Each human listener had to listen to each utterance in the database to choose an icon per emotion dimension in order to best describe the emotion heard. Afterward, the choice of the icons was mapped onto a discrete five-point scale for each dimension in the range of +1 to −1, leading to an emotion estimation (Grimm et al., 2007a).

#### **2.2.3. Emotion in music: now that's what i call music (NTWICM) database**

For building the NTWICM music database the compilation "Now That's What I Call Music!" (UK series, volumes 1–69) is selected. It contains 2648 titles – roughly a week of total play time – and covers the time span from 1983 to 2010. Likewise it represents very well most music styles which are popular today; that ranges from Pop and Rock music over Rap, R&B to electronic dance music as Techno or House. While lyrics are available for 73% of the songs, in this study we only use acoustic information.

Songs were annotated as a whole, i.e., without selection of characteristic song parts. Respecting that mood perception is generally judged as highly subjective (Hu et al., 2008), four labellers were decided for. While mood may well change within a song, as change of more and less lively passages or change from sad to a positive resolution, annotation in such detail is particularly time-intensive. Yet, it is assumed that the addressed music type – mainstream popular and by that usually commercially oriented – music to be less affected by such variation as, for example, found in longer arrangements of classical music. Details on the chosen raters are provided in Schuller et al. (2011b). They were picked to form a well-balanced set spanning from rather "naïve" assessors without instrument knowledge and professional relation to "expert" assessors including a club disc jockey (DJ). The latter can thus be expected to have a good relationship to music mood, and its perception by the audiences. Further, young raters prove a good choice, as they were very well familiar with all the songs of the chosen database. They were asked to make a forced decision according to the two dimensions in the mood plane assigning values in −2, −1, 0, 1, 2 for arousal, and valence, respectively. They were further instructed to annotate according to the perceived mood, that is, the

"represented" mood, not to the induced, that is, "felt" one, which could have resulted in too high labeling ambiguity. The annotation procedure is described in detail in Schuller et al. (2010), and the annotation along with the employed annotation tool are made publicly available<sup>1</sup> .

#### **2.2.4. Emotion in sound events: emotional sound database**

The emotional sound database (Schuller et al., 2012) 2 is based on the on-line freely available engine FindSounds.com<sup>3</sup> (Rice and Bailey, 2005). It consists of 390 manually chosen sound files out of more than 10,000. To provide a set with a balanced distribution of emotional connotations, it was decided to use the following eight categories taken from FindSounds.com: *Animals*, *Musical instruments*, *Nature*, *Noisemaker*, *People*, *Sports*, *Tools*, and *Vehicles*. With this choice the database represents a broad variety of frequently occurring sounds in everyday environment. The emotional sound database was annotated by four labelers (one female, 25–28 years). They were all post graduate students working in the field of audio processing. All labelers are of Southeast-Asian origin (Chinese and Japanese), and two reported to have musical training. For the annotation these four listeners were asked to make a decision according to the two dimensions in the emotion plane assigning values on a five-point scale in {−2, −1*,* 0*,* 1*,* 2} for arousal and valence. They were instructed to annotate the perceived emotion and could repeatedly listen to the sounds that were presented in random order across categories. Annotation was carried out individually and independently by each of the labelers. For annotation, the procedure as described in detail in Schuller et al. (2010) was used – thus, the annotation exactly corresponds to the one used for music mood (cf. above). The annotation tool can be downloaded freely<sup>4</sup> .

#### **2.2.5. Reliability and "gold standard"**

For all four of the databases, the individual listener annotations were averaged using the evaluator weighted estimator (EWE) as described by Grimm and Kroschel (2005). The EWE provides quasi-continuous dimensional annotations taking into account the agreement of observers. For instance *n* and dimension *d* (arousal or valence), the EWE *y d EWE*,*n* is defined by

$$\mathcal{Y}\_{\text{EWE},n}^{d} = \frac{1}{\sum\_{k=1}^{K} r\_k} \sum\_{k=1}^{K} r\_k \mathcal{Y}\_{n,k}^{d},\tag{1}$$

where*K* is the number of labellers, and *y d n*,*k* is the rating of instance *n* by labeller *k* in dimension *d*. Thus, the EWE is a weighted mean rating with weights corresponding to the confidence in the labeling of rater *k* – in this study, we use the correlation coefficient *r<sup>k</sup>* of rater *k*'s rating and the mean rating. By the first term in the above equation, the weights are normalized to sum up to one, in order to have the EWE in the same scale as the original ratings.

<sup>1</sup>http://openaudio.eu/NTWICM-Mood-Annotation.arff – accessed 27 Mar 2013 <sup>2</sup>http://www.openaudio.eu/Emotional-Sound-Database.csv – accessed 27 Mar 2013 <sup>3</sup>http://www.findsounds.com – accessed 27 Mar 2013

<sup>4</sup>http://www.openaudio.eu/wsh\_mood\_annotation.zip – accessed 27 Mar 2013

The average *r<sup>k</sup>* (across the *K* raters) is depicted for arousal and valence annotation in the four databases in **Table 1**. For VAM, we observe that valence was more difficult to evaluate than arousal, while conversely, on ESD, raters agree more strongly on valence than arousal. In NTWICM, both arousal and valence have similar agreement (*r* = 0.70 and 0.69). Results for GEMEP are in the same order of magnitude, indicating some ambiguity despite the fact that the emotion is enacted.

Furthermore, **Table 1** summarizes the number of raters, number of rated instances, and length of the databases' audio. It can be seen that NTWICM is by far the largest regarding the number of instances and audio length, followed by VAM, ESD, and GEMEP. The huge differences in audio length are further due to the time unit of annotation, which is similar for VAM, ESD, and GEMEP (roughly 2–4 s of audio material), yet in NTWICM entire tracks of several minutes length of popular music were rated.

**Figure 1** shows the distribution of the arousal and valence EWE ratings on the three databases considered. For the purpose of this visualization, the quasi-continuous arousal/valence ratings are discretized into five equally spaced bins spanning the interval [−1, 1] on each axis, resulting in a discretization of the arousalvalence space into 25 bins. The number of instances per bin is counted. It is evident that in VAM, instances with low valence prevail – this indicates the difficulty of creating emotionally balanced data sets by sampling audio archives. Furthermore, we observe a strong concentration of ratings in the "neutral" (center) bin of the arousal-valence space. The enacted GEMEP database is overall better balanced in terms of valence and arousal ratings – yet still, there seems to be a lack of instances with low arousal and non-neutral valence rating, although some of the chosen emotion categories (e.g., pleasure) would be expected to fit in this part. For NTWICM, we observe a concentration in the first quadrant of the valence-arousal plane, and a significant correlation between the arousal and valence ratings (Spearman's ρ = 0.61, *p* 0.001). This indicates a lack of, e.g., "dramatic" music with high arousal and low valence in the chosen set of "chart" music. Finally, in ESD, ratings are distributed all over the arousal and valence scales – as shown in more detail by Schuller et al. (2012), this is due to the different sound classes in the databases having different emotional connotation (e.g., nature sounds on average being associated with higher valence than noisemakers).

#### **2.3. EXTRACTION OF ACOUSTIC DESCRIPTORS**

In this article, the ultimate goal is automatic emotion recognition (AER) from general sound. In contrast to neighboring fields of audio signal processing such as speech or speaker recognition, which rely exclusively on rather simple spectral cues (Young et al., 2006) as acoustic features, AER typically uses a large variety of descriptors. So far no attempt has been made at defining a "standard" feature set for generic AER from sound, which may be due to the facts that AER still a rather young field with about 15 years of active research, and that emotion recognition is a multi-faceted task owing to the manifold ways of expressing emotional cues through speech, music, and sounds, and the subjective nature of the task. Some of the currently best performing approaches for automatic speech emotion recognition (Schuller et al., 2011a) use a large set of potentially relevant acoustic features and apply a large, "brute-force" set of functionals to these in order to summarize the evolution of the contours of the acoustic features over segments of typically a few seconds in length (Ververidis and Kotropoulos,


**FIGURE 1 | Distribution of valence/arousal EWE on the VAM (A), GEMEP (B), emotional sound (C), and NTWICM (D) databases: number of instances per valence/arousal bin**.

2006). This is done to capture temporal dynamics in a feature vector of fixed length and has been shown to outperform modeling of temporal dynamics on the classifier level (Schuller et al., 2009). In the process of addressing various tasks in speech and speaker characterization in a series of research challenges (Schuller et al., 2009, 2013), various large sets for the speech domain have been proposed. Little work, however, has been done on cross-domain generalization of these features, which will be the focus of the present study.

For the analysis reported on in this article,we use a well-evolved set for automatic recognition of paralinguistic phenomena – the one of the INTERSPEECH 2013 Computational Paralinguistics Evaluation baseline (Schuller et al., 2013). In this set, suprasegmental features are obtained by applying a large set of statistical functionals to acoustic low-level descriptors (cf. **Tables 2** and **3**). The low-level descriptors cover a broad set of descriptors from the fields of speech processing, Music Information Retrieval, and general sound analysis. For example,Mel Frequency Cepstral Coefficients (Davis and Mermelstein, 1980; Young et al., 2006) are very frequently used in ASR and speaker identification. Further, they are used in Music Information Retrieval. Spectral statistical descriptors, such as spectral variance and spectral flux, are often used in multi-media analysis, and are part of the descriptor set proposed in the MPEG-7 multi-media content description standard (Peeters, 2004). They are thus very relevant for music and sound analysis. Loudness and energy related features are obviously important for all tasks. The same holds true for the sound quality descriptors (which are used to discriminate harmonic and noiselike sounds) and the fundamental frequency and psychoacoustic sharpness. The latter is a well-known feature in sound analysis (Zwicker and Fastl, 1999). Jitter and Shimmer are micro-prosodic variations of the length and amplitudes (respectively) of the fundamental frequency for harmonic sounds. They are mainly used in voice pathology analysis, but are also good descriptors of general sound quality.

#### **3. RESULTS**

#### **3.1. FEATURE RELEVANCE**

Let us now discuss the most effective acoustic features out of the above mentioned large set for single- and cross-domain emotion recognition. To this end, besides correlation coefficients (*r*) of features with the arousal or valence ratings,we introduce the crossdomain correlation coefficient (CDCC) as criterion. As we strive to identify features which carry similar meaning with respect to emotion in different domains, and at the same time provide high correlation with emotion in the domains by themselves, the purpose of the CDCC measure is to weigh high correlation in single domains against correlation deviations across different domains. Let us first consider a definition for two domains *i* and *j*, namely

$$\text{CDCC}^2\_{f,i,j} = \frac{\left| r^{(i)}\_f + r^{(j)}\_f \right| - \left| r^{(i)}\_f - r^{(j)}\_f \right|}{2} \tag{2}$$

where *r* (*i*) *f* is the correlation of feature *f* with the domain *i*, and "domain" refers to the arousal or valence annotation of a certain data set. We only consider the CDCC across the data sets (speech, music, and sound), not CDCC across arousal and valence.

**Table 2 | ComParE acoustic feature set: 64 provided low-level descriptors (LLD).**


#### **Table 3 | ComParE acoustic feature set: functionals applied to LLD contours (Table 2).**


<sup>1</sup>Arithmetic mean of LLD/positive 1 LLD. <sup>2</sup>Not applied to voice related LLD except F0. <sup>3</sup>Only applied to F0.

It is obvious that the CDCC measure is symmetric in the sense that CDCC<sup>2</sup> *<sup>f</sup>* ,*i*,*<sup>j</sup>* <sup>=</sup> CDCC<sup>2</sup> *f* ,*j*,*i* , and that it ranges from −1 to 1. If a feature f exhibits either strong positive or strong negative correlation with both domains, the CDCC will be near one, where as it will be near −1 if a feature is strongly positively correlated with one domain yet strongly negatively correlated with the other. A CDCC near zero indicates that the feature is not significantly correlated with both domains (although it might still be correlated with either one). Thus, we can expect a regressor to show similar performance on both domains if it uses features with high CDCC.

Next, we generalize the CDCC<sup>2</sup> to *J* domains by summing up the CDCCs for domain pairs and normalizing to the range from −1 to +1,

$$\text{CDCC}\_f^J = \frac{\sum\_{i=1}^{J} \sum\_{j=i+1}^{J} \left( \left| r\_f^{(i)} + r\_f^{(j)} \right| - \left| r\_f^{(i)} - r\_f^{(i)} \right| \right)}{J(J-1)}. \tag{3}$$

Intuitively, a regression function determined on features with high CDCC *<sup>J</sup> f* is expected to generalize well to all *J* domains.

In **Tables 4** and **5**, we now exemplify the CDCC<sup>3</sup> across the three domains on selected features, along with presenting their correlation on the individual domains. Note that for the purpose of feature selection, we treat the union of VAM and GEMEP as a single domain ("speech"). Further, in our analysis we restrict ourselves to those features that exhibit high (absolute) correlation in a single domain (termed *sound, speech,* or *music features* in the table), and those with high CDCC<sup>3</sup> (termed *cross-domain features*). Thereby we do not present an exhaustive list of the top features but rather a selection aiming at broad coverage of feature types. To test the significance of the correlations, we use t-tests with the null hypothesis that feature and rating are sampled from independent normal distributions. Two-sided tests are used since we are interested in discovering both negative and positive correlations. Significance levels are adjusted by Bonferroni correction, which is conservative, yet straightforward and does not require independence of the individual error probabilities.

Looking at the top sound arousal features (**Table 4**), we find loudness to be most relevant – in particular, the (root quadratic) mean, the linear regression offset (corresponding to a "floor value") and the 99-percentile. This is similar to the ranking for speech. Interestingly, loudness is stronger correlated than RMS energy, indicating the importance of perceptual auditory frequency weighting as performed in our loudness calculation. For music, these three loudness features are not as relevant, though still significantly correlated.

The overall best speech arousal feature is the root quadratic mean of spectral flux – indicating large differences of consecutive short-time spectra – which is interesting since it is independent of loudness and energy, which have slightly lower correlation (cf. above). The "second derivative" of the short-time spectra (arithmetic mean of 1 spectral flux) behaves in a similar fashion as spectral flux itself. However, the correlation of these features with arousal in sound and music is lower. Further, we find changes in the higher order MFCCs, such as the root quadratic mean of delta MFCC 14 to be relevant for speech and music arousal, relating to quick changes in phonetic content and timbre. Finally, mean F0, a "typical" speech feature characteristic for high arousal, is found to be relevant as expected, but does not generalize to the other domains.

The best music arousal features are related to mean peak distances – for example, in the loudness contour and the spectral entropy contour resembling occurrence of percussive instruments, indicating positive correlation between tempo and arousal. In contrast, the peak distance standard deviation is negatively correlated with arousal – thus, it seems that "periodic" pieces of music are more aroused, which can be explained by examples such as dance music. However, it seems that all these three features have a mostly musical meaning, since they only show weak correlations in sound and speech. Yet, a notable feature uniting speech and music is the (root quadratic) mean of the first MFCC, which is related to spectral skewness: arguably, a bias toward lower frequencies (high skewness) is indicative of absence of broadband (mostly percussive) instruments, and "calm" voices, and thus low arousal.

Summarizing cross-domain features for arousal, we find that the"greatest common divisor"of speech, sound, and music is loudness (and – relatedly – energy), but the behavior of functional types is interesting: the quadratic regression offset is much more relevant in the case of music than the mean loudness, which is mostly characteristic in speech and sound. In the NTWICM database of popular music, in fact we often find parabola shaped loudness contours, such that this offset indicates the intensity of the musical climax. A suitable cross-domain feature not directly related to loudness or energy is the spectral flux quadratic regression offset (the ordinate of the "high point" of spectral change).

Judging from the results in **Table 5**, we see that loudness is also indicative of *valence* in sound, music, and speech, but the correlations have different signs: on the one hand, loud sounds as identified by high root quadratic mean of loudness are apparently perceived as unpleasant, as are loud voices. For music, on the other hand, loudness can be indicative of high valence ("happy" music).

Among relevant speech valence features, we find mean energy (change) in the speech frequency range (1–4 kHz) and F0 (quartiles 1 and 2) – F0, however,is a"speech only"feature which exhibits low correlation in the other domains (similarly to the observations for arousal above).

Music valence features overlap with music arousal features, due to the correlation in the ratings. Among the music valence features, the median first MFCC (related to spectral skewness – cf. above) is particularly noticeable as it has "inverse" correlation on speech and music – "percussive"music with a flat spectrum is connotated with positive emotion (high valence) while "noisy" voices are characteristic of negative emotion (low valence).

Cross-domain features for valence are generally rarely significant on the individual domains and hard to interpret – here, in contrast to arousal, it seems difficult to obtain descriptors that generalize across multiple domains.

We now move from discussion of single features to a broader perspective on automatic feature selection for cross-domain emotion recognition. To this end, we consider automatically selected subsets of the ComParE feature set by the CDCC criteria. In particular, for each of arousal and valence, we choose the 200 features that show the highest CDCC<sup>2</sup> for the (sound, music), (sound, speech), and (music, speech) pairs of domains. Furthermore, for each of arousal and valence, we select a set of 200 features by highest CDCC<sup>3</sup> across all three of the sound, music, and speech domains.


**Table 4 | Cross-domain feature relevance for arousal: top features ranked by absolute correlation (r) for single domain, and CDCC across all three domains (CDCC<sup>3</sup> ).**

Significance denoted by \*\*p < 0.00, \*p < 0.01, °p < 0.05, <sup>−</sup>p ≥ 0.05; Bonferroni corrected p-values from two-sided paired sample t-tests.

#### **Table 5 | Cross-domain feature relevance for valence: top features ranked by absolute correlation (r) for single domain, and CDCC across all three domains (CDCC<sup>3</sup> ).**


Significance denoted by \*\*p < 0.001, \*p < 0.01, <sup>o</sup>p < 0.05), <sup>−</sup>p ≥ 0.05; Bonferroni corrected p-values from two-sided paired sample t-tests.

In **Figure 2**,we summarize the obtained feature sets by the share of cepstral, prosodic, spectral, and voice quality LLDs, as well as by the share of modulation,moment, peak, percentile, regression, and temporal functionals (see **Tables 2** and **3** for a list of descriptors in each of these groups). We compare the cross-domain feature sets to the full ComParE feature set as well as the "single domain" feature sets that are obtained in analogy to the cross-domain feature sets by applying the CDCC<sup>2</sup> to a 50% split of each corpus. A feature group is considered particularly relevant for a recognition task if its share among the selected features is larger than its share of the full feature set.

We observe notable differences in the importance of different LLD groups; it is of particular interest for the present study to highlight the results for the considered cross-domain emotion recognition tasks: cepstral features seem to be particularly relevant for cross-domain speech and music emotion recognition. In contrast, cross-domain emotion recognition from speech and sound, and from sound and music, are dominated by"prosodic"and spectral cues such as loudness, sub-band energies, and spectral flux. Regarding relevant functional types, the summarization reveals less evident differences between the tasks; still, percentile type functionals seem to be particularly promising for all of the tasks considered.

#### **3.2. AUTOMATIC CLASSIFICATION EXPERIMENTS**

Finally,we demonstrate the predictive power of the obtained crossdomain feature sets in automatic regression. In automatic regression, the parameters of a regression function on N-dimensional feature vectors are optimized to model the assignment of L "learning" vectors (e.g., feature vectors of emotional utterances) to the gold standard (e.g., the arousal observer rating). Then, the regression function is evaluated on a disjoint set of test vectors and the correlation of the function's predictions and the test set gold standard is computed as a measure of how well the regression function generalizes to "unseen" test data. In the present study, it is of particular interest to consider cross-domain evaluation, i.e., training on data from one domain (e.g., enacted speech) and evaluating on another domain (e.g., sound). In this context, we also treat spontaneous and enacted speech as different domains, as such analysis is receiving increasing attention at the moment (Bone et al., 2012) also due to practical reasons: for instance, it is of interest to determine if training on "prototypical" data from a controlled experiment (such as the GEMEP database) can improve automatic emotion recognizers applied "in the wild," e.g., to media analysis (such as given by theVAM database). For reference,we also consider within-domain regression in a twofold cross-validation manner.

For each learning set, we determine a multivariate linear regression function by means of support vector regression (SVR) (Smola and Schölkopf, 2004), which defines a real valued mapping

$$f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b \tag{4}$$

of *N*-dimensional feature vectors **x** to a regression value *f*(**x**). **w** is the normal vector of the *N*-dimensional hyperplane describing the regression function, and *b* is a scalar offset. Specifically for SVR, the primary optimization goal is *flatness* of the regression function, which is defined as low norm of the weight vector **w**.

This is related to the notion of *sparsity* and crucial to avoid over-fitting of the model parameters in the present case of high dimensional feature spaces. The trade-off between flatness of the weight vector and deviation of the regression values from the gold standard on the learning set is modeled as a free parameter *C* in the optimization (cf. Smola and Schölkopf, 2004 for details). In our experiments, *C* is set to 10−<sup>3</sup> for within-domain regression and 10−<sup>5</sup> for cross-domain regression. The optimization problem is solved by the frequently used Sequential Minimal Optimization algorithm (Platt, 1999). To foster reproducibility of our research, we use the open-source machine learning toolkit Weka (Hall et al., 2009). Unsupervised mean and variance normalization of each feature per database is applied since SVR is sensitive to feature scaling.

In **Table 6**, the correlation coefficients (*r*) of automatic withindomain and cross-domain regression with the arousal observer ratings are displayed. First, we consider regression using the full 6373-dimensional ComParE feature set. In within-domain regression, results ranging from *r* = 0.54 (sound) up to *r* = 0.85 (enacted speech) are obtained,which are comparable to previously obtained results on sound, music, and spontaneous speech (Grimm et al., 2007b; Schuller et al., 2011b, 2012). Especially the result for music is notable, since we do not use any "hand-crafted" music features such as chords or tempo. In cross-domain regression, significant correlations are obtained except for the case of training on music and evaluating on sound. However, the mean r across all training and testing conditions (0.50) is rather low.

Considering automatic feature selection by CDCC<sup>2</sup> for each combination of two domains, results in **Table 6B** indicate a drastic gain in performance especially for cross-domain regression. However, also the results in within-domain regression are improved. All correlations are significant at the 0.1% level. Particularly, using CDCC based feature selection robust regression (achieving *r* > 0.76) is possible across enacted and spontaneous speech. Further, it is notable that the average result across the four testing databases does not vary much depending on the training database used, indicating good generalization capability of the selected features. The overall mean r in this scenario is 0.65.

Finally, if we select the top features by CDCC<sup>3</sup> on all databases (treating speech as a single domain for the purpose of feature selection), it is notable that we still obtain reasonable results (mean r of 0.58) despite the fact that the top features by CDCC<sup>3</sup> exhibit comparably low correlation with the target labels on the single domains (cf. **Table 4**).

Summarizing the results for valence regression (**Table 7**), we observe that using the full feature set, we cannot obtain reasonable results in cross-domain regression. Among the cross-domain results, the only significant positive correlations are obtained in evaluation on spontaneous speech, however, these are lower than the correlation of the single best speech features. Interestingly, we observe significant negative correlations when evaluating on music and training on another domain, which is consistent with the fact that some of the music valence features are "inversely" correlated with the target label in the other domains (cf., e.g., the discussion of median MFCC 1 above). In the within-domain setting, it can be observed that regression on valence in music is possible with high robustness (*r* = 0.80). This is all the more noticeable since this **Table 6 | Results of within-domain and pair-wise cross-domain support vector regression on arousal observer ratings for sound (emotional sound database), music (NTWICM database), and spontaneous and enacted speech (VAM/GEMEP databases).**


Significance denoted by \*\*p < 0.001, \*p < 0.01, <sup>−</sup>p ≥ 0.05; Bonferroni corrected p-values from two-sided paired sample t-tests. Full ComParE feature set (cf. **Tables 2** and **3**); 200 top features selected by CDCC<sup>2</sup> for specific within-domain or cross-domain regression tasks; Generic features: 200 features selected by CDCC<sup>3</sup> across sound, music, and speech domains (cf.**Table 4**).

correlation is higher than the one obtained in arousal regression, while for the other domains, valence seems to be harder to recognize than arousal. This can partly be attributed to the fact that in the analyzed music data, the valence rating is correlated to the arousal rating.

Concerning feature selection by CDCC<sup>2</sup> (**Table 7B**), we observe a boost in the obtained correlations (mean = 0.44, compared to 0.12 without feature selection). For instance, when training on enacted speech and evaluating on music, we obtain a significant *r* of 0.60. This result is interesting in so far as the best selected feature for this particular cross-domain setting, namely the flatness of the loudness contour, only exhibits a correlation of 0.28, respectively 0.27, with the valence rating on the NTWICM (music) and GEMEP (enacted speech) databases. Thus, the 200 CDCC<sup>2</sup> -selected features for this regression task seem to be of complementary nature. Furthermore, by applying feature selection in the within-domain setting, best results are obtained for sound (*r* = 0.51), music (*r* = 0.82), and enacted speech (*r* = 0.50) valence recognition. However, regarding the issue of enacted vs. spontaneous speech,we find that regressors trained on one type do not generalize well to the other, which is in contrast to the finding for arousal.

Weninger et al. Acoustics of emotion in audio

**Table 7 | Results of within-domain and pair-wise cross-domain support vector regression on valence observer ratings for sound (emotional sound database), music (NTWICM database), and spontaneous and enacted speech (VAM/GEMEP databases).**


Significance denoted by \*\*p < 0.001, \*p < 0.01, °p < 0.05, <sup>−</sup>p ≥ 0.05; Bonferroni corrected p-values from two-sided paired sample t-tests. Full ComParE feature set (cf. **Tables 2** and **3**); 200 top features selected by CDCC<sup>2</sup> for specific withindomain or cross-domain regression tasks; Generic features: 200 features selected by CDCC<sup>3</sup> across sound, music, and speech domains (cf.**Table 5**).

Finally,when applying the"generic valencefeature set"obtained from the CDCC<sup>3</sup> ranking across sound, music, and speech, we obtain an average correlation of 0.32. Results are considerably below the CDCC<sup>2</sup> results particularly for sound and enacted speech. This – again – points at the difficulty of finding features that generalize to valence recognition across domains. However, it is notable that robust results (*r* = 0.75) are obtained in withindomain music recognition using the generic feature set, of which the "best" feature (rise time of spectral centroid) only has an (absolute) correlation of 0.16 with the music valence rating.

#### **4. DISCUSSION**

We have presented a set of acoustic descriptors for emotion recognition from audio in three major domains: speech (enacted and spontaneous), music, and general sound events. Using these features, we have obtained notable performances in within-domain regression – particularly, these surpass the so far best published results on the NTWICM database (Schuller et al., 2011b) despite the fact that the latter study used hand-crafted music features rather than the generic approach pursued in the present paper.

We have found that it is rather hard to obtain features that are equally well correlated across the three domains. For arousal, such features comprise mostly loudness-related ones. In contrast, we have not been able to obtain features that are significantly correlated with the valence rating in all domains. A further notable result for valence is that some features have an "inverse" meaning in different domains (i.e., significant correlations with different signum), while this does not occur for arousal. It will be subject of further research whether this is simply due to the correlation of intended arousal and valence in popular music or to more fundamental differences.

This phenomenon has motivated the introduction of a "crossdomain correlation coefficient" which summarizes the differences in correlation across multiple domains. Using this coefficient, we were able to provide an automatic method of selecting generalizing features for cross-domain arousal and valence recognition. In the result, cross-domain arousal and valence regression has been proven feasible, achieving significant correlations with the observer annotations.

The degree of cross-domain consistency in encoding the two main dimensions of affect – valence and arousal – demonstrated in this article is quite astounding. Music has often been referred to as the "language of emotion" and a comprehensive review of empirical studies on the expression of emotion in speech and music (Juslin and Laukka, 2003) has confirmed the hypothesis that the acoustic parameters marking certain emotions are quite similar in music and speech (cf. also Ilie and Thompson, 2006). Scherer (1991) has suggested that speech and emotion may have evolved on the basis of primitive affect bursts serving similar communicative functions across many mammalian species. Ethological work shows that expression and impression are closely linked, suggesting that, in the process of conventionalization and ritualization, expressive signals may have been shaped by the constraints of transmission characteristics, limitations of sensory organs, or other factors. The resulting flexibility of the communication code is likely to have fostered the evolution of more abstract, symbolic language, and music systems, in close conjunction with the evolution of the brain to serve the needs of social bonding and efficient group communication.

As vocalization, which remained a major modality for analog emotion expression, became the production system for the highly formalized, segmental systems of language and singing, both of these functions needed to be served at the same time. Thus, in speech, changes in fundamental frequency (F0),formant structure, or characteristics of the glottal source spectrum can, depending on the language and the context, serve to communicate phonological contrasts, syntactic choices, pragmatic meaning, or emotional expression. Similarly, in music, melody, harmonic structure, or timing may reflect the composer's intentions, depending on specific traditions of music, and may simultaneously induce strong emotional moods. This fusion of two signal systems, which are quite different infunction and in structure,into a single underlying production mechanism, vocalization, has proven to be singularly efficient for the purpose of communication, and the relatively high degree of convergence as demonstrated by the correlations found in our study suggests that it might be possible to identify elements of a common code for emotion signaling. Recently, Scherer (2013)

has reviewed theoretical proposals and empirical evidence in the literature that help to establish the plausibility of this claim, in particular, the evolutionary continuity of affect vocalizations, showing that anatomical structuresfor complex vocalizations existed before the evidence for the presence of representational systems such as language.

As to the cross-domain consistency with different kinds of environmental sounds, it seems quite plausible to assume that once speech and music were decoupled from actually occurring affect bursts and took on representational functions, different kinds of nature sounds were used in speech and music both for reference to external events and expressive functions. It seems reasonable to assume that the type of representational coding was informed by

#### **REFERENCES**


*Proceeding of the ASRU* (Cancún: IEEE), 381–385.


the prior, psychobiological affect code, particularly with respect to the fundamental affect dimensions of valence and arousal.

Empirical studies like the one reported here, using machine learning approaches, may complement other approaches to examine the evolutionary history of affect expression in speech and music by empirically examining, using large corpora of different kinds of sound events, the extent to which auditory domains exhibit cross-domain consistency and which common patterns are particularly frequent.

#### **ACKNOWLEDGMENTS**

This study has received funding from the European Commission (grant no. 289021, ASC-Inclusion).


of emotion,"in *Handbook of Affective Sciences*, eds R. J. Davidson, K. R. Scherer, and H. H. Goldsmith (Oxford, NY: Oxford University Press), 433–456.


*Annual Conference of the International Speech Communication Association* (Lyon: ISCA).


*Book, Version 3.4.1*. Cambridge: Cambridge University Engineering Department.

Zwicker, E., and Fastl, H. (1999). *Psychoacoustics – Facts and Models*. Heidelberg: Springer.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 28 March 2013; paper pending published: 15 April 2013; accepted: 06 May 2013; published online: 27 May 2013.*

*Citation: Weninger F, Eyben F, Schuller BW, Mortillaro M and Scherer KR (2013) On the acoustics of emotion in audio: what speech, music, and sound have in common. Front. Psychol. 4:292. doi: 10.3389/fpsyg.2013.00292*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Weninger, Eyben, Schuller, Mortillaro and Scherer. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## The "Musical Emotional Bursts": a validated set of musical affect bursts to investigate auditory affective processing

#### *Sébastien Paquette1 \*, Isabelle Peretz <sup>1</sup> \* and Pascal Belin1,2,3*

*<sup>1</sup> Department of Psychology, International Laboratory for Brain Music and Sound Research, Center for Research on Brain, Language and Music, University of Montreal, Montreal, QC, Canada*

*<sup>2</sup> Voice Neurocognition Laboratory, Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, UK*

*<sup>3</sup> Institut des Neurosciences de La Timone, Aix-Marseille Université, Marseille, France*

#### *Edited by:*

*Petri Laukka, Stockholm University, Sweden*

#### *Reviewed by:*

*Tuomas Eerola, University of Jyväskylä, Finland Cesar F. Lima, University College London, UK; University of Porto, Portugal*

#### *\*Correspondence:*

*Sébastien Paquette and Isabelle Peretz, International Laboratory for Pavillon 1430 Boulevard Mont-Royal, Montréal, QC H2V 4P3, Canada e-mail: sebastien.paquette.1@ umontreal.ca; isabelle.peretz@umontreal.ca Brain Music and Sound Research,*

The Musical Emotional Bursts (MEB) consist of 80 brief musical executions expressing basic emotional states (happiness, sadness and fear) and neutrality. These musical bursts were designed to be the musical analog of the Montreal Affective Voices (MAV)—a set of brief non-verbal affective vocalizations portraying different basic emotions. The MEB consist of short (mean duration: 1.6 s) improvisations on a given emotion or of imitations of a given MAV stimulus, played on a violin (10 stimuli × 4 [3 emotions + neutral]), or a clarinet (10 stimuli × 4 [3 emotions + neutral]). The MEB arguably represent a primitive form of music emotional expression, just like the MAV represent a primitive form of vocal, non-linguistic emotional expression. To create the MEB, stimuli were recorded from 10 violinists and 10 clarinetists, and then evaluated by 60 participants. Participants evaluated 240 stimuli [30 stimuli × 4 (3 emotions + neutral) × 2 instruments] by performing either a forced-choice emotion categorization task, a valence rating task or an arousal rating task (20 subjects per task); 40 MAVs were also used in the same session with similar task instructions. Recognition accuracy of emotional categories expressed by the MEB (*n*:80) was lower than for the MAVs but still very high with an average percent correct recognition score of 80.4%. Highest recognition accuracies were obtained for happy clarinet (92.0%) and fearful or sad violin (88.0% each) MEB stimuli. The MEB can be used to compare the cerebral processing of emotional expressions in music and vocal communication, or used for testing affective perception in patients with communication problems.

#### **Keywords: music, emotion, auditory stimuli, voices**

#### **INTRODUCTION**

With increasing knowledge in the field and new methods to explore the human brain, emotions are no longer too obscure or subjective to be studied scientifically. In neuroscience, many research projects are now entirely dedicated to the study of emotion. Thus, it appears timely to construct a standardized and validated set of stimuli and to make these freely and easily available (www.brams.umontreal.ca/plab\_download; http://vnl.psy.gla.ac. uk/resources.php) in order to facilitate the comparability of future studies.

A great amount of work has been achieved in the field of visually perceived emotions, utilizing validated stimuli like the International Affective Picture System and the Ekman faces (Ekman and Friesen, 1978; Lang et al., 1988; Dailey et al., 2001; Ekman et al., 2002), which were designed to portray basic emotions (anger, disgust, fear, happiness, sadness, and surprise as well as a neutral expression). These validated sets of stimuli have provided highly useful tools for the study of brain structures (e.g., amygdala: Adolphs et al., 1994) involved in emotional processing and its developmental trajectory (Charlesworth and Kreutzer, 1973). With the same objectives, an increasing number of studies are being conducted in the domain of aurally perceived emotions, thus calling for validated stimuli sets.

A large part of the research on auditory affective processing has been conducted on speech prosody utilizing words or sentences spoken with various emotional expressions (Monrad-Krohn, 1963; Banse and Scherer, 1996; Buchanan et al., 2000; Kotz et al., 2003; Mitchell et al., 2003; Schirmer et al., 2005; Pell, 2006). Another way to express an emotion vocally is via non-verbal affect bursts (Scherer, 1994; also sometimes called non-verbal interjections: Schröder, 2003). Non-verbal affect bursts are vocal expressions (e.g., screams, laughter) that usually accompany intense emotional feelings. Affect bursts are minimally conventionalized, thus a relatively universal means of spontaneous human communication (see Sauter et al., 2010; Koeda et al., 2013, for cross-cultural studies). They are believed to reflect more of a biological push than a sociological pull (Scherer, 1986); they are closer to the primitive affect expressions of babies and animals than to emotional speech.

Recently, a validated set of auditory affect bursts designed as an auditory counterpart of Ekman faces was recorded and validated by Belin et al. (2008). The so-called Montreal Affective Voices (MAV) consist of a set of short vocal interjections on the vowel /a/ expressing anger, disgust, fear, pain, sadness, surprise, happiness, sensual pleasure, and neutrality. The MAV represent short primitive expressions of these emotions with minimal semantic information, providing useful stimuli for the study of the psychological mechanisms underlying auditory affective processing with minimal interaction with linguistic processes (e.g., Bestelmeyer et al., 2010).

However, vocal affect bursts are not the only means of transmitting auditory emotions. Music is often described as the "language of emotions," and recent research on basic musical emotions has shown that emotion recognition in music is consistent across listeners (Vieillard et al., 2008). The terms "basic emotions" correspond to a limited number of innate and universal emotion categories (happiness, sadness, anger, fear, and disgust) from which all other emotions can be derived (Ekman, 1982). Moreover, many studies have demonstrated that emotions in music fit Ekman's definition of basic emotions, they are recognized quickly [only a quarter of a second of music; one chord or a few notes (Peretz et al., 1998; Bigand et al., 2005)], early in development (Terwogt and van Grinsven, 1991; Flom et al., 2008), and across different cultures (Balkwill et al., 2004). The latter is even true for cultures without previous exposure to western music (Fritz et al., 2009).

Perception of specific musical emotions (e.g., fear and sadness) can also be lost after damage to the amygdala (Gosselin et al., 2005, 2007), suggesting that damage to the limbic system affects perception of basic musical emotion just as reported for other domains (e.g., vocal expression: Dellacherie et al., 2011; facial expression: Adolphs et al., 1994).

An important question that ensues is why music moves us? Recent studies have shown that certain brain areas [e.g., the striatum (Salimpoor et al., 2011), the amygdala (Gosselin et al., 2007)] are associated with musical emotional processing. These same areas have also been associated with basic biological functions (sex, pain). How can we conceptualize the relationship between music and these neurobiological substrates? One possibility is that music co-opts or invades emotional circuits that have evolved primarily for the processing of biologically important vocalizations [e.g., laughs, screams; Peretz (2010)]. There is currently little experimental data supporting or invalidating the existence of a common musical and vocal channel.

For example, Lima and Castro (2011), demonstrated that musical expertise enhances the recognition of emotions in speech prosody, suggesting that expertise in one domain could translate to the other. Conversely, Thompson et al. (2012), reported that amusics (individual with a pitch perception deficit; Peretz et al., 2002) were also impaired in perceiving emotional prosody in speech.

More specifically, Ilie and Thompson (2006) compared domains by evaluating the effect of manipulating acoustic cues common to both the voice and music [intensity, rate (tempo), and pitch height] on emotional judgments. They found that loud excerpts were judged as more pleasant, energetic and tense compared to soft excerpts, and that fast music and speech were judged as having greater energy than slow music and speech. However, it was also found that tempo and pitch had opposite effects on other emotional scales. Their results support the view that the processing of musical and vocal emotion could utilize common circuitry, but that some of this circuitry might be domain specific.

The existence of domain-specific processes for decoding emotion is consistent with neuropsychological dissociations found between music and language (Peretz and Coltheart, 2003; Omar et al., 2011; Lima et al., 2013). These dissociations could be explained by the fact that musical emotion needs to be actively decoded by the brain based on associations learned via exposure to a musical culture (Peretz et al., 1998; Juslin and Västfjäll, 2008) and past experience with music (Eschrich et al., 2008); since not all musical emotional acoustic parameters are present in emotional vocalizations (e.g., harmony: Juslin and Laukka, 2003), it is possible that these additional cues require additional processing.

Musical and vocal stimuli have both been used to study auditory perceived emotions (Music: Vieillard et al., 2008; Roy et al., 2009; Aubé et al., 2013, Voices: Dalla Bella et al., 2001; Schirmer et al., 2005; Pell, 2006; Fecteau et al., 2007; Belin et al., 2008). Although such stimuli have been quite useful to help exploring aurally perceived emotions in their respective channel, many characteristics set current musical and vocal stimuli apart making them hard to compare in a controlled study. This is especially true for factors such as musical structure (limited by mode or tempo), length, level of complexity as well as the context in which they have been created. The use of pre-existing music can introduce uncontrolled variability of many acoustic parameters, with various demands on attention and memory. Such acoustic and cognitive differences are likely to recruit different neural networks (Peretz and Zatorre, 2005). This is why it is important to create and validate musical stimuli that would be as similar as possible to the MAV to allow for a more proper comparison of aurally (musical and vocal) perceived emotions.

The purpose of the present study is to make available for future research a validated set of brief musical clips expressing basic emotions, designed as a musical counterpart of the MAV. We chose to only include happiness, sadness, and fear because these emotions are among the easiest to recognize from music (Gabrielsson and Juslin, 2003; Juslin and Laukka, 2003; see Zentner et al., 2008, for a more nuanced range of musically induced emotions).

Brief "musical emotional bursts" (MEB) depicting neutral and emotional (happy, sad, and fear) expressions have been recorded from different musicians. The violin and the clarinet were chosen as instruments, not only because they are representative of two different classes of instruments (strings and woodwind) but also because they share important similarities with the voice: "The quasi-vocal quality implied by a seamless progression between notes is a characteristic that can be cultivated in both the clarinet and the violin" (Cottrell and Mantzourani, 2006:33). These recordings were then pre-selected and validated based on listeners' emotion categorization accuracy, as well as on valence and arousal ratings.

#### **MATERIALS AND METHODS RECORDING**

### *Participants*

Twenty professional musicians (10 violinists, 10 clarinettists) participated in the recording sessions, after providing written informed consent. They received a compensation of 20\$ per h.

#### *Procedure*

The musicians were first instructed to perform 10 short improvisations with different levels of expressiveness. They were not told in advance what the recording session was about; on the day of the recording they were told one after the other the emotion they were supposed to improvise on, [fear (as if they were scared), happiness, sadness, and neutrality]. They were told their improvisation had to last around a second (they could practice with the metronome), when ready they realized 10 renditions of the emotion. Neutral stimuli were presented just like any other category of stimuli, but characterized as "without emotion." After improvising, the same musicians were asked to imitate one after another four MAV stimuli depicting fear, happiness, sadness, and neutrality; they could listen to the stimuli as often as they wished. If the emotional category of the musical burst was not clearly recognized by the experimenter (SP) or if the improvisations were too long they were discarded.

The musical bursts were recorded in a sound-treated studio using a TLM 103 large diaphragm microphone Neumann (Georg Neumann, Berlin, Germany) at a distance of approximately 30 cm. Recordings were pre-amplified using a Millennia Media HV-3D preamplifier and digitized at a 44-kHz sampling rate at 24-bit resolution, using Apogee AD16X. Subsequently they were edited into short segments and normalized at peak value (90% of maximum amplitude), using Adobe Audition 3.0 (Adobe Systems, Inc. San Jose, CA).

We ended up with more stimuli than expected, because each musician gave us more excerpts than we asked for. In total, 1505 improvisations [a minimum of 10 × 4 emotions (happy, sad, fear, and neutral) per musician] and 319 imitations of the MAV [a minimum of 4 × 4 emotions (happy, sad, fear, and neutral) per musician] were recorded.

#### *Stimulus pre-selection*

Improvisations lasting longer than 4 s were excluded. Improvisations or imitations containing an artifact (breathing, vocal sounds, breaking bow hair sounds) were also excluded. In the end, the clearest and most representative stimuli (120 Violin-MEB and 120 Clarinet-MEB) were selected for the validation phase, regardless of their type (improvisation or imitation).

#### **VALIDATION**

#### *Participants*

Sixty participants (19 males) aged from 19 to 59 years (*M*: 28.8; *SD*: 9.2), with normal hearing participated in an on-line validation test. Each participant gave informed consent and filled out a socio-demographic information questionnaire prior to the judgment phase. Fifteen participants had 6 years or more of musical education and 45 had 5 years or less of training. They were compensated 3£ for their participation.

#### *Procedure*

Participants were instructed to evaluate each of the 240 MEB and 40 MAV (The MAV were included for comparison with the vocal stimuli). There were 30 violin-MEB, 30 clarinet-MEB and 10 MAV per emotion, and all were presented in a random order. Twenty of the 60 participants performed a four alternative forced-choice identification task "*Please choose the emotion you think this stimulus represents" among* fear, happiness, sadness, and neutrality labels, 20 participants gave arousal ratings "*Please rate on the scale below the perceived arousal of the emotion expressed (from 1 not at all aroused to 9 extremely aroused)*" and another group of 20 participants gave valence ratings "*Please rate on the scale below the perceived valence of the emotion expressed (from 1 extremely negative to 9 extremely positive).*"

#### **RESULTS**

The stimuli (40 violin-MEB and 40 clarinet-MEB) that were best identified (by being categorized in the intended emotion) by the largest amount of participants were selected (10 MEB; 7 improvisations, 3 imitations- per emotion). In the presence of identical ratings, the briefest stimuli were selected. Due to the small number of stimuli in each category, improvisations and imitations were not analysed separately (separate Tables can be found in the Supplementary Material).

Acoustical analyses were also performed to allow users to individually select their stimuli (Supplementary material).

#### **EMOTIONAL CATEGORIZATION**

Overall accuracy in the four-alternative emotions categorization task is 85.5% (*SD*: 15.8) for the violin-MEB, 75.4% (23.9) for the clarinet-MEB, and 94.8% (12.1) for the voice-MAV. The average percentage of correct recognition of each intended emotion for the selected stimuli are presented in **Table 1**. As can be seen, timbre had a greater effect on certain emotional intentions than on others. For example, fear was more difficult to recognize when expressed on a clarinet than on any other timbre.

The ANOVA conducted on the recognition scores (see values in bold in **Table 1**) with Timbre (violin, clarinet, and voice) and Emotion (happiness, sadness, fear, and neutrality) as within-subject factors yielded a main effect of timbre [*F*(2, <sup>38</sup>) = <sup>79</sup>.51, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.81] and of emotion [*F*(3, <sup>57</sup>) <sup>=</sup> <sup>6</sup>.81, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.005, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.26]; however, they are modulated by a significant interaction between Timbre and Emotion, [*F*(3.4, <sup>64</sup>.4) = <sup>16</sup>.41, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.46, corrected Greenhouse-Geisser].

Recognition scores were compared using Tukey's honestly significant difference. Scores averaged across emotions for each timbre were all significantly different (all *p* < 0.005) from one another: voices yielded the highest recognition scores and clarinet the lowest. Comparing emotions, fear was overall significantly (*p* < 0.01) less accurately recognized than all other emotions.

Using binomial tests to determine if the emotions conveyed by each of the 80 stimuli were recognized above chance level (25%), we found that 87.5% (70/80) of the MEB were recognized above chance (*p* < 0.05; bonferroni corrected). Thus, most MEB are effective in expressing an emotion on a musical instrument. Eight of the 10 stimuli that failed to be recognized belonged to the clarinet-fear category; the other two stimuli were from the violin-joy category.

#### **EMOTIONAL RATINGS**

The arousal and valence ratings averaged across participants for each stimulus are presented in **Figure 1**. The individual ratings are provided in the Supplementary Material.

The same ANOVA with Timbre and Emotion as betweensubjects factors as the one performed on the recognition scores



*Each row represents the percentage of choices for each emotion in each timbre. Percentage of correct recognition is presented in bold, (SE).*

was computed on the arousal ratings. A main effect of timbre [*F*(2, <sup>38</sup>) <sup>=</sup> <sup>10</sup>.05, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.35] and of emotion [*F*(3, <sup>57</sup>) <sup>=</sup> <sup>33</sup>.94, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.64] were observed; however as previously an interaction between Timbre and Emotion was obtained, [*F*(6, <sup>114</sup>) <sup>=</sup> <sup>5</sup>.85, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup>2= 0.24].

In general, the clarinet stimuli were judged to be less arousing than the violin and the vocal ones (all *p* < 0.05; by Tuckey's tests), whereas the latter two were judged to be equally arousing (*p* = 0.67). Neutral expressions were overall significantly less arousing (*p* < 0.001) than all other emotions, and happy stimuli were found to be more arousing (*p* < 0.001) than the sad ones.

It is important to note that the stimuli played on a clarinet were rated differently than the violin and vocal stimuli. Happy clarinet stimuli were rated as more arousing than all the other emotions played on clarinet (all *p* < 0.05); [fear was also significantly (*p* < 0.005) more arousing then the neutral stimuli]. In contrast however, the only significant difference for violin and vocal emotional bursts was that neutral stimuli were significantly less arousing (all *p* < 0.01) than all other stimuli.

Regarding valence ratings, we found qualitatively a similar pattern for both the violin and vocal stimuli (Happy > Neutral > Fear > Sad). The clarinet stimuli showed however a slightly different pattern, where fear was rated as being more positive than neural stimuli (Happy > Fear > Neutral > Sad). Again, both a main effect of timbre [*F*(2, <sup>38</sup>) <sup>=</sup> <sup>6</sup>.13, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.24] and of emotion [*F*(3, <sup>57</sup>)= 116.65, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.86] were observed, while the interaction between Emotion and Timbre was again found to be significant [*F*(6, <sup>114</sup>) = 31.64, *p* < 0.001, <sup>η</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.63]. Overall, violin MEB were judged to be less positive than the vocal ones (*p* < 0.005), but globally emotions were significantly different from one another in terms of their valence ratings (*p* < 0.005).

This interaction can be explained by the fact that some differences were observed within timbre. Among the vocal stimuli, the happy ones were judged to be more positive than the neutral ones which were rated as more positive than fear, which in turn was also rated more positively than sadness (all *p* < 0.01). When played on a musical instrument, the happy stimuli were also judged as most pleasant (all *p* < 0.001), whereas only the sad stimuli were rated as significantly more negative than the neutral ones when played on violin (*p* < 0.05), and also as more negative than the stimuli expressing fear played on the clarinet (*p* < 0.005).

#### **DISCUSSION**

Here we validate the MEB—a set of short music clips designed to express basic emotions (happy, sad, fear, and neutral). Despite their short duration (1.6 s on average), the MEB stimuli were correctly categorized by emotion with high accuracy (average recognition score of 80.4%). The highest accuracy was obtained on the violin for stimuli expressing fear and sadness (88%) and on the clarinet for those conveying happiness (92%). Although, the MAV stimuli were best recognized, the newly created MEB were still accurately portraying the desired emotions.

Only three emotions were tested here to allow for direct comparison between basic vocal (MAV) and musical (MEB) emotions. Our limited selection of emotions does limit voicemusic comparison, but it is a first step in making that comparison. We acknowledge that there are multiple declinations of positive and negative emotions in the musical and vocal literature, our aim was to use the most easily recognized common to both domains. From a dimensional approach, basic emotions can be distinguished on the dimensions of valence and arousal; variations of these (and other) emotions also differ in valence and arousal and can easily be represented along basic emotions.

The arousal and valence ratings obtained here fit well with this dimensional representation of emotions, with happy stimuli as conveying positive and arousing emotions, fear stimuli as conveying negative and arousing emotions (with the exception of a few clips played on clarinet), sad stimuli as conveying moderately arousing and negative emotions, and the neutral stimuli as conveying an emotional valence that is neither positive or negative with little arousal.

Although the valence scale had a highest rating possible of 9, it is important to note that the maximal average arousal elicited by our stimuli is 6.8 (7.1 for voice), Perhaps the short duration of our stimuli limited their arousing capabilities and could potentially explain the partial overlap in arousal observed in **Figure 1** between our two negative emotions (fear, sadness). Also, the fact that the valence scale ranged from "extremely negative" to "extremely positive" (Belin et al., 2008; Aubé et al., 2013), and not from "unpleasant" to "pleasant" could explain why the sad stimuli are differently positioned on the scale than in previous studies (e.g., Vieillard et al., 2008). Nevertheless, our results are still quite similar to those of Vieillard et al. (2008), which were obtained with longer and more conventional musical stimuli (inspired from film music), suggesting that the MEB may tap into similar emotional processes as those evoked by more elaborate film music clips. Yet, the MEB consist of brief expressions and are less likely to involve high-level cognitive mechanisms such as dividedattention and sophisticated knowledge of musical structure than more conventional musical stimuli. The MEB are not limited by tonality or defined by a specific rhythm; they were created as short musical bursts, by professional musicians on their instrument.

Our stimuli can be viewed as a primitive form of musical emotion, situated somewhere in between long musical excerpts from recordings (e.g., Peretz et al., 1998) or short musical segments extracted from these (Dalla Bella et al., 2003; Filipic et al., 2010) and synthesized frequency-modulated tones designed to mimic key acoustic features of human vocal expressions (Kantrowitz et al., 2013). Our novel stimuli were created to be exactly where they are in this spectrum by representing the most basic form of musical emotion that can be closely related to vocal expressions. Although exact replicas of the MAV could have been used instead, by digitally transposing the MAV to another timbre, we chose to produce new recordings in order to keep the stimuli as natural (realistic) as possible.

The timbre, or instrument on which music is played, is known to have an important impact on emotion recognition (Behrens and Green, 1993; Gabrielsson and Juslin, 1996; Balkwill and Thompson, 1999; Hailstone et al., 2009). For example, Hailstone et al. (2009) have found that melodies sound less happy when played on the violin than on other instruments, as we found here. This effect was particularly clear in the imitations of vocal expressions (see Supplementary Material). A range of timbres were used in prior studies (including violin and voice) and each instrument seemed to present its own possibilities and limitations when it came to expressing specific emotions. For instance in our study, we observed that fear was not well recognized when expressed on the clarinet.

Other limitations will also need to be addressed. For example, a forced-choice emotion recognition task was used here, and such tasks can have an impact on statistical analyses, such as increased co-linearity (if one response is chosen, the others are not), which generates artificially high recognition rates (Cornwell and Dunlap, 1994; Frank and Stennett, 2001). This method was selected to facilitate the web-based validation procedure of a large number of stimuli (280), and we believe the technique has served its purpose, as significant differences were observed between the timbres and within each timbre as revealed within the confusion matrix.

In addition, musicians were explicitly asked to imitate vocalizations (3/10 MEB per emotion). Such imitations produced on an instrument with voice-like characteristics may limit the chance to obtain domain-specific responses. In contrast, by using such a setup, finding evidence for domain-specificity would be compelling, even more so if parameters like pitch, emotion recognition scores and valance/arousal ratings are controlled for and used as regressors (Supplementary material) to compensate for the observed differences.

Here we propose a validated set of auditory stimuli designed as a musical counterpart of the MAV to allow a better comparison between auditory (musical and vocal) stimuli designed to convey emotions. We hope that the MEB will contribute to the understanding of emotions across domains and modalities.

#### **ACKNOWLEDGMENTS**

This work was supported by an Auditory Cognitive Neurosciences, Erasmus Mundus Mobility Scholarship by the European Union to Sébastien Paquette, by grants from Natural Sciences and Engineering Research Council of Canada, the Canadian Institutes of Health Research and from the Canada Research Chairs program to Isabelle Peretz and BBSRC grants BB/E003958/1 and BB/I022287/1, and ESRC/MRC large grant RES-060-25-0010 to Pascal Belin. We thank Patrice Voss for editing a previous version of the manuscript.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/Emotion\_Science 10.3389/fpsyg.2013.00509/abstract

### **REFERENCES**


*Research in Review*, ed P. Ekman (New York, NY: Academic Press), 91–168.


P. N., Butler, P. D., et al. (2013). Reduction in tonal discriminations predicts receptive emotion processing deficits in schizophrenia and schizoaffective disorder. *Schizophr. Bull.* 39, 86–93. doi: 10.1093/schbul/sbr060


anticipation and experience of peak emotion to music. *Nat. Neurosci.* 14, 257–262. doi: 10.1038/nn. 2726


*Commun.* 40, 99–116. doi: 10.1016/ S0167-6393(02)00078-X


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 04 April 2013; accepted: 18 July 2013; published online: 13 August 2013. Citation: Paquette S, Peretz I and Belin P (2013) The "Musical Emotional Bursts": a validated set of musical affect bursts to investigate auditory affective processing. Front. Psychol. 4:509. doi: 10.3389/fpsyg. 2013.00509*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Paquette, Peretz and Belin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## A vocal basis for the affective character of musical mode in melody

#### *Daniel L. Bowling\**

*Department of Cognitive Biology, University of Vienna, Vienna, Austria*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Sébastien Paquette, University of Montréal, Canada Meagan Curtis, Purchase College, USA*

#### *\*Correspondence:*

*Daniel L. Bowling, Department of Cognitive Biology, University of Vienna, Alserstrasse 14, 1090 Vienna, Austria e-mail: dan.bowling@univie.edu*

Why does major music sound happy and minor music sound sad? The idea that different musical modes are best suited to the expression of different emotions has been prescribed by composers, music theorists, and natural philosophers for millennia. However, the reason we associate musical modes with emotions remains a matter of debate. On one side there is considerable evidence that mode-emotion associations arise through exposure to the conventions of a particular musical culture, suggesting a basis in lifetime learning. On the other, cross-cultural comparisons suggest that the particular associations we make are supported by musical similarities to the prosodic characteristics of the voice in different affective states, indicating a basis in the biology of emotional expression. Here, I review developmental and cross-cultural studies on the affective character of musical modes, concluding that while learning clearly plays a role, the emotional associations we make are (1) not arbitrary, and (2) best understood by also taking into account the physical characteristics and biological purposes of vocalization.

**Keywords: music, emotion, voice, mode, interval-size**

#### **INTRODUCTION**

One of the oldest and most prevalent ideas in the history of music is that there is a special connection between music and emotion (Kivy, 2002, 14). Today we know that the emotional coloring of a piece of music results from a complex interplay between acoustical properties, expectation, context, and a listener's personal experience. Efforts to understand the role of acoustical properties have been particularly successful. A great deal is now known about the affective contributions of tempo, intensity, pitch, spectral energy distribution, tone attack, and microstructural irregularity—and a coherent biological framework for understanding why they affect us has been offered in terms of similarity to corresponding properties in vocal affect expression (reviewed in Juslin and Laukka, 2003). The focus here is on the affective contribution of musical *mode*, which has proven a greater challenge to understand. In musicology, the term "mode" is used to refer to a variety of different concepts, only loosely related by their usage in the study of scales and melodies (Randel, 1986, 499). This ambiguity of meaning has generated some confusion, particularly when comparing different musical traditions. Thus, as a first step, I define mode as a set of musical tones and tone-relationships that are used to create a melody. By avoiding additional related concepts such as rules for ornamentation or pre-specified melodic motifs, this simple definition can be applied to music from a wide variety of traditions.

The notion that the mode of a melody influences the perception of its emotional coloring has been a part of music theory since antiquity. In *The Republic*, Plato treats it as common sense that certain modes are best suited for the representation of particular feelings or states of character (Plato, 375 BCE/1955, 93–95). Similarly, in Indian musical traditions, descriptions of which modes are appropriate for various emotions are found in ancient Sanskrit texts, such as the *Naty* ¯ s´*astra* ¯ (∼200 CE; Capwell, 1986, 779; Devy, 2002, 3–4). Further examples come from the Middle East and East Asia, where the affective connotations of different modes are documented in Persian and Japanese musical traditions (Nettl, 1986, 531; Hoshino, 1996).

Despite these rich historical and cross-cultural opportunities, the affective character of musical modes has almost exclusively been studied in the context of post-renaissance Western music, where the most salient examples of mode-emotion associations are based on the major and minor modes. When other factors such as tempo and intensity are carefully controlled, music composed using the major mode is typically heard as relatively positive and excited (e.g., joyful), whereas music composed using the minor mode is heard as relatively negative and subdued (e.g., sad; Hevner, 1935; Crowder, 1984, 1985; Kastner and Crowder, 1990; Gerardi and Gerken, 1995; Gregory et al., 1996; Peretz et al., 1998; Dalla Bella et al., 2001; Gagnon and Peretz, 2003). These connotations appear to have persisted for over 400 years the 16th century Italian music theorist and composer Gioseffo Zarlino similarly described the effect of the major mode as "gay and lively," and the minor mode as "sad and languid" (Zarlino, 1558/1968, 21–22).

The historical, cross-cultural, and repeatedly verified relationships between musical modes and emotions make it clear that these associations must be addressed by any theory of musical emotions. In spite of this, most modern theories downplay mode, relegating it to a catch-all of cultural convention, and in so doing pay little attention to the possibility of biological roots. In what follows, I advance a biological approach to mode-emotion associations by: (1) demonstrating the logical error of using evidence that mode-emotion associations are learned, as evidence against biological underpinnings; (2) presenting the evidence that mode-emotion associations exhibit acoustical similarities across cultures; and (3) extending the biological framework of vocal imitation proposed by Spencer (1857, 1890) to account for why we associate modes and emotions.

#### **THE EVIDENCE FOR LEARNING**

Most modern accounts of why music expresses emotion suggest that emotional associations with mode are learned through exposure to the conventions of a particular musical culture (e.g., Lundin, 1967, 166–169; Juslin and Laukka, 2003; see also Huron, 2008; Trainor and Corrigall, 2010). This position is supported by evidence that mode-emotion associations develop with age, and are more reliably made by individuals with musical training. One experimental paradigm involves presenting children of different ages with major and minor music and asking them to "match the feeling" by selecting among schematic drawings of various facial expressions (Dolgin and Adelson, 1990). Using this approach, a number of studies have found that culturally appropriate modeemotion associations do not develop until 6–8 years of age, before which they are not reliably made (Gerardi and Gerken, 1995; Gregory et al., 1996; Dalla Bella et al., 2001; but see Kastner and Crowder, 1990 for emergence at age 3). The perception of major and minor chords has also been studied in infancy (6 months) using a preferential looking paradigm (Crowder et al., 1991). In accord with the results for young children, no reliable preferences were observed for major over minor (or vice versa). With respect to musical training, a number of studies have shown that individuals who have had explicit instruction in Western music make culturally appropriate mode-emotion associations more reliably than those who have not (e.g., Heinlein, 1928; Hevner, 1935). Importantly however, these studies also show that musical training is not necessary for appropriate associations to be made (see also Dalla Bella et al., 2001). These findings, together with the fact that mode-emotion associations are well established in adulthood (Hevner, 1935; Crowder, 1984, 1985; Gerardi and Gerken, 1995; Gregory et al., 1996; Peretz et al., 1998; Dalla Bella et al., 2001; Gagnon and Peretz, 2003), suggest a pattern of learning over the course of development.

Evidence that mode-emotion associations strengthen with age and musical training is often taken to imply that such associations are arbitrary, arising solely through learning the conventions of a given musical culture (Heinlein, 1928; Lundin, 1967, 166–169; Gregory et al., 1996). Over the past 70 years, however, ethologists have repeatedly demonstrated that evidence for learning does not necessarily imply an absence of biological preparation. On the contrary, learned associations are often supported by inherited predispositions. For example, honeybees must learn to associate specific flower types with food, but they instinctively attend flowerlike objects (Gould and Marler, 1987). Similarly, rats must learn whether specific foods are safe for consumption, but they are predisposed to attend olfactory as opposed visual or auditory cues in doing so (Garcia and Koelling, 1966). Another example comes from songbirds that despite having to learn local song dialects from adult conspecifics, are predisposed to recognize and preferentially learn species-typical song patterns (Marler, 1991). Among primates, young Vervet monkeys must learn to emit "eagle" alarm calls to objects that pose an aerial

threat (such as eagles or hawks), but are predisposed to emit these calls to a wide-range of objects moving overhead, including for example, falling leaves (Seyfarth et al., 1980). Human children must also learn about potential sources of danger, a process that is in part supported by perceptual biases to associate certain classes of animals (e.g., spiders or snakes) with fear responses (Lobue and DeLoache, 2008). Finally, and perhaps of even greater significance, we are somehow predisposed to learn the myriad associations and rules that constitute language (Tomasello, 1999; Chomsky, 2006). In sum, these examples underscore the importance of biology in explaining perception and behavior and caution against the assumption that evidence for learning in mode-emotion associations precludes a specieswide predisposition to make them.

#### **MODE-EMOTION ASSOCIATIONS ACROSS CULTURES**

Interpretation of developmental studies is complicated by the confounding roles of perceptual and cognitive development. Thus, in the developmental studies alluded to above, it remains possible that mode-emotion associations are somehow made by young children but were not observed due to perceptual or cognitive limitations, such as difficulties with infering harmonic structure (see Gerardi and Gerken, 1995) or because the task of pointing to schematic faces representing emotional reactions to music requires cognitive skills poorly developed at the ages examined (see Dalla Bella et al., 2001). Cross-cultural studies offer a clear advantage here because adults from different cultures presumably have comparable perceptual and cognitive abilities. Accordingly, a number of studies have examined how adults respond to emotion in music from unfamiliar cultures. Few of these studies however, have been specifically designed to assess the influence of mode, which is thus typically confounded with other variables such as tempo and intensity. Nevertheless, this body of work shows that adult listeners perceive the emotions intended by composers/performers in culturally unfamiliar music with considerable accuracy (particularly for joy and sadness), and that their emotional judgments vary with mode in a way that is either in agreement with unfamiliar traditions (Balkwill and Thompson, 1999; Balkwill et al., 2004), or at least matches the judgments of listeners native to the culture in question (Fritz et al., 2009; Zacharopoulou and Kyriakidou, 2009). Only one cross-cultural study has isolated the influence of mode on perceived emotion. Hoshino (1996) played simple major and minor melodies to Japanese adults who grew up before WWII and were reportedly unfamiliar with Western musical conventions. Although Hoshino's methods were somewhat unorthodox—subjects were asked to associate melodies with colors, which they later described with emotional labels—she found some evidence for cross-cultural similarity in modal perception: major melodies were described as bright and warm, whereas minor melodies were described as dark and melancholic.

A different approach to studying mode-emotion associations is to compare the structure of music composed using modes from different cultures that are associated with similar emotions. Features that are held in common are candidates for mediating emotional associations. Accordingly, Bowling et al. (2012) compiled melodies composed in modes associated with either negative-subdued or positive-excited emotions in classical South Indian (*Carnatic*) and classical Western music and compared their structure in terms of the sizes of the intervals that occur between adjacent melody notes (see also Huron, 2008). In both traditions, it is apparent that the use of modes associated with negative-subdued emotion results in melodies with a significantly greater proportion of smaller intervals (<200 cents1 ), whereas the use of modes associated with positive-excited emotion results in melodies with a significantly greater proportion of larger intervals (≥200 cents). These results are not specific to classical music, as the same pattern has also been observed in major and minor Finnish Folk melodies (Bowling et al., 2010). These comparisons cast further doubt on the idea that the associations between mode and emotion are inherently arbitrary.

#### **A VOCAL BASIS FOR THE AFFECTIVE CHARACTER OF MUSICAL MODES**

What are cross-cultural similarities in mode-emotion associations based on? In his essay "On the Origin and Function of Music," Herbert Spencer (1857) proposed that music expresses emotion by imitation and exaggeration of acoustical properties of emotional expression in the voice. His logic entails two steps. First, he argued that the physiological components of emotion affect the mechanisms of vocal production and thus the acoustical properties of the voice, resulting in routine associations between those properties and the subjective components of emotional experience. And second, he argued that by employing the same acoustical properties, music gains access to the same emotional associations. Although Spencer's theory co-exists with several others that attempt to explain how music conveys emotion [reviewed in Davies (2001); see also Crowder (1984) and Juslin and Västfjäll (2008)], increasing evidence of acoustical similarities between music and voice, paired with careful consideration of the various relationships that might explain them (Juslin and Laukka, 2003), have provided strong support for Spencer's vocal imitation hypothesis.

While Spencer (1857, 399) did not specifically consider musical mode, he did comment on interval size, noting that "calm speech is comparatively monotonous" whereas "emotion makes use of fifths, octaves, and even wider intervals." Distinguishing between different types of emotion, more recent studies have found that frequency contours in sad speech are relatively flat and stable, whereas frequency contours in joyful speech are more dynamic and variable [reviewed in Scherer (1986, 2003)]. With respect to interval-size, Bowling et al. (2012) calculated the frequency differences between adjacent voiced intensity maxima in recordings of sad and joyful speech. In parallel with the pattern of interval-sizes found in modal melodies (see above), sad speech comprised a greater proportion of smaller intervals, while joyful speech comprised a greater proportion of larger intervals. Furthermore, the interval-size at which the reversal in prevalence occurred between the two emotional conditions was roughly the same in speech and music, between 100 and 200 cents.

Evidence of similarities in interval-size between musical and vocal expression have also been found by Curtis and Bharucha (2010), who report that descending minor thirds (-300 cents) frequently occur between the first and second syllables of twosyllable expressions conveying sadness (e.g., "come on"). This result is of particular interest because minor thirds play a central role in distinguishing music composed in the minor modes (Bowling et al., 2010). Accordingly, Curtis and Bharucha argue for the possibility of an "interval code" by which the occurrence of specific intervals signal specific emotions in music and the voice. Two issues complicate this interpretation. First, only descending minor thirds ending on the tonic (the "tonal center" of a mode) are emphasized in minor vs. major music. Defined more generally (i.e., not limited to those intervals ending on the tonic), descending minor thirds are only marginally more prevalent in minor vs. major melodies: by 0.1% in classical Western music and 0.4% in Finnish folk music (Bowling et al., 2010, 2012). Furthermore, the opposite pattern occurs in Carnatic melodies, where descending minor thirds are actually more prevalent in melodies composed in modes associated with joy (accounting for 8% of melodic intervals) than they are in melodies composed in modes associated with sadness (accounting for 4.2%; Bowling et al., 2012). It remains possible that tonic minor thirds tap into some form of interval code, but further research is needed to explore whether or not the concept of a tonic has any bearing in speech. Second, there is conflicting evidence as to whether or not specific musical intervals are emphasized in speech prosody. In contrast to Curtis and Bharucha, Bowling et al. (2012) found no evidence of emphasis at descending minor thirds in two-syllable expressions of sadness. Methodological differences in speaker selection, method of interval calculation, and histogram-bin size are likely sources of this difference. However, aside from these conflicting findings over the importance of relative vs. specific intervalsizes in mode-emotion associations, the available evidence clearly indicates that interval-size, like many other acoustical properties, varies with emotion in music and the voice in a parallel fashion.

#### **THEORETICAL ACCOUNTS**

Five theoretical frameworks can account for the relationship between interval-size in modal music and vocal expression. In what follows, I argue for the exclusion of four of them in favor of the fifth, i.e., Spencer's theory that music expresses emotion by imitating the voice. Juslin and Laukka took a similar approach in their 2003 review of acoustical similarities between musical and vocal expression, although they did not explicitly consider interval-size.

#### **EXPLANATION 1 - INTERVAL-SIZES IN MUSICAL AND VOCAL EXPRESSION ARE ENTIRELY UNRELATED, OBSERVED SIMILARITIES ARE COINCIDENTAL**

The principal argument against this possibility is that the same relationship between interval-size and emotion is observed in

<sup>1</sup>Cents are units of frequency interval size. One cent is 1/100 of an equally tempered semitone. An octave thus comprises 1200 cents. The formula for calculating the size of an interval between two frequencies (F1 and F2) in cents C, is C = 1200 × log2(F1/F2).

music and speech from different musical traditions and cultures. Whereas we can be fairly certain that basic acoustical properties of vocal expression are conserved across cultures (Elfenbein and Ambady, 2002), assessments of interval-size in mode-emotion associations from additional musical traditions could potentially strengthen the argument against spurious correlation (Persian music offers perhaps the best opportunity here). If such assessments are undertaken in the future, the prediction from the Carnatic and Western music comparisons (see above) is that musical traditions that systematically associate smaller intervals with positive-excited emotion, and/or larger intervals with negative-subdued emotion, are either non-existent or exceptionally rare. Additionally, the creation of new tonal traditions with these "backward" emotional associations should be considerably more difficult than the creation of new tonal traditions with "normal" associations.

A second argument against a coincidental interpretation of interval-size similarities in musical and vocal expression is that interval-size is part of a larger pattern of overall acoustic similarity between vocal and musical expression that includes a host of other properties. The possibility that all of these similarities are coincidental is remote (Juslin and Laukka, 2003), and there is no apparent reason to make an exception of interval-size.

#### **EXPLANATION 2—INTERVAL-SIZES IN MUSICAL AND VOCAL EXPRESSION ARE RELATED, BUT ONLY THROUGH THE COMMON INFLUENCE OF A THIRD CAUSAL FACTOR**

The arguments against this explanation are best made using examples. Thus, one potential third factor that could influence music and the voice is the expression of emotion through body postures and movements (Kivy, 2002, 37–48). According to this hypothesis, acoustical properties such as tempo or intensity, express emotion in both music and the voice because they imitate analogous properties of bodily expression, such as the speed and force of movements. However, it is unclear how this hypothesis would apply to the sizes of frequency intervals, as there is no clear analogue in body postures or movements. This problem reflects a general limitation of *third causal factor* accounts of interval-size similarities: frequency intervals are essentially unique to music and the voice. It is true that a metaphorical similarity between frequency intervals in music and the size/extent of body movements could be drawn, but it is unclear why this would be more plausible than a direct similarity with frequency intervals in the voice.

Another type of third causal factor interpretation considers influence apart from explicit imitation. For example, it is possible that a physiological factor such as arousal determines the relationship between emotion and interval-sizes in music and the voice. With respect to the voice one could postulate that changes in arousal determine interval-size by regulating the energy available to the muscles that control breathing and laryngeal posture. With respect to music however, this link between arousal and interval-size does not necessarily apply. The reason is that the forms of musical instruments are not physiologically constrained. Accordingly, the relationship between muscular effort and interval-size may be decoupled (as it is to some extent with a piano or guitar, and to a fuller extent with computer software used for composition) or even reversed (imagine an instrument on which the more forcefully one plays, the slower, softer, and smaller frequency intervals become). The lack of necessary connection between arousal and interval-size in instrumental music poses a problem for an arousal-based third causal factor account because it makes clear that arousal does not always determine interval-sizes in music. This example thus provides another general argument against third causal factor accounts of interval-size similarities in musical and vocal expression. For any cause that does not explicitly comprise frequency intervals, a necessary link to interval-size must be demonstrated in both domains.

#### **EXPLANATIONS 3–5**

Each of the remaining explanations posits some form of causal relationship between interval-size in music and the voice. Before considering them, it is necessary to draw a distinction between vocal and non-vocal music since any argument claiming that emotional expression in the voice derives from vocal music (or vice versa) is inherently circular. Keeping this in mind, there are three possible causal relationships that could exist between interval-size in musical and vocal expression: (3) *the voice imitates music*; (4) *both music and the voice imitate each other, with neither being primary*; and (5) *Spencer*'*s theory that music imitates the voice*.

The critical evidence for deciding between these alternatives concerns the evolutionary primacy of affective vocal expression. Despite the well-known debate between Spencer and Darwin over whether music derives from speech or speech derives from music, both men agreed on the primacy of some form of vocalization in auditory affective communication (Darwin, 1879/2004, 638; 1889/1998, 88–99; Spencer, 1857, 1890), and the evidence has only grown stronger since. With respect to musical instruments the earliest uncontested archaeological evidence—a pair of flutes made from the wing bones of a swan—dates to approximately 37000 years ago (reviewed in Fitch, 2006). In contrast, many of the neural and physiological mechanisms responsible for vocal affect expression (e.g., descending motor control from the brainstem, and various aspects of laryngeal anatomy) are shared by a large variety of mammals, suggesting relatively ancient phylogenetic roots (Jürgens, 1992; Ploog, 1992).

Even if one pushes the origins of instrumental music back to our common ancestor with *H. neaderthalensis* (as suggested by certain interpretations of slightly older flute-like objects found with other Neaderthal artifacts; Kunej and Turk, 2000), or even further back to our last common ancestor with chimpanzees and gorillas, (which both display some forms of drumming behavior; Fitch, 2006), the primacy of vocal affect expression remains incontrovertible. Explanations that the voice imitates music, or that both imitate each other with neither having primacy (explanations 3 and 4 above), are both incompatible with these facts. Only Spencer's theory, that interval-size in music expresses emotion by imitation and exaggeration of interval-size in the voice (explanation 5 above), is compatible with current archeological and phylogenic data relevant to the ages of vocal affect expression and instrumental music.

#### **A NOTE ABOUT HARMONY**

The present discussion has focused on changes in frequency over time (i.e. melody), but the affective character of musical modes is also realized in the simultaneous presentations of multiple frequencies, i.e., harmonies or chords (Heinlein, 1928; Crowder, 1984, 1985). Nevertheless, the relationship between melodic interval-size and emotion discussed here may also have relevance for the affective character of mode in harmony. Krumhansl (1990) provides extensive evidence that the notes in a melody line are perceived in the harmonic context of previously occurring notes, even though there is no physical simultaneity. Accordingly, when this context is made explicit in the form of chords, simultaneous intervals may in some sense be interpreted in relation to their melodic counterparts, i.e., the response of the nervous system may be partially redundant between these situations, activating associations originally made in melodic context in harmonic context and vice versa. For alternative theories on the affective impact of mode in harmony, based on sensory dissonance, familiarity with harmonic spectra, or sound symbolism, see Helmholtz (1877/1895) (211–219), Crowder (1984), and Cook (2007), respectively.

#### **CONCLUSION**

Explanations of mode-emotion associations that rely solely on lifetime learning of arbitrary cultural conventions are incompatible with evidence suggesting that the particular

#### **REFERENCES**


harmonies. *Music Percept.* 24, 315–319.


associations we make are similar across cultures and constrained by the interval-size properties of modes as expressed in melody. The parallel patterns observed between intervalsize and emotion in modal music and the voice suggest a relationship between these domains. Upon consideration of the possible explanations that could account for interval-size similarities between musical and vocal expression, Spencer's theory that music imitates the voice is the only explanation that is entirely consistent with the available evidence. Accordingly, it seems reasonable to conclude that the affective character of modes as realized in melody is—like many other aspects of music—best understood by also taking into account the physical characteristics and biological purposes of vocalization.

#### **FINANCIAL DISCLOSURE**

This work was supported by a grant from the National Science Foundation [BCS-0924181] awarded to Dale Purves, and a European Research Council advanced grant [No. 230604 "SOMACCA"] awarded to Tecumseh Fitch.

#### **ACKNOWLEDGMENTS**

I thank Douglas Bowling, Dale Purves, Sudara Williams, Marisa Hoeschele, and Bruno Gringas for their comments on the manuscript. Thanks are also due to Tecumseh Fitch and the University of Vienna for financial support during writing.


*Basis for the Theory of Music*. Translated by A. J. Ellis. London: Longmans, Green, and Co.


Oxford University Press. doi: 10.1017/S0140525X08005293


in *Music Perception (Springer Handbook of Auditory Research)*, eds M. R. Jones, A. N. Popper, and R. R. Fay (New York, NY: Springer), 89–127.


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 08 April 2013; accepted: 03 July 2013; published online: 31 July 2013. Citation: Bowling DL (2013) A vocal basis for the affective character of musical mode in melody. Front. Psychol. 4:464. doi: 10.3389/fpsyg.2013.00464*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Bowling. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Animal signals and emotion in music: coordinating affect across groups

#### *Gregory A. Bryant\**

*Department of Communication, Center for Behavior, Evolution, and Culture, University of California at Los Angeles, Los Angeles, CA, USA*

#### *Edited by:*

*Daniel J. Levitin, McGill University, Canada*

#### *Reviewed by:*

*Rajagopal Raghunathan, University of Texas at Austin, USA Charles T. Snowdon, University of Wisconsin–Madison, USA*

#### *\*Correspondence:*

*Gregory A. Bryant, Department of Communication, Center for Behavior, Evolution, and Culture, University of California at Los Angeles, 2303 Rolfe Hall, Los Angeles, CA 90095, USA e-mail: gabryant@ucla.edu*

Researchers studying the emotional impact of music have not traditionally been concerned with the principled relationship between form and function in evolved animal signals. The acoustic structure of musical forms is related in important ways to emotion perception, and thus research on non-human animal vocalizations is relevant for understanding emotion in music. Musical behavior occurs in cultural contexts that include many other coordinated activities which mark group identity, and can allow people to communicate within and between social alliances. The emotional impact of music might be best understood as a proximate mechanism serving an ultimately social function. Recent work reveals intimate connections between properties of certain animal signals and evocative aspects of human music, including (1) examinations of the role of nonlinearities (e.g., broadband noise) in non-human animal vocalizations, and the analogous production and perception of these features in human music, and (2) an analysis of group musical performances and possible relationships to non-human animal chorusing and emotional contagion effects. Communicative features in music are likely due primarily to evolutionary byproducts of phylogenetically older, but still intact communication systems. But in some cases, such as the coordinated rhythmic sounds produced by groups of musicians, our appreciation and emotional engagement might be driven by an adaptive social signaling system. Future empirical work should examine human musical behavior through the comparative lens of behavioral ecology and an adaptationist cognitive science. By this view, particular coordinated sound combinations generated by musicians exploit evolved perceptual response biases – many shared across species – and proliferate through cultural evolutionary processes.

**Keywords: emotion in music, arousal, nonlinearities, music distortion, coalition signaling**

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 1 — #1

#### **INTRODUCTION**

Musical sounds can evoke powerful emotions in people, both as listeners and performers. A central problem for researchers examining music and emotion is to draw clear causal relationships between affective acoustic features in music and the associated responses in listeners. Behavioral ecologists have long studied emotional communication in non-human animals, and one guiding principle in this research is that the physical forms of evolved signals are shaped by their respective communicative functions (Morton, 1977; Owren and Rendall, 2001). Signals evolve as part of signaling system – that is, the production of a signal is necessarily tied to a systematic response by target listeners. This basic fact of animal signaling leads us to an inescapable conclusion regarding music and emotion: the physical structure of musical forms must be related in important ways to people's perceptions and behavioral responses to music. The complex question thus arises: does music, in any way, constitute a signal that is shaped by selection to affect listeners' behavior and potentially convey adaptive information to conspecifics (i.e., members of the same species)? Alternatively, perhaps music is a by-product of a variety of cognitive and behavioral phenomena. In any case, comparative analyses examining acoustic signals in non-human animals can shed light on musical behaviors in people.

Here I will describe research that explores the perception of arousal in music from a comparative perspective, and frame this work theoretically as the exploration of one important proximate mechanism (i.e., an immediate causal process) among many underlying our special attention and attraction to affective properties in musical sound. Music is a cultural product that often exploits pre-existing perceptual sensitivities originally evolved for a variety of auditory functions including navigating sonic environments as well as communication. Cultural evolution has led to increasingly complex, cumulative musical developments through a sensory exploitation process. I suggest that humans have evolved an adaptive means to signal relevant information about coalitions and collective affect within and between social groups. This is accomplished through the incorporation of elaborate tonal and atonal sound, combined with the development of coordinated performance afforded by rhythmic entrainment abilities.

A key issue for understanding the nature of music is to explain why it is emotionally evocative. Darwin (1872)famously described many affective signals in humans and non-human animals, and biologists have since come to understand animal emotional expressions not as cost-free reflections of internal states, but rather as strategic signals that have evolved to alter the behavior of target organisms in systematic ways (Maynard Smith and Harper, 2003). Receivers have evolved response biases that allow them to react adaptively to these signals resulting in co-evolutionary processes shaping animal communication systems (Krebs and Dawkins, 1984). Many scholars have noted the clear connections between human music and emotional vocalizations (Juslin and Laukka, 2003), as well as the connections between human and animal vocalizations (Owren et al., 2011). Snowdon and Teie (2013) recently outlined a theory of the emotional origins of music from a comparative perspective. But researchers examining emotion in music do not typically draw explicit connections to animal vocal behavior.

#### **FORM AND FUNCTION IN ANIMAL SIGNALS**

Recently there has been an increased focus on the form–function relationship between acoustic structure in animal signals and their communicative purposes. The principle of form and function has been indispensable in the study of, for example, functional morphology, but is also crucial for understanding animal signaling. Morton (1977) in his classic paper described the convergent evolution of specific structural features in animal signals based on the behavioral communicative context, and the motivations of senders. Low, broadband (i.e., wide frequency range) sounds are often honestly tied to body size and hostile intent, and can induce fear in receivers. Conversely, high pitched tonal sounds are related to appeasement, and are often produced to reduce fear in listeners. These motivational–structural (MS) rules apply widely across many species and have provided an evolutionary basis for studying the acoustic structure of animal signals (see Briefer, 2012 for a recent review). MS rules illustrate nicely how sound is often much more important than semantics in animals signals. Owren and Rendall (2001) described researchers'frequent reliance on linguistic concepts in understanding primate vocalizations. Animal signals have often been studied as potentially containing "meaning" with referential specificity. An alternative approach is to examine patterns of responses to closely measured non-referential acoustic features of signals. Many signals can affect perceivers in beneficial ways that that do not require the activation of mental representations, analogs to "words," or the encoding of complex concepts. Owren and Rendall (2001) encouraged researchers to rule out simple routes of communication before invoking necessarily more complex cognitive abilities that would be required of the signaling organism. That is not to say that complex meanings are never instantiated in non-human animal signals, but that we should not begin with that assumption.

So how do specific acoustic parameters in vocal signals underlie the communicative purposes for which they are deployed? Consider the interactive affordances of the acoustic-startle reflex. Many animal calls consist of loud bursts of acoustic energy with rapid onsets, loudness variation, and nonlinear spectral characteristics that often give the signals a harsh or noisy sound quality. These features serve to get the attention of a target audience, and can effectively interrupt motor activity. The direct effect of this kind of sound on the mammalian nervous system is a function that has been phylogenetically conserved across many taxa. Humans

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 2 — #2

rely on this reflexive principle in vocal behaviors such as infantdirected (ID) speech, crying, pain shrieks, and screams of terror. For instance, in the case of ID speech, prohibitive utterances across cultures contain similar acoustic features – including fast rise times in amplitude, lowered pitch (compared to other ID utterances), and small repertoires (e.g., No! No! No!; Fernald, 1992). These directed vocalizations are often produced in contexts where caretakers want to quickly interrupt a behavior, and must do so without the benefit of grammatical language.

In studies examining the recognition of speaker intent across disparate cultures, subjects are quite able to identify prohibitive intentions of mothers speaking to infants, and other adults as well (Bryant and Barrett, 2007; Bryant et al., 2012). This ability is not a function of understanding the words, but instead due to the acoustic properties of the vocalizations (Cosmides, 1983; Bryant and Barrett, 2008). In the case of ID prohibitives, proximate arousal in senders contributes to the generation of particular kinds of sound features, including rapid amplitude increases and lowered pitch for the authoritative stance. People, including infants, respond in predictable ways to high arousal sounds, such as stopping their motor activity and re-orienting their attention to the sound source. Research with animal trainers also reveals the systematic relationships between specific communicative forms and desired outcomes in animals such as sheep, horses, and dogs (McConnell, 1991). Vocal commands to initiate motor activity in a variety of species typically contain multiple short and repeated broadband calls, while signals intended to inhibit behavior tend to be longer and more tonal. McConnell (1991) also draws an explicit connection to music and cites several older studies from the 1930s showing the above characteristics in music correlating with physiological changes in human listeners. Short repeating rising notes are associated with increased physiological responses such as pulse rate and blood pressure, while longer, slower musical pieces have the opposite effects.

Research has shown that non-human animals respond predictably to musical stimuli, if the music is based on affective calls of their species. Snowdon and Teie (2010) created synthesized musical excerpts that were based on acoustic features of cotton-top tamarin affiliation and threat signals, and they played these compositions, as well as music made for humans, to adult tamarins. Musical stimuli based on threat calls resulted in increased movement, and huddling behavior shortly after exposure. Conversely, the tamarins reacted to affiliation-based music with calming behavior and reduced movement. There was little response to human music, except some reduced movement in response to human threat-based music, suggesting that species-specific characteristics were crucial in eliciting predictable reactions. Because the stimuli did not contain actual tamarin vocalizations, the responses were likely due to structural features of their vocal repertoire, and not merely the result of conditioning. The acoustic structure in the music clearly triggered tamarin perceptual systems designed for perceiving conspecific vocalizations, but importantly, this work demonstrates how acoustic forms can be readily transposed into stimuli we would consider musical, and that it can be affective for non-human listeners. There is some evidence that human music can have effects on non-human animals. Akiyama and Sutoo (2011) found that exposure to recordings of Mozart reduced blood pressure in spontaneously hypertensive rats, and the effect was driven by relatively high frequencies (4 k–16 kHz), an optimal range for rat hearing sensitivity. The authors proposed that the blood pressure reduction was a result of accelerated calcium-dependent dopamine synthesis. These data again show the importance of species-specific response biases in examinations of the effects of musical stimuli on humans and non-humans alike.

Universal form and function relationships are due to the fact that emotional communication systems in animals are evolutionarily conserved (Owren et al., 2011; Zimmermann et al., 2013), and recent work examining the perception of non-human animal affective vocalizations by humans shows that even when people cannot accurately recognize the affect in an animal vocal expression, brain structures react differentially as a function of the emotional valence in the vocalizations. Belin et al. (2008) found that judges could not reliably judge rhesus monkey or cat vocalizations on a positive–negative scale, but still had varying activation in right ventrolateral orbitofrontal cortex (OFC) in response to the recorded vocalizations. There was also greater overall activationfor negative affect in the vocal samples, whether produced by human or non-human animals. Other research shows that experience also matters when humans can accurately judge affect in non-human vocal signals. Trained pig ethologists were more accurate than naïve students at classifying the behavioral context of domestic pig vocalizations, and caretakers also systematically judged intensity features as being lower overall (Tallet et al., 2010). Chartrand et al. (2007) found that bird experts had unique brain responses (using EEG) to birdsong than naïve listeners, but the difference extended to environmental sounds and voices as well suggesting that expertise in one domain of auditory processing can affect how people hear sounds in other ways.

#### **SOUND OF AROUSAL**

Excitement in mammals is often characterized by physiological activation that prepares the animal for immediate action. An emotional state characterized by heightened arousal occurs in context-specific ways, but often motivates vocal communication shaped by selection to affect others' behavior in an urgent manner. Animals produce pain shrieks, alarm calls, and urgent contact calls, each demanding particular responses perceptually and behaviorally. Specifically in vocalizations, the physiology of high arousal results in increased activation of upper body musculature (including vocal motor systems and respiration) that can cause increased subglottal air pressure and heightened muscle tension. Consequently, vocal folds can vibrate at their natural limit, generating sound waves that reach their maximum amplitude given particular laryngeal and supralaryngeal structural constraints. This saturating nonlinearity (e.g., deterministic chaos) correlates perceptually with a harsh, noisy sound – a sound that effectively penetrates noisy environments, and is hard for listeners to habituate to. **Figure 1** shows a single coyote (*Canis latrans*) contact call that contains subtle deterministic chaos, subharmonics, and a downward pitch shift.

Nonlinearities can be adaptive features of conspicuous signals that require a quick response or certain attention (Fitch et al., 2002). As is the case with many acoustic features of emotional vocalizations, the sound of arousal in scared or excited animals has been conserved across numerous mammal species (Mende et al., 1990; Blumstein et al., 2008; Blumstein and Récapet, 2009; Zimmermann et al., 2013). Researchers examining how noisy features manifest in particular communicative contexts have found that results are not always predictable (e.g., Slaughter et al., 2013) but responses to noisy vocalizations are typically consistent with the idea that these sounds invoke fear in listeners and prepare them for a quick response. Accurate recognition of high arousal in a vocalizer can provide valuable cues concerning threats in the immediate environment, predicting events such as an imminent attack by a conspecific, or an external danger like the approach of a predator. Signaling behavior can evolve from these cues when senders and receivers mutually benefit from the communicative interaction (Maynard Smith and Harper, 2003), and behavioral features often become ritualized in a co-evolutionary process of production enhancement and perceptual sensitivity (Krebs and Dawkins, 1984).

The sound of arousal example provides a very clear logicfor why specific sound features (i.e., forms) are associated with systematic emotional reactions and likely subsequent behavioral responses (i.e., functions). Audio engineers and musicians have exploited the sound of arousal in music, and as a result, instrumentation and performances across a variety of music genres seem well-suited to invoke arousal in listeners including inducing fear, excitement, anger, and exhilaration. For the same reasons people watch horror films, ride roller coasters, or surprise each other for amusement, particular sounds in music are interesting and sometimes exciting.

#### **IS MUSIC SPECIAL?**

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 3 — #3

A complete explanation of the sound features of music is most likely going to be developed from an adaptationist cognitive science informed by a cultural evolutionary framework. The perception and appeal of music is currently best characterized as the co-occurring activation of a collection of by-product perceptual and judgment processes (McDermott, 2009). Pinker (1997) famously described music as "auditory cheesecake" – the theory to beat when proposing adaptive functions for music. It is clear that many systems designed to solve adaptive auditory problems faced recurrently by mammalian species are triggered by phenomena most people would call music. That is, the melodic and rhythmic properties of "musical" sounds satisfy input conditions in a variety of auditory processing mechanisms. Auditory scene analysis research has examined in great detail many fundamental sound perceptual processes and how they relate to navigating the sonic environment (Bregman, 1990). We can segregate sound streams, locate sound sources, and categorize sounds efficiently – abilities that clearly contribute to our perception of music.

Musical forms affect the full range of human emotions. I will focus on the sound of arousal, which often induces fear, as one good example of how a specific vocal phenomenon can manifest itself in music and be perpetuated culturally. This is not intended to explain other emotional phenomena in music, although I would certainly expect similar principles to apply widely across the emotion spectrum. Theories such as these, however, do not fully explain the appeal of Mozart or Bach, for example. Formal accounts of musical structure have laid out in rich detail the hierarchical patterning in tonal organization (e.g., Lerdahl and

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 4 — #4

Jackendoff, 1983), so a complete account of the nature of music must incorporate connections with other aspects of our cognition beyond emotional vocalizations. Snowdon and Teie (2013) proposed four categories of elements to explain the various factors contributing to music. The first two categories involve the development of auditory perception and sensitivity to vocal emotion information. But in the other two categories they point to elements such as melody, harmony, counterpoint, and syntax that are fundamental to the complexity and beauty in music (see also Patel, 2008).

#### **SPEECH AND MUSIC**

Speech is often cited as an important domain contributing to music perception. Speech communication in people has likely resulted in many refinements of phylogenetically older vocal production and perception abilities shared with many non-human animals (Owren et al., 2011). Models of efficient coding of sound also suggest that any specialized auditory processes for speech could be achieved by integrating auditory filtering strategies shared by all mammalian species (Lewicki, 2002). Human hearing sensitivity, however, appears particularly well-attuned to the frequency range of normal speech (Moore, 2008) just as all vocalizing species' auditory abilities are adapted to conspecific vocalization characteristics. Based on modeling work examining potential filtering strategies of peripheral auditory systems, Lewicki (2002) proposed that the representational coding of speech could be effectively instantiated using schemes specialized for broadband environmental sounds combined with schemes for encoding narrowband (i.e., tonal) animal vocalizations. That is, evolutionarily conserved auditory processes might have constrained speech production mechanisms such that speech sounds fell into frequency and temporal ranges exploiting prelinguistic perceptual sensitivities.

Speech perception is quite robust in normal speakers even in cases where high degradation or interruption is occurring (e.g., Miller and Licklider, 1950), and the temporal rate at which speech can be reliably understood far exceeds the production capability of the most efficient speakers (Goldman-Eisler, 1968). These facts hint at perceptual specialization. But a good deal of our speech processing ability is likely due to auditory abilities widely shared across mammals (Moore, 2008). Cognitive neuroscience research has shown repeatedly that music and speech share brain resources indicating that speech perception systems accept music as input (for recent reviews seeArbib,2013), though evidence exists for separate processing as well (Zatorre et al., 2002; Peretz and Coltheart, 2003; Schmithorst, 2005). The relationship between speech and music is certainly more than a coincidence. Amplitude peaks in the normalized speech spectrum correspond well to musical intervals of the chromatic scale, and consonance rankings (Schwartz et al., 2003). Many parallels also exist between music and speech development (McMullen and Saffran, 2004).

The physical properties of the sounds are not the only dimensions that link speech and music. The structure of various sound sequences also seems to activate the same underlying cognitive machinery. Research examining rule learning of auditory stimuli demonstrates the close connection between perceiving speech and music. Marcus et al. (2007) found that infants could learn simple rules (e.g., ABA) in consonant–vowel (CV) sequences, and the learning can apply to non-speech stimuli such as musical tones or non-human animal sounds. However, extracting rules from sequences of non-speech stimuli was facilitated by first learning the rules with speech, suggesting that the proper domain (see below) of rule learning in sound sequences is speech, but musical tones and other sounds satisfy the input conditions of the rule learning system once the system is calibrated by spoken syllables. Studies exploring the acquisition of conditional relations between non-adjacent entities in speech or melodic sequences show similar patterns (Creel et al., 2004; Newport and Aslin, 2004).

A good deal of music perception is likely due to the activity of speech processing mechanisms, but perception is only half of the system. We should be concerned with how production and perception systems evolved together. There are clear adaptations in place underlying breathing processes in speech production and laryngeal and articulator control (MacLarnon and Hewitt, 1999). Moreover, we have fine cortical control over pitch, loudness, and spectral dynamics (Levelt, 1989). These production systems, as a rule of animal signaling, must have complementary adaptive response patterns in listeners. Many perceptual biases were in place before articulated speech evolved, such as the categorical perception of continuous sounds (Kuhl, 1987). But other response biases might be new, such as sensitivity to the coordinated isochronic (i.e., steady, pulse-based repetition) rhythms produced by multiple conspecifics. Sperber (1994) made a distinction between the proper domain of a mechanism and its actual domain. Proper domain refers to those specific features that allow a system to solve an adaptive problem. Depending on the nature of the dynamics (i.e., costs and benefits) of the adaptation, systems will vary in how flexible the input conditions are to respond to a stimulus. The actual domain of a system is the range of physical variation in stimuli that will result in a triggering of that mechanism, something that is often a function of context and the evolutionary history of the cognitive trait. In these terms, the actual domain of speech processers presumably includes most music.

Domain specificity in auditory processing can illuminate the nature of people's preferences for certain sounds, including why certain musical phenomena are so interesting to listeners. But how these preferences manifest themselves as social phenomena remains to be explained. One possibility is that cultural evolutionary processes act on those sound characteristics that people are motivated to produce and hear. For example, rhythmic sound that triggers spatial localization mechanisms could be preferred by listeners, and consequently be subject to positive cultural selection resulting in the feature spreading through musical communities. Other examples include singing patterns that exaggerate the sound of affective voices, or frequency and amplitude modulations that activate systems designed to detect speech sounds. The question becomes, of course, is any sound pattern unique to music?

#### **CULTURAL TRANSMISSION OF MUSICAL FEATURES**

Researchers are starting to explore how listeners' specific sound preferences can lead to the evolution of higher order structure that can constitute eventual musical forms. MacCallum et al. (2012) created a music engine that generates brief clips of sounds that

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 5 — #5

were judged by listeners – clips that started out quite non-musical. Passages that were preferred in forced-choice trials "reproduced," that is, were recombined with other preferred passages. This evolutionary process resulted in several higher order structures manifesting as unquestionably musical attributes. For instance, an isochronic beat emerged. Understanding perceptual sensitivities (i.e., solutions to auditory processing adaptive problems) that are relevant in music listening contexts will help explain preference patterns, and evolutionary cultural processes can provide a framework for understanding the proliferation of these sensitivities (Merker, 2006; Claidière et al., 2012). The sound of fear represents one dimension of auditory processing relevant for music which is in place because of conserved signaling incorporating arousal. As a consequence, people are interested in sounds associated with high arousal, and cultural transmission processes perpetuate them.

Consider the form and function of punk rock in western culture. The relevant cultural phenomena for a complete description of any genre of music are highly complex, and not well understood. But we can clearly recognize some basic relationships between the sonic nature of certain genres of music and their behavioral associations in its listeners. Like much music across culture, there is a strong connection between music production and movement in listeners, epitomized by dancing, resulting in a cross-cultural convergence on isochronic beats in music traditions. The tight relationship between musical rhythm perception and associated body movement is apparent in babies as young as seven months (Phillips-Silver and Trainor, 2005). Punk rock is no exception. Early punk is characterized by a return to fundamentals in rock music (Lentini, 2003). It began as a reaction to a variety of cultural factors, and the perceived excesses of ornate progressive music in general. The initial creative ethos was that anybody can do it, and it was more of an expression of attitude than the making of cultural artifacts. In short, it was intense (and sometimes aggressive) in many ways, and whatever one's interpretation of the cultural underpinnings, the energy is apparent. The music is characterized by fast steady rhythms, overall high amplitude, and noisy sound features in all instruments – attributes that facilitate forceful dancing. But the distortion noise is especially distinct and key for the genre. Of course, many genres of rock use noise – the punk example is just preferred here for many cultural and explanatory reasons, but the same principle applies to many variations of blues and rock music.

Noisy features in rock took a life of their own in the No Wave, post punk, and experimental movements of the 1980s and beyond (e.g., O'Meara, 2013). In rock music, what originally likely arose as a by-product of amplification (i.e., attempting to be loud along with an intense style of playing) soon became conventionalized in ways that are analogous to ritualization in the evolution of animal signals (Krebs and Dawkins, 1984). Particular manifestations of noisy features (forms) were directly related to compositional and performance goals of musicians (functions). Products were developed that harnessed particular kinds of distortion in devices (e.g., effects pedals) that modified the signal path between an instrument and the amplifier. This allowed artists to achieve the desired distortion sounds without having to push amplifiers beyond their natural limit. The use of noise quickly became a focus of a whole family of musical styles, most being avant garde and experimental. Continuing the trend of rejecting aspects of dominant cultural practices, artists could signal their innovation and uniqueness by using this new feature of music in ways that set them apart. The sound affordances of broadband noise provide a powerful means for artists to generate cultural attractors fueled by discontent with mass market music. Moreover, the creative use of distortion and other effects can result in spectrally rich and textured sounds. Cultural evolutionary forces will tap into any feature that allows socially motivated agents to differentially sort based on esthetic phenomena (Sperber, 1996; McElreath et al., 2003). Simple sound quality dimensions like intensity might be excellent predictors of how people are drawn to some genres and not others (Rentfrow et al., 2011). Listeners also often find moderate incongruities (as opposed to great disparities) between established forms and newer variations the most interesting (Mandler, 1982). For example, modern noise rock with extreme distortion that is quite popular today would likely have been considered much more unlistenable in 1960 because it is such a dramatic departure from the accepted sounds for music at the time. But today it is only slightly noisier than its recent predecessors. What gets liked depends on what is liked.

#### **DISTORTION, AROUSAL, AND MUSIC**

Distortion effects in contemporary music mimic in important ways the nonlinear characteristics we see in highly aroused animal signals, including human voices. Electronic amplification, including the development of electro-magnetic pick ups in guitars, was arguably the most important technological innovation that led to the cultural evolution of rock music, and the situation afforded an incredible palette of sound-making that is ongoing well over half a century later (Poss, 1998). Just in the same ways that an animal's vocal system can be "overblown," so can the physical hardware of amplification systems. Early garage rock music, the precursor to punk rock, was likely the first genre to systematically use this overblown amplification effect on purpose. Specific manipulations of electronic signal pathways were developed that allowed musicians to emulate in music what is an honest feature of a vocalization: high arousal. A basic distortion pedal works as follows. The first process is typically an amplitude gain accompanied by a low-pass filter, pushing the signal toward a saturation point where nonlinear alterations will occur. This saturating nonlinearity is filtered again, resulting in output that becomes a multi-band-passed nonlinearity. **Figure 2** shows the effect of a wave shaping function on a 4 s recording of an acoustic guitar and **Figure 3** shows a 78 ms close-up segment of several cycles of the complex waveform in both unaltered and distorted treatments. Yeh et al. (2008) have used ordinary differential equations (ODEs) to digitally model this analog function suggesting that analog distortion used by musicians closely approximates noisy features in vocalization systems that are also well described by the same mathematics. **Figure 1** shows the spectrogram of a coyote vocalization with subtle nonlinear phenomena that appear quite similar to broadband noises generated by ODEs.

Recently, we produced musical stimuli to examine the role of noise in emotional perceptions of music, and used digital

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 6 — #6

models created for musicians as our noisy source (Blumstein et al., 2012). Twelve 10 s compositions were created that were then manipulated into three different versions: one with added musical distortion noise, one with a rapid frequency shift in the music, and one unaltered control. The manipulations were added at the halfway point in the pieces. These stimuli were played to listeners and they were asked to rate them for arousal and valance. We expected that distortion effects approximating deterministic chaos would cause higher ratings of arousal, and negative valence judgments – the two dimensional description of vocalized fear (Laukka et al., 2005). This is precisely what we found. Subjects also judged rapid pitch shifts up as arousing, but not pitch shifts down. Downward pitch shifts were judged as more negatively valenced which is what we should expect given the acoustic correlates of sadness in voices (Scherer, 1986). Surprisingly, previous work had not explored the role of distortion in affective judgments of music, but an animal model of auditory sensitivity afforded a clear prediction which was confirmed.

We were interested in how these effects occurred in the context of film. Previous work had found that horror soundtracks contained nonlinearities at a much higher rate than other film genres (Blumstein et al., 2010). Film soundtrack composers were exploiting people's sensitivity to noisy features in their efforts to scare or otherwise excite their viewers. Of course, for the most part the direct connection is not consciously made between the ecology of fear screams in animals and the induction of fear in a human audience. But composers and music listeners have an intuitive sense of what sounds are associated with what emotions, and this intuition is rooted in our implicit understanding of form and function in nature – a principle that is strongly reinforced by cultural processes bringing these sounds to us repeatedly generation after generation.

But would sound features alone be sufficient to invoke fear even in the context of an emotionally benign film sequence? We created simple 10-s videos of people engaged in emotionally neutral actions, such as reading a paper, or drinking a cup of coffee. The videos were edited so that the key "action" happened at the exact midpoint, the same time that our nonlinear features in the music clips occurred. Subjects viewed these videos paired with the same music as described above, and we found something interesting. Judgments of arousal were no longer affected by the nonlinear features in the music clips when viewed in the context of a benign action, but the negative valence remained. Clearly, decision processes used in judgments of affect in multimodal stimuli will integrate these perceptual dimensions. One obvious possibility for our result is that the visual information essentially trumped the auditory information when assessing urgency, but the emotional quality of a situation was still shaped by what people heard. Future research should explore how consistent fearful information is processed, and we should expect that auditory nonlinearities will enhance a fear effect as evidenced by the successful pairing of scary sounds and sights in movies. Currently, we are examining psychophysiological responses to nonlinearities, with the expectation that even when judges do not explicitly report greater arousal while hearing nonlinear musical features in certain contexts, there will be measurable autonomic reactions, similar to how brain (OFC) responses to non-human animal

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 7 — #7

voices do not correspond to people's judgments (Belin et al., 2008).

As mentioned earlier, nonlinear characteristics in music represent one dimension in sound processing that plays a role in music perception and enjoyment. Our sensitivity to such features is rooted in a highly conserved mammalian vocal signaling system. I argue that much of what makes music enjoyable can be explained similarly. But one aspect of music that is not well explained as a by-product is the conspicuous feature that it is often performed by groups – coordinated action of multiple individuals sharing a common cultural history, generating synchronized sounds in a context of ritualized group activity.

#### **MUSIC AS COALITION SIGNALING**

Humans are animals – animals with culture, language, and a particular set of cognitive adaptations designed to interface with a complex social network of sophisticated conspecifics. Pinker (2010) called this the "cognitive niche" taking after ideas earlier proposed by Tooby and DeVore (1987). Information networks and social ecologies have co-evolved with information processors, and thus, a form–fit relationship exists between the cognitive processes in the human mind and the culturally evolved environments for social information. Humans cooperate extensively – in an extreme way when viewed zoologically – and we have many reliably developing cognitive mechanisms designed to solve problems associated with elaborate social knowledge (Barrett et al., 2010). Because many of the adaptive problems associated with extreme sociality involve communicating intentions to cooperate as well as recognizing cues of potential defection in conspecifics, we should expect a variety of abilities that facilitate effective signaling between cooperative agents.

Many species, ranging from primates, to birds, to canines, engage in coordinated signaling. By chorusing together, groups can generate a signal that honestly communicates their numbers, and many other properties of their health and stature. Chorusing sometimes involves the ability to rhythmically coordinate signal production. When two signaling systems synchronize their periodic output (i.e., enter a phase relationship), it can be described as entrainment – an ability that is phylogenetically old, and evolutionarily widespread (Schachner et al., 2009; Phillips-Silver et al., 2010). Fitch (2012) described the paradox of rhythm, which is the puzzle of why periodic phenomena are so ubiquitous in nature, but overt rhythmic ability in animals is so exceedingly rare. The answer, Fitch argued, lies in how we conceptualize rhythm in the first place. When we consider the component abilities that contribute to our capacity for rhythmic entrainment, the complexity in the neurocomputational underpinnings makes the capacity much less paradoxical, and instead understandably rare.

The basic ability to coordinate behavior with an external stimulus requires at a minimum three capabilities: detecting rhythmic signals, generating rhythms through motor action, and integrating sensory information with motor output (Phillips-Silver et al., 2010; Fitch, 2012). Phillips-Silver et al. (2010) described the ecology of entrainment, and the assortment of its manifestations in nature. While many species have variations of these abilities, only humans seem to have a prepared learning system designed to govern coordinated action of a rhythmic nature. The ability to entrain with others develops early, and is greatly facilitated by interactions with other social agents, but not mechanized rhythmic producers, or auditory stimuli alone (Kirschner and Tomasello, 2009). Young infants reliably develop beat induction quite early (Winkler et al., 2009) and have also been shown to engage rhythmically with music stimuli without the participation of social agents, which is associated with positive affect (Zentner and Eerola, 2010). Most rhythmic ability demonstrated by human infants has never been replicated in any other adult primate. Even with explicit training, a grown chimpanzee cannot entrain their rhythmic production with another agent, let alone another chimpanzee. African apes, including chimps and gorillas, will drum alone, and this behavior is likely to be homologous with human drumming (Fitch, 2006), suggesting that coordinated (as opposed to solo) rhythmic production evolved after the split with the last common ancestor. So what is it about the hominin line that allowed for our unique evolutionary trajectory in the domain of coordinated action?

There are other species that have the ability to entrain their behavior to rhythmic stimuli and other agents. Birds that engage in vocal mimicry, such as the sulfur-crested cockatoo (*Cacatua galerita*) have been shown to be capable of highly coordinated responses to music and rhythmic images, and will even attempt to ignore behaviors around them produced by agents who are not in synch with the stimulus to which they are coordinated (Patel et al., 2009). African gray parrots (*Psittacus erithacus*) also have this ability (Schachner et al., 2009). Recently, Cook et al. (2013) found motor entrainment in a California sea lion (*Zalophus californianus*), an animal that does not have vocal mimicry skills, suggesting that the ability either does not require vocal mimicry mechanisms, or the behavior can emerge through multiple motor control pathways. Fitch (2012) pointed out that examining these analogous behaviors can quite possibly elucidate human adaptations for entrainment, but he did not address the larger question of why humans might possess entrainment abilities uniquely across all terrestrial mammals.

Hagen and Bryant (2003) proposed that music and dance constitute a coalition signaling system. Signals of coalition strength might have evolved from territorial displays seen in other primates, including chimpanzees (Hagen and Hammerstein, 2009). The ideal signal of coalition quality should be easily and rapidly decoded by a target audience, and only plausibly generated by stable coalitions able to engage in complex, coordinated action. A coordinated performance affords an opportunity to signal honest information about time investments with fellow performers, individual skills related to practice time investment, and creative ability indicating cognitive competence. In short, individuals can signal about themselves (which could be subject to sexual selection), and the group can signal about their quality as well. To test these ideas, original music was recorded, and versions were made that contained different kinds of performance errors (Hagen and Bryant, 2003). As expected, the composition with introduced errors that disrupted the synchrony between the performers was judged by listeners as lower in music quality. We also asked the listeners to judge the relationships between the performers, including questions about how long they have known each other, and whether they liked each other. Listeners' judgments of the coalition quality between the performers were a function of the music quality judgments – the lower they rated the music quality, the worse coalition they perceived between the musicians.

The ethnographic record clearly reveals the importance of music and dance displays to traditional societies throughout history (Hagen and Bryant, 2003). Initial meetings where groups introduce one another to their cultures, including these coordinated displays, can have crucial adaptive significance in the context of cooperation and conflict. The potential for selection on such display behaviors is clear, as is the important interface with cultural evolutionary processes (McElreath et al., 2003). Cultural traditions that underlie the nature of specific coordinated displays are revealed in contemporary manifestations of the role of music in social identity and early markers of friendship preferences and alliances (Mark,1998; Giles et al.,2009; Boer et al.,2011). Mark (1998) proposed an ecological theory of music preference suggesting that music can act as a proxy for making judgments

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 8 — #8

about social similarity. According to the theory, musical preferences spread through social network ties unified by principles of social similarity and history. Investment of time in one preference necessarily imposes time constraints on other preferences. Developing a strong esthetic preference, therefore, can honestly signal one's social affiliation.

Music can also function to increase coalition strength within groups (McNeill, 1995) and this effect has been documented in children. Kirschner and Tomasello (2010) had pairs of 4-year-old children partake in one of two matched play activities that differed only in the participation of a song and dance. The musical condition involved singing to a prerecorded song (with a periodic pulse) while striking a wooden toy with a stick, and walking to the time. The non-musical condition involved only walking together in a similar manner with non-synchronized utterances. Pairs of children who participated together in the musical condition spontaneously helped their partner more in a set-up scenario immediately after the play activity where one needed assistance, and they engaged in more joint problem solving in that set-up as well. Our proximate experiences of pleasure in engaging with other social agents in musical activity might serve to bolster withingroup relationships, and provide a motivating force for generating a robust signal of intragroup solidarity that can be detected by out-group members.

Patterns of cultural transmission occur through different channels. Many cultural traits get passed not only vertically from older members of a culture to their offspring, but also horizontally across peers. For instance, children typically will adopt the dialect and accent of their same-aged peers rather than their parents (Chambers, 2002), illustrating how language learning and communicative-pragmatic mechanisms are quite sensitive to the source of its input. Similarly, peers should be an important source of musical taste development if that esthetic is important for social assortment (Selfhout et al., 2009). Variations of forms in any cultural domain will typically cluster around particular attractors, but the nature of the attraction depends on the type of artifact. For instance, artifacts such as tools that have some specific functional use will be selected based largely (though not completely) on physical affordances (e.g., hammers have the properties they have because they have undergone selection for effectiveness in some task), whereas esthetic artifacts tap into perceptual sensitivities that evolved for reasons other than enjoying or using the artifacts. For example, people prefer landscape portrayals with water over those without water because of evolved foraging psychology (Orians and Heerwagen, 1992). As described earlier, music exploits many auditory mechanisms that were designed for adaptive auditory problems like speech processing, sound source localization, or vocal emotion signaling. Physical characteristics of musical artifacts that appealed to people's perceptual machinery were attractive, and as a result, the motivation to reproduce and experience these sounds repeatedly provides the groundwork for cultural selection.

Many proposals exist describing potential factors that might contribute to the spreading of any kind of cultural product, and theorists debate about the nature of the representations (including whether they need to be conceived as representations at all) and what particular dynamics are most important for the successful transmission of various cultural phenomena (Henrich and Boyd, 2002; McElreath et al., 2003; Claidiere and Sperber, 2007). In the case of music, some aspects seem relatively uncontroversial. For example, the status of an individual composer or a group of individual music makers likely plays an important role in whether musical ideas get perpetuated. A coordinated display by the most prestigious and influential members of a group was likely to be an important factor in whether the musical innovations by these people were learned and perpetuated by the next generation. Subsequent transmission can be facilitated by conformity-based processes. A combination of factors related to the physical properties of the music, the social intentions and status of the producers, and the social network dynamics of the group at large will all interact in the cultural evolution of musical artifacts. McElreath et al. (2003) showed formally that group marking (which in an ancestral environment could quite plausibly have included knowledge of specific musical traditions), can culturally evolve and stabilize if participants preferentially interact in a cooperative way with others who are marked like them, and they acquire the markers (e.g., musical behaviors) of successful individuals. By this formulation, acquired arbitrary musical markers can honestly signal one's past cooperative behavior beyond the investment to develop the marker, and potentially provide that information to outside observers.

#### **EMOTIONS AND MUSIC IN GROUPS**

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 9 — #9

There are many possible evolutionary paths for the perpetuation of musical forms, any even the propensity for musical ability in the first place (e.g., Miranda et al., 2003). But how does emotion play into the process? Little research has explored directly the affective impact of group performances aside from the evocative nature of the music itself. The feelings associated with experiencing coordinated action between groups of people might not fit into a traditional categorical view of emotions, and instead may be better categorized as something like profundity or awe (Davies, 2002; Keltner and Haidt, 2003). According to the coalition signaling perspective, elaborate coordinated performances are an honest signal that is causally linked to the group of signalers. This view does not require any specific affective component, at least not in the traditional approach of studies on emotion and music. The affect inducing qualities of music facilitate its function in that the generated product is inherently interesting to listeners and relevant to the context-specific emotional intentions of the participants. The surface features of the signals satisfy input conditions of a variety of perceptual systems (i.e., they act proximately), and cultural processes perpetuate these characteristics because coordinated displays that embody esthetically attractive displays do better than alternatives. But the ultimate explanation addresses how coordinated displays provide valuable information about the group producing it. A form–function approach again can illuminate the nature of the signaling system and how it operates. Musical features such as a predictable isochronic beats and fixed pitches facilitate the coordinated production of multiple individuals and afford a platform for inducing intended affect in listeners. Our perceptual sensitivity to rhythm and pitch, also important for human speech and other auditory adaptations, allow listeners to make fine grained judgments about relationships between performers. We can tell if people have practiced, whether they have skill that requires time, talent, and effort, and whether they have spent time with the other performers.

Hatfield et al. (1994) developed the idea of emotional contagion as an automatic and unconscious tendency of people to align behaviorally as a means to transfer affect across multiple individuals. Contagion effects in groups are likely connected to a variety of non-human animal behaviors. Several primate species seem to experience some version of contagious affect, including quite notably the pant hoots of chimpanzees that could be phylogenetically related to music behavior in humans (Fritz and Koelsch, 2013). While rhythmic entrainment is zoologically rare, other acoustic features can be coordinated in non-human animals signals, a phenomenon Brown (2007) calls contagious heterophony which he believes played a crucial role in the evolution of human music. In the case of people, Spoor and Kelly (2004) proposed that emotions experienced by groups might assist in communicating affect between group members and help build social bonds. Recent work shows that the transmission of emotion across crowds can act like an unconscious cascade (Dezecache et al., 2013), so the utility of a unifying source of affect (e.g., music) is clear. While all of these ideas are likely to be part of the human music puzzle, scholars have neglected to develop the idea of how coordinated musical action might constitute a collective signal to people outside of the action. Many of the claimed benefits of coordinated action, such as increased social cohesion and alignment of affect, might be proximate mechanisms serving ultimate communicative functions. As is common in the social sciences, proximate mechanisms are often treated as ultimate functions, or function is not considered at all.

Evidence is mounting that affect is not necessarily tied to synchronous movement or the benefits associated with it. A variety of studies have shown that positive affect is not needed for successful coordination, and that explicit instruction to coordinate action can result in cooperative interactions without any associated positive emotions being experienced by participants (e.g., Wiltermuth and Heath, 2009). Recent research has demonstrated that strangers playing a prisoner's dilemma (PD) economic game after a brief conversation were more likely to cooperate with one another as a function of how much they converged in their speech rate (Manson et al., 2013), and this effect occurred independent of positive emotions between conversationalists. Language style matching was also not related to cooperative moves in the PD game, suggesting that coordinated action can impact future interaction behavior without mediating emotions or behavior matching lacking temporal structure.

The role of emotions in group musical performances is not clear, but what is intuitively obvious is that the experience of a group performance is often associated with feelings of exhilaration, and a whole range of emotions. But such emotional experiences are necessarily tied up in the complexities of the social interaction, and the cultural evolutionary phenomena that contribute to the transmission of the musical behavior. Researchers should examine more closely how specific emotions are conjured during group performances: in players, dancers, and audience

members alike. Moreover, how much of the impact of the emotional experience is due to the particular structural features of the music, independent of the coordinated behavioral components? In players and listeners, the psychological concept of the "groove" is related to easily achievable sensorimotor coupling and an associated positive emotional experience (Janata et al., 2012), which is consistent with notions of "flow" that underlie a broad range of individual and coordinated behaviors (Csikszentmihalyi, 1990). Flow can be thought of as an experiential pleasure that is derived from certain moderately difficult activities, and it can facilitate the continued motivation to engage in those activities. One study examined flow in piano players, and found that several physiological variables such as blood pressure, facial muscle movements, and heart rate measures were positively correlated with self-reported flow experiences (de Manzano et al., 2010). The psychological constructs of the groove and flow speak to both the motivational mechanisms underlying music, and the high degree of shared processing that many musical and non-musical phenomena share. In many cultures, the concept of music as separate from the social contexts and rituals in which it manifests is non-existent (Fritz and Koelsch, 2013). The western perspective has potentially isolated music as a phenomenon that is often divorced from the broader repertoire of behaviors in which is typically occurs, and this situation might have important consequences for understanding it as an evolved behavior (McDermott, 2009).

#### **CONCLUSION**

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 10 — #10

Music moves us – emotionally and physically. The physical characteristics of music are often responsible, such as the wailing sound of a guitar that is reminiscent of a human emotional voice, or the solid beat that unconsciously causes us to tap our foot. The reasons music has these effects are related in important ways to the information-processing mechanisms it engages, most of which did not evolve for the purposes of listening to music. Music sounds like voices, or approaching objects, or the sounds of animals. Cognitive processes of attraction, and cultural transmission mechanisms, have cumulatively shaped an enormous variety of genres and innovations that help people define themselves socially. Music is an inherently social phenomenon, a fact often lost on scientists studying its structure and effects. The social nature of music and the complex cultural processes that have led to its important role in most human lives strongly suggests an evolutionary function: signaling social relationships. Evidence of adaptive design is there: people are especially susceptible to the isochronic beats so common across cultures, we are particularly skilled like no other animal in coordinating our action with others in a rhythmic way, and the ability develops early and reliably across cultures. Group performances in music and dance are universal across all known cultures, and they are usually inextricably tied to central cultural traditions.

Several predictions emerge from this theoretical perspective. For example, if listeners are attuned to the effects of practice on well-coordinated musical displays as a proxy for time investment and group solidarity, then manipulations of practice time between a set of musicians should affect subjects' judgments on a variety of perceptual measures, including measures that do not explicitly ask about the musical performance. Subjects should be able to readily judge coalition quality through music and dance production (Hagen and Bryant, 2003). High resolution analyses of synchrony between performers should be closely associated with listeners' assessments of social coordination, and this association should be independent of the assessment of any individual performer's skills. Researchers need to closely examine the developmental trajectory of entrainment abilities and begin to explore children's ability to infer social relationships based on coordinated displays. Kirschner and Tomasello (2009, 2010) have begun work in this area that I believe will prove to quite fruitful in understanding the nature of group-level social signaling.

The current approach also makes predictions about the culturally evolved sound of music. We should expect musical elements to exploit pre-existing sensory biases, including sensitivity to prosodic signals conveying vocal emotion in humans and nonhuman animals (Juslin and Laukka, 2003; Blumstein et al., 2012) and sound patterns that facilitate auditory streaming (Bregman, 1990), for example. These characteristics should be stable properties of otherwise variable musical traditions across cultures, and persistent across cultural evolutionary time. One obvious case described earlier is the perpetuation of electronically generated nonlinearities across a broad range of musical styles today that can be traced back to fairly recent technological innovations. In a matter of a few decades, most popular music now includes nonlinear features of one sort or another that only experimental avant-garde music used before. Indeed, sound features present in the vocal emotions of mammalian species are reflected in the most sophisticated instrumentation of modern classical and jazz. Following Snowdon and Teie (2010, 2013) we should also expect to find predictable responses in many non-human animals to musical creations based on the structural features of their emotional vocal signals. The question of why humans have evolved musical behavior, and other social animals have not, can only be answered by understanding the nature of culture itself – no small task.

Comparative analyses provide crucial insights into evolutionary explanations for any behavioral trait in a given species. In the case of human music, there is clear uniqueness, but we recognize traits common across many species that play into the complex behavior (Fitch, 2006). Convergent evolutionary processes lead to structural similarities across diverse taxa, such as the relationships between birdsong and human music (e.g., Marler, 2000; Rothenberg et al., 2013), and while there are possible limitations in what we can learn from such analogies (McDermott and Hauser, 2005), there is certainly value in exploring the possibilities. Many animals signal in unison, or at least simultaneously, for a variety of reasons related to territorial behavior, and mating. These kinds of behaviors might be the most important ones to examine in our effort to identify any adaptive function of human musical activity, as the structural forms and typical manifestations of human music seem particularly well-suited for effective and efficient communication between groups. This is especially interesting considering the fact that music often co-occurs with many other coordinated behaviors such as dancing, and themes in artifacts like clothing and food. Music should be viewed as one component among many across cultures that allows groups to effectively signal their social identity in the service of large scale cooperation and alliance building. The beautiful complexity that emerges stands as a testament to the power of biological and cultural evolution.

#### **REFERENCES**


"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 11 — #11

Davies, S. (2002). Profundity in instrumental music. *Br. J. Aesthet.* 42, 343–346. doi: 10.1093/bjaesthetics/42.4.343


"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 12 — #12


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 01 July 2013; accepted: 11 December 2013; published online: 25 December 2013.*

*Citation: Bryant GA (2013) Animal signals and emotion in music: coordinating affect across groups. Front. Psychol. 4:990. doi: 10.3389/fpsyg.2013.00990*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2013 Bryant. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

"fpsyg-04-00990" — 2013/12/31 — 11:24 — page 13 — #13

### Speech vs. singing: infants choose happier sounds

#### *Marieve Corbeil <sup>1</sup> \*, Sandra E. Trehub1,2 and Isabelle Peretz <sup>1</sup> \**

*<sup>1</sup> International Laboratory for Brain, Music and Sound Research, Department of Psychology, Université de Montréal, Montréal, QC, Canada <sup>2</sup> Music Development Laboratory, Department of Psychology, University of Toronto Mississauga, Mississauga, ON, Canada*

#### *Edited by:*

*Anjali Bhatara, Université Paris Descartes, France*

#### *Reviewed by:*

*Christine Tsang, Huron University College at Western, Canada Carolyn Quam, University of Arizona, USA*

#### *\*Correspondence:*

*Marieve Corbeil and Isabelle Peretz, International Laboratory for Brain, Music and Sound Research, Department of Psychology, Université de Montréal, 1430 Mont Royal boul., Montreal, QC, H2V 4P3, Canada e-mail: marieve.corbeil@ umontreal.ca; isabelle.peretz@umontreal.ca*

Infants prefer speech to non-vocal sounds and to non-human vocalizations, and they prefer happy-sounding speech to neutral speech. They also exhibit an interest in singing, but there is little knowledge of their relative interest in speech and singing. The present study explored infants' attention to unfamiliar audio samples of speech and singing. In Experiment 1, infants 4–13 months of age were exposed to happy-sounding infant-directed speech vs. hummed lullabies by the same woman. They listened significantly longer to the speech, which had considerably greater acoustic variability and expressiveness, than to the lullabies. In Experiment 2, infants of comparable age who heard the lyrics of a Turkish children's song spoken vs. sung in a joyful/happy manner did not exhibit differential listening. Infants in Experiment 3 heard the happily sung lyrics of the Turkish children's song vs. a version that was spoken in an adult-directed or affectively neutral manner. They listened significantly longer to the sung version. Overall, happy voice quality rather than vocal mode (speech or singing) was the principal contributor to infant attention, regardless of age.

**Keywords: infants, music, language, singing, speech, emotion, attention**

#### **INTRODUCTION**

There is considerable debate about similarities and differences in the processing of language and music (e.g., Pinker, 1997; Patel, 2008; Jackendoff, 2009; Peretz, 2009). Because the greatest differences arise from the presence of propositional meaning in language but not in music, comparisons in the early pre-verbal period are of particular interest (Trehub et al., 1993; Chen-Hafteck, 1997; McMullen and Saffran, 2004; Brandt et al., 2012), notably when both modes of parental communication are used to regulate infant attention and affect (Fernald, 1992; Papoušek, 1994; Kitamura and Burnham, 2003; Trehub et al., 2010). To date, however, the only study comparing young infants' behavioral responsiveness to speech and singing (Nakata and Trehub, 2004) used audiovisual stimuli, obscuring the relative contributions of auditory and visual expressiveness to infants' greater engagement with maternal music. Another study found no difference in newborns' neural responses to happy-sounding speech and singing (Sambeth et al., 2008). The present investigation examined infants' attentiveness to speech and singing on the basis of auditory cues alone.

Whereas verbal aspects of speech convey propositional meaning, non-verbal or prosodic aspects such as intonation and rhythm convey the speaker's affective intent and emotional state (Frick, 1985). Mothers across cultures speak and sing to their pre-verbal infants in the course of providing care (Fernald, 1992; Trehub and Trainor, 1998; Dissanayake, 2000; Trehub, 2000). Their manner of speaking or singing to infants (infant-directed or ID) differs dramatically from their manner in other contexts (adult-directed or AD; self-directed or non-ID)(Ferguson, 1964; Jacobson et al., 1983; Fernald and Simon, 1984; Trainor et al., 1997; Trehub et al., 1997a,b), with notable variations across cultures (Grieser and Kuhl, 1988; Fernald et al., 1989; Kitamura et al., 2002). In general, ID speech features higher pitch, expanded pitch contours, slower speaking rate, longer vowels, larger dynamic range, and greater rhythmicity and repetition than AD speech (Stern et al., 1982, 1983; Fernald and Simon, 1984; Fernald et al., 1989). These features, especially high pitch, expanded pitch contours, rhythmicity, repetition, and reduced speaking rate, make ID speech sound much more musical than AD speech (Fernald, 1989, 1992). High pitch, expanded pitch contours, and large dynamic range also reflect the heightened affective quality of typical ID speech, which contrasts with the affective restraint of typical AD speech (Trainor et al., 2000). Nevertheless, ID speech is finely tuned to the infant's age and needs, with mothers using relatively more comforting speech for 3-month-olds, more approving speech for 6-month-olds, and more directive speech for 9-month-olds (Kitamura and Burnham, 2003). Approving speech, with its higher pitch and greater pitch range, receives higher ratings of positive affect by adult listeners (Kitamura and Lam, 2009).

Unlike speech, singing is constrained by the prescribed pitch and rhythmic form of the material (i.e., specific songs). Nevertheless, ID versions of singing are also characterized by higher pitch and slower tempo than non-ID versions of the same songs by the same singers (Trainor et al., 1997; Trehub et al., 1997a,b). While repetition is an important aspect of ID speech, it is central to music in general (Kivy, 1993; Trainor and Zatorre, 2008) and to songs for young children in particular (Trehub and Trainor, 1998).

The available evidence indicates that infants find ID singing more engaging than non-ID singing (Trainor, 1996; Masataka, 1999) just as they find ID speech more engaging than AD speech (Fernald, 1985; Werker and McLeod, 1989; Pegg et al., 1992). One possible source of infants' enhanced engagement is the heightened positive expressiveness of typical ID speech and singing (Trainor et al., 2000; Trehub et al., 2010; Nakata and Trehub, 2011). In fact, infants exhibit preferential listening to speech that sounds happy rather than sad or inexpressive regardless of the intended audience (Kitamura and Burnham, 1998; Singh et al., 2002). For example, infants listen longer to happy AD speech than to affectively neutral ID speech even when the latter is higher in pitch (Singh et al., 2002). Note, however, that happy ID vocalizations are closer to AD vocalizations described as *high-arousal joy/happiness* or elation than to *low-arousal joy/happiness* (Banse and Scherer, 1996; Bänziger and Scherer, 2005). Infants also exhibit more positive affect to ID expressions of approval than to disapproval or prohibition even when the utterances are low-pass filtered (Papoušek et al., 1990) or presented in an unfamiliar language (Fernald, 1993). The general consensus is that positive vocal emotion, especially the high arousal variety, makes a substantial contribution to infants' interest in ID speech. Nevertheless, one cannot rule out alternative explanations such as the attentiongetting potential of expanded pitch and dynamic range and the attention-holding potential of repetition. When these acoustic factors are controlled, however, infants exhibit preferences for the happier speech version (Kitamura and Burnham, 1998; Singh et al., 2002), suggesting that these acoustic features make secondary contributions to infant preferences. Infants' interest is also affected by their age and corresponding needs. For example, 3-month-old infants exhibit greater attention to comforting than to approving ID speech (Kitamura and Lam, 2009).

The influence of ID pitch contours is seen in infants' preferential listening for sine-wave replicas of ID speech that preserve the pitch contours (and timing) with uniform amplitude over those that preserve the timing and amplitude with unvarying pitch (Fernald and Kuhl, 1987). Despite the fact that infants display greater positive affect to approving than to disapproving ID utterances, they listener longer to the former only if they exhibit greater F0 modulation (Fernald, 1993). Interestingly, pitch modulation also makes important contributions to the differentiation of emotions in music and in AD speech (Scherer, 1986, 1995; Laukka et al., 2005). Across cultures, happy-sounding speech and music feature high mean pitch, large pitch variability, relatively high mean amplitude, and rapid rate or tempo (Juslin and Laukka, 2003). Smiling elevates pitch and increases amplitude by altering the mouth opening and shape of the vocal tract, contributing to the vocal qualities associated with happiness (Tartter, 1980). Tender speech and music, by contrast, have lower mean pitch, pitch variability, mean amplitude, and slower rate or tempo than happy speech and music (Juslin and Laukka, 2003).

Perhaps the two classes of songs for infants, lullabies and play songs, are caregivers' expressions of tenderness and happiness, respectively, as well as tools for soothing or amusing infants. In line with their soothing function, lullabies feature very slow tempo, low pitch, falling pitch contours, limited amplitude variation, and soothing tone of voice (Unyk et al., 1992; Trehub et al., 1993; Trehub and Trainor, 1998), properties that are shared with soothing ID speech (Papoušek and Papoušek, 1981; Fernald, 1989). Lullabies are also soothing to adult listeners, so it is not surprising that they are used, at times, as laments (Trehub and Prince, 2010) and in palliative care (O'Callaghan, 2008). Although play songs are commonly sung to Western infants, they are not universal, as lullabies are (Trehub and Trainor, 1998).

Maternal speech melodies are considered central to the expression of maternal affect and the regulation of infant attention and arousal (Fernald, 1992; Papoušek, 1994). Is it possible that musical melodies would be equally effective or even more effective in regulating infant attention and arousal? The melodies or pitch contours of expressive speech differ from those in music (Zatorre and Baum, 2012). In music, pitches are discrete and sustained, and steps from one pitch level to another are generally small, most commonly, one or two semitones, with larger pitch jumps being much less frequent (Vos and Troost, 1989). By contrast, pitches in speech glide continuously over a larger range (Patel et al., 1998), which is even larger in ID speech (Ferguson, 1964; Stern et al., 1982, 1983; Fernald and Simon, 1984). Moreover, pitches have precise targets in music but not in speech (Zatorre and Baum, 2012).

If the expanded pitch and dynamic range of ID speech underlies infants' greater attention to ID than to AD speech (e.g., Fernald, 1993), then infants could show more interest in ID speech than ID singing. If rhythmicity and predictability are relevant (e.g., McRoberts et al., 2009), then infants might exhibit more attention to ID singing than to ID speech. If positive emotion is the critical feature (Kitamura and Burnham, 1998; Singh et al., 2002), then infants could show greater interest in the stimulus expressing more positive affect regardless of whether it is speech or music. For adults, music generates a range of positive emotions from tranquillity and tenderness to joy and euphoria (Blood and Zatorre, 2001; Menon and Levitin, 2005; Zentner et al., 2008; Salimpoor et al., 2011). Some scholars contend that the expression of emotion by some form of music (e.g., protomusic) preceded language (Darwin, 1871; Mithen, 2005). Others regard speech, even at present, as a type of music, especially when considered in developmental perspective (Brandt et al., 2012). If the status of speech is privileged, as some contend (Vouloumanos and Werker, 2004, 2007; Shultz and Vouloumanos, 2010; Vouloumanos et al., 2010), then ID speech would be favored over forms of singing that exclude speech. Obviously, the aforementioned factors are not independent. Nevertheless, comparisons of infants' responsiveness to speech and music are a first step toward the long-range goal of identifying the acoustic features that attract and hold infants' attention. Such features may differ for infants of different ages, as reflected in age-related changes in listening biases for ID speech with comforting, approving, or directive tones of voice (Kitamura and Lam, 2009) and for regular or slowed ID speech (Panneton et al., 2006).

It is difficult to assess infants' degree of engagement with music and even more difficult to ascertain their aesthetic preferences. Instead of overt affective responses to music, infants commonly exhibit interest or attention, sometimes accompanied by reduced motor activity (Nakata and Trehub, 2004). The usual assumption is that longer listening to one of two auditory stimuli reflects preference or greater liking for that stimulus (e.g., Fernald and Kuhl, 1987; Trainor, 1996; Vouloumanos and Werker, 2004). In general, such "preferences" are assessed with the headturn preference procedure, which is used with infants as young as 2 or 3 months of age (e.g., Trainor et al., 2002; Shultz and Vouloumanos, 2010). The procedure involves pairing one auditory stimulus with a visual display and a contrasting auditory stimulus with the same visual display, at the same or different locations, on a series of trials. Infants control the procedure in the sense that looking away from the visual stimulus terminates the auditory stimulus. In other words, they can *choose* to listen to one stimulus longer than another. The interpretation of longer or shorter listening times as positive or negative aesthetic evaluations is questionable in the absence of positive or negative affective displays (Trehub, 2012). At times, infants listen longer to familiar stimuli and, at other times, to novel stimuli (e.g., Rose et al., 1982; Volkova et al., 2006; Soley and Hannon, 2010). Even when infants show positive affect to one auditory stimulus and negative or neutral affect to another, their listening times to the stimuli may not differ (Fernald, 1993). Unquestionably, looking or listening times indicate infants' listening choice or relative attention to the stimuli, but the factors that contribute to such attention are unclear. Some listening biases may be innate, arising from the salience of biologically significant stimuli (e.g., human vocal sounds) or biologically significant parameters of sound (e.g., loud or unexpected). Other listening biases may arise from acquired salience, as in preferential responding to the sound of one's name (Mandel et al., 1995) or to a stimulus heard previously (Zajonc, 2001). Attention biases, regardless of their origin, are likely to facilitate learning (Vouloumanos and Werker, 2004).

In addition to the well-documented listening bias for ID over AD speech, there are reported biases for vocal over nonvocal sounds (Colombo and Bundy, 1981; Vouloumanos and Werker, 2004, 2007), speech over non-human vocalizations, (Vouloumanos et al., 2010), speech over human non-speech vocalizations (Shultz and Vouloumanos, 2010), musical consonance over dissonance (Trainor and Heinmiller, 1998; Zentner and Kagan, 1998), and familiar over unfamiliar musical meters (Soley and Hannon, 2010). Infants also exhibit considerable interest in vocal music (Glenn et al., 1981), but their exposure to music is much more limited than their exposure to speech (Eckerdal and Merker, 2009). To date, however, there has been little exploration of infants' relative interest in speech and singing. In the single study that addressed this question directly (Nakata and Trehub, 2004), 6-month-olds infants watched audio-visual recordings of their mother singing or speaking from an earlier interaction. Infants showed more intense and more sustained interest in singing than in speech episodes, as reflected in greater visual fixation coupled with reduced body movement. Infants' heightened interest in these maternal singing episodes could stem from mothers' propensity to smile more when singing than when talking to infants (Plantinga et al., 2011). In the present study, we used the head-turn preference procedure to assess infants' interest in speech and singing with unfamiliar materials and voices. As noted above, the procedure provides information about infants' listening choices or relative attention rather than their aesthetic preferences.

In line with age-related changes in infants' attention to the affective tone of ID speech (Kitamura and Lam, 2009), developmental changes might be evident in infants' responsiveness to ID speech and song. Accordingly, infants in the present research, who were 4–13 months of age, were divided into three age groups to explore the possibility of comparable age-related changes. In Experiment 1, infants were exposed to ID or happy-sounding speech syllables and soothing hummed lullabies produced by the same woman. The principal question concerned the relative efficacy of soothing hummed song and happy ID speech for attracting and maintaining infants' attention. In other words, is vocal music compelling for infants, as it is for adults, even in the absence of speech or properties associated with heightened arousal? If infants listened longer to hummed lullabies than to simple ID speech, it would challenge the prevailing view that infants have an innate or early developing preference for speech over any other auditory stimulus (Vouloumanos and Werker, 2004, 2007; Shultz and Vouloumanos, 2010; Vouloumanos et al., 2010). Experiments 2 and 3 narrowed the differences between speech and singing stimuli by comparing the same verbal materials that were spoken or sung with comparable or contrasting affective intentions. Specifically, infants in Experiment 2 heard sung vs. spoken renditions of the lyrics of a Turkish children's song, both in an ID/joyful manner. Infants in Experiment 3 heard the ID children's song vs. a spoken version of the lyrics in an AD or affectively neutral manner.

All of the stimuli in the present study were portrayed or acted rather than being recorded during actual interactions with infants and adults. Early research on infants' responsiveness to ID and AD speech (e.g., Fernald, 1985) used recordings of women's interactions with their infant and with an adult experimenter. Such stimuli differed dramatically in content as well as expressiveness, making it difficult to identify the factors contributing to infants' responsiveness. Later research used portrayals of ID and AD speech (e.g., Singh et al., 2002; Kitamura and Lam, 2009) so that the content could be carefully controlled across speech registers. When studying infants' responsiveness to ID and non-ID singing (e.g., Trainor, 1996; Masataka, 1999), it is possible to use recordings of mothers singing the same song in the presence or absence of their infant. Comparisons of natural ID speech and singing (e.g., Nakata and Trehub, 2004), however, necessarily differ in content as well as form. Because the features of ID speech and singing have been described extensively (e.g., Ferguson, 1964; Trainor et al., 1997), it is possible to create relatively natural portrayals of those stimuli. For practical as well as ethical reasons, most of the research on vocal emotion (e.g., Scherer, 1986, 1995; Juslin and Laukka, 2003) has used portrayals of various emotions rather than emotional expressions produced in natural contexts.

#### **EXPERIMENT 1**

The goal of the present experiment was to examine the possibility that infants might be more responsive to vocal music than to happy ID speech even for vocal music lacking the acoustic features (e.g., highly variable pitch and dynamics) and expressive intentions (high-arousal happiness) that have been linked to infant preferences for ID speech (e.g., Fernald, 1985; Singh et al., 2002). By using hummed songs, it was possible to generate vocal music without speech. Humming, usually with closed mouth, can be used to generate melodies with sustained nasal sounds that have low spectral amplitude (Kent et al., 2002). Because humming constrains amplitude modulation, it provides reduced scope for expressing high-arousal emotions. There are speculations, however, that humming played an important role in early hominid evolution, functioning like contact calls in other species (Jordania, 2010). At present, humming may be the most common type of informal, solitary singing.

We considered lullabies the musical genre of choice because of their suitability for humming, their universal use in caregiving (Trehub and Trainor, 1998), and their stark contrast with happy ID speech in acoustic features and affective intentions. As noted, lullabies transmit positive affective qualities such as tranquillity and tenderness both in their musical features and vocal tone. The ID speech stimuli approximated those used in previous research on infants' listening biases for speech (Vouloumanos and Werker, 2004, 2007). They consisted of nonsense syllables with typical exaggerated pitch contours and happy voice quality. For adults, it is likely that the lullabies, although unfamiliar, would have high aesthetic appeal, while the repetitive, high-pitched nonsense syllables would sound boring or worse. Nevertheless, the speech combined the exaggerated pitch contours and joyful expressiveness that have been linked to infant preferences in contemporary urban cultures (Fernald and Kuhl, 1987; Kitamura and Burnham, 1998; Singh et al., 2002). If infants share adults' aesthetic appraisals or favor universal forms, they would listen longer to the hummed versions of traditional lullabies. On the basis of previous research with Western infants, however, one might expect them to listen longer to the arousing and joyfully rendered speech.

#### **METHOD**

#### *Participants*

The sample consisted of 50 healthy, full-term infants who were 4.3–13.1 months of age (*M* = 8.6 months, *SD* = 2.6) divided into 3 age groups: 4–6 months (*M* = 5.5, *SD* = 0.48; *n* = 16), 7–9 months (*M* = 8.6, *SD* = 0.87; *n* = 16) and 10–13 months (*M* = 11.5, *SD* = 0.74; *n* = 18). No infant had a family history of hearing loss or personal history of ear infections, and all were free of colds or ear infections on the day of testing. An additional five infants failed to complete the test session because of fussiness. This experiment and others in this report were approved by the Arts and Sciences ethics committee of the University of Montreal, and written informed consent was obtained from all participating parents.

#### *Stimuli*

The speech stimulus, which was comparable to that used by Vouloumanos and Werker (2004) except for a different speaker, consisted of 12 variations of each of two nonsense syllables (*lif* and *neem*) spoken with ID prosody. Varied repetitions of each syllable had rising, falling, and rising-falling (i.e., bell-shaped) pitch contours. There were two versions of the syllabic sequence, differing only in the order of elements. Each sequence consisted of a semi-random ordering of syllables, with the constraint that any four consecutive syllables contained two instances each of *li*f and *neem*. Syllables were separated by silent inter-stimulus intervals (ISIs) of 300–500 ms, and the order of ISIs was randomly distributed, with a mean of 450 ms, as in Vouloumanos and Werker (2004). Each sequence was approximately 20 s in duration, and was repeated for an overall duration of 40 s. The music stimulus consisted of a hummed version of a lullaby. There were two traditional lullabies, one Chilean (in duple meter, AA form) and one German (in triple meter, AB form), each approximately 40 s in duration and each assigned to half of the infants. Hummed and spoken stimuli were produced by a native speaker of English who had considerable music training, singing experience, and experience with children. She was instructed to produce the nonsense syllables in a lively ID manner and to hum the melodies as if lulling an infant to sleep. She listened to many samples of ID speech and singing beforehand (including the Vouloumanos and Werker syllables) and used pictures of infants to help induce the appropriate mood for her speaking or lulling. Sample stimuli are presented in Supplementary Materials.

Acoustic features of the stimuli, which were measured with Praat software (Boersma and Weenink, 2010), are shown in **Table 1**. Because pitch extraction software is prone to octave errors, it is common to manually specify a minimum and maximum fundamental frequency (F0 in Hz) or to use a formula for setting the F0 range of each sound such as that suggested by De Looze and Hirst (2008): floor = q25 × 0.75; ceiling = q75 × 1.5. We used this formula for acoustic analyses in the present study. Mean F0 was higher for singing (*M* = 280.2 Hz) than for speech (*M* = 244.2 Hz, difference of 2.46 semitones), but speech was more variable in F0, amplitude, and timing. The standard deviation (*SD*) of F0, a measure of pitch variability, was 3.81 and 3.40 semitones for speech and singing, respectively. As can be seen in **Figure 1**, which depicts the F0 contours, changes in pitch were larger and more abrupt for the speech than for the humming stimuli. Amplitude variation (*SD*), measured in the voiced portions of each sound, was 9.31 dB for speech and 4.46 dB for singing. The timing of the syllables was varied deliberately as in Vouloumanos and Werker (2004).

#### *Apparatus*

Testing was conducted in a sound-attenuating booth (IAC) 4 by 4 m in size. Infants were seated on their parent's lap facing a

#### **Table 1 | Acoustic features of stimuli.**


**FIGURE 1 | Fundamental frequency (F0) contours of 5-s excerpts from each sound type. (A)** hummed lullaby (Chilean) and syllable sequence, **(B)** ID sung and spoken lyrics of Turkish play song, **(C)** ID sung and AD spoken lyrics of Turkish play song.

central computer monitor at a distance of 127 cm, with two identical monitors to the right and left side of the central monitor and at a distance of 152 cm from infants. Parents wore earphones (ER-4 MicroPro with reusable ER-4S eartips) with an approximate attenuation of 35 dB and earmuffs (Peltor H10A, Optime 105) with an approximate attenuation of 30 dB. They heard continuous music through the headphones to mask the sounds presented to infants. The walls and table for the monitors were covered with black cloth to reduce visual distraction and optimize attention to the target stimuli. A camera immediately above the central screen provided a continuous record of infant visual behavior on a monitor outside the booth. Two loudspeakers (Genelec 8040A) located behind the lateral monitors transmitted the sounds at a comfortable listening level, approximately 60–65 dB (A). The procedure was controlled by customized software on a computer (Mac Pro 8 cores) located outside the booth.

#### *Procedure*

The head-turn preference procedure (Kelmer Nelson et al., 1995) was used. Infants remained seated on their parent's lap throughout the procedure, and parents were asked to minimize their own movement. Infants were randomly assigned to one of the two speech sequences and one of the two hummed lullabies. The speech and singing stimuli were presented on 10 alternating trials, with order of stimuli (speech or singing first) and side of presentation (left or right) counterbalanced across infants. On each trial, the infant's attention was attracted to one monitor by a flashing red square. As soon as the infant looked at that monitor, one sound stimulus was presented together with a visual animation of a carousel. When the infant looked away from the monitor for more than 2 s, the visual and sound stimuli were terminated. The infant's attention was then attracted to the other monitor. Looking at that monitor initiated the same visual stimulus but the contrasting auditory stimulus, which continued until the infant looked away for 2 s. On each trial, the stimulus was always presented from the beginning (i.e., beginning of the lullaby or syllable sequence). The experimenter outside the booth, who had no access to sound (auditory stimuli or infant vocalization) and no information about test conditions, observed the infant's behavior on the external monitor and continuously recorded looking toward or away from each monitor in the booth by means of key codes on a computer keyboard. Looking times during the presentation of each stimulus type were computed automatically. Typically, infants completed the procedure in approximately 5 min.

#### **RESULTS**

Infants often look disproportionately long on the initial trial of a novel stimulus, so it is common to exclude the first two trials (i.e., initial exposure of each stimulus) from data analysis (e.g., Vouloumanos and Werker, 2004; Volkova et al., 2006), a procedure followed here. These initial trials can be considered familiarization rather than test trials. Missing values from infants (4 incomplete trials: 1–2 trials from 3 infants) were replaced with the multiple imputation method (Graham, 2012) implemented with NORM software (Schafer, 1999). Substitution or omission of those values yielded similar results. A preliminary analysis of variance revealed that the effect of age (4–6, 7–9, 10–13 months) on looking time was not significant. Age, considered as a continuous variable in a regression analysis, also made no contribution to looking time. Consequently, age was excluded from the main analysis. A paired sample *t*-test on cumulative looking time across the four trials with each stimulus revealed a significant difference between speech and singing [*t*(49) = 3.35, *p* < 0.01, two tailed]. Infants looked longer during the syllable sequences (*M* = 77.93 s, *SD* = 53.98 s) than during the hummed lullabies (*M* = 50.14 s, *SD* = 29.58) (see **Figure 2**). A binomial test revealed that of the 50 infants in the sample, 36 (72.0 %) had longer looking times for speech, *z* = 3.11, *p* < 0.01.

#### **DISCUSSION**

Infants exhibited greater attention to the ID speech syllables than to the hummed lullabies despite the greater coherence and continuity of the lullabies. Although our findings are consistent with the speech bias that has been proposed for young infants (Vouloumanos and Werker, 2004, 2007), there are a number of alternative interpretations. The stimuli contrasted in other respects than the presence or absence of speech or syllabic content. For one thing, the speech was considerably more variable than the humming in pitch and amplitude. Typical ID speech has much more continuity than the present sequence of disconnected syllables, each of which had the properties of stressed syllables. Moreover, each of the syllables had the exaggerated pitch contours that are considered critical in capturing infant attention (e.g., Fernald and Kuhl, 1987), and these contours were highly variable. The speech stimuli also had bursts of very high-pitched sound

at irregular time intervals (see **Figure 1**), which could have functioned as salient alerting signals. Hummed speech produces less neural activation than natural speech (Perani et al., 2011), so one would expect hummed music to produce less cortical activation than other types of vocal music.

The affective qualities of the stimuli also differed dramatically, with the speech having the properties of high-arousal happiness or joy and the hummed lullabies being tranquil and soothing. Joyful or happy speech reliably attracts and maintains the attention of Western infants (Singh et al., 2002), and joyful music may do likewise. In contrast to Western mothers, who engage in lively vocal and non-vocal interactions with their infants, mothers in many others cultures interact in ways that are primarily soothing rather than arousing (Toda et al., 1990; Trehub and Schellenberg, 1995; Trehub and Trainor, 1998). It is possible that infants who are accustomed to soothing vocal interactions would distribute their attention differently from the infant participants in the present experiment. Nevertheless, the youngest infants in the present study, who might experience more soothing interactions than the older infants (Kitamura and Burnham, 2003), responded no differently than the older infants.

Finally, the stimuli in the present study were atypical in a number of respects. The speech stimulus had the usual exaggerated pitch contours and happy affect of Western mothers in the unusual context of two single, non-contiguous syllables that repeated with variable renditions (following Vouloumanos and Werker, 2004, 2007). In other words, it was dramatically different from conventional ID speech. Although lullabies, sung or hummed, are common in non-Western cultures, they are used infrequently in Western cultures (Trehub and Trainor, 1998). If Western infants are exposed to lullabies, such exposure typically occurs when they are sleepy or distressed rather than awake or alert. For those reasons, we used more conventional stimulus materials in subsequent experiments, namely the lyrics of foreign children's songs that were spoken or sung.

#### **EXPERIMENT 2**

The goal of the present experiment was to ascertain the relative efficacy of speech and singing for maintaining infant attention when verbal or syllabic content and affective intentions are similar across vocal modes. Infants were presented with a sung and spoken version of an unfamiliar Turkish play song, both produced in an ID or joyful manner. The same lyrics ensured comparable phoneme sequences despite their different realization in speech and singing. Although the overall affective intentions were joyful in both cases, the means of achieving those intentions differ in speech and singing, with unknown consequences.

In research with ID and AD speech, the stimuli are often drawn from natural interactions with infants and adults (e.g., Kitamura and Burnham, 1998) so that verbal content and speaking style differ. At other times, actors portray ID and AD speech with the same verbal content (e.g., Singh et al., 2002). No previous study used the texts of play songs, which include words and nonsense syllables that are distinctive and memorable as well as alliteration, assonance, and rhyme. As a result, the spoken ID version was closer to a spoken nursery rhyme than to conventional ID speech, reducing many of the usual differences between spoken and sung material for infants. Differences between speech and singing still remained, however, with speech being more variable in its pitch patterns and amplitude and also lacking the steady beat of music. If the expanded pitch range and greater pitch variability of speech drive infant attention (e.g., Fernald and Kuhl, 1987; Fernald, 1992), then infants could be expected to attend longer to the spoken lyrics. If happy affect is primarily responsible for infants' listening choices, as is the case for speech style (Singh et al., 2002), then infants might respond no differently to happy ID speech and singing with comparable verbal content.

#### **METHOD**

#### *Participants*

The sample included 48 healthy full-term infants who were 4.2–12.4 months of age (*M* = 8.3 months, *SD* = 2.3), with the same inclusion criteria as Experiment 1, and the same age groups: 4–6 months (*M* = 5.7, *SD* = 0.9; *n* = 16), 7–9 months (*M* = 8.5, *SD* = 0.8; *n* = 16), and 10–12 months (*M* = 10.8, *SD* = 0.8; *n* = 16). An additional 6 infants were excluded from the final sample because of experimenter error (*n* = 2) or failure to complete the test session (*n* = 4).

#### *Stimuli*

Stimuli consisted of unfamiliar foreign lyrics (Turkish) of a play song (duple meter, AABAA form) that were spoken or sung. The performer was a native Turkish speaker and trained singer who had considerable experience with children. She listened to many samples of ID speech and singing and was instructed to speak and sing as if doing so for an infant. Stimuli are available in Supplementary Materials. Acoustic features of the sounds, as analyzed by Praat software (Boersma and Weenink, 2010) with pitch range settings following Experiment 1, are shown in **Table 1**. Sung versions were slightly longer than spoken versions, 26.8 s vs. 24.6 s. Mean pitch level was 2.3 semitones higher for sung (*M* = 351.14 Hz) than spoken versions (*M* = 312.28 Hz), but spoken versions had considerably greater pitch range (17.64 vs. 11.41 semitones) and pitch variability (*SD*s of 3.86 and 2.34 semitones, respectively). The mean pitch of the sung lyrics was substantially higher for the highly trained Turkish singer than for mothers' ID singing of play songs (253.6 Hz) (Trainor et al., 1997), but the pitch level of the spoken lyrics was comparable to that of mothers' ID speech (Fernald et al., 1989). As can be seen in **Figure 1**, however, there was more overlap of the ID speech and singing contours than was the case for Experiment 1.

#### *Apparatus and procedure*

The apparatus and procedure were identical to Experiment 1.

#### **RESULTS**

As in Experiment 1, a preliminary ANOVA revealed no effect of age on looking time, so age was excluded from the main analysis. A paired sample *t*-test on cumulative looking time across four trials with each stimulus (initial two trials omitted, as in Experiment 1) revealed no difference between speech (*M* = 66.97 s, *SD* = 43.24 s) and singing (*M* = 56.58 s, *SD* = 31.57 s) [*t*(47) = 1.30, *p* = 0.199, two tailed] (see **Figure 2**).

#### **DISCUSSION**

Infants' attention did not differ for spoken and sung versions of a Turkish play song performed in an ID manner. The absence of differential attention, even in the presence of greater pitch and duration variability of the spoken versions (i.e., lively and rhythmic ID speech), implies that such acoustic variability, in itself, cannot account for the attention differences in Experiment 1 or in previous research (Nakata and Trehub, 2004). The findings raise the possibility that happy vocal affect, which characterized the spoken and sung versions, is primarily responsible for infants' engagement. Affective voice quality may be transmitted, in part, by the acoustic features that were measured but it is also transmitted by vocal timbre (i.e., tone of voice), which is not readily amenable to quantification. Issues of affective intent were addressed in the subsequent experiment.

#### **EXPERIMENT 3**

In the present experiment, we altered the affective intent of the spoken lyrics of Experiment 2 for comparison with the ID sung lyrics. Infants were exposed to the ID sung version from Experiment 2 and a spoken version in a non-ID style with neutral affect. If infants' attention is driven primarily by the joyful or happy quality of adult vocalizations, then they should exhibit greater attention to the sung versions than to the spoken versions. Just as infants are more engaged by happy speech than by neutral speech regardless of the ID or AD register (Kitamura and Burnham, 1998; Singh et al., 2002), we expected them to be more engaged by happy than by neutral vocal material regardless of whether it was spoken or sung.

#### **METHOD**

#### *Participants*

The sample included 48 healthy, full-term infants who were 4.7–12.5 months of age (*M* = 8.3 months, *SD* = 2.5). Inclusion criteria were comparable to Experiment 1, as were the age groups: 4–6 months (*M* = 5.7, *SD* = 0.7; *n* = 16), 7–9 months (*M* = 8.0, *SD* = 0.9; *n* = 16), and 10–12 months (*M* = 11.3, *SD* = 0.8, *n* = 16). An additional five infants were excluded from the final sample because of failure to complete the test session (*n* = 4) or parents' interaction with infants during the test session (*n* = 1).

#### *Stimuli*

Stimuli consisted of the same sung lyrics of the Turkish play song used in Experiment 2, which was unfamiliar to infants or mothers, and an affectively neutral version of the spoken lyrics. The lyrics were spoken by the same native Turkish speaker from Experiment 2, who was instructed to speak with neutral affective tone as if communicating with an adult. Stimuli are available in Supplementary Materials. Acoustic features of the sounds (analyzed by means of Praat software) are shown in **Table 1**. Pitch range setting followed the procedures described in Experiment 1. The **s**ung version was substantially longer (26.8 s) than the spoken version (19.02 s), reflecting the slow pace of singing relative to ordinary speech. Mean pitch level for the sung and spoken versions was 350.14 and 210.24 Hz, respectively, corresponding to a difference of 8.9 semitones. F0 variability (SD) for the spoken and sung lyrics was similar at 2.30 and 2.34 semitones, respectively, as was the pitch range (i.e., difference between minimum and maximum pitch) of 11.33 and 11.41 semitones, respectively (see **Figure 1**). In short, the singing and speech stimuli differed substantially in pitch level, rate, and vocal tone (happy vs. neutral) but were comparable in pitch variability and pitch range.

#### *Apparatus and procedure*

The apparatus and procedure were identical to Experiment 1.

#### **RESULTS**

Missing values for one infant on the final trial were handled by the multiple imputation method (Graham, 2012), as in Experiment 1. Data from one outlier (>3 *SD* from the mean) were excluded from the data set. Inclusion of the outlier and omission of the missing trial did not alter the results. A preliminary ANOVA revealed no effect of age on looking time, so age was excluded from the main analysis. A paired sample *t*-test on cumulative looking time across the four trials for each stimulus type revealed a significant difference between speech and singing [*t*(46) = 2.34, *p* < 0.05, two tailed]. Infants looked longer in the context of singing (*M* = 68.17 s, *SD* = 40.41 s) than in the context of neutral speech (*M* = 49.20 s, *SD* = 29.45) (see **Figure 1**). A binomial test revealed that, of the 47 infants in the sample, 34 (72.3 %) looked longer during the presentation of singing, *z* = 3.016, *p* < 0.01.

#### **DISCUSSION**

As predicted, infants exhibited greater attention during the presentation of the happy ID singing than during the neutral AD speech. Despite identical lyrics, similar pitch range (but different pitch register), and similar pitch variability of the sung and spoken versions, singing maintained infants' attention more effectively than did speech. The findings are consistent with a critical role for positive vocal affect, specifically happy or joyful vocalizations. An alternative explanation is that infants responded on the basis of pitch register, with the higher register of ID singing attracting their attention more effectively than the lower register of AD speech (see **Figure 1**). In speech contexts, however, happy vocal affect makes a greater contribution to infant attention than pitch register does (Kitamura and Burnham, 1998; Singh et al., 2002).

#### **GENERAL DISCUSSION**

The purpose of the present study was to ascertain infants' relative interest in singing and speech. In Experiment 1, infants showed greater attention to happy ID versions of a series of unconnected nonsense syllables than to soothing hummed lullabies. The soothing humming proved to be no match for the effusively spoken syllables, which combined features of alerting vocalizations and joyful speech as well as high acoustic variability. In general, Western mothers' interactions with infants, whether spoken or sung, are lively and playful, in contrast to the soothing interactions and high levels of body contact that prevail in many non-Western cultures (Morikawa et al., 1988; Fernald, 1992; Trehub and Trainor, 1998). Perhaps infants' listening choices to stimuli such as these would differ in different cultures (e.g., non-Western) and contexts (e.g., when infants are experiencing fatigue or distress).

In Experiment 2, infants heard the lyrics of a Turkish play song that were spoken or sung in a lively, joyful manner. Neither the higher mean pitch of the sung versions nor the greater pitch range and pitch variability of the spoken version resulted in differential infant attention, as they have in previous studies of ID and AD speech (Fernald and Simon, 1984; Fernald and Kuhl, 1987) or ID and non-ID singing (Trainor, 1996; Trainor and Zacharias, 1998). Obviously, the absence of a difference does not provide definitive evidence of equivalent interest in the stimuli, but it is consistent with the notion that infants' listening preferences are influenced primarily by the joyful or happy expressiveness of speech and singing. It is also consistent with newborns' comparable right hemisphere responses to lyrics that are spoken or sung in a happy manner (Sambeth et al., 2008).

In Experiment 3, infants' greater interest in the joyfully sung lyrics than in the neutrally spoken lyrics is in line with high positive affect driving infant attention. The speech stimuli of Experiment 1, the speech and singing stimuli of Experiment 2, and only the singing stimuli of Experiment 3 had features associated with vocal expressions of high-arousal happiness or joy (Banse and Scherer, 1996; Bänziger and Scherer, 2005). Taken together, the results of the three experiments are consistent with the possibility that features associated with vocal expressions of high-arousal happiness or joy are the principal determinants of infant preferences. Infants' attention to stimuli reflecting high levels of positive affect has been documented in visual (Kuchuk et al., 1986; Serrano et al., 1995) as well as auditory (Papoušek et al., 1990; Fernald, 1993; Kitamura and Burnham, 1998; Singh et al., 2002) contexts.

Although caregivers' expressive intentions are important for regulating infants' attention, other factors such as timing and pitch patterns may play an independent role. Music is much more predictable than speech in its temporal and pitch structure, generating expectations and the fulfillment of those expectations as the music unfolds (Kivy, 1993; Trainor and Zatorre, 2008; Jones, 2010). Such predictability contributes to the appeal of music for mature listeners (Kivy, 1993), and it may do so for infants as well. Maternal sung performances for infants have even greater predictability than other music, with many mothers singing the same songs at the same tempo and pitch level on different occasions (Bergeson and Trehub, 2002). Although maternal speech, with its frequent repetition of phrases and intonation contours, is much more predictable than AD speech, the contours are usually repeated with different verbal content (Bergeson and Trehub, 2002, 2007). The speech in Experiment 1, consisting of variable renditions of two syllables, carried repetition to an extreme from the perspective of adults, but the predictable content in the context of changing pitch contours may have highlighted those contours. The lullabies were also repetitive, as are most lullabies (Unyk et al., 1992), but repetition occurred on a longer timescale than for the monosyllabic speech sounds.

The slow tempo and minimal amplitude variation of the lullabies de-emphasized the typical rhythmic regularity of music. The Turkish play song was more rhythmic than its spoken counterpart in Experiment 2, but the simple, repetitive lyrics sounded more like a nursery rhyme or poetry than conventional ID speech. Poetry blurs many of the distinctions between speech and singing by its inclusion of rhythm, meter, rhyme, alliteration, and assonance (Tillmann and Dowling, 2007; Obermeier et al., 2013), all of which were featured to varying degrees in the ID spoken and sung versions of the play song. In addition to having several repeated and rhyming syllables, the speech stimuli in Experiment 2 also had wider pitch contours than the sung stimuli. Such pitch contours have been linked to infants' listening bias for ID over AD speech (e.g., Fernald and Kuhl, 1987). Expanded pitch contours may compete with timing regularity for gaining and retaining infants' attention. Differences in pace, timing regularity, and rhythmicity between speech and singing were pronounced in Experiment 3 when singing finally prevailed. Naturally, one would expect infants' attention to be influenced by several factors acting together rather than a single factor (Singh et al., 2002), with some features being more salient than others in different situations. The acoustic parameters of the speech stimuli in Experiments 2 and 3 conformed to conventional differences between Western ID and AD registers (e.g., Fernald and Simon, 1984), with the ID speech having substantially higher mean pitch, a pitch range that was over 6 semitones greater, and a speaking rate that was substantially slower than the AD or neutral versions (Ferguson, 1964; Stern et al., 1982, 1983). In fact, the ID version of spoken lyrics, with its heightened pitch and slowed rate (see **Table 1**), was much closer to the sung version than it was to the neutral or AD spoken version (see **Figure 1**).

Obviously, speech and singing are not uniform across persons or contexts, and the differences between them narrow or widen in different situations. ID speech capitalizes on dimensions that are central to music, especially pitch and rhythm, which make it sound more musical than non-ID speech (Fernald, 1992; Trainor et al., 2000). Although maternal speech is more acoustically variable than maternal singing (Bergeson and Trehub, 2002), mothers make their speech more accessible to infants by the use of individually distinctive intonation patterns or tunes (Bergeson and Trehub, 2007).

To the adult ear, speech and singing, even ID speech and singing, are distinct classes. For young infants, however, melodious speech and singing may be variations on a theme. Brandt et al. (2012) suggest that speech is a special form of music, at least from the perspective of pre-verbal infants. Before language achieves referential status, infants may hear human vocal sequences as sound play, which is what music is all about (Brandt et al., 2012). Because speech lacks the constraints of music, it can become music-like without losing the essential properties of speech. Not only does ID speech exaggerate the features of conventional speech; it also incorporates some musical features such as sustained vowels and phrase-final lengthening, exaggerating others such as pitch range expansion (e.g., Fernald et al., 1989). The elevated pitch and slow tempo of ID speech are comparable to the pitch and tempo of ID singing and to music in general. Perhaps ID speech would be misjudged as music in cultures in which vocal music incorporates free rhythm and pitch glides (e.g., Clayton, 2000).

The present study provides support for the view that happy vocalizations or those with high positive affect, whether speech or singing, play an important role in regulating infant attention. The happy talk of Experiment 1 elicited greater infant attention than the soothing humming, and the happy singing of Experiment 3 elicited greater attention than the neutral speech. When speech and singing were both happy, as in Experiment 2, there was no difference in infants' attention. Can one conclude that that there would be no difference in infants' attention to happy speech and singing outside as well as inside the laboratory? Not necessarily. In everyday life, ID vocal interactions typically involve a familiar voice (e.g., parent), familiar content (e.g., frequently sung song, familiar phonemes, repeated syllable sequences), familiar face

#### **REFERENCES**


*Sci.* 13, 72–75. doi: 10.1111/1467- 9280.00413


and facial expressions, as well as physical contact or movement, creating many possibilities for differential responsiveness to multimodal speech and singing. In fact, infants are more attentive to happy maternal singing than to happy maternal speech when the material is presented audiovisually (Nakata and Trehub, 2004).

Finally, the present research examined infants' attention in a series of relatively brief trials, providing insight into the potential of the stimuli for *capturing* their attention rather than *maintaining* it for sustained periods of time. In principle, one stimulus might be better for initial attention capture (e.g., unconnected speech syllables rendered in a happy voice) while another could have greater efficacy for maintaining attention or contentment, preventing distress, or alleviating distress (e.g., coherent passages of speech or singing). Visual fixation, the measure used in the present study, provides a limited perspective on attention and engagement, being imperfectly correlated with physiological and neural measures of infant attention (Richards et al., 2010) and with infant facial affect (Fernald, 1993). We know, for example, that infants move rhythmically to rhythmic music but not to ID or AD speech (Zentner and Eerola, 2010) and that intense infant attention to vocal music initially leads to reduced body movement (Nakata and Trehub, 2004). Maternal singing also modulates infant cortisol levels (Shenfield et al., 2003). Future research with a wider variety of stimuli and measures may resolve the unanswered questions about infants' responsiveness to expressive speech and singing.

#### **ACKNOWLEDGMENTS**

We thank Roxane Campeau, Audrey Morin, Émilie Gilbert and Cynthia Paquin for their assistance in recruiting and data collection. We also thank Jessica Phillips-Silver and Beste Kalender for their talking, humming, and singing. Finally, we thank Athena Vouloumanos for providing samples of speech stimuli from her research with infants. This research was supported by grants from the Social Sciences and Humanities Research Council of Canada and Advances in Interdisciplinary Research in Singing (AIRS) to the second author and by a doctoral fellowship from the Natural Sciences and Engineering Research Council of Canada to the first author.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/Emotion\_Science/10.3389/ fpsyg.2013.00372/abstract


coding of intonation," in *Proceedings of International Conference on Speech Prosody,* Vol. 4, (Campinas), 135–138.


auditory preferences in nonhandicapped infants and infants with Down's syndrome. *Child Dev.* 52, 1303–1307. doi: 10.2307/1129520


in infant-directed speech: pitch modifications as a function of infant age and sex in a tonal and non-tonal language. *Infant Behav. Dev.* 24, 372–392. doi: 10.1016/S0163-6383 00086-638300083


in infant-directed and non-infantdirected singing. *Psychomusicology* 21, 45. doi: 10.1037/h0094003


their own culture: a cross-cultural comparison. *Dev. Psychol.* 46, 286. doi: 10.1037/a0017555


infants. *Music Percept.* 20, 187–194. doi: 10.1525/mp.2002.20.2.187


bias for speech in neonates. *Dev. Sci.* 10, 159–164. doi: 10.1111/j.1467-7687.2007.00549.x


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 01 March 2013; paper pending published: 23 March 2013; accepted: 06 June 2013; published online: 26 June 2013.*

*Citation: Corbeil M, Trehub SE and Peretz I (2013) Speech vs. singing: infants choose happier sounds. Front. Psychol. 4:372. doi: 10.3389/fpsyg.2013.00372*

*This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Corbeil, Trehub and Peretz. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

## Child implant users' imitation of happy- and sad-sounding speech

#### *David J. Wang1, Sandra E. Trehub2 \*, Anna Volkova2 and Pascal van Lieshout 2,3*

*<sup>1</sup> Mississauga Academy of Medicine, University of Toronto, Mississauga, ON, Canada*

*<sup>2</sup> Department of Psychology, University of Toronto, Toronto, ON, Canada*

*<sup>3</sup> Department of Speech-Language Pathology, University of Toronto, Toronto, ON, Canada*

#### *Edited by:*

*Petri Laukka, Stockholm University, Sweden*

#### *Reviewed by:*

*Swann Pichon, Swiss Center for Affective Sciences, Switzerland Björn Lyxell, Linköping University, Sweden*

#### *\*Correspondence:*

*Sandra E. Trehub, Department of Psychology, University of Toronto Mississauga, 3359 Mississauga Road North, Mississauga, ON L5L 1C6, Canada*

*e-mail: sandra.trehub@utoronto.ca*

Cochlear implants have enabled many congenitally or prelingually deaf children to acquire their native language and communicate successfully on the basis of electrical rather than acoustic input. Nevertheless, degraded spectral input provided by the device reduces the ability to perceive emotion in speech. We compared the vocal imitations of 5- to 7-year-old deaf children who were highly successful bilateral implant users with those of a control sample of children who had normal hearing. First, the children imitated several happy and sad sentences produced by a child model. When adults in Experiment 1 rated the similarity of imitated to model utterances, ratings were significantly higher for the hearing children. Both hearing and deaf children produced poorer imitations of happy than sad utterances because of difficulty matching the greater pitch modulation of the happy versions. When adults in Experiment 2 rated electronically filtered versions of the utterances, which obscured the verbal content, ratings of happy and sad utterances were significantly differentiated for deaf as well as hearing children. The ratings of deaf children, however, were significantly less differentiated. Although deaf children's utterances exhibited culturally typical pitch modulation, their pitch modulation was reduced relative to that of hearing children. One practical implication is that therapeutic interventions for deaf children could expand their focus on suprasegmental aspects of speech perception and production, especially intonation patterns.

**Keywords: prosody, emotion, production, cochlear implants, children**

#### **INTRODUCTION**

Modern cochlear implants (CIs) enable large numbers of prelingually deaf children to perceive speech and acquire the native language of their community by means of electrical rather than acoustic cues (Spencer et al., 1998; Svirsky et al., 2000; Blamey et al., 2001). Because the devices relay degraded pitch and spectral cues (Geurts and Wouters, 2001; Green et al., 2004), CI users have difficulty perceiving pitch sequences (Cousineau et al., 2010) such as the melodies in speech (i.e., intonation) (Hopyan-Misakyan et al., 2009; Nakata et al., 2012) and music (Vongpaisal et al., 2006; Cooper et al., 2008; Kang et al., 2009).

Intonation, perhaps the most salient aspect of speech prosody, corresponds to changes in fundamental frequency (F0) or pitch over time. Such pitch variations are often accompanied by variations in amplitude and duration (Ladd, 1996). In specific contexts, prosodic variations carry linguistic meaning, as when they distinguish nouns (e.g., *pro*ject) from verbs (e.g., pro*ject*) and statements (e.g., You're hungry) from yes/no questions (e.g., You're hungry?). Prosodic variations also provide information about a speaker's emotional state (e.g., happy, sad, angry, fearful) and intentions (e.g., approving, disapproving, sarcastic). This kind of information, pitch patterning in particular, is less accessible to listeners who use CIs. The pitch processing limitations of implants also have implications for speakers of tone languages (e.g., Mandarin, Vietnamese) where contrasting pitch height or contour can signal differences in meaning.

Research on prosody in CI users has focused mainly on perception. In general, child and adult CI users can distinguish statements (i.e., falling terminal pitch contour) from yes/no questions (i.e., rising contour) by gross periodicity cues (Rosen, 1992), but their performance is well below that of their normally hearing (NH) peers (Most and Peled, 2007; Peng et al., 2008). Their pitch processing limitations put them at an even greater disadvantage in the differentiation of vocal emotions. In one study, child CI users 7–13 years of age failed to identify utterances with neutral content that were expressed in a happy, sad, angry, and fearful manner, but they readily identified facial expressions of the same emotions (Hopyan-Misakyan et al., 2009). In other studies, child CI users identified happy and sad vocal expressions on the basis of prosodic cues alone, but they performed significantly worse than their hearing peers (Nakata et al., 2012; Volkova et al., 2013). Happy utterances typically have higher pitch and pitch variability than sad utterances (Scherer, 1986), but CI users, especially adults, may capitalize on available amplitude and duration cues (e.g., greater amplitude variation and faster speaking rate for happy utterances), as indicated by decrements in performance when those cues are unavailable (Luo et al., 2007).

Research on speech production in CI users has focused primarily on intelligibility (Peng et al., 2004a; Flipsen and Colvard, 2006), with relatively limited attention to speech prosody (but see Carter et al., 2002; Lenden and Flipsen, 2007; Peng et al., 2008) or lexical tones. The available evidence indicates that child CI users have difficulty perceiving (Barry et al., 2002) and producing (Wei et al., 2000) lexical tones. They often have difficulty producing the rising pitch contours of yes/no questions, with correlations evident between perception and production of these distinctions (Peng et al., 2008). Their differentiation of emotional expressions is predictive of their ability to imitate familiar expressions with culturally typical prosody (Nakata et al., 2012).

The ability to produce expressive variations in speech is central to communicative and social competence. Little is known, however, about child CI users' ability to produce age-appropriate distinctions in expressive prosody involving the most basic emotions such as happiness or sadness. As a first step in addressing this issue, we sought to determine the extent to which highly competent child CI users and a control sample of NH children could provide credible imitations of happy and sad prosody. In previous research, young NH children as well as child CI users produced imperfect prosodic imitations of brief Japanese utterances (Nakata et al., 2012). For NH children, the major prosodic distinctions are in place by about 5 years of age, but refinements in expressive prosody continue for some years (Cruttenden, 1985). In general, mature control of F0 is not achieved before 7 years of age (Patel and Grigos, 2006). In the present study, we recorded children's imitation of happy and sad utterances produced by a child model. Adults in Experiment 1 listened to each model utterance and imitation, rating the extent to which children's prosody matched that of the model. On the basis of NH children's advantages in the processing of F0 patterns (Vongpaisal et al., 2006; Volkova et al., 2013), we expected them to produce better imitations of the model than child CI users. Adults in Experiment 2 listened to low-pass filtered versions of the utterances that obscured the verbal content and rated each utterance on a scale ranging from very sad to very happy. We expected the happy and sad versions to be more differentiated for NH children than for child CI users. Because happy utterances embody greater prosodic variability than sad utterances (Banse and Scherer, 1996), we predicted that both NH children and child CI users would produce poorer matches of the happy utterances. Finally, with the verbal content obscured by electronic low-pass filtering in Experiment 2, we expected the utterances of child CI users to be less interpretable as happy or sad than those of NH children.

#### **EXPERIMENT 1**

The purpose of the first experiment was to explore the ability of child CI users and NH children to imitate conventional happy and sad prosody. Previous research has indicated that child CI users can differentiate happy from sad utterances with age-appropriate stimuli and tasks (Nakata et al., 2012; Volkova et al., 2013). What remains unclear is whether they can produce distinctive happy and sad prosody. Children in the present study were required to imitate a model child's utterances, matching, as closely as possible, her expressive prosody. Adult listeners subsequently rated the closeness of each imitated utterance to the model utterance on a 10-point scale from not at all similar to extremely similar. Utterance content conflicted with prosodic form in half of the utterances. When young children are asked to judge a speaker as feeling happy or sad from utterances with conflicting verbal content and prosodic form, they typically rely on verbal content, in contrast to older children and adults who rely more on prosody (Morton and Trehub, 2001). No such judgment was required in the present experiment because children were simply asked to talk exactly like the model. Nevertheless, the conflicting content and form had the potential to interfere with children's focus on prosody and lead to poorer imitations.

#### **MATERIALS AND METHODS**

#### *Participants*

The deaf participants, or talkers, consisted of nine bilateral CI users (five boys, four girls), 5–7 years of age (*M* = 6.0, SD = 0.7) from well-educated middle-class families who spoke English regularly at home. Of the nine CI users, six were congenitally deaf and had used their prostheses for at least 4.0 years (*M* = 4.8). Of the remaining three children, all were prelingually deaf. One became deaf in the neonatal period and two were diagnosed with progressive hearing loss at 1 year of age. Their implant experience was 3.1, 5.9, and 5.3 years, respectively. All child CI users had normal cognitive abilities. They were considered successful implant users as indicated by their speech perception skills, speech intelligibility, speech quality, and ease of communicating orally with hearing adults and peers. They had participated in auditory-verbal therapy with a focus on language acquisition for at least 2 years after implantation, and all communicated exclusively by oral means. Age of implantation, type of implant, age at recording, and etiology are shown in **Table 1**. The comparison sample consisted of 17 NH children (5 boys, 12 girls), 4–6 years of age (*M* = 5.2, SD = 0.8) who were also from middle-class, English-speaking families. It is common to select NH comparison groups that are slightly younger than the target CI groups to compensate for the reduced years of listening experience of child CI users (Lenden and Flipsen, 2007). Hearing was not tested in NH children, but there was no family history of hearing impairment, personal history of ear infections, or current cold, according to parents' report. The adult raters consisted of 15 NH university students (5 men, 10 women) 19–28 years of age (*M* = 23.0) who participated for partial course credit or token payment. Their hearing status was presumed to be normal by self-report.

#### *Apparatus*

Children's utterances were recorded in a double-walled, soundattenuating booth (Industrial Acoustics Company) with a microphone (Sony F-V30T) and external sound card (SoundBlaster X-Fi Fatal1ty) linked to a computer workstation outside the booth running Windows XP and Audio Recording Wizard version 4 (NowSmart) software. Audio stimuli for imitation were presented via an amplifier (Harman/Kardon HK3380) outside the booth and two loudspeakers (Electro-Medical Instrument Co.), one on either side of the seated child at a distance of 80 cm and 45◦ azimuth. NH undergraduates were tested in the same sound-attenuating booth with audio stimuli presented over the loudspeakers. Rating tasks were presented through an interactive computer program that


#### **Table 1 | Description of the CI sample.**

automatically recorded response selections on a 17-inch touchscreen monitor (Elo LCD TouchSystems).

#### *Stimuli*

A 10-year-old native speaker of English (female) produced several versions of sentences (see **Table 2**) in a happy and sad manner. The most clearly articulated and prosodically natural versions were selected, by consensus, as model utterances. High-quality digital sound files (44.1 kHz, 16-bit, monaural) were created by means of a digital audio editor (Sound Forge 6.0). Child CI users and NH children began by playing an interactive game in which they copied whatever the experimenter said, doing so exactly the way she said it. After this orientation phase, they were instructed to listen to each recorded utterance of the girl (the model), attempting to imitate it as closely as possible. They were told to pay particular attention to the way the girl spoke, copying her happy or sad way of talking. Then the model utterances were presented, one by one, at approximately 65 dB SPL, and children were recorded while imitating each utterance. The child model presented each utterance in both a happy and sad manner for a total of 16 utterances. Children's imitations were normalized for root-mean-square amplitude by means of PRAAT speech analysis and synthesis software (v. 5.3.17; Boersma and Weenink, 2008). Stimuli were played to the adult raters at approximately 65 dB SPL. Only four of the eight utterances (1, 4, 6, 7 from Table 2, selected randomly) were used in the rating task because of time constraints of testing (1 h session). The final stimulus set for adult listeners consisted of 8 utterances (four happy four sad versions) from 26 children for a total of 203 utterances (5 of the potential set of 208 utterances were missing because of instances in which children failed to provide an imitation). Sample happy and sad utterances from the child model and from a child CI user are provided in Supplementary Materials.

#### *Procedure*

Normally hearing undergraduates were tested individually. Eight utterances from each child were presented, half happy and half sad versions. Participants listened to each model utterance followed by the imitation of each child CI user and NH child in random order and rated how closely each utterance matched the model on a 10-point scale (1 = not similar at all to 10 = extremely similar). Prior to the actual test trials, participants completed a practice phase with utterances that were not included in the

#### **Table 2 | Sentences imitated by children.**


test phase. Participants were instructed to base their ratings on utterance intonation rather than content. In other words, they were encouraged to ignore the occasional word errors that children made. They were not told anything about children's age or hearing status.

PRAAT software was used to extract the acoustic features in children's imitations. Vowel boundaries were demarcated to include the entire vowel from spectrographic depictions of the model utterances and imitations, after which estimates of F0 (mean, SD, range), duration, and intensity variability (SD) were obtained automatically by means of a custom-made script.

#### **RESULTS**

An analysis of variance (ANOVA) with hearing status (CI, NH) as a between-subjects factor and content/form (consistent, conflicting) as a within-subjects factor, revealed a significant effect of hearing status, *F*(1, 29) = 172, *p* < 0.001, reflecting better performance of NH children, and a significant effect of content/form, *F*(1, 29) = 26.47, *p* < 0.001, but no interaction between hearing status and content/form. Unexpectedly, children matched the model better for conflicting than for consistent utterances. Examination of the model's consistent and conflicting utterances indicated systematically lower F0 (i.e., slightly less happy-sounding) for inconsistent happy than for consistent happy utterances. Because the conflicting utterances did not put children at a disadvantage and had comparable effects for both groups, the consistent and conflicting utterances were combined in subsequent analyses. Adults' mean ratings of the imitations of happy and sad utterances by child CI users and NH children are shown in **Figure 1**. An ANOVA with hearing status (CI, NH) as a between-subjects factor and utterance type (happy or sad) as a within-subjects factor revealed a main effect of hearing status,

*F*(1, 14) = 148, *p* < 0.001. This effect reflected lower ratings for child CI users' imitations (*M* = 5.33, SD = 0.32) than for those of NH children (*M* = 6.68, SD = 0.24). In fact, child CI users received significantly lower ratings than NH children on happy utterances (CI: *M* = 5.13, SD = 1.22; NH: *M* = 6.28, SD = 1.03), *t*(14) = 12.98, *p* < 0.001, as well as sad utterances (CI: *M* = 5.53, SD = 1.36; NH: *M* = 7.07, SD = 0.98), *t*(14) = 8.81, *p* < 0.001). There was also a main effect of utterance type, *F*(1, 14) = 10.01, *p* = 0.007, reflecting higher overall ratings for sad utterances (*M* = 6.30, SD = 0.29) than for happy utterances (*M* = 5.71, SD = 0.29). Finally, there was a significant interaction between hearing status and utterance type, *F*(1, 14) = 5.48, *p* = 0.035, which arose from greater rating differences between the happy and sad utterances of NH children than child CI users. In fact, the rating differences for NH children's happy and sad utterances were highly significant, *t*(14) = −3.81, *p* = 0.002, and the same trend was evident for child CI, *t*(14) = −1.96, *p* = 0.07.

Acoustic features of the child model's utterances and children's imitations are shown in **Figure 2**. It is apparent that the child model's happy and sad utterances were much more distinct in pitch level, pitch variability, pitch range, and intensity variability than were those of the young child imitators, whether hearing or deaf. Nevertheless, the happy and sad imitations of both groups of children were still distinct. The model's happy and sad utterances differed most from the imitators in their greater variability in F0 and F0 range. Likewise, the NH children differed most from child CI users in these indices of variability.

#### **EXPERIMENT 2**

Experiment 1 revealed that NH children imitated the model's happy and sad utterances more effectively than child CI users did, but both groups produced clearly differentiated utterances. In addition, both groups produced better imitations of sad than happy utterances. Although the acoustic analyses revealed distinctive cues for happy and sad utterances, the model's cues were considerably more distinctive than those of the imitators. The question of interest here was whether the imitators' utterances would be interpretable as happy and sad on the basis of prosodic cues alone, that is, when listeners had no access to verbal content.

#### **MATERIALS AND METHODS**

#### *Participants*

The participants were 16 NH undergraduates (4 men, 12 women) 19–28 years of age (*M* = 21.1), who received partial course credit or token payment for their participation. Their hearing status was presumed to be normal by self-report. An additional participant was tested but excluded from the final sample for failure to provide ratings for several utterances.

#### *Apparatus*

The apparatus was the same as in Experiment 1.

#### *Stimuli*

A randomly selected subset of the happy and sad imitations of child CI users and NH children from Experiment 1 – utterances 1, 2, 4, 5, 6, and 7 in **Table 2** – was normalized for root-meansquare amplitude and low-pass filtered with a cutoff frequency of 500 Hz (via PRAAT). Low-pass filtering preserved frequencies below 500 Hz and attenuated higher frequencies, which made the verbal content unintelligible while retaining cues to emotion such as intonation, speech rate, and speech rhythm (Ben-David et al., 2013). The stimuli were presented at approximately 65 dB SPL. The low-pass filtered versions of a happy and sad utterance from one CI user can be found in Supplementary Materials.

#### *Procedure*

Participants were tested individually. Happy and sad versions of each utterance (total of 12 utterances per child) were presented for a total of 304 utterances (26 children × 12 utterances each = 312 minus the occasional missing imitations). Participants listened to each filtered utterance and rated how happy or sad each sounded on a 7-point scale (1 = very sad, 4 = neutral, 7 = very happy). Unlike the rating scale in Experiment 1, which involved a single dimension of similarity, the present bipolar scale had a neutral midpoint (neither sad nor happy). Testing was preceded by a familiarization phase to provide exposure to the sound quality of filtered utterances and practice rating the utterances on the happy/sad scale. Utterances in the familiarization phase differed from those in the test phase.

#### **RESULTS**

Mean ratings for happy and sad utterances produced by child CI users and NH children are shown in **Figure 3**. Note that the mean rating for NH children appears to be above the neutral midpoint of four (i.e., in the happy zone) for happy utterances but slightly below the midpoint for child CI users. Note also that both groups achieved mean ratings below four (i.e., in the sad zone) for sad utterances. This clustering of ratings close to the neutral midpoint suggests that, on average, the filtered versions did not sound particularly happy or sad. To ascertain whether adults provided differential ratings of the happy and sad utterances, we examined differences in mean ratings (happy minus sad ratings) for all happy and sad utterances of both groups. One sample *t*-tests indicated that the difference scores significantly exceeded zero for child CI users (*M* = 0.95, SD = 0.68), *t*(15) = 5.60, *p* < 0.001, as well as NH children (*M* = 1.20, SD = 0.74), *t*(15) = 6.48, *p* < 0.001. A paired samples *t*-test revealed that the difference scores

were significantly larger for NH children *t*(15) = 3.77, *p* = 0.002, than for child CI users, reflecting adults' greater ease of identifying the emotional intentions of NH children from prosodic cues alone.

As can be seen from the boxplots in **Figure 4**, there were large individual differences in the efficacy of child CI users' prosodic cues. Although the emotional intentions of NH children were more transparent than those of child CI users, difference scores for the top quartile of child CI users (2.00) and NH children (1.98) were roughly equivalent. Because pitch level and pitch variability are particularly distinctive markers of happy vocal affect (Scherer, 1986), the mean F0 and SD of F0 were compared for the happy and sad utterances of NH children and child CI users by means of paired-sample *t*-tests (with Bonferroni corrections for multiple tests). Happy utterances of NH children had significantly higher mean F0, *t*(17) = 5.92, *p* < 0.001, and SD of F0, *t*(17) = 5.04, *p* < 0.001, than sad utterances. Mean F0 also differentiated the happy and sad utterances of child CI users, *t*(8) = 3.86, *p* = 0.049, but F0 variability did not. Again, there were large individual differences in child CI users' use of F0 and F0 variability to distinguish their happy from their sad utterances. Despite the modest sample size of child CI users (*n* = 9), mean F0 difference of happy and sad utterances was highly correlated with adults' difference scores (ratings), *r*(7) = 0.71, *p* = 0.03. The correlation between F0 variability and adult difference scores did not reach conventional significance levels, *r*(7) = 0.6, *p* = 0.086.

#### **DISCUSSION**

The goal of the present study was to ascertain the ability of child CI users and young NH children to signal happiness and sadness by speech prosody alone. Children 4–7 years of age imitated utterances with conventional happy and sad prosody that had been produced by a 10-year-old child. Half of the model utterances had happy content and half had sad content, but all utterances were produced in both a happy and a sad manner. Adults listened to the model's version of an utterance before hearing each child's imitation of that utterance, rating how closely the imitation matched

the model. In principle, the divergent content and expressive style could have been a source of confusion, leading to less adequate imitations of those utterances than for utterances with consistent content and style. Surprisingly, children, both hearing and deaf, produced better prosodic matches in the context of inconsistent content and prosody, which indicates that they can focus on prosody when imitating utterances even though they have difficulty doing so in emotional judgment tasks (Morton and Trehub, 2001). The lower mean F0 of the model's inconsistent happy utterances, like their sad utterances, may have contributed to children's greater ease of imitation. It is also possible that the discordant messages captured children's attention, increasing their sensitivity to the acoustic cues and leading to better imitations.

Both groups of children produced better imitations of sad prosody than happy prosody. Unquestionably, happy prosody is more engaging than sad prosody for listeners in general and children in particular, but it is more difficult to reproduce because of its greater pitch range and modulation (Banse and Scherer, 1996). For example, young children's imitations of expressive utterances such as exclamations or simulated animal sounds (e.g., meow) reveal a considerably smaller pitch range than that of older children (Nakata et al., 2012).

Although NH children and child CI users showed similar overall patterns of performance, their levels of performance differed significantly. NH children produced better imitations of happy and sad messages than did the child CI users, as reflected in adults' ratings. The imitations of child CI users, on average, were modest in quality rather than being poor, with mean ratings near the midpoint on the 10-point scale of similarity to the model. Acoustic analyses revealed that both groups of children used distinctive F0 cues for their happy and sad utterances, but even NH children, on average, failed to produce happy and sad utterances that were as highly contrastive in mean F0 and F0 variability as those of the older child model (see **Figure 2**). What is impressive, however, is that the best child CI users were equivalent to the best performing NH children. Sample utterances from the model and from one high-performing CI user can be found in Supplementary Materials.

The modest pitch modulation in many children's utterances increased the difficulty of judging their low-pass filtered utterances as happy or sad, as evident in the ratings and in listeners' comments after completing the task. Amplitude normalization removed obvious cues such as the higher overall amplitudes of happy than sad utterances although it preserved the greater amplitude variability of happy utterances. In general, speaking rate, especially vowel duration, distinguishes adults' happy from sad utterances (Scherer, 1986), but even the model did not use timing cues for such purposes. Although adults did not rate the utterances as particularly happy or sad, they assigned significantly higher (happier) ratings to the happy versions than to the sad versions both for NH children and for child CI users. Our finding of more differentiated ratings for NH children's utterances than for those of child CI users is consistent with reports of lesser prosodic expressiveness by child CI users (Lenden and Flipsen, 2007; Nakata et al., 2012). It is important to note, however, that distinctive productions of happy and sad speech remained distinctive after low-pass filtering (see Supplementary Materials for examples).

The happy and sad utterances of NH children differed in mean F0 and F0 variability, but F0 variability did not distinguish the happy and sad utterances of CI users. Perhaps the cluster of acoustic cues that predicts listeners' ratings is different for NH children and child CI users. Given the emotion perception (e.g., Hopyan-Misakyan et al., 2009) and prosodic production limitations (Lenden and Flipsen, 2007) reported in previous studies, child CI users' performance in the present study is impressive. The use of imitations rather than spontaneous speech reduced the processing demands on child CI users, perhaps optimizing performance. For example, emphatic stress is less problematic in imitated (Carter et al., 2002) than in spontaneous (Lenden and Flipsen, 2007) speech.

Unquestionably, device limitations increase the cognitive effort of listening in general (Pals et al., 2012) as well as emotion perception and production difficulties in particular (Peng et al., 2008; Hopyan-Misakyan et al., 2009; Nakata et al., 2012; Volkova et al., 2013). Remarkably, however, they do not preclude successful performance on such tasks by the best CI users (Peng et al., 2004b; Nakata et al., 2012; Volkova et al., 2013), including the top performers in the present study. The highest performing child CI users had a number of background factors associated with favorable outcomes, including early implantation (Tomblin et al., 2005) and highly educated and motivated parents (Teagle and Eskridge, 2010). Interestingly, these "star" children were also taking music lessons, which may have helped focus their attention on the pitch patterns and rhythms of speech. There is evidence linking music lessons in childhood to improved pitch perception (Chen et al., 2010) and enhanced linguistic abilities (Moreno et al., 2009). One practical implication of the findings is that therapeutic interventions for child CI users, which focus primarily on speech perception and speech intelligibility and secondarily on some aspects of prosody would do well to expand their focus on emotional expressiveness.

In short, young child CI users effectively reproduce the prosody of happy and sad utterances, but their reproductions are less accurate than those of NH children. Despite the fact that child CI users provide fewer cues than NH peers to signal their happy and sad intentions, adults interpret their intentions at better than chance levels on the basis of prosodic cues alone. Child CI users, who were 5–7 years of age, spent one or more years without functional hearing, so their chronological age does not reflect their cumulative listening experience, as it does for NH children. It is important to ascertain whether the gap between the prosodic skills of young child CI users and NH children narrows or disappears over time either spontaneously or as a result of intervention.

#### **REFERENCES**


hearing listeners as measured by the Montreal battery for the evaluation of Amusia. *Ear Hear.* 29, 618–626. doi:10.1097/AUD. 0b013e318174e787


#### **AUTHOR NOTE**

All research reported in this paper was approved by local ethical committees. Funding for this project was provided by grants from the Social Sciences and Humanities Research Council of Canada (Sandra E. Trehub) and from the Comprehensive Research Experience for Medical Students at the University of Toronto (David J. Wang). We thank the families of participating children for their cooperation and Judy Plantinga for her assistance in implementing the experiments.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at http://www.frontiersin.org/Emotion\_Science/10.3389/ fpsyg.2013.00351/abstract


4, 7 and 8-year-old children. *Speech Commun.* 48, 1308–1318. doi:10.1016/j.specom.2006.06.007


development in profoundly deaf children with cochlear implants. *Psychol. Sci.* 11, 153–158. doi:10.1111/1467-9280.00231


on expressive language growth in infants and toddlers. *J. Speech Lang. Hear. Res.* 48, 853–867. doi: 10.1044/ 1092-4388(2005/059)


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 05 March 2013; accepted: 30 May 2013; published online: 21 June 2013.*

*Citation: Wang DJ, Trehub SE, Volkova A and van Lieshout P (2013) Child implant users' imitation of happy- and sad-sounding speech. Front. Psychol. 4:351. doi: 10.3389/fpsyg.2013.00351 This article was submitted to Frontiers in Emotion Science, a specialty of Frontiers in Psychology.*

*Copyright © 2013 Wang, Trehub, Volkova and van Lieshout. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any thirdparty graphics etc.*

## Age-related differences in affective responses to and memory for emotions conveyed by music: a cross-sectional study

#### *Sandrine Vieillard1 \* and Anne-Laure Gilet <sup>2</sup>*

*<sup>1</sup> Laboratoire de Psychologie (EA 3188), Psychology, Université de Franche-Comté, Besançon, France*

*<sup>2</sup> Laboratoire de Psychologie des Pays de la Loire (EA 4638), Université Nantes Angers Le Mans, Nantes, France*

#### *Edited by:*

*Petri Laukka, Stockholm University, Sweden*

#### *Reviewed by:*

*Jonna K. Vuoskoski, University of Oxford, UK Michaela Riediger, Max Planck Institute for Human Development, Germany*

#### *\*Correspondence:*

*Sandrine Vieillard, Laboratoire de Psychologie (EA 3188), Université de Franche-Comté, 30 Rue Mégevand, 25030 Besançon, France e-mail: sandrine.vieillard@ univ-fcomte.fr*

There is mounting evidence that aging is associated with the maintenance of positive affect and the decrease of negative affect to ensure emotion regulation goals. Previous empirical studies have primarily focused on a visual or autobiographical form of emotion communication. To date, little investigation has been done on musical emotions. The few studies that have addressed aging and emotions in music were mainly interested in emotion recognition, thus leaving unexplored the question of how aging may influence emotional responses to and memory for emotions conveyed by music. In the present study, eighteen older (60–84 years) and eighteen younger (19–24 years) listeners were asked to evaluate the strength of their experienced emotion on happy, peaceful, sad, and scary musical excerpts (Vieillard et al., 2008) while facial muscle activity was recorded. Participants then performed an incidental recognition task followed by a task in which they judged to what extent they experienced happiness, peacefulness, sadness, and fear when listening to music. Compared to younger adults, older adults (a) reported a stronger emotional reactivity for happiness than other emotion categories, (b) showed an increased zygomatic activity for scary stimuli, (c) were more likely to falsely recognize happy music, and (d) showed a decrease in their responsiveness to sad and scary music. These results are in line with previous findings and extend them to emotion experience and memory recognition, corroborating the view of age-related changes in emotional responses to music in a positive direction away from negativity.

**Keywords: aging, musical emotions, emotional responses, facial muscle activity, incidental recognition, positivity effect**

#### **INTRODUCTION**

Research on age differences in emotion processing has been mostly restricted to visual stimuli (e.g., facial expression, video, words, and pictures) but a growing body of research converges in indicating that music also serves as a powerful emotional trigger. For example, Blood et al. (1999) have shown that classical musical stimuli which are selected to elicit intensely pleasant emotional responses engage neural networks that are implicated in reward. Neuropsychological studies have also demonstrated that the amygdala is recruited when processing scary music (e.g., Gosselin et al., 2005, 2007). Physiological data have also put in evidence that music is a strong emotion inducer (e.g., Khalfa et al., 2002). Furthermore, music has the clear advantage of maintaining attention toward the emotions conveyed because it does not allow perceptual attention to be redirected (except if the listener takes off his/her headphones). For these reasons, music appears as a viable method to test age-related changes in emotion processing.

Among the past studies that have addressed the question of age-related changes in musical emotion processing, most of them have focused on the people's ability to *recognize* musical emotions (Allen and Brosgole, 1993; Laukka and Juslin, 2007; Drapeau et al., 2009; van Tricht et al., 2010; Lima and Castro, 2011; Vieillard et al., 2012). For instance, Drapeau et al. (2009) compared healthy elderly adults and elderly adults with Alzheimer's disease in their ability to rate the extent to which the selected musical stimuli communicating happiness, peacefulness, sadness and fear (Vieillard et al., 2008) expressed each of these four emotions. Their findings showed that recognition performances of healthy older adults were relatively preserved with the highest recognition accuracy for happy stimuli. Laukka and Juslin (2007) compared young and older adults' ability to recognize anger, fear, happiness, sadness, and neutrality in short melodies performed on an electric guitar with different degrees of expressivity. In their study, the participants judged the emotional expression of each stimulus in a forced choice task comprised of anger, fear, happiness, sadness, neutral, and "other emotion" alternatives. Compared to young adults, older adults were less accurate in recognizing negative emotions such as sadness and fear, but their ability to recognize other emotion categories was spared. More recently, Lima and Castro (2011) went one step further by examining age-related changes in emotion recognition among three age groups of participants (i.e., young, middle-aged, and old). Selecting the same musical excerpts as those used in Drapeau et al. (2009)'s study, the authors asked the participants to judge the emotional intensity *perceived* for each stimulus on four rating scales (i.e., happiness, sadness, peacefulness, and fear) presented simultaneously. Consistent with previous findings, the authors found an emotion-specific age-related change characterized by a stable recognition of happiness and peacefulness categories but a gradual decline in responsiveness to sad and scary music from young adulthood to older age.

As an alternative to the hypotheses of age-related differences in cognitive functioning <sup>1</sup> or hearing loss <sup>2</sup> , the above findings have been interpreted as being the result of a combination of agerelated changes in brain structure and functioning and in motivational goals. On the one hand, empirical evidence suggest that the observed decline found in older adults in negative emotion recognition may be the result of a linear reduction in the volume of the amygdala (e.g., Zimmerman et al., 2006) and/or of a decrease of the reactivity to negative information in the amygdala (e.g., Mather et al., 2004). On the other hand, Socioemotional Selectivity Theory (Carstensen et al., 1999; Carstensen, 2006) suggests that the decline in the recognition of negative emotions would reflect a motivational shift toward emotionally meaningful goals due to an increased awareness of the limited perspective of time. This so-called "positivity effect" which refers to all combinations of *enhanced* processing of positive information and *reduced* processing of negative information, has been thought of as an emotion regulation strategy to preserve high levels of well-being in later life. In a recent literature review, Reed and Carstensen (2012) showed that this positivity effect (1) requires cognitive resources (e.g., Mather and Knight, 2005), (2) is sensitive to the experimental context (Kensinger et al., 2002; Grühn et al., 2005), and (3) is adaptive, i.e., it emerges when emotional well-being is prioritized. Consequently, the authors claimed that the positivity effect would represent a controlled shift in attentional resources rather than an *automatic* process associated with the neuronal degeneration in brain regions. Such view is compatible with the idea that the positivity effect may have a cognitive counterpart.

Recently, Vieillard et al. (2012) conducted a study in order to test for age-related differences in the psychological structure of musical emotions, and to assess whether these changes may be associated with a decrease in emotional complexity. In this research, younger and older participants were presented with musical excerpts conveying different emotions such as happiness, peacefulness, sadness, and fear (Vieillard et al., 2008). Participants were asked to perform an emotional judgment task using different rating scales (i.e., valence, hedonic value, arousal, and liking) as well as a free categorization task in which they freely created emotional categories based on the perceived acoustical cues. Findings showed age-related differences characterized by a reduced processing of arousal for scary music, an increased focus on happy music, and an emotional dedifferentiation corresponding to a decrease in differentiation between the arousal and valence dimensions. Such results have been explained within the framework of the Dynamic Integration Theory (Labouvie-Vief, 2009; Labouvie-Vief et al., 2010) postulating that the degradation of emotional complexity would be the cognitive counterpart of the older adults' attempt to maximize positive affects and minimize negative affect in order to preserve well-being.

In short, the studies reviewed above show converging evidence for a positivity effect in how emotions in music are perceived, categorized, and recognized with advancing age. However, several questions remain open. First, little is known about age-related changes regarding the emotions *experienced* while listening to music. Previous findings have suggested that participants were more accurate in their judgment of intended emotions in musical excerpts when focusing on their own emotional experience (Vieillard et al., 2008). One of the goals of the present study was thus to investigate the influence of aging on emotion processing while being personally engaged in musical listening. This is an important question because it has been suggested that emphasizing on emotion rather than on knowledge may be more meaningful to older adults (e.g., Mikels et al., 2010). Second, as far as we know, the possibility of age-related changes in memory recognition for positive musical stimuli has yet to be examined. Past research showed that age-related changes in emotional goals influence memory. A positivity effect in memory recognition tasks has already been shown in older adults for affective pictures (e.g., Charles et al., 2003; Mather and Carstensen, 2005) and for words (e.g., Kensinger, 2008). The main explanation was that memory can work as an elaborative process to regulate emotions such that the older adults' goal to maintain well-being would influence mental constructions of the past and thus lead to a positivity effect in the way they remember events. In line with this hypothesis, it has been showed that memory for negative pictures decreased in older adults both in recall and recognition tasks (e.g., Charles et al., 2003). Kensinger (2008) found a positivity effect in older adults for non-arousing words, explaining this as an age-related difference in the way positive information was primarily processed as a function of differences in motivational goals at each age level. To date, the question remains open whether age-related differences in the elaborative processing of memory may be observed in a non-verbal channel of emotion communication such as music. Third, to our knowledge, no previous study has investigated age differences in facial muscle activity when listening to music. Past studies examining age-related differences in facial expressiveness have found that young and older adults express similar patterns of facial responding to visual stimuli such as emotional scenes, objects or faces (Reminger et al., 2000; Smith et al., 2005; Bailey et al., 2009). However, older adults compared to younger adults may exhibit diminished reactivity in facial expressiveness (Smith et al., 2005; Burriss et al., 2007). The reduction of facial expressiveness in the elderly has been thought to be a possible consequence of general physiological losses in the nervous system. However, another explanation suggests that the reduction of facial expressiveness may reflect an attempt to regulate emotion since facial expression may be motivationally driven (Smith et al., 2005). Given such emotion regulation hypothesis and in line with the embodiment theory of emotion (Niedenthal et al., 2005), one can imagine that facial expressions may not only help to down-regulate emotion (by displaying less facial expressions), but also allow to modify the emotional reaction (by displaying a facial expression contrary to what one feels). In this perspective, facial electromyogram (EMG) appears as an interesting indicator

<sup>1</sup>Emotion categories that were less accurately recognized by older adults were not those known to be the hardest to recognize.

<sup>2</sup>Age groups did not differ or only differed marginally on self-reported hearing loss.

of whether there is congruence between facial expressivity and musical emotion in both young and older adults, or do older adults express positive facial expression as a means to counteract negative emotions.

There is an agreement to consider that emotional responding is a multi-component process, giving rise to affective experiences, physiological adjustments and expressive behaviors (e.g., Scherer, 2005). These various aspects of emotion may be differentially influenced by age. Therefore, a unifying view of these changes is necessary to give more insight into the lifespan developmental course of emotion, particularly in the musical domain in which this topic has remained unexplored. To this end, we focused on different indexes of emotional response to music, namely subjective experience, emotion expression and memory recognition for musical excerpts that conveyed different emotions.

Finally, although the positivity effect has been observed across a number of experimental paradigms such as dot-probe tasks (Charles et al., 2003), eye-tracking paradigms (Isaacowitz et al., 2006a,b), working memory (Mikels et al., 2005), memory recognition and free recall tasks (Charles et al., 2003), and across a variety of stimuli (e.g., pictures, word lists, facial expressions), the robustness of this phenomenon has been mainly demonstrated through the visual channel of emotion communication. Consequently, research regarding the effect of age on emotion processing in music is needed to test for the generalizability of the positivity effect.

#### **CURRENT RESEARCH**

The present study was designed to further extend previous studies and expand experimental designs to the domain of music. Our aim was twofold: first, to investigate the effects of aging on the emotion felt when listening to music and second, to address agerelated differences on memory recognition for musical excerpts as a function of their intended emotion. To this end, we used a set of musical stimuli expressing happiness, peacefulness, sadness, and fear which were all controlled for valence and arousal (Vieillard et al., 2008). We used a rating task focused on the emotion experienced by the participants rather than on the emotion recognized by the participants. In order to address more extensively the effect of aging on the experienced emotion, we designed the experiment so that the subjective report of the intensity of the emotion felt was coupled with a recording of participants' facial expressions. These particular indexes were chosen since past research has shown that facial expressions, measured by the corrugator (i.e., frowning) and zygomatic (i.e., smiling) muscle activity, were mostly related to the valence in music: positive emotions generally lead to increased zygomatic activity, while negative ones were associated more with increased corrugator activity (e.g., Witvliet and Vrana, 2007; Khalfa et al., 2008). Because it has been suggested that facial EMG may also be voluntarily modulated to serve emotion regulation goal (Smith et al., 2005), we used this index to examine to what extent and how older adults show positive or negative facial expression as a function of musical emotions.

Based on the hypothesis postulating a motivated attention toward positivity with advancing age, and given the view that situations relevant to a person's motivational goals may elicit more intense emotional experience (e.g., Charles and Piazza, 2007), we expected that older adults, compared to younger adults, would judge their feeling as more intense when listening to positive musical excerpts (especially happy music that is more arousing than peaceful one) than when listening to negative musical excerpts. Since no age-related changes in facial expressivity were found in previous research (e.g., Levenson et al., 1991; Tsai et al., 2000; Magai et al., 2006), we also expected older adults to be spared in their facial expressions (i.e., corrugators and zygomatic muscle activity). More specifically, if older adults have spontaneous facial activity, it is expected that they would display a greater zygomatic activity for positive music in comparison with their younger counterparts. At the same time, it is also expected that older adults would display voluntary facial expression as a tool to manage emotion. In this hypothesis, older adults would show reduced expressivity or incongruent expression, in particular in response to negativity. Moreover, in view of the scarce data available on the influence of aging on memory recognition for musical excerpts that convey different emotions, and because the memory elaborative processes for music are based on more abstract information than those involved in the memory for visual and autobiographical material, it is difficult to predict the nature of the effects likely to be observed. Consequently, we conducted an exploratory approach to test whether the positivity effect may be generalized to memory recognition for musical excerpts conveying different emotions. We expected that compared to younger adults, older adults should better recognize positive musical excerpts than negative ones.

### **METHOD**

#### **PARTICIPANTS**

A total of 40 native French speaking volunteers (22% amateur musicians <sup>3</sup> ) participated in the present study. Exclusion criteria included the presence of uncorrected hearing, medical or psychiatric antecedents, psychotic symptoms, and history of substance abuse. As a result, the data of 18 young adults (19–24 years, *M* = 21 years; 61 % females) and 18 older adults (60–84 years, *M* = 66 years; 83% females) was analyzed. Younger and older adults were recruited respectively at the psychology department of the University of Franche-Comté and through senior social programs in Besançon. Participants did not receive financial compensation for their participation.

#### **APPARATUS**

Participants were tested individually in a quiet room at stable ambient temperature at the University. Facial muscle activity was monitored continuously during the listening and rating phases using an MP150 Biopac system (Biopac Systems, Inc., Goleta, CA) at a sampling rate of 500 Hz and processed using AcqKnowledge software. Eprime software (Schneider et al., 2002) was used for excerpts presentation and ratings recording. Musical excerpts were presented binaurally through Professional 240 Sennheiser headphones.

<sup>3</sup>The musical training was measured as the proportion of participants who received at least 3 years of formal training and who were still practicing a musical instrument without reaching professional levels.

#### **MATERIALS**

Forty short musical excerpts, computer-generated in a piano timbre and taken from Vieillard et al. (2008) set of unfamiliar musical stimuli were selected for their power to convey four distinct emotions (i.e., happiness, peacefulness, sadness, and fear). Musical excerpts were controlled for their valence (unpleasant vs. pleasant) and arousal (low vs. high). Each emotion category included ten musical excerpts that lasted an average of 10 s. The happy excerpts were written in a major mode at an average tempo of 137 Metronome Markings (MM range: 92–196), with the melodic line lying in the medium high pitch range (the pedal was not used). The peaceful excerpts were composed in a major mode, had an intermediate tempo (mean: 74 MM, range: 54–100), and were played with pedal and arpeggio accompaniment. The sad excerpts were written in a minor mode at an average slow tempo of 46 MM (range: 40–60), with the pedal. The scary excerpts were composed with minor chords on the third and sixth degree, hence implying the use of many out-of-key notes. Although most scary excerpts were regular and consonant, a few had irregular rhythms and were dissonant. Their tempo varied from 44 to 172 MM. Examples can be heard at www.brams.umontreal.ca/peretz. A previous study that was conducted to examine the effect of age on emotion perception in music demonstrated that older listeners successfully distinguished happiness, peacefulness, sadness, and fear conveyed by these musical excerpts (Vieillard et al., 2012). In a study phase described below, participants were presented with 20 musical excerpts (i.e., 5 happy, 5 peaceful, 5 sad, and 5 scary) and were instructed to indicate what they experienced in terms of Emotional Intensity. The 20 remaining musical excerpts were then used as lures in the incidental recognition task.

#### **PROCEDURE**

The experiment was divided into two sessions separated by an interval of ∼1 week. During the first session, participants completed a consent form and were asked about their age, musical listening, education level, self-reported health, visual and auditory acuity, and medical history. Auditory perception was controlled using free AudioTest software (www.cotral.com). More specifically, it was assessed by presenting pure tones at intervals between 500 and 8000 Hz to both ears through Professional 240 Sennheiser headphones. For each participant, the lowest sound pressure level at which each frequency was detected was recorded. In addition, several tests assessing general cognitive function (MMSE, Petit et al., 1998), fluid intelligence (Raven's progressive matrices, set I, Raven et al., 1998, updated 2003), and working memory (letter-number sequencing from WAIS-III, Wechsler, 2000) were administered. The first session lasted about an hour.

During the second session, physiological sensors were attached while the participants sat comfortably in a quiet room in the presence of the experimenter. To prevent participants from focusing on their facial muscles, they were informed that the electrodes placed on their face were used to record their electrodermal activity during the experiment. At the beginning of the session, two musical excerpts different from those used in the experiment were used in order to adjust the volume of the headphones for each participant. In the study phase, two practice excerpts (1 happy and 1 peaceful) following by 20 excerpts (5 of each emotion) were then presented binaurally. After each trial, participants were asked to rate the intensity of the emotion felt using a 10-point scale ranging from 0 "weak" to 9 "strong." Facial muscle activity was also recorded. The excerpts were presented in two pseudo-randomized orders that were created to ensure that no more than two excerpts of the same emotion category were presented consecutively. Each musical excerpt was preceded and followed by two baseline periods of at least 10 s of silence.

Before the incidental recognition task, participants completed two questionnaires to assess depression (BDI-II; Beck et al., 1998) and anxiety (STAI; Spielberger, 1993). In the recognition task participants were asked to indicate whether an excerpt had been heard before ("old") or not ("new") by pressing the appropriate key. In this phase, 40 musical excerpts (i.e., 20 old excerpts and 20 new excerpts) were randomly presented.

Finally, participants were instructed to listen to the same set of 40 musical excerpts presented binaurally in a randomized order. After each musical excerpt, participants were asked to judge to what extent they experienced "happiness," "peacefulness," "sadness," or "fear" using a 10-point scale ranging from 0 "not at all" to 9 "a lot." Accordingly, each excerpt was presented four times; each presentation was associated with one of the four emotion scales. The presentation order of each musical excerpt and each rating scale was fully randomized across participants. The second session lasted about 2 h. At the end of the session, participants were fully debriefed.

#### **DATA ACQUISITION AND TRANSFORMATION**

Facial EMG activity (μVolts) was recorded over the left corrugator and zygomatic sites, using two pairs of 8 mm Ag/AgCl shielded electrodes filled with isotonic gel. The EMG data were band-pass filtered from 100 to 500 Hz and processed with a root mean square algorithm over 20 samples (with a 100-ms window). Recording artifacts were visually identified and discarded from the sample. These corresponded to less than 0.5% of all measurements.

#### **RESULTS**

#### **SAMPLE CHARACTERISTICS**

Younger adults reported more years of education than the older adults, *t*(34) = −2.33, *p* < 0.05. A chi-square goodness-of-fit test (χ2) indicated no significant differences between age groups in the proportion of formal musical training of at least 3 years <sup>χ</sup>2(1, *<sup>N</sup>* <sup>=</sup> <sup>36</sup>) <sup>=</sup> <sup>4</sup>.5, *<sup>p</sup>* <sup>&</sup>gt; <sup>0</sup>.05. Age groups did not differ regarding their depression, state anxiety, or trait anxiety scores <sup>4</sup> .

<sup>4</sup>BDI-II scores indicated that 75% (*<sup>n</sup>* <sup>=</sup> 27) of the participants scored below the cut-off (score of 11) for a minimum depression symptomatology and that 19.44% (*n* = 7) of the participants scored below the cut-off (score of 19) for at least mild depression. The remaining two participants (one young and one older adult) scored between 22 and 24, which indicates moderate depression symptomatology. STAI state anxiety scores indicated that 83.33% (*n* = 30) of the participants scored below the cut-off (score of 35) for very low state anxiety, 13.89% (*n* = 5) of them scored below the cut-off (score of 45) for low state anxiety, and that 2.78% (*n* = 1) of them scored below the cut-off (score of 55) for mild state anxiety. STAI trait anxiety scores indicated that 30.56% (*n* = 11) of the participants scored below the cut-off (score of 35) for very low

Non-parametric Mann-Whitney *U*-test<sup>5</sup> performed on the auditory thresholds (dB) did not show statistically significant differences between younger and older adults (*U* = 10.50, *z* = 0.77, *p* = 0.44 for 500 Hz; *U* = 0, *z* = 0, *p* = 1 for 1000 Hz; *U* = 10, *z* = 0.84, *p* = 0.40 for 2000 Hz; *U* = 8, *z* = 1.12, *p* = 0.26 for 4000 Hz; *U* = 33, *z* = 0.23, *p* = 0.81, for 8000 Hz). As was expected, younger adults scored better than older adults on fluid intelligence (Raven's progressive matrices, set I, Raven et al., 1998, updated 2003), *t*(34) = −2.84, *p* < 0.05. Younger adults tended to perform better than older adults on a working memory test (Digit Span from WAIS-III, Wechsler, 2000), *t*(34) = −1.94, *p* = 0.06. There were no significant age differences on self-reported health, *t*(34) = −0.81, *p* = 0.42. Finally, the Mini Mental State Examination (MMSE; Petit et al., 1998) scores for the older adults suggested no apparent signs of dementia (*M* = 29.7, 28–30). Sample characteristics are detailed in **Table 1**.

#### **EMOTION INTENSITY FELT**

A mixed model analysis of variance was conducted on the mean score of the Emotional Intensity Felt with Age Group (younger adults, older adults) as the between-subjects factor and Intended


*Standard deviations are listed in parentheses. \*Significant difference at p* < *0.05.*

Emotions (happiness, peacefulness, sadness, fear) as the withinsubjects factor.

As illustrated in **Figure 1**, we found a significant Age Group by Intended Emotion interaction, *<sup>F</sup>*(3, <sup>102</sup>) <sup>=</sup> <sup>3</sup>.75, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *G* = 0.06 <sup>6</sup> . In order to test our hypothesis, we computed a planned comparison between the emotion intensity felt by young adults and that felt by older adults when listening to happy music. As expected, older adults compared to young adults reported experiencing higher emotional intensity when listening to happy music, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>5</sup>.57, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.14. This older adults' reactivity for positivity was also confirmed by another set of planned comparisons indicating that older adults reported experiencing higher emotional activation when listening to happy music than when listening to sad music, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>7</sup>.34, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.18, or scary music, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>8</sup>.76, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.20, while younger adults did not. No other significant effect was found. A separate analysis with years of education, fluid intelligence scores, and working memory performances (i.e., factors that were found to be different between the two age groups) as covariates indicated that the Age Group by Intended Emotion interaction remained significant, *<sup>F</sup>*(3, <sup>93</sup>) <sup>=</sup> <sup>3</sup>.14, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.05.

#### **FACIAL MUSCLE ACTIVITY**

Facial EMG responses were calculated as the difference between the signal (Area under the curve, μV∗sec) over the time course of the musical excerpt and a baseline EMG level measured from 1s prior to the onset of the excerpt (time -1 to 0 s) to the beginning of the excerpt. The area under the curve was extracted within these two time windows and was averaged for each condition and for each participant. Two participants were excluded (1 younger and 1 older adults) from the initial sample due to technical problems. Analyses were then conducted on seventeen younger and seventeen older adults. The Shapiro-Wilk normality test reached significance for the sample set of EMG data meaning that the assumption of normality has to be rejected. As a result, statistical analyses were performed using non-parametric tests. First, the differences between age groups were tested separately for zygomatic and corrugator using the Mann-Whitney *U*-test. Results showed a significant effect of Age Group both for zygomatic (*U* = 42, *z* = 3.51, *p* < 0.001) and for corrugator muscle (*U* = 83, *z* = 2.10, *p* < 0.05) indicating that facial activity was more important in older adults than in their younger counterparts. Friedman repeated measures analyses of variance (RM-ANOVA) were conducted separately on zygomatic and on corrugator activity to test the effect of Intended Emotion factor for each Age Group. For zygomatic muscle, data revealed a significant effect of Intended Emotion in older adults, <sup>χ</sup><sup>2</sup> <sup>=</sup> <sup>12</sup>.46, *df* <sup>=</sup> 3, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, but not in young adults, <sup>χ</sup><sup>2</sup> = 2.01, *df* <sup>=</sup> 3, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.57. As shown in **Figure 1**, older adults showed an increased zygomatic activity, in particular for scary music. Regarding corrugator

state anxiety, 41.67% (*n* = 15) scored below the cut-off (score of 45) for low trait anxiety, and 19.44% (*n* = 7) scored below the cut-off (score of 55) for moderate trait anxiety. The remaining two participants (one young and one older adult) scored between 57 and 58, which correspond for a high anxiety state.

<sup>5</sup>Because the audio test software did not provide absolute thresholds, we used the lowest sound pressure level (dB) at which each frequency was detected as an ordinal data. For this reason, the non-parametric Mann-Whitney test (independent samples) was used. Although hearing thresholds varied as a function of age, the Mann-Whitney test did not reveal significant differences between age groups probably because of the variability observed within the groups, in particular in the older adults.

<sup>6</sup>We computed generalized eta squared statistics (η<sup>2</sup> *<sup>G</sup>*) with the aim to yield measures of effect size comparable across a wide variety of research designs (Bakeman, 2005), regardless of whether the factor is between or within subjects. These effect-size measures provide indices of effect that are consistent with Cohen (1988) guidelines indicating that η<sup>2</sup> *<sup>G</sup>* = 0.01 corresponds to a small effect, η<sup>2</sup> *<sup>G</sup>* <sup>=</sup> <sup>0</sup>.09 to a medium effect, and <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.25 to a large effect.

activity, no significant effect of Intended Emotion was found either in older adults, <sup>χ</sup><sup>2</sup> <sup>=</sup> <sup>3</sup>.00, *df* <sup>=</sup> 3, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.39, or in younger adults, <sup>χ</sup><sup>2</sup> <sup>=</sup> <sup>5</sup>.68, df <sup>=</sup> 3, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.13.

**(happiness, peacefulness, sadness, fear) and age group (younger adults, older adults).**

#### **INCIDENTAL RECOGNITION TASK**

Proportions of hits, false alarms, and corrected recognition scores (hits minus false alarms) are reported in **Table 2**. An analysis of variance was conducted on corrected recognition scores with Age Group (younger adults, older adults) as the between-subjects factor and Intended Emotion (happiness, peacefulness, sadness, fear) as the within-subjects factor. Analysis showed a significant main effect of Age Group indicating that younger adults recognized more musical stimuli than older adults, *F*(1, <sup>34</sup>) = 30.87, *p* < 0.001, η<sup>2</sup> *<sup>G</sup>* = 0.50. There was also a significant main effect of Intended Emotion, *<sup>F</sup>*(3, <sup>102</sup>) <sup>=</sup> <sup>4</sup>.86, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.12, indicating that scary music stimuli were better recognized than peaceful and sad music stimuli. This was confirmed by *post-hoc* Bonferroni comparisons (*p*s < 0.05). No significant Age Group by Intended Emotion interaction was observed, *F*(3, <sup>102</sup>) = 0.27, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.84, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.01. Younger adults' recognition performances were at chance level or above only for the highly arousing musical stimuli such as happy (52%) and scary (61%) ones, while older adults' recognition performances varied between 10 and 39% through the four intended emotions.

Additional separate analyses conducted on hits and false alarms revealed that older adults generated more false alarms than younger adults, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>18</sup>.47, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.35. There was also a significant main effect of Intended Emotion, *<sup>F</sup>*(3, <sup>102</sup>) <sup>=</sup> <sup>22</sup>.02, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.36, as well as a significant Intended Emotion by Age Group interaction for false alarm rates, *<sup>F</sup>*(3, <sup>102</sup>) <sup>=</sup> <sup>4</sup>.85, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.08. *Post-hoc* Bonferroni comparisons showed that the older listeners, compared with their younger counterparts, had more difficulty to correctly reject new happy musical stimuli (*p* < 0.05). Moreover, older adults were better to correctly reject new scary musical stimuli in comparison to all other intended emotions (*ps* < 0.001) while younger adults were better to correctly reject new scary musical stimuli in comparison to only low arousing stimuli like peaceful (*p* < 0.05) and sad (*p* < 0.05) ones. No significant effect or interaction was found for hits rates. Again, the analysis of covariance with years of education, fluid intelligence scores, and working memory performances as covariates showed that the Intended Emotion by Age Group interaction remained significant, *F*(3, <sup>93</sup>) = 3.70, *p* < 0.05, η2 *<sup>G</sup>* = 0.07.

D-prime was calculated using tables for d-prime and beta available in Hochhaus (1972) and analyzed using another mixed ANOVA. This indicates the ability to discriminate between true targets and false targets (Green and Swets, 1966), with Age Group (younger adults, older adults) as the between-subjects factor and Intended Emotion (happiness, peacefulness, sadness, fear) as the within-subjects factors. We obtained a significant main effect of Age, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>26</sup>.41, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.44 indicating that, overall, older adults showed lower d-prime score (*M* = 0.43, *SE* = 0.15) than younger adults (*M* = 1.50, *SE* = 0.15). This suggests a lower sensitivity in the discrimination of true musical excerpts from false musical excerpts in older adults. No other significant interaction was found with d-prime as the dependent variable. The beta value that indicates the minimum level of activation necessary for a participant to respond to a true target (Green and Swets, 1966) was also calculated (Hochhaus, 1972). No significant main effect or interaction was found with beta value as the dependent variable.

#### **DIFFERENTIATION IN EMOTION FELT**

As in previous studies (Vieillard et al., 2008), we derived the best label attributed to each musical excerpt by each participant. This was done selecting the label (i.e., happy, peaceful, sad, scary) that had received the maximal rating. When the maximal rating corresponded to the label that matched the intended emotion, a score of 1 was given. When the maximal rating did not correspond to the emotion, a score of 0 was given. When the highest rating was given for more than one label, the response was considered


**Table 3 | Mean percentage of the label that received the maximal rating of Emotion Felt by younger and older listeners as a function of the Intended Emotions.**


*Bold indicates the match between Emotion Felt and Intended Emotions. Ambivalent responses correspond to highest ratings given to more than one label.*

ambivalent and received a score of 0. For example, when an excerpt was perceived eliciting both peacefulness and sadness to the same degree (e.g., with a rating of 7), it was considered as ambivalent. Best labels scores are presented in **Table 3**.

A mixed model analysis of variance was conducted on the mean Best Label with Age Group (younger adults, older adults) as a between-subjects factor and Intended Emotions (happiness, peacefulness, sadness, fear) and Emotion Felt (happiness, peacefulness, sadness, fear) as within-subjects factors. Significant main effects of Age Group, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>7</sup>.43, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.18, and Emotion Felt, *<sup>F</sup>*(3, <sup>102</sup>) <sup>=</sup> <sup>5</sup>.87, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.14 were found. As expected, results also indicated a significant Intended Emotion by Emotion Felt interaction, *<sup>F</sup>*(9, <sup>306</sup>) <sup>=</sup> <sup>71</sup>.45, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> *G* = 0.58, as well as a significant Intended Emotion by Emotion Felt by Age Group, *<sup>F</sup>*(9, <sup>306</sup>) <sup>=</sup> <sup>5</sup>.10, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001; <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.09. No other significant main effect or interaction was observed. We first compared the experience emotion between younger and older adults for each intended emotion. The results indicated that older adults, compared to younger adults, reported experiencing lower levels of sadness when listening to sad music, *F*(1, <sup>34</sup>) = 5.44, *p* < 0.05, η2 *<sup>G</sup>* = 0.14, but reported higher levels of sadness when listening to scary music, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>6</sup>.35, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.16. Moreover, when listening to scary music, older adults reported experiencing lower levels of fear, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>18</sup>.78, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.36, than their younger counterparts. The second set of comparisons was conducted to compare the emotion experienced for each intended emotion within each age group. The results showed that older adults reported similar levels of sadness and peacefulness when listening to peaceful music, *F*(1, <sup>34</sup>) = 1.92, *p* = 0.18, η2 *<sup>G</sup>* <sup>=</sup> <sup>0</sup>.05, and sad music, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>0</sup>.85, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.36, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.03, as well as similar levels of sadness and fear when listening to scary music, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>0</sup>.44, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.51, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.01. Younger adults reported similar levels of sadness and peacefulness only when listening to peaceful music, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>0</sup>.73, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>.40, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.02. Happy music was the only music that primarily elicited happiness (when compared with the level of peacefulness felt) in both young and older adults, *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>117</sup>.63, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> *G* = <sup>0</sup>.78, and *<sup>F</sup>*(1, <sup>34</sup>) <sup>=</sup> <sup>105</sup>.92, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.001, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.76, respectively. The Intended Emotion by Emotion Felt by Age Group interaction remained significant when years of education, fluid intelligence scores, and working memory performances were entered as covariates, *<sup>F</sup>*(9, <sup>279</sup>) <sup>=</sup> <sup>2</sup>.90, *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.05, <sup>η</sup><sup>2</sup> *<sup>G</sup>* = 0.06.

#### **THE RELATIONSHIP BETWEEN AGE AND DEPENDENT MEASURES**

In order to check for any relationships between age and the different dependent measures (i.e., emotion intensity felt, physiological responses to music, recognition accuracy, and the type of emotion felt) for each of the four intended emotions, we computed a series of correlations. Because age was significantly correlated with years of education, *r*(34) = −0.41, *p* < 0.05, fluid intelligence, *r*(34) = −0.48, *p* < 0.05, working memory, *r*(34) = −0.35, *p* < 0.05, and each of the five measures of auditory thresholds, *r*(34) = 0.56, *p* < 0.001 for 500 Hz; *r*(34) = 0.54, *p* < 0.001 for 1000 Hz; *r*(34) = 0.61, *p* < 0.001 for 2000 Hz; *r*(34) = 0.42, *p* < 0.05 for 4000 Hz; *r*(34) = 0.75, *p* < 0.001 for 8000 Hz, these variables were controlled for in partial correlations. Results indicated that the mean scores of hits, corrected recognition, and d-prime for peaceful music were negatively and significantly correlated with age, *r*(34) = −0.34, *p* < 0.05 for hits, *r*(34) = −0.51, *p* < 0.05 for corrected recognition, and *r*(34) = −0.53, *p* < 0.05 for d-prime. Similarly, age was negatively and significantly correlated with the mean score of corrected recognition and d-prime for happy music, *r*(34) = −0.69, *p* < 0.001 for corrected recognition, and *r*(34) = -0.65, *p* < 0.001 for d-prime, while it was positively and significantly correlated with the mean score of false alarms, *r*(34) = 0.63, *p* < 0.001. Altogether, the data indicated that the older the people are, the lower their ability to discriminate studied positive musical excerpts conveying peacefulness and happiness from unstudied ones. The mean score of Beta index for sad music was also positively and significantly correlated with age, *r*(34) = 0.50, *p* < 0.05, suggesting that the older the people are, the more conservative they are to discriminate studied stimuli from unstudied musical excerpts conveying sadness. Moreover, data indicated that the older the people are, the stronger their experience of sadness while listening to peaceful music, *r*(34) = 0.46, *p* < 0.05, and the weaker their experience of fear while listening to scary music, *r*(34) = −0.49, *p* < 0.05. No other significant correlations were found.

#### **THE RELATIONSHIP BETWEEN EMOTION INTENSITY FELT, FACIAL MUSCLE ACTIVITY, AND RECOGNITION PERFORMANCES**

For each age group, we investigated to what extent the emotion intensity felt during the first presentation of musical stimuli was linked to the physiological responses as well as to the subsequent cognitive performances on the incidental recognition task. The relationship between physiological reactions and recognition performances was also examined. In younger adults, results indicated that the stronger the emotion intensity felt in response to sad music, the higher the hits, *r*(14) = 0.54, *p* < 0.05. In older adults, results showed that the stronger the emotion intensity felt for happy music, the higher the false alarms, *r*(14) = 0.54, *p* < 0.05. No other significant correlations were found.

#### **DISCUSSION**

In this study, we investigated how the emotional experience as well as memory recognition for musical excerpts eliciting four different emotions (happiness, sadness, peacefulness, and fear) may change with age. To this end, younger and older listeners were asked to evaluate the intensity of the emotion felt while their facial expressions (i.e., zygomatic and corrugator muscle activity) were recorded. They were then instructed to perform an incidental recognition task followed by another task in which they had to assess for each musical excerpt to what extent they experienced each of the four emotions.

As predicted, the results showed that, when presented with happy music, older adults assessed the emotion felt as more intense than their younger counterparts. The fact that older adults rated their emotional experience as significantly more intense for happy music stimuli in comparison to sad and scary music stimuli is consistent with the literature showing that aging is associated with a relative preference for positivity over negativity. This also supports the view that emotions and motivations cannot be disentangled from each other. However, the assumption that the stronger emotional experience reported by older adults while listening to happy music would be reflected in a greater zygomatic activity was not supported. Compared to younger adults, older adults showed stronger facial expressions for both corrugator and zygomatic muscles as well as for all intended emotions. This suggests that facial expressions are not exclusively aligned with the emotional state, thus raising the question of whether the general increase of facial expressiveness in older adults would simply reflect a deeper engagement in the task. However and interestingly, the current results also showed that older adults' zygomatic activity, but not for young adults', varied as a function of the intended emotions in such a way that older adults showed an increased zygomatic activity for scary excerpts but not for happy or peaceful ones. This is in line with the idea that the reaction of smiling may serve as a defensive goal in inhibiting negative feelings for older adults. It can also be argued that zygomatic activity may reflect partial facial expression of fear, but then there we should have observed a greater concomitant activity for the corrugator muscle. However, this was not observed. Taken together, these findings are consistent with the idea that older adults' facial expressions possibly reflect an attempt to regulate emotion. Further research is needed to substantiate the role of voluntary facial expressions in the older adults' response to emotions.

Consistent with our expectations, our findings indicated that older adults correctly recognized less musical excerpts than their younger counterparts. Moreover, older adults' range of performances was quite similar to that found by Kensinger (2008) with emotional words. This suggests that modality has little impact on the strength of the memory decline with aging. The results of the present study also indicated that younger adults as well as older adults better recognized negative and arousing musical excerpts (i.e., scary music) than all other excepts while producing low false alarms rates for these scary music stimuli. This corroborates the hypothesis of an increased distinctiveness of negative stimuli (e.g., Pesta et al., 2001) and extends previous studies that showed that older adults can visually detect arousing and negative stimuli as well as their younger counterparts (e.g., Magai et al., 2006; Knight et al., 2007). Our findings also gave evidence for increased false recognition for happy music stimuli in older adults but not for young adults. We found a negative relationship between age and the ability to discriminate between true and false happy musical excerpts as well as a positive relationship between the emotion intensity felt in older adults and their rate of false alarms for happy stimuli. Taken together, these findings suggest that positive emotion elicited by happy excerpts may produce an attentional bias in older adults that can lead to confusion between studied and non-studied excerpts and thus enhance the probability of false alarms. Such increase in discrimination threshold is consistent with previous studies showing that aging was associated to a higher false response rate to positive words (Fernandes et al., 2008; Piguet et al., 2008) and corroborates the idea that the reduction of distinctiveness for positive information in older adults would be the result of their liberal bias toward positivity. However, in the present study, the positivity bias is detrimental to memory accuracy.

Another main finding of the current research is that, when presented with sad and scary musical excerpts, older adults reported experiencing lower levels of sadness and fear than their younger counterparts. Correlation analyses indicated that the older the people are, the weaker their experience of fear felt while listening to scary music. This fits nicely with previous research demonstrating age-related changes in emotion recognition (Laukka and Juslin, 2007; Lima and Castro, 2011) and emotion perception (Vieillard et al., 2012) in music, and extends these studies by showing these changes also occur when participants are focused on their own emotional experience. Interestingly, compared to Lima and Lima and Castro's (2011) recognition paradigm, the personal engagement involved in the current task seems to facilitate the older adults' ability to process negative emotions. This is consistent with previous findings demonstrating that older adults benefit more from instructions encouraging to focus on emotion than on information acquisition (Mikels et al., 2010) and corroborates the view of an age-related emphasis on emotion processing. Of course, further research is needed to compare the older adults' responsiveness to musical emotions in both contexts of recognition and of emotional experience.

Given the relatively short duration of the musical stimuli used in the present study, one may argue that this could challenge their ability to induce emotions, leading participants to rate their perceived emotions rather than their felt emotions. Although studies aiming to induce felt emotions in listeners tend to use longer excerpts than those investigating perceived emotions (Eerola and Vuoskoski, 2013), we believe that the short excerpts used in our experiment also successfully induced emotions. First of all, our results indicated that participants reported moderate intensity of the emotion felt along with significant differences in facial expressivity. Furthermore, previous findings demonstrated that musical excerpts as short as 13s may recruit neural mechanisms involved in pleasant/unpleasant emotional responses (Blood et al., 1999). In the study of Vieillard et al. (2008) which used similar 10 s musical excerpts recorded in a piano timbre, listeners better recognized some intended emotions when focusing on their emotional experience rather than when focusing on the recognition of the emotion. This suggests that asking participants to focus on felt emotions increases the degree of personal engagement in musical emotion even for short musical excerpts. Taken together, these data corroborate the hypothesis that the musical emotions were not only recognized, but indeed felt.

#### **REFERENCES**


One limitation of this study is that we used a cross-sectional design. Historical differences in the cultural system and in musical exposure may have affected young and older adults' performances differently. The observed age differences in emotional responses to music might thus reflect a cohort effect rather than an age effect. Future research would benefit from investigating this issue more thoroughly. Nevertheless, our study suggests that emotional response to music and memory recognition for musical excerpts conveying emotions show differences with advancing age. These age-related differences are characterized by a stronger emotional reactivity for happiness, an increased zygomatic activity in response to scary stimuli, an increase in false recognition for happy musical excerpts, and a decrease in responsiveness to sad and scary music. This study extends previous findings and expands them to music, a powerful channel of emotion communication. Importantly, the findings suggest that aging may cause a decrease in negative affects and an increase in positive affects even when these affects are elicited by a more abstract source of emotion that does not refer to specific events. Finally, the current data are in line with the hypothesis that older adults could use emotional coping skills acquired over the life span in order to avoid potentially negative events and maintain positive ones (Charles et al., 2001; Labouvie-Vief et al., 2010).

#### **ACKNOWLEDGMENTS**

This research was funded by the ANR "EMCO" program (Project Streem N ANR 11 EMCO 003 01). We are grateful to Alexandra Richen for her help in running the experiments as well as to Alejandra R. Velasquez and Joanna Blatter-Minn for their proofreading the manuscript.


emotion studies: approaches, emotion models, and stimuli. *Music Percept.* 30, 307–340. doi: 10.1525/mp.2012.30.3.307


Vieillard and Gilet Aging and emotional responses to music

using emotionally toned words. *Psychol. Aging* 20, 579–588. doi: 10.1037/0882-7974.20.4.579


*Emot.* 31, 182–191. doi: 10.1007/ s11031-007-9063-z


*Learn. Mem. Cogn*. 27, 328–338. doi: 10.1037/0278-7393.27.2.328


disease. *Brain Cogn*. 74, 58–65. doi: 10.1016/j.bandc.2010.06.005


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 31 March 2013; accepted: 17 September 2013; published online: 16 October 2013.*

*Citation: Vieillard S and Gilet A-L (2013) Age-related differences in affective responses to and memory for emotions conveyed by music: a crosssectional study. Front. Psychol. 4:711. doi: 10.3389/fpsyg.2013.00711*

*This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology.*

*Copyright © 2013 Vieillard and Gilet. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*