Rhythm on Your Lips

Peña, Marcela; Langus, Alan; Gutiérrez, César; Huepe-Artigas, Daniela; Nespor, Marina

doi:10.3389/fpsyg.2016.01708

ORIGINAL RESEARCH article

Front. Psychol., 08 November 2016

Sec. Psychology of Language

Volume 7 - 2016 | https://doi.org/10.3389/fpsyg.2016.01708

Rhythm on Your Lips

$\r\nMarcela Pea*$ Marcela Peña¹^*

Alan Langus²

César Gutiérrez³

Daniela Huepe-Artigas³

Marina Nespor²

¹Laboratorio de Neurociencias Cognitivas, Escuela de Psicología, Pontificia Universidad Católica de Chile, Santiago, Chile
²Language, Cognition and Development Lab, International School for Advanced Studies, Trieste, Italy
³Center for Social and Cognitive Neuroscience, School of Psychology, Universidad Adolfo Ibañez, Santiago, Chile

The Iambic-Trochaic Law (ITL) accounts for speech rhythm, grouping of sounds as either Iambs—if alternating in duration—or Trochees—if alternating in pitch and/or intensity. The two different rhythms signal word order, one of the basic syntactic properties of language. We investigated the extent to which Iambic and Trochaic phrases could be auditorily and visually recognized, when visual stimuli engage lip reading. Our results show both rhythmic patterns were recognized from both, auditory and visual stimuli, suggesting that speech rhythm has a multimodal representation. We further explored whether participants could match Iambic and Trochaic phrases across the two modalities. We found that participants auditorily familiarized with Trochees, but not with Iambs, were more accurate in recognizing visual targets, while participants visually familiarized with Iambs, but not with Trochees, were more accurate in recognizing auditory targets. The latter results suggest an asymmetric processing of speech rhythm: in auditory domain, the changes in either pitch or intensity are better perceived and represented than changes in duration, while in the visual domain the changes in duration are better processed and represented than changes in pitch, raising important questions about domain general and specialized mechanisms for speech rhythm processing.

Introduction

Spoken language is governed by rhythm and rhythm can be found at almost every single level of speech. At the most basic level, linguistic rhythm is signaled by the time occupied by vowels in the speech stream (V%) and the standard deviation of consonantal intervals (ΔC) (Ramus et al., 1999). Moreover rhythm in spoken language is also signaled through the periodic changes in intensity, duration and pitch involving speech units longer than phonemes such as syllables, which help us to identify for instance which syllables are strong in a word or where is the prominence in phonological phrases. Those changes in intensity, duration and pitch involving phonemes, syllables and other longer linguistic units alternate regularly, conferring to speech prosody a rhythmic alternation.

Because rhythmic alternation at different levels of the rhythmic hierarchy signals different linguistic properties, it offers language learners cues that might allow them to break into different regularities of language detectable from the speech stream. For example, languages can be discriminated on the basis of rhythm at the basic level (e.g., Ramus et al., 1999) and rhythmic alternation offers cues to the size of the syllabic repertoire (Nespor et al., 2011). Continuous speech can be segmented into words (Jusczyk, 1999; Shukla et al., 2007) and phrases (Christophe et al., 1994) on the basis of rhythm, and rhythmic alternation even offers a cue to such basic syntactic properties like word order (Christophe et al., 2003). The ability to represent, recognize and discriminate rhythm in spoken language is therefore likely to play a crucial role for infants acquiring their mother tongue (Langus et al., 2016) and possibly for adults learning a second language. Speech perception, however, is a multi-sensory experience. In addition to the sound of spoken language, speech is also perceived visually from the movements of the lips (McGurk and MacDonald, 1976), the face (Graf et al., 2002; Blossom and Morgan, 2006), the hands (McNeill, 2005; Guellaï et al., 2014), and possibly also from other parts of the body of the speaker. Visual information may be sufficient to discriminate between different languages. For example, bilingual Spanish-Catalan as well as monolingual Spanish and Catalan speakers, but not monolingual speakers of English and Italian, can discriminate Catalan and Spanish using only visual cues (Soto-Faraco et al., 2007). Also monolingual and bilingual English- and Spanish-speaking adults have been shown to discriminate between Spanish and English—two languages differing at the basic rhythmic level—only on the basis of the visual cues provided by speaking faces (Ronquest et al., 2010). These results suggest that adult listeners can discriminate between rhythmically similar (Spanish and Catalan) as well as rhythmically different (English and Spanish) languages by analyzing the facial mimic when they know at least one of the two languages. However, because these visual rhythmic discrimination tasks relied on utterances from languages that differed in prosodic, segmental, lexical and syntactic characteristics, it is difficult to determine which of the speech cues with a visual correlate contributes to the discrimination of the stimuli. Nevertheless, a recent study showed that both, English monolingual and Spanish/Catalan bilingual adults speakers, discriminated resynthesized flat prosody versions of English and Japanese utterances, two languages differing in the mean duration of their consonant and vowel clusters, not only when they were presented as auditory stimuli but also when they were transformed into visual sequences of aperture-close of the mouth of an schematic face, and into vibro-tactile streams (Navarra et al., 2014). Speech rhythm perceived by different sensorial modalities is thus relevant to discover not only segmental but also supra-segmental properties of speech.

Here we therefore investigate audiovisual discrimination of rhythm in phonological phrases in a non-native language. The phonological phrase extends from the left edge of a phrase to the right edge of its head (e.g., the noun in noun phrases or the verb in verb phrases) in head-complement languages; and from the left edge of a head to the left edge of its phrase in complement-head languages (Nespor and Vogel, 1986, 2007). Thus, for example in a language like English a phonological phrase starts with the function word or a preposition and ends with the head, as in for the girls, and in a language like Turkish it starts with the head and it ends at the end of the maximal projection, thus, e.g., a postposition, as in benim için me—for “for me.” Among the languages spoken around the world there are only two known types of phrasal rhythm: iambic and trochaic. Languages with the basic Object-Verb word order, where the head of the phrase follows its complements, such as Turkish and Japanese, mark prominence trochaically mainly through pitch and intensity on the stressed syllable of the first word of the phonological phrase. Languages with the basic Verb-Object word order where the head of the phrase precedes its complements, like English and Italian, mark prominence iambically mainly through duration on the stressed syllable of the last word of phonological phrases (Nespor et al., 2008).

Compared to the other levels of rhythm, phrasal rhythm is likely to play a prominent role in audiovisual discrimination because phrasal prominence is highly salient. The location of prominence at higher levels of the prosodic hierarchy coincides with the location of prominence at lower levels. For example, phonological phrase prominence is signaled on top of lexical stress that is signaled on top of prominence in feet. Thus, prominence at the phonological phrase level is thus acoustically more salient than lexical stress or prominence in feet, which is hardly prominent in connected speech. Furthermore, because phonological phrase boundaries never straddle word boundaries and phonological phrases are fully contained in intonational contours, the rhythm signaled through phonological phrase prominence provides cues to both phrase and word boundaries (Christophe et al., 1994), it signals how words combine into phrases (Nespor and Vogel, 1986, 2007; Langus and Nespor, 2010), and correlates even with the basic word order of the language (Nespor et al., 2008). Sensitivity to phonological phrase rhythm thus provides acoustically highly salient entry points to the understanding of the structure of continuous speech.

Given the importance of rhythm at the phonological phrase level both in language perception and acquisition, could phonological phrase rhythm also be discernable from visual information accompanying speech? At least some prosody is discernable from visual speech cues because the timing and the motor organization of the head are linked to the production of lexical stress, and to prosody in general (Hadar et al., 1983, 1984). When relying only on visual cues of their native language, participants can correctly discriminate the intonation of a statement from that of a question (Srinivasan and Massaro, 2003), detect contrastive focus (Dohen and Loevenbruck, 2005; Dohen et al., 2008), determine when utterances end (Barkhuysen et al., 2008), and identify the location of phrasal as well as lexical stress (Bernstein et al., 1989). In accordance with the fact that prominence at higher levels of the prosodic hierarchy is acoustically more salient, participants are also significantly better at judging the location of phonological phrase prominence than that of lexical stress when relying on visual cues alone (Bernstein et al., 1989).

Adult participants are not only good at perceiving visual prosody: the prosodic cues embedded in the visual information accompanying speech also enhance their perception of speech sounds. For example, speech intelligibility increases when speech is accompanied by nods, by head movements and eye-brows movements (Granström et al., 1999; House et al., 2001; Krahmer et al., 2002; Massaro and Beskow, 2002; Srinivasan and Massaro, 2003; Munhall et al., 2004). Perceivers' judgments about stress are also better with audiovisual speech than with the sound of spoken language alone (Dohen et al., 2008), and the perception of prominence in words is significantly improved when speech sounds are accompanied by hand gestures (Krahmer and Swerts, 2004). Recent findings suggest that prosody in the spontaneous gestures accompanying speech may even help to disambiguate ambiguous sentences (Guellaï et al., 2014). This suggests that adult listeners are quite good at determining the location of phrasal stress in their native language from visual cues alone. However, it remains unclear to what extent adult second language learners are capable of discriminating iambic/trochaic phrasal stress visually in non-native languages.

The ability to discriminate rhythm will depend on the visibility of the main acoustic correlates of phrasal stress: fundamental frequency (F0), duration, and intensity. Because the laryngeal muscles that control fundamental frequency only produce small visible movements in speech production, fundamental frequency is difficult to perceive visually. Even though several studies have tried to associate F0 also with other visual cues that include eyebrows (Cavé et al., 1996) and head movement (Yehia et al., 2002), the evidence for visual perception of F0 is considered impoverished at best (Smith et al., 2010). In contrast, prominence signaled through duration and intensity is realized through the vocal articulators and is therefore considerably more visible than fundamental frequency. The pronunciation of stress is in fact associated with larger, faster and longer jaw and lip movements (e.g., Beckman and Edwards, 1994; de Jong, 1995; Erickson et al., 1998; Harrington et al., 2000; Erickson, 2002; Cho, 2005, 2006). The role of the articulators in stress perception is also supported by the findings that judgments about phrasal stress are not affected when the face is hidden from the nose up, suggesting that the mouth may contain enough information for discriminating trochaic phrasal stress signaled through intensity and iambic phrasal stress signaled through duration (Lansing and McConkie, 1999). Perceiving prominence signaled through pitch—that is difficult to discern visually—should therefore rely more on stable auditory information. In contrast, perceiving prominence signaled through intensity and duration, both of which are visible in the face of the speaker, could also rely on visual information.

This raises the issue of the role of audio-visual information in speech perception. While auditory and visual information clearly contribute to the perception of supra-segmental information, the differences in the visibility of the different acoustic correlates of prominence also suggest that they are likely to contribute differently to speech perception. The majority of studies that investigate the audio-visual perception of prosody test how auditory and visual information are integrated in speech perception. This generally entails comparing participants' performance either exclusively in the auditory modality or exclusively in the visual modality to participants' performance with audio-visual speech. However, because auditory information alone is often sufficient for participants to perform at ceiling (e.g., Brunellière et al., 2013), the importance of visual cues in speech perception has remained difficult to describe. Rather than testing the advantages of audio-visual rhythm over rhythm perceived from the single modalities, we tested participants' ability to discriminate iambic/trochaic rhythm within and across the two modalities. By comparing participants' performance in discriminating iambic/trochaic rhythm either in the auditory modality alone (Experiment 1a), or in the visual modality alone (Experiment 1b), we will investigate whether there are significant differences between auditory and visual cues in phonological phrase rhythm. By testing whether participants can transfer rhythm from the auditory to the visual modality (Experiment 2a) and vice versa (Experiment 2b), we aim at discovering whether participants can dynamically integrate information from the auditory modality with information from the visual modality, when such information is simultaneously only available from a single source (either audio or visual). For example, can participants switch from auditory to visual prosody (or vice versa) when the first of the two becomes degraded or inaudible due to situational constraints?

In this paper, we therefore investigate adult participants' ability to match rhythmic patterns in audio and visual presentations of faces uttering either iambic or trochaic nonsense phrases. Although recognition of Iambs and Trochees has been extensively investigated auditorily (Hay and Diehl, 2007; Iversen et al., 2008; Bion et al., 2011; Bhatara et al., 2013), the evidence for Iambic-Trochaic Law (ITL) in the visual domain is highly scarce. To the best of our knowledge in the visual domain only one previous study has shown that visual sequence of colored squares are grouped respecting the ITL (Peña et al., 2011). To the best of our knowledge, no study has reported that iambic/trochaic rhythm can be recognized from the visual information provided by lips and mouth when perceiving non-native languages; neither that the information obtained from ITL in one modality, i.e., auditory or visual, can be transferred to the other sensorial modality. Since the ITL is informative about word order, and visual and auditory cues are exploited for language learning, our study will add new data on how particular audio-visual cues might support language processing and learning. We recorded the stimuli from German native speakers because in German, subordinate clauses, depending on the type of complementizer chosen, can either have the Object-Verb order (e.g., weil ich papa sehe translated in English as because I father see), where phonological phrases prominence is signaled trochaically at the beginning of the phrase, or the Verb-Object order (e.g., denn ich sehe papa translated in English as because I see father) where phonological phrases prominence is signaled iambically at the end of the phrase (Nespor et al., 2008). We replaced the object-verb (e.g., papa sehe)/verb-object (e.g., sehe papa) pairs in these constructions with nonsense words and then video recorded German native speakers uttering the resulting nonsense subordinate clause. These Object-Verb/Verb-Object constructions within a single language enabled us to test the discrimination of iambic-trochaic rhythm in a controlled way. That is, the discrimination between iambic-trochaic phonological phrase prominence could not occur due to differences caused by cross-linguistic and cross-speaker differences. In addition, the rhythmic differences could only be related to phonological phrase prominence rather than to either lexical stress or secondary stress within feet, since the latter are identical in the two types of phrases. In addition all relevant phrases were uttered in a non-emphatic way—as if out of the blue—and were included in a single intonational phrase. Thus, the rhythmic differences could not be caused by different intonational phrase breaks.

Experiment 1: Matching Iambic and Trochaic Items from Unimodal Contexts

Even though the ITL has been investigated in spoken language, as well as in tones (e.g., Hay and Diehl, 2007) and in visual stimuli (Peña et al., 2011) we know of no study that directly compares adult participants' ability to discriminate phrasal prominence in both the auditory and the visual component of speech samples. In Experiment 1a we therefore tested Spanish-speaking adult listeners' ability to discriminate iambic and trochaic phrasal prominence within auditory stimuli that were recorded from German speakers in whose native language phrasal prominence in subordinate clauses can be either iambic or trochaic. There is no experimental evidence that shows that speakers of Spanish, an iambic language where phrasal prominence is signaled primarily through duration at the end of phonological phrases, can recognize Iambs and Trochees in linguistic stimuli. However, on the basis of previous findings from speakers of other iambic languages such as Italian and English, we predict that also Spanish speakers will discriminate Iambs from Trochees. Spanish speakers are also likely to succeed in this task because violations in perceiving iambs have only been found in trochaic languages, that have a basic Object-Verb word order (e.g., Turkish and Persian), but never in iambic languages (e.g., Italian and English), that have a basic Verb-Object order (Langus et al., 2016). On the one hand, testing Spanish-speakers with German stimuli would thus provide evidence for the perception of iambs and trochees at the phrasal level in a language that has previously not been investigated.

Following the discrimination in the auditory modality, in Experiment 1b, we tested Spanish-speaking adult listeners' ability to discriminate iambic and trochaic phrasal prominence within visual stimuli that were extracted from the same audio-visual recordings used in Experiment 1a. Previous findings with visual speech suggest that adult participants are highly accurate in determining the location of phrasal prominence in visual presentations of their native language (Soto-Faraco et al., 2007). Is it possible that hearing adults could also discriminate Iambic and Trochaic nonsense phrases extracted from the visual prosody of a foreign language? On the one hand, the iambic and trochaic prominences in phonological phrases differ not only in location but also in the acoustic correlates that signal it: pitch and intensity at the beginning of the phrase in the case of trochaic rhythm and duration at the end of the phrase in the case of iambic rhythm. This suggests that participants have at least two cues - prominence location and the acoustic correlate of prominence—for discriminating between iambic and trochaic phrases. However, it is unknown whether these cues are available when perceiving foreign speech visually. We thus expect that Spanish-speaking adults would detect the prosodic properties involving pitch and duration in the auditory presentations of trochaic and iambic nonsense phrases, respectively (Experiment 1a), as well as in the visual representation of pitch and duration in the movements of the mouth and the lips (Experiment 1b).

Experiment 1a: Matching Iambic and Trochaic Items in the Auditory Modality

We tested whether Spanish-speaking adults can discriminate iambic and trochaic nonsense phrases extracted from German prosody when only auditory cues are available.

Materials and Methods

Participants

Eighteen college students (9 male, 9 female; mean age = 26.5 years) completed the study. In this, as well as in all experiments of our study, participants were native speakers of Spanish with either normal or corrected to normal vision. They received academic credits for their participation, and, as in all experiments of this study, signed a written consent form approved by the local ethical committee.

Stimuli

We video-recorded two native speakers of German (one female and one male, both 25 years old) while they uttered a series of 6 nonsense phrases (“bole tase,” “bale tose,” “dofe mave,” “dafe move,” “move fape,” “mave fope”). In order to record the same nonsense phrases with natural iambic and trochaic prosody in a single language, we chose to record native speakers of German because it is a language where both forms are equally used (Nespor et al., 2008). We prepared a written list of nonsense phrases by replacing the last two words of an iambic and a trochaic real phrase. In the real phrase, e.g., denn ich sehe papa (“because I see father”), that has an iambic structure, we replaced sehe papa by each one of the six nonsense two-words phrases, e.g., denn ich bole tase, and in the real phrase weil ich papa sehe (“because I see father”), that has a trochaic structure, we replaced papa sehe by the same nonsense phrases, e.g., weil ich bole tase. Speakers were unaware of the purpose of the experiment and were asked to read the list and utter each sentence with the same prosody they use for uttering the real sentence. Each sentence was randomly presented 6 times in the list. To avoid participants' tendency to direct their gaze to the speakers' eyes, we asked speakers to wear black sunglasses. Speakers were also asked to avoid facial mimic and to talk in non-emphatic speech. In a second step, we segmented the last two nonsense words of each video recording, and selected the best 5 iambic and 5 trochaic exemplars (hereafter nonsense phrases) from each speaker. We chose sequences that had similar duration and pitch, and did not contain salient head movements or facial mimic. The resulting stimuli were prepared as auditory and visual files, containing either only the auditory or only the visual track of the nonsense phrases (see Supplemental Videos and sounds). In Experiment 1a only auditory files were used.

To quantify whether and how the stimuli acoustically differed in pitch distribution and duration, for each speaker we first measured the duration and the maximum F0 for of each one of the four syllables of each phrase, and we then statistically compared these values in trochees vs. iambs. We did verified that the maximum F0 in the first syllable was significantly higher for Trochaic than for Iambic stimuli, and that the duration of the third syllable was significantly larger for Iambs than for Trochees in both female and male speakers. Moreover, only in female speaker, the pitch of the second syllable was also significantly higher for trochees than for iambs (see Table 1).

TABLE 1

Table 1. Duration and pitch of the auditory tracks of iambs and trochees stimuli.

Trial Structure

Participants heard 48 randomly presented trials, 12 in each of the four experimental conditions that resulted from the combination of two trial types, (i.e., without a change, hereafter standard trial and with a change, hereafter deviant trial) and two prosodic contrast types (i.e., iambic and trochaic). Each trial comprised the presentation of four consecutive nonsense phrases uttered by a single speaker, separated by 1000 ms of no stimulation. Inter-trial interval varied from 1500 to 3000 ms.

Procedure

All experiments were carried out in a dimly lit soundproof room (0.05–0.1 μW/cm²). Participants were seated at 60–70 cm distance from the screen and heard the stimuli through headphones. Written instructions were presented on a computer screen. Before starting the experiment each participant was informed that in normal speech people can utter a single phrase in different ways, e.g., welcome! vs. welcome?, and that this study explored their ability to recognize similarities and differences in the prosody of nonsense phrases, by listening to them. We emphasized that similarities and differences will only involve prosody, and we gave them one practice trial of each trial type using material not included in the experiment. Participants were then informed that in all trials, the first three stimuli were three different exemplars of a same nonsense phrase pronounced with the same melody, e.g., welcome!—welcome!—welcome!. In contrast, in half of the trials the fourth nonsense phrase will be uttered with the same melody, such as welcome! in the previous example, while in the other half, it will be uttered with a different melody, e.g., welcome?. We thus exposed participants to trials with no change in prosodic contrast where all the four exemplars matched, i.e., iambic1- iambic2- iambic3- iambic4, or trochee1- trochee2- trochee3- trochee4 (i. standard trials), and to trials with change in prosodic contrast where the fourth exemplar mismatched the previous three exemplars, i.e., iambic1- iambic2- iambic3- trochee4, or trochee1- trochee2- trochee3- iamb4 (i.e., deviant trials). Participants were asked to attentively listen to the four nonsense phrases of the trials because at the end of each trial they would be asked to judge if the fourth nonsense phrase had the same or a different prosody from the three preceding phrases. They were asked to respond as accurately and as fast as possible by pressing the “same” or the “different” button. Figure 1A illustrates the structure of the trials in this experiment.

FIGURE 1

Figure 1. Schematic structure of trials in Experiment 1a (A), 1b (B), 2a (C), and 2b (D). Loudspeaker and face still images represent auditory and visual video files, respectively. The first 3 stimuli in each trial (the context), and the last stimulus of each trial (the target) were different exemplars of a prosodic category. However, the 4th stimuli were identical across standard and deviant trials.

Data Analysis

Data analysis was similar for all the experiments. For each participant in each prosodic contrast type (i.e., iambic and trochaic) and each trial type (i.e., standard and deviant) we measured the mean accuracy as the percentage of correct responses (i.e., percentage of correct Same + percentage of correct Different over all trials), and the mean reaction time for correct responses. To estimate participants' perceptual discriminability and response bias to the prosodic contrasts and to determine whether there was a bias for participants in our “same-different” task, we computed A–prime (hereafter A') and Beta (hereafter B”), respectively. A' measures participants ability to discriminate iambs from trochees in the given task estimating the probability to answer “same” when the target was the same and to answer “different” when the target was not the same. B” measures the bias that participants may have to answer “same” or “different” regardless of target. A' 0.5 means performance at chance, near to 1.0 indicates good discriminability. B” equal to 0.0 indicates no bias, while positive and negative numbers (until a maximum of −1 and 1) reflect a tendency to answer “different” and “same,” respectively.

For the statistical analysis, we first computed A' and B” for each participant by using classical algorithms (Snodgrass et al., 1985; Mueller and Weidemann, 2008). We then submitted A' of all participants to a one sample t-test (alpha = 0.05; two-tailed) compared against chance (i.e., 0.5), and we computed the mean B” for the group. Subsequently, only if A' was significantly different from chance for each group, that is, if data showed good discriminability, we submitted the mean accuracy of participants for Iambic and Trochaic trials to separated one sample t-test (alpha = 0.05; two tailed) against chance (50%). Moreover, to estimate possible differences in the ability to recognize Iambic and Trochaic items, we submitted the mean accuracy and the mean reaction time of participants for both types of Prosodic contrast to a repeated measure ANOVA with Prosodic contrast (Iamb and Trochee) as within-subject factor with Greenhouse-Geisser correction.

Results

The mean accuracy and mean reaction times for correct responses in Iambic and Trochaic trails are illustrated in Figure 2.

FIGURE 2

Figure 2. Mean accuracy (left panel) and mean reaction time (right panel) for Experiments 1a and 1b for Iambic and Trochaic trails are plotted. Fifty percent accuracy represents the chance level. Error bars indicate 95% confidence interval.

The mean A' across all trials was significantly higher than chance [Mean = 0.927 ± 0.045, t₍₁₇₎ = 39.920, p < 0.001, Cohen's d = 13.81] showing that subjects did discriminate the phonological cues evaluated in this study. Moreover, the mean B” was 0.060 (range = 0.584 to −0.464), showing that the group of participants did not show a tendency to press either mostly the “Same” or mostly the “Different” key in this study. We thus submitted the mean accuracy for Iambic and Trochaic trials to a separated one-sample t-test (alpha = 0.05; two-tailed) compared against chance (50%). Mean accuracy was significantly above chance for Iambic [Mean ± SD = 90.135 ± 10.374, t₍₁₇₎ = 16.413, p < 0.001, Cohen's d = 5.61], and Trochaic trials [Mean ± SD = 84.670 ± 7.486, t₍₁₇₎ = 19.650, p < 0.001, Cohen's d = 6.74]. No significant differences were observed in statistical comparisons of mean accuracy for Iambs and Trochees, or participants' reaction times for correct responses across conditions. Our results thus show that both Iambic and Trochaic nonsense phrases are recognized by adult Spanish speakers in the auditory modality.

Due to previous studies that suggest that facial mimic, particularly eyebrow, may serve as a cue associated to pitch increase, we explored this possibility. We classified the trials in with and without salient eyebrow movement, regardless if they were iambics or trochaics, in each speaker. We found that the male speaker frequently elevated the eyebrow especially when uttered trochees. In contrast, the female speaker rarely elevated her eyebrows. We then measured the accuracy for iambs and trochees separately in trials with and without eyebrow movements for iambic and trochaic (match and mismatch). In both types of trials we found similar results than those observed when we compared all trials together, suggesting that eyebrow is a salient cue but not indispensable to make the task.

Experiment 1b: Matching Iambic and Trochaic Items in the Visual Modality

We tested whether adult Spanish-speaking adults can discriminate iambic and trochaic nonsense phrases extracted from German prosody when only visual cues are available. The results of Experiment 1a are directly comparable to the results of Experiment 1b because the visual stimuli of Experiment 1b are extracted from exactly the same audio-visual recordings as the auditory stimuli of Experiment 1a.