# LANGUAGE AND COGNITION

EDITED BY: Kuniyoshi L. Sakai and Leonid Perlovsky PUBLISHED IN: Frontiers in Behavioral Neuroscience

#### *Frontiers Copyright Statement*

*© Copyright 2007-2015 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

*All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-627-2 DOI 10.3389/978-2-88919-627-2

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

### **LANGUAGE AND COGNITION**

Topic Editors: **Kuniyoshi L. Sakai,** The University of Tokyo, Japan **Leonid Perlovsky,** Harvard University and Air Force Research Laboratory, USA

Functional evidence of a syntax-related network.

In a given sentence with syntactic structures, the greater depth of merged subtrees, but not the linear order of words, elicited significant activation in the pars opercularis and pars triangularis of the left inferior frontal gyrus (L. F3op/F3t), as well as the left supramarginal gyrus (L.SMG). Activations (red) are projected onto the left (L) and right lateral surfaces of a standard brain. Taken from Ohta, Fukui, and Sakai (2013; doi: 10.3389/fnbeh.2013.00204).

Interaction between language and cognition remains an unsolved scientific problem. What are the differences in neural mechanisms of language and cognition? Why do children acquire language by the age of six, while taking a lifetime to acquire cognition? What is the role of language and cognition in thinking? Is abstract cognition possible without language? Is language just a communication device, or is it fundamental in developing thoughts? Why are there no animals with human thinking but without human language? Combinations even among 100 words and 100 objects (multiple words can represent multiple objects) exceed the number of all the particles in the Universe, and it seems that no amount of experience would suffice to learn these associations. How does human brain overcome this difficulty?

Since the 19th century we know about involvement of Broca's and Wernicke's areas in language. What new knowledge of language and cognition areas has been found with

fMRI and other brain imaging methods? Every year we know more about their anatomical and functional/effective connectivity. What can be inferred about mechanisms of their interaction, and about their functions in language and cognition? Why does the human brain show hemispheric (i.e., left or right) dominance for some specific linguistic and cognitive processes? Is understanding of language and cognition processed in the same brain area, or are there differences in language-semantic and cognitive-semantic brain areas? Is the syntactic process related to the structure of our conceptual world?

Chomsky has suggested that language is separable from cognition. On the opposite, cognitive and construction linguistics emphasized a single mechanism of both. Neither has led to a computational theory so far. Evolutionary linguistics has emphasized evolution leading to a mechanism of language acquisition, yet proposed approaches also lead to incomputable complexity.

There are some more related issues in linguistics and language education as well. Which brain regions govern phonology, lexicon, semantics, and syntax systems, as well as their acquisitions? What are the differences in acquisition of the first and second languages? Which mechanisms of cognition are involved in reading and writing? Are different writing systems affect relations between language and cognition? Are there differences in languagecognition interactions among different language groups (such as Indo-European, Chinese, Japanese, Semitic) and types (different degrees of analytic-isolating, synthetic-inflected, fused, agglutinative features)? What can be learned from sign languages?

Rizzolatti and Arbib have proposed that language evolved on top of earlier mirror-neuron mechanism. Can this proposal answer the unknown questions about language and cognition? Can it explain mechanisms of language-cognition interaction? How does it relate to known brain areas and their interactions identified in brain imaging?

Emotional and conceptual contents of voice sounds in animals are fused. Evolution of human language has demanded splitting of emotional and conceptual contents and mechanisms, although language prosody still carries emotional content. Is it a dying-off remnant, or is it fundamental for interaction between language and cognition? If language and cognitive mechanisms differ, unifying these two contents requires motivation, hence emotions. What are these emotions? Can they be measured? Tonal languages use pitch contours for semantic contents, are there differences in language-cognition interaction among tonal and atonal languages? Are emotional differences among cultures exclusively cultural, or also depend on languages?

Interaction of language and cognition is thus full of mysteries, and we encouraged papers addressing any aspect of this topic.

**Citation:** Sakai, K. L., Perlovsky, L., eds. (2015). Language and Cognition. Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-627-2

## Table of Contents


Carmelo M. Vicario and Raffaella I. Rumiati

*121 Why open-access publication should be nonprofit—a view from the field of theoretical language science*

Martin Haspelmath

*124 The importance of Open Access publishing in the field of Linguistics for spreading scholarly knowledge and preserving languages diversity in the era of the economic financial crisis*

Nicola L. Bragazzi

**EDITORIAL** published: 16 December 2014 doi: 10.3389/fnbeh.2014.00436

### Language and Cognition

#### *Leonid Perlovsky1 \* and Kuniyoshi L. Sakai <sup>2</sup> \**

*<sup>1</sup> Department of Electrical Engineering, School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA <sup>2</sup> Department of Basic Science, Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, Japan*

*\*Correspondence: lperl@rcn.com; sakai@mind.c.u-tokyo.ac.jp*

#### *Edited and reviewed by:*

*Nuno Sousa, University of Minho, Portugal*

**Keywords: language, cognition, brain, functional imaging, emotion**

Interaction between language and cognition remains an unsolved scientific problem. What are the differences in neural mechanisms of language and cognition? Why do children acquire language by the age of six, while taking a lifetime to acquire cognition? What is the role of language and cognition in thinking? Is abstract cognition possible without language? Is language just a communication device, or is it fundamental in developing thoughts? Why are there no animals with human thinking but without human language? Combinations even among 100 words and 100 objects (multiple words can represent multiple objects) exceed the number of all the particles in the Universe, and it seems that no amount of experience would suffice to learn these associations. How does human brain overcome this difficulty?

Since the nineteenth century we know about involvement of Broca's and Wernicke's areas in language. What new knowledge about the brain regions responsible for language and cognition has been found with fMRI and other brain imaging methods? Every year we know more about their anatomical and functional/effective connectivity. What can be inferred about their interactions and functions in language and cognition? Why does the human brain show hemispheric (i.e., left or right) dominance for some specific linguistic and cognitive processes? Is linguistic and cognitive comprehension processed in the same or different regions? Do the syntactic processes affect the structure of our conceptual world?

Such issues regarding brain functions and mind have been increasingly drawing attention from various fields in recent years, and investigations that go beyond the boundaries of previous fields of study are becoming necessary. The need for study spanning the brain and the mind has given birth to a new discipline, such as cognitive neuroscience, neurolinguistics, biolinguistics, etc. We assume that mind is a part of brain function, and we tentatively define the mind as a combination of three main cognitive factors: perception, memory, and consciousness. Language is created by mind, yet, once uttered, words return to the mind, where they are understood. The cycle from the mind to the language and then from the language to the mind, is *recursive*, in that the language produced by the mind comes back to the mind once again. This recursiveness is important when considering the relationship between language and mind.

When viewed language and mind as a whole system, it is evident that the functions of language are part of the brain system at the same time as being involved in the workings of the mind. Moreover, information is exchanged between language and each of perception, memory, and consciousness in both directions. Namely, language is involved in both reciprocal and recursive information exchange with each element of the mind. Since language is tightly linked to the mind, it would be more natural to assume that language is a part of the mind than to think it is an entity which exits outside the mind. The study of language is, in essence, to understand a part of the "human" mind. The more we study the language used by humans, the more we will understand the structure of the mind.

Chomsky has suggested that language is separable from cognition (Berwick et al., 2013), and this notion has been well supported by functional imaging experiments in neuroscience (Sakai, 2005). On the opposite, cognitive and construction linguistics emphasized a single mechanism of both. Neither has led to a computational theory so far, but language is learned early in life with only limited cognitive understanding of the world (Perlovsky, 2009). Evolutionary linguistics has emphasized evolution leading to a mechanism of language acquisition, yet proposed approaches also lead to incomputable complexity. Papers in this volume report new knowledge on interacting language and cognition, still there remains more questions than answers.

In animals, emotional and conceptual contents of voice sounds are fused. Evolution of human language has demanded splitting of emotional and conceptual contents, as well as of their mechanisms, although language prosody still carries emotional content. Is it a dying-off remnant, or is it fundamental for interaction between language and cognition? If language and cognitive mechanisms differ, unifying these two contents requires motivation, hence emotions. What are these emotions? Can they be measured? If tonal languages use pitch contours for semantic contents, are there differences in language-cognition interaction among tonal and atonal languages? Are emotional differences among cultures exclusively cultural, or also depend on languages?

This volume introduces a broad range of research addressing these topics, including three opinion articles, one hypothesis and theory article, eight original research articles, and a pair of an opinion article and a general commentary article. Their summaries are as follows.

First, Perlovsky (2013) introduces joint acquisition, dual hierarchy, and emotional prosody of language and cognition, such that emotional prosody may perform a fundamental function in connecting sounds and meanings of words. Vicario (2013) discusses about FOXP2 gene and language development, which might inform us about the origin of language. Perry and Lupyan (2013) explain that language and thought are different but strongly interacting abilities, based on the online manipulation of linguistic activity.

Next, Ohta et al. (2013) propose computational principles of syntax in the regions specialized for language, thereby integrating theoretical linguistics and functional neuroimaging. Nagels et al. (2013b) present an fMRI study on the neural substrates of figurative language during natural speech perception. De La Cruz et al. (2013) show that finger counting helps cognitive robots to learn words. Straube et al. (2013) suggest that abstract information conveyed by speech and gesture may be processed independent of modality. Tilles and Fontanari (2013) examine reinforcement and inference in cross-situational word learning. Nagels et al. (2013a) indicate the role of semantic abstractness and perceptual category in processing speech accompanied by gestures. Zhong et al. (2013) study a self-organizing pre-symbolic neural model representing sensorimotor information. Shuai and Gong (2013) analyze temporal relationships between top-down and bottom-up processing in lexical tone perception. Vicario and Rumiati (2013) demonstrate how notions of left and right affect processing of trading verbs.

We end the volume with a highly-popular discussion on the role of open access publications in linguistics, contributed by Haspelmath (2013) and Bragazzi (2013).

#### **REFERENCES**


Nagels, A., Kauschke, C., Schrauf, J., Whitney, C., Straube, B., and Kircher, T. (2013b). Neural substrates of figurative language during natural speech perception: an fMRI study. *Front. Behav. Neurosci.* 7:121. doi: 10.3389/fnbeh.2013.00121


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 13 November 2014; accepted: 01 December 2014; published online: 16 December 2014.*

*Citation: Perlovsky L and Sakai KL (2014) Language and Cognition. Front. Behav. Neurosci. 8:436. doi: 10.3389/fnbeh.2014.00436*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience.*

*Copyright © 2014 Perlovsky and Sakai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Language and cognition—joint acquisition, dual hierarchy, and emotional prosody

#### *Leonid Perlovsky\**

*The AFRL and Athinoula A. Martinos Center for Biomedical Imaging, Harvard University, Charlestown, MA, USA \*Correspondence: lperl@rcn.com*

#### *Edited by:*

*Kuniyoshi L. Sakai, The University of Tokyo, Japan*

**Keywords: language, cognition, acquisition, dual hierarchy, prosody, emotion**

#### **FUNCTION OF LANGUAGE AND COGNITION IN THINKING**

Do we think with language, or is it just a communication device used for expression of completed thoughts? What is a difference between language and cognition? Chomsky (1995) suggested that these two abilities are separate and independent. Cognitive linguistics emphasizes a single mechanism for both (Croft and Cruse, 2004). Evolutionary linguistics considers the process of transferring language from one generation to the next one (Cangelosi and Parisi, 2002; Christiansen and Kirby, 2003; Hurford, 2008). This process is a "bottleneck" that forms the language. Brighton et al. (2005) demonstrated emergence of compositional language due to this bottleneck. Still, none of these approaches resulted in a computational theory explaining how humans acquire language and cognition. Here I discuss a computational model overcoming previous difficulties and based on a hypothesis that language and cognition are two separate and closely integrated abilities. I identify their functions and discuss why human thinking ability requires both language and cognition.

Among fundamental mechanisms of cognition are mental representations, memories of objects and events (Perlovsky, 2001, 2006a). The surrounding world is understood by matching mental representations to patterns in sensor signals. However, mathematical modeling of this process since the 1950s met with difficulties. The first difficulty is related to a need to consider combinations of sensor signals, objects, and events. The number of combinations is very large and even a limited number of signals or objects form a very large number of combinations, exceeding all interactions of all elementary particles in a lifetime of the Universe (Perlovsky, 1998). This is known as combinatorial complexity, CC. This difficulty in modeling the mind has been overcome by dynamic logic (Perlovsky, 2001, 2006a,b, 2007a; Perlovsky et al., 2011). Whereas classical logic considers static statements such as "this is a chair," dynamic logic models processes from vague to crisp representations. These processes do not need to consider combinations, an initial vague state of a "chair" matches any object in the field of view, and at the end of the process it matches the chair actually present, without CC.

The second difficulty is similar still even more complex. It is related to the fact that "events" and "situations" in the world do not necessarily exist "ready for cognition." There are many combinations of percepts and objects, a near infinity, events and situations important for understanding and learning have to be separated from those that are just random collections of meaningless percepts or random objects (Perlovsky and Ilin, 2012). Events and situations recognized by non-human animals are very limited compared to human abilities to differentiate events in the world. Human cognitive abilities acquire their power due to language. Language is "easier" to learn than cognitive representations. Language representations: words, phrases exist in the surrounding language "ready made," created during millennia of cultural evolution. Therefore, language could be learned without much real-life experience; only interactions with language speakers are required. Every child learns language early in life before acquiring full cognitive understanding of events and their cognitive meanings. Thus, language is learned early in life with only limited cognitive understanding of the world (Perlovsky, 2009a, 2012c). Cognitive representations of situations and abstract concepts initially exist in vague states. Throughout the rest of life, language guides acquisition of cognitive representations from experience. Vague cognitive representations become more crisp and concrete. Thinking involves both language and cognition, and as we discuss later thinking about abstract ideas usually involves language more than cognition, not too different from thinking by children.

#### **THE DUAL HIERARCHY**

Cognitive representations are organized in mind in an approximate hierarchy (Grossberg, 1988) from sensor-motor percepts near "bottom," to objects "higher up," to situations, and to still more abstract cognitive representations. Language representations are organized in a parallel hierarchy from sounds, and words for objects and situations, to phrases, and to more abstract language representations. Our previous discussion can be described by an integrated mathematical model of language and cognition forming a dual hierarchy (Perlovsky, 2009a), as illustrated in **Figure 1**. Neural evidence suggests that the hierarchy is approximate, not as definite as shown in this figure.

Hierarchical organization of cognition and related brain structures are reviewed in (Badre, 2008). In particular, anterior-posterior axis corresponds to a gradient of abstract-concrete cortex functions. Hierarchical organization of language functions is also well established. However, hierarchical organization of language does not correspond to a particular spatial axis in the brain, it is distributed (Price, 2012). Therefore, the dual hierarchy in **Figure 1** is a functional hierarchy not organized along a spatial axis in the brain as in this figure. A fundamental aspect of acquiring

mental representations is interaction between higher and lower layer representations (top and bottom layers). In this interaction a lower layer representations are organized in more abstract and general concept-representations at a higher layer. These interactions are referred to as bottom-up and top-down signals (BU and TD) indicated in **Figure 1** by vertical arrows.

Mathematical model of the dual hierarchy is described in Perlovsky (2009a, 2012c) and Perlovsky and Ilin (2010, 2012). This model explains many facts about thinking, language, and cognition, which has remained unexplainable and would be considered mysteries, if not so commonplace.

The dual model makes a number of experimentally testable predictions. (1) It explains functions of language and cognition in thinking: cognitive representations model surrounding world, relations between objects, events, and abstract concepts. Language stores culturally accumulated knowledge about the world, yet language is not directly connected to objects, events, and situations in the world. Language guides acquisition of cognitive representations from random percepts and experiences, according to what is considered worth learning and understanding in culture. Events that are not described in language are likely not even noticed or perceived in cognition. (2) Whereas language is acquired early in life, acquiring cognition takes a lifetime. The reason is that language representations exist in surrounding language "ready-made," acquisition of language requires only interaction with language speakers, but does not require much life experience. Cognition on the opposite requires life experience. (3) This is the reason why abstract words excite only language regions of brain, whereas concrete words excite also cognitive regions (Binder et al., 2005). The dual model predicts that abstract concepts are often understood as word descriptions, but not in terms of objects, events, and relations among them. (4) This model explains why language is acquired early in life, whereas cognition takes a lifetime. It also explains why children can acquire the entire hierarchy of language including abstract words without experience necessary for understanding them. (5) Since dynamic logic is the basic mechanism for learning language and cognitive representations, the dual model suggests that language representations become crisp after language is learned (5–7 years of age), however, cognitive representations may remain vague for much longer; the vagueness is exactly the meaning of "continuing learning," this takes longer for more abstract and less used concepts. (6) The dual model gives mathematical description of the recursion mechanism (Perlovsky and Ilin, 2012). Whereas Hauser et al. (2002) postulate that recursion is a fundamental mechanism in cognition and language, the dual model suggests that recursion is not fundamental, hierarchy is a mechanism of recursion.

(7) Another mystery of humancognition, not addressed by cognitive or language theories, is basic human irrationality. This has been widely discussed and experimentally demonstrated following discoveries of Tversky and Kahneman (1974), leading to the 2002 Nobel Prize. According to the dual hierarchy model, the "irrationality" originates from the dichotomy between cognition and language. Language is crisp and conscious while cognition might be vague and ignored when making decisions. Yet, collective wisdom accumulated in language may not be properly adapted to one's personal circumstances, and therefore be irrational in a concrete situation. In the 12th century Maimonides wrote that Adam was expelled from paradise because he refused original thinking using his own cognitive models, but ate from the tree of knowledge and acquired collective wisdom of language (Levine and Perlovsky, 2008).

#### **EMOTIONAL PROSODY AND ITS COGNITIVE FUNCTION**

The dual model implies connections between language and cognitive representations, indicated by a wide horizontal arrow in **Figure 1**. These neural connections have to be developed and maintained. This requires motivation, in other words, emotions. These emotions must be in addition to utilitarian meanings of words, otherwise only practically useful words would be connected to their cognitive meanings. Also these emotions must "flow" from language to cognition, so that language is able to perform its cognitive function of guiding acquisition of cognitive representations, organizing experience according to cultural contents of language. These emotions therefore must be contained in language sounds, before cognitive contents are acquired.

This requirement of emotionality of language sounds is surprising and contradictory to assumed direction of evolution of language. Evolution of the language ability required rewiring of human brain in the direction of freeing vocalization from uncontrollable emotions (Deacon, 1997; Perlovsky, 2009b). Yet, the dual model requires that language sounds be emotional. Emotionality of human voice is most pronounced in songs (Perlovsky, 2010, 2012a,d, 2013b). Emotions of everyday speech are low, unless affectivity is specifically intended. We may not notice emotions in everyday "non-affective" speech. Nevertheless, this emotionality is important for developing the cognitive part of the dual model. If language is highly emotional, speakers are passionate about what they say, however, evolving new meanings might be slow, emotional ties of sounds to old meanings might be "too strong." If language is low-emotional, new words are easy to create, however, motivation to develop the cognitive part of the dual model might be low, the real-world meaning of language sound might be lost. Cultural values might be lost as well. Indeed languages differ in how strong are emotional connections between sounds and meanings. This leads to cultural differences. Thus, the dual model leads to Emotional Sapir-Whorf Hypothesis (Perlovsky, 2007b, 2009b, 2012b). Strength of emotional connections between sound and meaning depends on language inflections. In particular, after English lost most of its inflections, it became a low emotional language, powerful for science and engineering. At the same time English is losing autonomous connections to cultural values that used to be partially inherent in language sounds. Fast change of cultural values during recent past is usually attributed to progress in thinking, whereas effects of change in emotionality of language sounds have not been noticed.

Emotional prosody can be important for overcoming cognitive dissonance. Cognitive dissonance is a discomfort due to holding contradictory cognitions (Festinger, 1957; Harmon-Jones et al., 2009). It is resolved by discarding contradictions. If a new word contradicts existing knowledge its meaning might be discarded. Emotional prosody as well as songs could be fundamental mechanisms that overcome cognitive dissonance and enable keeping new contradictory knowledge (Masataka and Perlovsky, 2012; Perlovsky, 2013a).

#### **CONCLUSION AND EXPERIMENTAL PREDICTIONS**

This article advances a hypothesis about functions of language and cognition in thinking, and possible model of their interactions. This is the only computable model explaining a number of mysteries about language and cognition and overcoming computational difficulties. It makes a number of predictions that could be experimentally tested, including the following: cognitive representations model the world, while language representations only model language; abstract cognitive representations can only be acquired due to language; abstract cognition is more clearly represented in language whereas cognitive representations may remain vague throughout life.

#### **ACKNOWLEDGMENTS**

I am thankful for discussions with my colleagues, Michel Cabanac and Nobuo Masataka.

#### **REFERENCES**

Badre, D. (2008). Cognitive control, hierarchy, and the rostro–caudal organization of the frontal lobes. *Trends Cogn. Sci.* 12, 193–200. doi: 10.1016/j.tics.2008.02.004


(Heidelberg: Springer Verlag), 73–108. doi: 10.1007/978-3-540-73267-9\_5


*Received: 26 August 2013; accepted: 02 September 2013; published online: 19 September 2013.*

*Citation: Perlovsky L (2013) Language and cognition joint acquisition, dual hierarchy, and emotional prosody. Front. Behav. Neurosci. 7:123. doi: 10.3389/ fnbeh.2013.00123*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience.*

*Copyright © 2013 Perlovsky. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### *FOXP2* gene and language development: the molecular substrate of the gestural-origin theory of speech?

#### *Carmelo M. Vicario\**

*School of Psychology, The University of Queensland, Brisbane, QLD, Australia \*Correspondence: uqcvicar@uq.edu.au*

*Edited by:*

*Kuniyoshi L. Sakai, The University of Tokyo, Japan*

The view that language evolved from a primarily gestural mode of communication has its roots on the 18th-century philosophers speculations (Vico, 1953/1744; de Condillac, 1971/1756).

Over time, these philosophical thoughts have gradually got consistence, thanks to the research in the field of Psychology and Neuroscience, which has provided exciting evidence in support of the so-called gestural origin-theory of language (Corballis, 2002). This theory recognizes to the gestures a precise role in language development. In particular, in terms of evolution it has been suggested that spoken language evolves from an ancient communication system using arm gestures. Accordingly, it has been suggested that gestures of the mouth might have been added to the manual system to form a combined manuofacial gestural system (Corballis, 2002; Gentilucci and Corbalis, 2006).

The literature on Mirror Neurons has provided a strong support to the gesturalorigin theory thanks to the evidence of a close relationship between arm (Gallese et al., 1996; Rizzolatti et al., 1996) and mouth actions (Ferrari et al., 2003) in the brain of non-human primates. In particular, it has been shown that the area F5 of the monkey premotor cortex includes also a class of neurons that discharge when the animal grasps an object with the either the hand or the mouth (Rizzolatti et al., 1988). However, although a recent meta-analysis of 125 human fMRI studies (Molenberghs et al., 2012) identified a core network of human brain regions that possess mirror properties associated with action observation and execution, there is also a literature that challenges the existence of these neurons in humans. In fact, direct evidence for the existence of mirror neurons in humans is still lacking (Dinstein et al., 2008; Hickok, 2009). On the other hand,

some studies provide data against it. For example, the study of Lingnau et al. (2009) failed in finding adaptation for motor acts that were first executed and then observed (as found for executed motor acts, when these were preceded by execution or observation of the same motor act) in the brain areas that are typically considered as endowed of mirror properties. This implies that the link between motor gestures and spoken language is not necessarily mediated by neurons provided of "mirror" properties.

Clues in support of the gestural-origins theory of the language are also provided by the research on humans. For example, grasping larger objects (Gentilucci et al., 2001) and bringing them to the mouth (Gentilucci et al., 2004) induces selective increases in parameters of lip kinematics and voice spectra of syllables pronounced simultaneously with action execution. Moreover, it has been reported that repetitive transcranial magnetic stimulation of Broca's area affects verbal responses to gesture observation (Gentilucci et al., 2006). This suggests that Broca's area is probably involved in the simultaneous control of gestures and word pronunciation.

Neuroimaging studies on humans provide a further support to this direct link between gestures and verbal language. Sakai et al. (2005) used functional Magnetic Resonance Imaging to examine hemispheric dominance during the processing of signed and spoken sentences. Their study was provided of two conditions: (i) the sign condition with sentence stimuli in Japanese Sign Language (JSL) in which were tested Deaf signers of JSL, hearing bilinguals (children of Deaf adults, CODA) of JSL and Japanese (JPN); (ii) the speech condition in which were tested hearing monolinguals (Mono) of JPN with auditory JPN stimuli alone (AUD), or with an audiovisual presentation of JPN and JSL stimuli (A and V). The authors found that the ventral part of the left inferior frontal gyrus (F3t/F3O) showed no main effects of modality condition, providing evidence in support of the existence on a common area for the processing of linguistic information from both signed and spoken sentences. Moreover, it has been recently documented the common involvement of the left area 7A in the superior parietal lobule while performing a sequenced button presses task or a sequence of different syllables repetition task (Heim et al., 2012). These data demonstrate the existence of a common cortical module in the area 7A while sequencing vocal gestures and hand motor actions.

Finally, a support to the gestural-origin theory of language originates from the study of human infants. For example, Fogel and Hannan (1985) provided evidence of gesture-vocalization synchrony in 2-and 3-months-old human infants. Word comprehension in children between 8 and 10 months and word productions between 11 and 13 months are typically accompanied by deictic gestures (Volterra et al., 1979; Bates and Snyder, 1987). Deictic gestures (referring to an object or location) are particularly important since they allow reference to grow from the immediate context toward abstraction by helping infants understand the link between symbols and referents (De Villiers Rader and Zukow-Goldring, 2010). Moreover, deictic gestures seem able to predict linguistic development in both typical and atypical human populations across many cultures (Iverson and Goldin-Meadow, 2005). All these studies suggest that gestures provide a foundation for each new stage in early linguistic development.

A recent discovery in the field of genetics seems providing new insights in support of the gestural-origin theory. In particular, evidence suggests that the *FOXP2* gene, located on the human chromosome 7 (Fisher et al., 1998), could be the molecular substrate linking speech with gesture. In fact, this gene is involved not only in speech production and comprehension but also in gesture coordination.

In an early work Gopnik (1990) argued that the *FOXP2* gene is involved in the development of morphosyntax. For this motivation this gene has been identified more broadly as the "grammar gene" (Pinker, 1994). However, a subsequent investigation suggested that the core deficit associated with the abnormal expression of this gene is one of articulation, with grammatical impairment as secondary outcome (Watkins et al., 2002). Thus, it was proposed (Corballis, 2004) that this gene may play a role in the incorporation of vocal articulation, but have little to do with grammar itself. In support of this suggestion it has been reported that *FOXP2* shows overlapping expression patterns within brains of zebra finches and fetal human brains, particularly in subcortical regions that play important roles in sensorimotor integration and coordinated movements important for vocalization and speech (Teramitsu et al., 2004). Moreover, recent studies on humans extend the role of *FOXP2* gene to the coordination of upper limb movements. For example, the recent study of Peter et al. (2011) found an influence of the *FOXP2* gene on several language processing tasks such as nonword repetition, real word reading efficiency, rapid oral reading. Interestingly, they documented an effect of this gene also on rapid motor sequencing ability which also included finger movements.

Another recent work (Wilcke et al., 2012) has shown that the Single Nucleus Polymorphism (rs12533005) of the *FOXP2* gene can be associated with congenital dyslexia. Interestingly, the difficulty with sequential finger movements is another type of deficit which may characterize reading disorders (Tiffin-Richards et al., 2004). Furthermore, *FOXP2* mutations seem to account for the childhood apraxia of speech (CAS) (MacDermot et al., 2005; Laffin et al., 2012), which is characterized by problems in saying sounds, syllables, and words. Peter (2012) recently described the CAS *FOXP2* phenotype in multi-generational families as characterized not only by deficits in sequential processing at the level of alternating oral motor movements, which is consistent with the traditional CAS definition as a motor programing disorder, but also by deficits in sequential hand movements.

All these studies provide suggestive evidence that the *FOXP2* gene might be the possible molecular substrate linking gestures with verbal language. However, the research in support of this hypothesis is still limited, although there are promising fields of investigation. For example, it would be interesting to assess the impact of the *FOXP2* gene polymorphism on linguistic and manual skills in healthy adults. This investigation not only would provide a further support to the molecular substrate hypothesis for the gestural-origin theory of speech, but it could have also practical implications for developmental and educational psychology, as it might allow an early assessment of the risk for dyslexia and/or dysgraphia in childhood individuals.

Other potential issues worthy of investigation might refer to the study of the expression of *FOXP2* in individuals with special linguistic and/or manual skills (e.g., polyglot people, painters of talent); the influence played by particular socio-environmental factors on its expression, which in turn might influence linguistic and/or manual skills of healthy individuals.

Finally, it would be intriguing to valuate whether the *FOXP2* genetic variations influence the resilience of linguistic and/or manual functions in patients affected by stroke.

#### **REFERENCES**


*Locke's Essay on the Human Understanding (A facsimile reproduction of the 1756 translation by T. Nugent of Condillac's 1747 essay)*. Gainesville, FL: Scholars' Facsimiles and Reprints.


action: a qualitative look," in *The Emergence of Symbols: Cognition and Communication in Infancy,* eds E. Bates, L. Benigni, I. Bretherton, L. Camaioni, and V. Volterra (New York, NY: Academic Press), 141–222.


*Received: 09 July 2013; accepted: 18 July 2013; published online: 05 August 2013.*

*Citation: Vicario CM (2013) FOXP2 gene and language development: the molecular substrate of the gesturalorigin theory of speech? Front. Behav. Neurosci. 7:99. doi: 10.3389/fnbeh.2013.00099*

*Copyright © 2013 Vicario. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

#### *Lynn K. Perry\* and Gary Lupyan*

*Department of Psychology, University of Wisconsin-Madison, Madison, WI, USA \*Correspondence: lkperry@wisc.edu*

*Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

**Keywords: verbal interference, transcranial direct current stimulation (tDCS), language and thought, linguistic relativity, labeling**

Questions about the relationship between language and thought have long fascinated psychologists, philosophers, and the general public. One specific question is the extent to which verbal labels causally impact cognitive processes—how does calling an object by a particular name influence the way people categorize it; how does knowing words for mental states influence our reasoning about the minds of others; how does learning and using words like *left* influence our navigation behavior? One way to learn how the words we use to label objects, mental states, or locations affect our thoughts is to increase or decrease the ease with which we can use these words and observe outcomes of these manipulations on "non-linguistic" tasks. For example, if the word *left* enables us to remember which way to turn, preventing its activation might be expected to disrupt navigation. Manipulating the labeling process (and the engagement of language more broadly) is therefore very useful in exploring how language influences cognition. In this paper, we review two methodologies for implementing linguistic manipulations: verbal interference and transcranial direct current stimulation (tDCS), and discuss what we can learn about the role of language in cognitive processes from this line of research.

#### **VERBAL INTERFERENCE**

The primary method for manipulating linguistic activity is the use of verbal interference. The logic of verbal interference is well summarized by Winawer et al. (2007) in the context of studying the effects of language on color perception:

"[I]f linguistic processes play an active, online role in perceptual tasks, then a verbal dual task, but not a non-linguistic dual task, should diminish the goluboy/siniy [light blue / dark blue] category advantage found in Russian speakers," (Winawer et al., 2007, p. 7781).

Winawer et al. reported that Russian speakers appear to perceive a larger perceptual distinction between light and dark blues, consistent with a lexical difference between these categories in Russian. One possibility is that these differences stem from long-term perceptual learning caused by years of distinguishing colors in one language (e.g., Özgen and Davies, 2002). An alternative is that the cross-linguistic perceptual differences arise from online top-down influences of language (e.g., Fonteneau and Davidoff, 2007; Lupyan, 2008). If true, then disrupting these top-down effects in some way may disrupt these online effects of language eliminating the cross-linguistic difference—the pattern observed in the study. It is important to note that for verbal interference to have some effect does not require the involvement of language to be strategic. Rather, the involvement could be one of "the spontaneous but unspoken use of lexical codes," (Gilbert et al., 2006, p. 489). On this account, previous associations of using the word *blue* to describe the color blue cause the word to become reactivated when the color is seen (Lupyan, 2012a,b). The automatic recruitment of color labels may temporarily warp the perceptual space, producing cross-linguistic differences of the sort observed by Winawer et al.

Using similar logic, verbal interference has been used to argue for a role of language in number concepts (Frank and Barner, 2012), spatial memory (Hermer-Vazquez et al., 1999), categorization (Lupyan, 2009), and theory of mind (Newton and de Villiers, 2007). A similar logic underlies behavioral *up*-regulation of linguistic processes through redundant presentation of labels (e.g., Lupyan et al., 2007) or overt self-directed speech (Lupyan and Swingley, 2012). If internally generated labels support some cognitive or perceptual process, then redundant externally-presented labels can be thought to *up-regulate* the linguistic contribution and verbal interference to *down-regulate* it.

Verbal interference has also been used to examine mechanisms of developmental change. For example, in a task requiring participants to locate an object hidden in the corner of a room, children rely on the room's shape, while adults use one of the walls—painted in a distinct color from the other walls—to find the object (Hermer and Spelke, 1994; cf. Twyman and Newcombe, 2010). While developments in spatial language correlate with developments in using the wall as a landmark, this cannot tell us whether language causally influences spatial memory. In order to determine whether language supports development of spatial memory, Hermer-Vazquez et al. (1999) used a verbal interference paradigm. When adults performed the task while shadowing speech, their performance mirrored that of children's. Thus, in addition to improving our understanding of how language affects cognition in real-time, verbal interference has improved our understanding of how influences of language on cognition develop.

However, the use of verbal interference is not without problems. First, despite being used for many years across many domains, there is at present no working theory of verbal interference. Put bluntly: no one is sure how it works. Its use is vaguely based on the notion that language is a unitary system that can be disrupted, but this assumption is potentially problematic given the degree to which language-related activity is distributed both anatomically (Jung-Beeman, 2005) and functionally, e.g., syntax and semantics, rather than being entirely dissociable systems, tightly interact (MacDonald et al., 1994). Additionally, there is little agreement as to what constitutes verbal interference. Researchers have used numerous tasks, with little theoretical basis for choosing one over another. Tasks falling under the umbrella of verbal interference have included: rehearsing multi-digit numbers for later memory tests (e.g., Gilbert et al., 2006; Lupyan, 2009), repeating the letters "a,b,c" (e.g., Emerson and Miyake, 2003); alternating between naming months and days (e.g., Baddeley et al., 2001), making rhyme judgments (Roberson et al., 2007), answering factual questions such as "What is your name?" (e.g., Hatano et al., 1977), and repeating text, known as speech shadowing (Hermer-Vazquez et al., 1999; Frank and Barner, 2012). These interference tasks differ both in general difficulty and the ease with which performance can be assessed. For example, an interference task with a memory component gives an indication of how well participants rehearsed information—a proxy for how much effort was put into the verbal task—repetition of "a,b,c" does not. Such inconsistencies makes it hard (1) to infer why verbal interference sometimes interferes with primary task performance and sometimes does not and (2) to assess what aspect of language is recruited in the primary task which, when interfered with, disrupts performance.

A final problem with verbal interference is that it requires participants to perform two tasks simultaneously. It is therefore necessary to use a control interference task to determine which changes in primary task performance stem from manipulation to specifically linguistic processes and which stem from having to perform two tasks. Control tasks also vary widely across experiments: from tests of visuospatial memory (e.g., Gilbert et al., 2006; Lupyan, 2009) to foot tapping (Baddeley et al., 2001; Emerson and Miyake, 2003), and rhythm-shadowing (Hermer-Vazquez et al., 1999). Importantly, unless interference tasks are equated in all ways except for their "verbality," little can be said about the role of language in the primary task. For example, it has been argued that verbal, but not non-verbal interference disrupted performance on a falsebelief task (Newton and de Villiers, 2007). However, when the interference tasks were better equated for difficulty, both were similarly disruptive (Dungan and Saxe, 2012). Additionally, because verbal interference uses a dual-task paradigm, participants can exert different amounts of effort into the tasks. Such differential effort is difficult both to measure and control.

Despite its appeal, verbal interference paradigms have clear shortcomings. Below, we outline an alternative way of perturbing linguistic processes that solve *some* of the shortcomings, and advocate for systematic cross-method comparisons of linguistic perturbation methods to more fully inform our understanding of how language augments cognition and perception.

#### **TRANSCRANIAL DIRECT CURRENT STIMULATION**

One way to avoid some challenges posed by verbal interferences is by manipulating language processing without using secondary tasks through the use of noninvasive brain stimulation. Here, we focus on one such method—tDCS a painless method of regulating cortical excitability through weak electrical current to the scalp—that allows the experimenter to subtly up- and down-regulate neural activity over cortical areas implicated in language processing. For example, using tDCS to up-regulate activity over Wernicke's area (associated with aspects of labeling, particularly comprehension of word meaning; e.g., Price, 2000) is associated with increased ability to map novel words to pictures (Flöel et al., 2008) using it to up-regulate activity over Broca's area (associated with linguistic processes such as speech production; e.g., Gernsbacher and Kaschak, 2003) is associated with increased artificial grammar learning (de Vries et al., 2010).

Similar to using verbal interference, using tDCS to study linguistic influences on cognition assumes that language is a system that can be selectively perturbed. However, tDCS avoids the need to use dual task paradigms. The participant simply performs the main task while undergoing tDCS which, depending on the electrode arrangement, either up- or down-regulates cortical activity in targeted regions. Up-regulating activity is theoretically analogous to behavioral up-regulation through presentation of overt labels (Lupyan, 2008) or behavioral self-directed speech (Lupyan and Swingley, 2012); down-regulating activity is theoretically analogous to verbal interference.

Recently, Lupyan et al. (2012) examined effects of tDCS on non-verbal categorization, showing that down-regulating activity over Broca's area was associated with impairments in the ability to form categories that required selectively representing specific perceptual features to the exclusion of others, e.g., GREEN THINGS, a deficit similar to one shown by individuals performing verbal interference (Lupyan, 2009) or by those with language impairments such as aphasia (Davidoff and Roberson, 2004; Lupyan and Mirman, 2013). In avoiding pitfalls of a dual task design, however, Lupyan and colleagues' tDCS study allows for more definitive conclusions about mechanisms by which language might affect categorization because participants all completed the same task, and between-group differences can be linked to changes in neural activity in a particular cortical region providing at least a foothold for starting to connect behavioral data on the role of language in categorization to particular neural mechanisms.

#### **IMPORTANT FUTURE CONSIDERATIONS**

Although tDCS may be well-suited for manipulating linguistic activity, at present there are no direct comparisons of tDCS to verbal interference. It would be useful to know whether domains in which we have seen effects of verbal interference on performance are affected by tDCS over areas associated with language processes and whether domains in which we have *not* previously seen effects are similarly *unaffected*. For example, tDCS can be used to determine if language-related cortical regions shown to be recruited in non-verbal color-judgment tasks (e.g., Ting Siok et al., 2009) are *causally* implicated by showing perturbations with tDCS have behavioral consequences in colorjudgment, thereby informing the neural basis of language-augmented perception. tDCS also has the potential to open up additional domains to investigate effects of language on cognition without the errorfraught hunt for perfect control interference tasks.

In sum, we advocate: (1) a more systematic comparison of linguistic perturbation methods, specifically comparing behavioral linguistic perturbations to perturbations utilizing stimulation techniques such as tDCS; (2) a more rigorous comparison of different verbal interference tasks; and (3) a theory that elucidates the mechanisms by which verbal interference actually works. Such an examination will be critical to clarifying the contributions of language to cognition, helping us answer such questions as whether the mechanisms by which language affects perception of color categories are in some broad sense similar to mechanisms by which language affects spatial cognition. Performing the cross-method comparisons we advocate will highlight possible contradictions leading to further theoretical refinement. For example, what would it mean for the underlying cognitive and neural processes if one verbal interference method affects a primary task but another does not?

The extant empirical literature on effects of language on cognition and perception takes us considerably beyond the question as it is often phrased: "Does language affect thought," (Boroditsky, 2010a). This literature (e.g., Gentner and Goldin-Meadow, 2003; Casasanto, 2008; Boroditsky, 2010b; Lupyan, 2012a,b), while rich in demonstrations, requires a more rigorous investigation of the mechanisms by which learning and using language augment and perhaps fundamentally alter cognition and perception. We believe significant clarity on this important question can be achieved by combining creative uses of linguistic perturbation techniques with theoretical refinement of their mechanisms.

#### **ACKNOWLEDGMENTS**

We would like to thank Pierce Edmiston for his comments on an earlier version of this paper.

#### **REFERENCES**


evidence for a "category adjustment" model. *Mem. Cogn.* 35, 1814–1829. doi: 10.3758/BF0319 3512


*Received: 15 July 2013; accepted: 29 August 2013; published online: 17 September 2013.*

*Citation: Perry LK and Lupyan G (2013) What the online manipulation of linguistic activity can tell us about language and thought. Front. Behav. Neurosci. 7:122. doi: 10.3389/fnbeh.2013.00122*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience.*

*Copyright © 2013 Perry and Lupyan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Computational principles of syntax in the regions specialized for language: integrating theoretical linguistics and functional neuroimaging

#### *Shinri Ohta1,2, Naoki Fukui 3,4 and Kuniyoshi L. Sakai 1,4,5\**

*<sup>1</sup> Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, Japan*

*<sup>2</sup> Japan Society for the Promotion of Science, Tokyo, Japan*

*<sup>3</sup> Department of Linguistics, Sophia University, Tokyo, Japan*

*<sup>4</sup> CREST, Japan Science and Technology Agency, Tokyo, Japan*

*<sup>5</sup> Department of Basic Science, Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, Japan*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### *Reviewed by:*

*Cedric A. Boeckx, Catalan Institute for Research and Advanced Studies, Spain Noriaki Yusa, Miyagi Gakuin*

*Women's University, Japan*

#### *\*Correspondence:*

*Kuniyoshi L. Sakai, Department of Basic Science, Graduate School of Arts and Sciences, The University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan e-mail: sakai@mind.c.u-tokyo.ac.jp*

The nature of computational principles of syntax remains to be elucidated. One promising approach to this problem would be to construct formal and abstract linguistic models that parametrically predict the activation modulations in the regions specialized for linguistic processes. In this article, we review recent advances in theoretical linguistics and functional neuroimaging in the following respects. First, we introduce the two fundamental linguistic operations: Merge (which combines two words or phrases to form a larger structure) and Search (which searches and establishes a syntactic relation of two words or phrases). We also illustrate certain universal properties of human language, and present hypotheses regarding how sentence structures are processed in the brain. Hypothesis I is that the Degree of Merger (DoM), i.e., the maximum depth of merged subtrees within a given domain, is a key computational concept to properly measure the complexity of tree structures. Hypothesis II is that the basic frame of the syntactic structure of a given linguistic expression is determined essentially by functional elements, which trigger Merge and Search. We then present our recent functional magnetic resonance imaging experiment, demonstrating that the DoM is indeed a key syntactic factor that accounts for syntax-selective activations in the left inferior frontal gyrus and supramarginal gyrus. Hypothesis III is that the DoM domain changes dynamically in accordance with iterative Merge applications, the Search distances, and/or task requirements. We confirm that the DoM accounts for activations in various sentence types. Hypothesis III successfully explains activation differences between object- and subject-relative clauses, as well as activations during explicit syntactic judgment tasks. A future research on the computational principles of syntax will further deepen our understanding of uniquely human mental faculties.

#### **Keywords: syntax, universal grammar, recursive computation, inferior frontal gyrus, supramarginal gyrus, fMRI**

#### **INTRODUCTION**

Tree structures are one of the most ubiquitous structures in nature, appearing in the branchings of rivers, lightning, snowflakes, trees, blood vessels, nervous systems, etc., and can be simulated in part by fractal geometry (Mandelbrot, 1977). To properly quantify the complexity of such tree structures, various models have been proposed. The number of nodes would be one of the simplest models; this approach consists of simply counting the total number of non-terminal nodes (branching points) and terminal nodes of a tree structure (**Figure 1A**). This model obviously cannot capture hierarchical levels within the tree (sister relations in linguistic terms). To properly measure the hierarchical levels of a tree structure, we have proposed the Degree of Merger (DoM) as a key computational concept (**Figure 1B**) (Ohta et al., 2013). The DoM is defined as the *maximum depth* of merged subtrees (called Mergers) within a given domain. With this model, the same numbers are assigned to the nodes with an identical hierarchical level. The DoM corresponds to the number of iterations for generating fractal figures, when the tree structures are self-similar.

In this article, we first explain certain universal properties of human language discovered in modern linguistics, and we present hypotheses regarding how sentence structures are processed in the brain. We then introduce our recent functional magnetic resonance imaging (fMRI) study, which demonstrated that the DoM is indeed a key syntactic factor that accounts for syntax-selective activations in the regions specialized for language (Ohta et al., 2013). We also show that the top-down connectivity from the left inferior frontal gyrus to the left supramarginal gyrus is critical for the syntactic processing. Next, we clarify that the DoM can account for activation modulations in the frontal region, depending on different sentence structures. Finally, we hypothesize that the DoM domain changes dynamically in accordance with iterative Merge applications, the distance

required for Search operations (or simply the "Search distance"), and/or task requirements. This hypothesis accounts for activation differences between subject-relative and object-relative clauses, as well as for activations during explicit syntactic judgment tasks.

#### **UNIVERSAL PROPERTIES OF HUMAN LANGUAGE**

#### **THEORETICAL BACKGROUND**

Modern linguistics has clarified universal properties of human language, which, directly or indirectly, reflect the computational power, or engine, of the human language faculty. A sentence is not a mere string of words, but is made of phrase structure (called constituent structure). Moreover, a single phrase contains the key element (i.e., the "head") that determines the basic properties of the phrase. Furthermore, a sentence can be recursively embedded within other sentences, as in, e.g., "*I think that John believes that Mary assumes that. . .* ," and there is in principle no upper bound for the length of sentences. These universal properties can be adequately and minimally expressed by hierarchical tree structures with a set of relevant structural relations defined on such structures (Chomsky, 1957, 1965).

To construct hierarchical tree structures, modern linguistics has proposed the fundamental linguistic operation of *Merge* (capitalized in linguistics to indicate a formal operation). Merge is a structure-building operation that combines two syntactic objects (words or phrases) to form a larger structure (Chomsky, 1995). Merge would be theoretically "costless," requiring no driving force for its application (Saito and Fukui, 1998; Chomsky, 2004; Fukui, 2011). Besides Merge, we have proposed *Search* operation of searching syntactic features, which applies to a syntactic object already constructed by Merge, where Search couples and connects two distinct parts of the same structure, thereby assigning relevant features from one to the other part (Fukui and Sakai, 2003). Various other "miscellaneous" operations that have been employed in the linguistics literature, such as Agree, Scope determination, Copy, etc., are in fact different manifestations of one and the same, i.e., more generalized, operation of Search (Fukui and Sakai, 2003). Human language, therefore, should minimally contain two universal operations, Merge and Search. The total number of Merge and Search applications within an entire sentence are here simply denoted as "number of Merge" and "number of Search," respectively. The number of Merge in a sentence becomes always one less than the number of terminal nodes, *irrespective of sentence structures* (see Appendix S2 of Ohta et al., 2013).

#### **SYMBOL SEQUENCES AND FORMAL LANGUAGES**

In regard to formal symbol sequences beyond the bounds of finite state languages, three specific types of language have been discussed in the linguistics literature: (i) "counter language," (ii) "mirror-image language," and (iii) "copying language" (cf. Chomsky, 1957, p. 21).


The counter language can be handled by a counting mechanism to match the number of each symbol, whereas the mirror-image language contains a mirror-image dependency, requiring more than a mere counter. If the number of symbols is not fixed (i.e., infinite), both of these languages are beyond the bounds of finitestate grammars, and are to be generated by context-free (simple) phrase structure grammars, while the copying language with a cross-serial dependency clearly goes beyond the bounds of even context-free phrase structure grammars, requiring a more powerful device, viz., context-sensitive phrase structure grammars or transformational grammars (Chomsky, 1959; Hopcroft and Ullman, 1979).

It remains a central issue in cognitive sciences whether or not the faculty of language is also shared by animals. Animals have thus been tested with regular symbol sequences such as A*n*B*<sup>n</sup>* (*n* ≥ 2; i.e., AABB, AAABBB, *...*) and (AB)*<sup>n</sup>* (*n* ≥ 2; i.e., ABAB, ABABAB, *...*), which differ in *symbol order*. In an animal study, songbirds were trained to discriminate patterns of A*n*B*<sup>n</sup>* and (AB)*<sup>n</sup>* in more than ten thousand trials (Gentner et al., 2006). However, this learning can be achieved by tracking symbol repetition or counting strategy alone (Corballis, 2007). There is also a recent report that songbirds seemed to discriminate strings with or without nesting (Abe and Watanabe, 2011), but this learning can be achieved by simply remembering partial strings (Beckers et al., 2012). Along the lines of contrasting A*n*B*<sup>n</sup>* and (AB)*n*, fMRI studies have tested participants with different symbol sequences, such as A2A1B1B2 vs. A1B1A2B2 (each subscript denotes a matching order), which also differ in matching order (Bahlmann et al., 2008). The difference in activation patterns can be simply explained by differences in any factor associated with matching orders and symbol orders, i.e., temporal order-related factors. It is thus necessary to completely control these general factors when extracting any syntactic factor from a number of cognitive factors involved in actual symbol processing.

Since the number of symbols is inevitably fixed (i.e., finite) in any actual experiment, it should be noted that any symbol sequence can be expressed by a regular (finite state) grammar, i.e., the least powerful grammar in the so-called Chomsky hierarchy. Therefore, one cannot, in principle, claim from the experiments that individual grammars (e.g., context-free phrase structure grammars vs. regular grammars) are differentially represented in the brain. Thus, the neural representation of individual grammars was *not* within the scope of Ohta et al. (2013). In addition to the various models examined, other non-structural and non-symbolic models with simple recurrent networks have been proposed to process some examples of even context-free and context-sensitive phrase structure languages, generalizing to some degree to longer strings than the training set (Rodriguez, 2001). However, these models do not account for any parametric modulation of the activations reported in Ohta et al. (2013), except the length of sentences.

In the previous experiment, we introduced letter strings, which had no lexical associations but had both symbol orders (e.g., AABB and ABAB) and matching orders (e.g., A2A1B1B2). There were two basic types of strings: reverse-order strings (Reverse) and same-order strings (Same). In the Reverse strings, the first and second halves of a string were presented in the reverse order, while in the Same strings the halves were presented in the same order (**Figure 2**). Under these conditions, there was actually no path connecting the non-terminal nodes of symbol pairs (e.g., A1B1 and A2B2), as there was *no* Merge application to connect the multiple pairs. In regard to the symbol orders, both the Reverse and Same strings took the above type (i) of A*n*B*n*. In regard to the matching orders, the Reverse string took the type (ii) of A2A1B1B2 or A3A2A1B1B2B3, while the Same string took the type (iii) of A1A2B1B2 or A1A2A3B1B2B3.

#### **HYPOTHESIS I**

Given a tree structure with a formal property of Merge and *iterativity* (recursiveness) (Fukui, 2011), we propose the following hypothesis (Hypothesis I):

(1) The DoM, which can be defined as the *maximum depth* of merged subtrees within a given domain, is a key computational concept to properly measure the complexity of tree structures.

The DoM can quantify and compare various syntactic phenomena, such as self-embedding, scrambling, *wh*-movement, etc. Furthermore, when Search applies to each syntactic object with its hierarchical structure, the calculation of the DoM plays a critical role. Indeed, from a nested sentence "[[*The boy*<sup>2</sup> [*we*<sup>3</sup> *like*3]2]1 *sings*1]0" (subscripts denote the DoM for each node), two sentences "[*The boy.*..]1 *sings*1" and "*we*<sup>3</sup> *like*3" are obtained, where relevant features (numbers and persons here) are searched and matched between the nodes with the identical DoM. Since such analyses of hierarchical structures would produce specific loads in syntactic computation, we expect that the DoM and associated "number of Search" would affect performances and cortical activations.

Sentences with various constructions have been previously discussed in terms of the acceptability of sentences (cf. Chomsky, 1965, p. 12).


The nested constructions are created by *centrally* embedding a phrase within another phrase (with some non-null element to its left and some non-null element to its right), and the self-embedded constructions are the special case of nested constructions when nesting occurs within the *same* type of phrases (e.g., noun phrases). The multiple-branching constructions are made by conjoining phrases at the same hierarchical level, and the left/right-branching constructions are yielded by merging a phrase in the left-most or right-most phrase. The degrees of nesting and self-embedding have already been proposed to model the understanding of sentences (Miller and Chomsky, 1963). By generalizing this attractive idea in such a way as to include any construction with merged phrases, we introduced the DoM as a key computational concept.

Based on the nested (self-embedded), left/right-branching, and multiple-branching constructions, three basic types of sentences can be distinguished: the nested sentence (Nested), simple sentence (Simple), and conjoined sentence (Conjoined), respectively. The sentences shown in **Figure 3** are some examples in Japanese. Given syntactic structures like the ones shown, the correspondence of each subject-verb pair becomes fixed. Here N and V denote a noun phrase and a verb phrase, respectively. For the sentence shown in **Figure 3A**, an entire sentence is constructed by nesting sentences in the form of [N2[N1V1]V2], where [N*i*V*i*] represents a subject-verb pair of a sentence. Since Japanese is a head-last, and hence an SOV (verb-final) language, a main verb is placed after a subordinate clause. Therefore, Japanese sentences naturally yield nested structures without having to employ, as in English, object-relative clauses (e.g., "*The boy whoi we like ti sings*"), which require "movement" of an object (i.e., with more Merge applications) and thus leave behind a "trace" (*ti*, subscripts denote the same entity). For the sentence shown in **Figure 3B**, a simple sentence is constructed by adding the same number of left/right branches to both Ns and Vs. The last noun (i.e., head) in the branches of Ns made a subject-verb pair with the last verb (i.e., head) of a compound verb. Each simple sentence thus takes the form of [(NN1)(VV1)]. For the sentence shown in **Figure 3C**, an entire sentence is constructed by conjoining sentences in the form of [N1V1][N2V2]. When considering longer sentences like N3N2N1V1V2V3, these constructions have distinct values for DoM.

#### **HYPOTHESIS II**

In any sentence, functional elements, such as inflections, auxiliary verbs, and grammatical particles, serve an essentially grammatical function without descriptive content. In regard to the fundamental role of these functional elements, we propose the following hypothesis (Hypothesis II):

(2) The basic frame of the syntactic structure of a given linguistic expression (e.g., sentence) is determined essentially by functional elements, which trigger Merge and Search operations.

In the non-sense poem "Jabberwocky" by Lewis Carroll, e.g., "'*Twas* ('*It was*') *brillig, and the slithy toves did . . .* ," the basic frames of syntactic structures are indeed determined by the functional elements of "*'Twas*," "*and*," "*the*," "*-s*," and "*did*." In the Japanese language, grammatical particles and morphosyntactic inflections are functional elements. The sentences shown in **Figure 3** actually contain only three kinds of grammatical particles, which represent *canonical* (i.e., in a prototypical use) case markings and syntactic information in Japanese: *-ga*, a nominative case marker; *-no*, a genitive case marker; and *-to*, a

**FIGURE 3 | Japanese sentences with three major constructions.** The figure shows three basic types of sentences in Japanese: the nested sentence, simple sentence, and conjoined sentence. Based on contemporary linguistics, each diagram represents a unique tree structure of each sentence constructed from nouns and verbs. Below each example, word-by-word translations in English are shown. **(A)** A sentence (S) at the lowest hierarchical level was nested into an entire sentence (S') ("*Taro-ga Hanako-ga utau-to omou*," "*Taro thinks that Hanako sings*"). **(B)** A simple sentence was constructed by adding the same number of left/right branches to both nouns and verbs ("*Taro-no ani-ga tabe hajimeru*," "*Taro's brother starts eating*"). **(C)** An entire sentence (S') was constructed by conjoining two sentences ("*Taro-ga utatte Hanako-ga odoru*," "*Taro sings, and Hanako dances*"). Symbols used: S and S', sentence; N, noun phrase; V, verb phrase; *-ga*, nominative case marker; *-no*, genitive case marker; *-to*, complementizer; *-te*, gerundive form; Nom, nominative case; Gen, genitive case; Comp, complementizer.

complementizer. It should be noted that both the nested and simple sentences have the same symbol order of N*n*V*n*, but they have different grammatical particles and syntactic structures. In contrast, both the simple and conjoined sentences have the same tree structures as a result, but they have different symbol orders of N*n*V*<sup>n</sup>* or (NV)*<sup>n</sup>* (*n* ≥ 2). It is the grammatical particles and morphosyntactic inflections, but not symbol orders or matching orders themselves, that determine the basic frame of syntactic structures of a sentence.

Following morphosyntactic and phonological features of Japanese verbs (Tsujimura, 2007), Vs take a non-past-tense form (*-ru*), past-tense form (*-ta*), or gerundive form (*-te*); Vs ending with *-to* and *-te* introduce *that*-clauses and *and*-conjunctives, respectively. The gerundive form can be used not only in *and*conjunctives, but in compound verbs (e.g., "*tabete-simau*," "*finish*

syntactic structures of these jabberwocky sentences are same as those of real sentences in **Figure 3**. The digits shown in red and blue denote the DoM for each node and "number of Search," respectively. We also tested the long stimuli with six words.

*eating*"; actual Japanese words will be translated hereafter), much as gerunds can in English. The *-ga*, *-no*, *-to*, and *-te* endings (*green* letters in **Figures 3**, **4**), together with the first verb of a compound verb in an adverbial form (e.g., "*tabe*"), are associated with Merge applications to connect multiple nouns/verbs or sentences, amounting to "number of Merge." The Japanese language lacks the "agreement features" (i.e., number, person, gender, etc.), but it is nevertheless equipped with the general Search procedure that is employed in agreement phenomena in other languages. This Search mechanism is in fact attested for various other phenomena in Japanese (see Fukui and Sakai, 2003 for further discussion). For example, the Japanese language exhibits a phenomenon called "honorification," where a noun phrase denoting an honored person and the form of honorifics in verbs are to be matched (Gunji, 1987; Ivana and Sakai, 2007).

In this section, we provided some theoretical discussions based on modern linguistics, focusing on the two fundamental linguistic operations of Merge and Search. We hypothesized that the DoM is a key computational concept to properly quantify the complexity of tree structures, and that the basic frame of the syntactic structure of a given linguistic expression is determined essentially by grammatical particles and morphosyntactic inflections, which trigger Merge and Search operations.

#### **THE DoM AS A KEY SYNTACTIC FACTOR ELUCIDATED BY AN fMRI EXPERIMENT**

One possible way to elucidate the neural basis of computational properties of natural language is to examine how the brain responds to the modulation of specified syntactic factors. We should not be content with such a general cognitive factor as so-called "syntactic complexity" or "syntactic working memory," which could involve both linguistic and non-linguistic factors. We should instead identify minimal factors that *sufficiently* explain any activation change obtained. In our recent study, we focused on different sentence constructions, and found that the DoM and "number of Search" were the *minimal* syntactic factors associated with phrase structures, which parametrically modulate cortical responses measured with event-related fMRI (Ohta et al., 2013). In this section, we will present the basic paradigm and results of this work.

#### **A PARADIGM TO TEST HYPOTHESES I AND II**

We used jabberwocky sentences, which consist of pseudonoun phrases (Ns) and pseudoverb phrases (Vs) that lack lexical associations, but have grammatical particles and morphosyntactic inflections (**Figure 4**). According to Hypothesis II stated above, these jabberwocky sentences had the same syntactic structures as normal sentences. Under the sentence conditions of Nested, Simple, and Conjoined with the same structures shown in **Figure 3**, the jabberwocky sentences were visually presented in a phrase-by-phrase manner to the participants. We made six pseudonouns by repeating the same syllables with voiced consonants and any one of /a/, /u/, or /o/: *rara*, *zaza*, *mumu*, *gugu*, *yoyo*, and *dodo*. We also made four pseudoverb roots by repeating the same syllables with voiceless consonants and either /i/ or /e/: *kiki*, *hihi*, *sese*, and *tete*. Here, vowel harmony was adopted to change the last, i.e., the second, vowel of the verb root, so that this vowel harmonized with the vowel (i.e., /a/, /u/, or /o/) of the corresponding subject (e.g., "*rara-ga tetaru*" from "*teteru*," underlined vowels within pseudowords). These features of vowels were only *experimentally* introduced, and these pseudoverbs lacked grammatical features, as in the Japanese verbs. In all jabberwocky sentences, the distinction between Ns and Vs was clear without memorizing pseudowords, because Ns, but not Vs, ended with either *-ga* or *-no*, i.e., case markers in Japanese such as -*ga* and -*no* can be generally attached only to nominal phrases.

To test whether participants actually paid attention to the correspondence of each subject-verb pair, we used a matching task, such that the vowel of a subject (N*<sup>i</sup>* as a sample stimulus) was matched with the last vowel of the corresponding verb root (V*<sup>i</sup>* as a comparison stimulus), probing the goal with the same vowel as explained above. It follows that the same syntactic structures were constructed from matching stimuli and non-matching stimuli (e.g., "*rara-ga teturu*"), which were both well-formed, i.e., *grammatical*, in Japanese. A matching strategy (counting, for example, the first and the fourth stimuli for matching) was useful in solving the task, but performing the task was *not* prerequisite for constructing syntactic structures. Our matching task is different from classification tasks for symbol orders (e.g., AABB vs. ABAB, where A and B are symbols representing certain sets of stimuli), which can be solved by counting the maximum number of consecutively repeated symbols. The order of the Nested, Simple, and Conjoined was pseudo-randomized without repetition. We further examined whether cortical activations were modulated by the length of sentences: short (S as a subscript, e.g., Conjoined*(S)*; four-word) and long (L as a subscript, e.g., Conjoined*(L)*; sixword) sentences, where the DoM domain spanned four and six relevant words, respectively.

We also used the same matching task under the string conditions of Reverse and Same (**Figure 2**), such that the first half of a string (A*<sup>i</sup>* as a sample stimulus) was matched with the corresponding second half (B*<sup>i</sup>* as a comparison stimulus) in the reverse or same order. These string conditions also controlled any involvement of the matching strategy stated above. Between the Nested (N2N1V1V2) and Reverse (A2A1B1B2) conditions, the curved arrows shown in **Figures 2**, **4** represent the *same* matching order of sequentially presented stimuli. The symbol order was also identical among the Nested, Simple, Reverse, and Same conditions in the form of N*n*V*<sup>n</sup>* or A*n*B*n*. Combining these multiple conditions, we were able to properly examine whether different structures were actually constructed between sentences and strings. The spatial and temporal resolution of fMRI, as well as its sensitivity, has been proven to be high enough to confirm various hypotheses about human cognitive functions like ours.

#### **SYNTAX-SELECTIVE ACTIVATIONS MODULATED BY THE DoM AND THE NUMBER OF SEARCH**

To control both matching orders and symbol orders, we directly compared the Nested with the Reverse condition, using the Simple and Same conditions as respective references, i.e., (Nested − Simple) *>* (Reverse − Same), where we combined the short and long stimuli. This contrast further controlled various linguistic and non-linguistic factors, such as the number of Merge, number of case markers, number of nodes, memory span, and counting. This point is particularly important, because temporal order-related or memory-related factors have often been confused with differences in structure or grammar type. Significant activation was elicited by this contrast in the pars opercularis and pars triangularis of the left inferior frontal gyrus (L. F3op/F3t) [local maximum: (*x*, *y*, *z*) = (−51, 24, 24), *Z* = 5.8], and the left supramarginal gyrus (L. SMG) [(−39, −45, 42), *Z* = 5.7] (**Figure 5A**). Our results are best explained by the linguistic factors associated with the Nested condition, supporting our second hypothesis that basic syntactic structures are constructed when well-formed sentences are given even without lexical meanings.

For these two critical regions, we examined the percent signal changes under the Nested and Simple conditions by subtracting those under the Conjoined condition, which had the simplest tree structures (**Figure 4** and **Table 1**), separately for

**FIGURE 5 | Functional and anatomical evidence of a syntax-related network. (A)** Regions identified by the (Nested − Simple) *>* (Reverse − Same) contrast (see **Figure 4**). Activations were projected onto the left (L) and right lateral surfaces of a standard brain. **(B)** Percent signal changes for Nested − Conjoined and Simple − Conjoined in the L. F3op/F3t and L. SMG. Overlaid red dots and lines denote the values fitted with the estimates (digits in red) for the best models: DoM for the L. F3op/F3t and "DoM + number of Search" for the L. SMG. **(C)**

The results of DCM, testing effective connectivity between the L. F3op/F3t and L. SMG. The best model included a significant top-down connection from the L. F3op/F3t to L. SMG (a thick line). **(D)** Anatomical connectivity between the L. F3op/F3t and L. SMG revealed by DTI. The population probability map is shown on the left lateral and dorsal surfaces of a standard brain with maximum intensity projection. Blue spheres represent seed regions of the L. F3op/F3t and L. SMG. Symbols used: L, long sentences; S, short sentences.


**Table 1 | Estimates of various factors to account for activations in Ohta et al. (2013).**

*Estimates under the Conjoined condition were subtracted from those under the other Nested and Simple conditions [e.g., DoM for Nested(L)* − *Conjoined(L), 5* − *2* = *3], separately for long and short sentences. We regarded "DoM* + *number of Search" (i.e., adding the estimates of two factors) as an additional factor.*

long and short sentences. Since we used the Conjoined*(L)* and Conjoined*(S)* as appropriate references, we examined whether likewise *subtracted* estimates of each factor (e.g., DoM for Nested*(L)* – Conjoined*(L)*; see **Table 1**) directly explained the parametric modulation of activations in the four contrasts of Nested*(L)* – Conjoined*(L)*, Nested*(S)* – Conjoined*(S)*, Simple*(L)* – Conjoined*(L)*, and Simple*(S)* – Conjoined*(S)*. The percent signal changes in the L. F3op/F3t and L. SMG, averaged across significant voxels, indeed correlated exactly in a step-wise manner with the parametric models of the DoM [3, 1, 1, 0] and "DoM + number of Search" [3, 1, 0, −1], respectively (**Figure 5B**). By generalizing the role of Search, we assumed that Search applied to a subject-verb pair, where the relevant features (vowels here) are experimentally "inserted" (Ohta et al., 2013).

We further examined 19 models proposed in theoretical linguistics, psycholinguistics, and natural language processing to verify that the models of the DoM and "DoM + number of Search" best explained the cortical activations (Ohta et al., 2013). All contrasts of Nested*(L)* – Conjoined*(L)*, etc. predicted that the activations should be exactly zero when a factor produced no effect or load relative to the Conjoined. We thus adopted a nointercept model, in which percent signal changes of each region were fitted with a single (thus minimal) scale parameter to a model of each factor using its subtracted estimates. For the four contrasts, a least-squares method was used to minimize the residual sum of squares (RSS) for the four fitted values (i.e., four estimates multiplied by a fitting scale) against the corresponding signal changes averaged across participants (**Table 2**).

The model of the DoM for the L. F3op/F3t, as well as that of "DoM + number of Search" for the L. SMG, indeed produced by far the least RSS (≤0.0020) and largest coefficient of determination (*r*2) (≥ 0.97). Goodness of fit was further evaluated for each model by using a one-sample *t*-test (significance level at α = 0*.*0125, Bonferroni corrected) between the fitted value for each contrast and individual activations. The model of the DoM for the L. F3op/F3t, as well as that of "DoM + number of Search" for the L. SMG, produced no significant deviation for the four contrasts (*P* ≥ 0*.*17). To further take account of interindividual variability, we fitted "linear mixed-effects models" with individual activations, and found that the models of the DoM and "DoM + number of Search" were by far more likely for the L. F3op/F3t and L. SMG, respectively. Even if we took the Simple condition as a reference for subtracted estimates, we obtained the same results of best models. These results directly support Hypotheses I and II, such that the basic frame of syntactic structures are determined essentially by functional elements, whereas the DoM, together with the number of Search, is a key factor to properly quantify the complexity of the syntactic structures.

#### **THE SIGNIFICANCE OF THE CONNECTIVITY BETWEEN THE L. F3op /F3t AND L. SMG**

It has been reported that the L. F3op/F3t is specialized for syntactic processing (Stromswold et al., 1996; Dapretto and Bookheimer, 1999; Embick et al., 2000; Hashimoto and Sakai, 2002; Friederici et al., 2003; Musso et al., 2003; Suzuki and Sakai, 2003; Kinno et al., 2008), suggesting that this region subserves a grammar center (Sakai, 2005). On the other hand, the left angular gyrus and SMG (L. AG/SMG) have been suggested to be important for vocabulary knowledge or lexical processing (Lee et al., 2007; Pattamadilok et al., 2010). To elucidate the relationships between the L. F3op/F3t and L. SMG, we modeled the effective connectivity between these two regions by using dynamic causal modeling (DCM). Our interest was to identify the direction of the connectivity modulated by the Nested condition, which has the largest DoM of all conditions. First, we assumed intrinsic, i.e., task-independent, bi-directional connections, and the models were grouped into three "modulatory families": families with modulation for the bottom-up connection from the L. SMG to L. F3op/F3t, for the top-down connection from the L. F3op/F3t to L. SMG, and for both connections. Each family was composed of three "input models" as regards the regions receiving driving inputs. We found that the model with the modulation for the bottom-up connection, in which the L. F3op/F3t received driving inputs, was the best and most probable model (**Figure 5C**). We further confirmed that the intrinsic top-down connectivity was significantly positive (+0.22; *P <* 0*.*0002), while the bottom-up connectivity was negatively modulated.

A recent DCM study with a picture-sentence matching task has suggested that the L. F3op/F3t received driving inputs (den


**Table 2 | Fittings and likelihood of various models tested in Ohta et al. (2013).**

*Percent signal changes in the L. F3op/F3t and L. SMG were fitted with a single scale parameter to a model of each factor using its subtracted estimates (Table 1) for the four contrasts of Nested(L)* − *Conjoined(L), Nested(S)* − *Conjoined(S), Simple(L)* − *Conjoined(L), and Simple(S)* − *Conjoined(S). The P-values for the t-tests are shown in ascending order. The models with an asterisk resulted in the best fit of 19 models tested (four models are shown here) for explaining activations in the L. F3op/F3t or L. SMG, i.e., with the least residual sum of squares (RSS), largest coefficient of determination (r2), and larger P-values. The likelihood ratio was taken as the ratio of each model's likelihood to the best model's likelihood. The best models were by far more likely than the other models.*

Ouden et al., 2012), which was consistent with our DCM results. Moreover, our previous studies revealed that the functional connectivity between the L. F3t/F3O (pars orbitalis) and L. AG/SMG was selectively enhanced during sentence processing (Homae et al., 2003), and that the L. AG/SMG was also activated during the identification of correct past-tense forms of verbs, probably reflecting an integration of syntactic and vocabulary knowledge (Tatsuno and Sakai, 2005). Considering the role of the L. AG/SMG in lexical processing, the Search operation based on the DoM would be essential in assigning relevant features to the syntactic objects derived from lexical items.

To further confirm the anatomical plausibility of the network between the L. F3op/F3t and L. SMG revealed by DCM, we used diffusion tensor imaging (DTI) with a probabilistic tractography. We observed that a single continuous cluster of the left superior longitudinal and arcuate fasciculi (SLF/AF) connected these regions (cluster size, 3189 mm3), together with much smaller clusters or islands (**Figure 5D**). Moreover, the left SLF/AF was consistently observed in all participants.

The findings of recent DTI studies have been controversial regarding the functional roles of two different pathways in language processes: the dorsal tracts of the SLF/AF, and the ventral tracts of the middle longitudinal fasciculus (MdLF) and extreme capsule (EmC). Both pathways connect the inferior frontal and superior/middle temporal areas (Saur et al., 2008; Wilson et al., 2011; Wong et al., 2011; Griffiths et al., 2013). Our DCM and DTI results indicate that the L. SMG activations reflecting the DoM mirrored a top-down influence from the L. F3op/F3t through the left dorsal pathway of the SLF/AF, revealing the most crucial network and pathway for syntactic computation.

#### **FURTHER CONFIRMATION OF HYPOTHESES I AND II**

#### **A PICTURE-SENTENCE MATCHING PARADIGM**

We further examined whether our hypotheses hold for various cases discussed in previous studies. In our fMRI study (Kinno et al., 2008), we used a picture-sentence matching task with three sentence types in Japanese: active, passive, and scrambled sentences (**Figure 6A**). In the picture-sentence matching task, the participants read a sentence covertly and judged whether or not the action depicted in a picture matched the meaning of the sentence. Each sentence had two noun phrases called *arguments*, each of which assumes a different grammatical relation ("subject, direct object, or indirect object" in linguistic terms) and a semantic role ("agent, experiencer, or patient" in linguistic terms, i.e., an agent who performs the action, and an experiencer/patient who is affected by it); these three conditions were thus called Twoargument conditions. More specifically, the active, passive, and scrambled sentences corresponded to "agent and patient" (subject and direct object), "experiencer and agent" (subject and indirect object), and "patient and agent" (direct object and subject) types, respectively. Pictures consisted of two stick figures, each of which was distinguished by a "head" symbol: a circle (◦), square (-), or triangle (). These sentences excluded the involvement of pragmatic information about word use (e.g., "*An officer chases a thief* " is more acceptable than "*A thief chases an officer*"). To minimize the effect of general memory demands, a whole sentence of a minimal length was visually presented for a longer time than was needed to respond.

In Japanese syntax, the grammatical relations are first marked by grammatical particles (nominative, dative, or accusative), which in turn allow the assignment of semantic roles. In the active sentences we used, a noun phrase with the nominative case marker *-ga* (*green* letters in **Figure 6B**) is associated with an agent, and the one with the accusative case marker *-o* is associated with a patient. For the passive sentences we used, however, a noun phrase with the nominative case marker *-ga* is associated with an experiencer (a person experiencing a situation), whereas a passive bound verb "-(*r*)*areru*" marks passiveness, making a subject-verb pair with the experiencer. In contrast, a noun phrase with the dative marker *-ni* is associated with an agent, whereas an action verb (e.g., "*hik*(*u*)," "*pull*") makes a subject-verb pair with the agent, forming a subordinate clause within the main clause "◦-*ga.*.. *-*(*r*)*areru.*" Note that there exist similar causative

structures in both Japanese and English: "*Hanako-ga kare-ni hikaseta*," "*Hanako made him pull*." Actually, there are two types of passivization in Japanese: *ni* passive (e.g., "*Hanako-ga Taroni hik-areru*," "*Hanako is affected by Taro's pulling her*") and *ni yotte* passive (e.g., "*Hanako-ga Taro-ni yotte hik-areru*," "*Hanako is pulled by Taro*"). According to Kuroda (1992), the *ni* passive involves no noun-phrase movement, while the *ni yotte* passive involves a movement similar to the case in English. For the scrambled sentences, an object moves from its canonical position to higher nodes by undergoing another Merge operation. This type of constructions is perfectly normal, not only in Japanese but in German, Finnish, and other languages. We also tested the One-argument condition, under which each sentence was presented with an intransitive verb and double agents. This condition did not involve two-argument relationships, and was thus syntactically simpler than any of the Two-argument conditions.

#### **HYPOTHESIS III**

Here we present the following hypothesis (Hypothesis III):

(3) The DoM domain changes dynamically in accordance with iterative Merge applications, the Search distances, and/or task requirements.

Since Merge combines two syntactic objects to form a larger structure, Merge always produces a one-level higher node. When Merge applies iteratively to an existing phrase or sentence, the DoM domain becomes thus larger in accordance with the number of Merge applications. The Search distance is the structural distance between two distinct parts to which the Search operation applies, regardless of the nodes that are irrelevant to the Search operation. As observed from **Figure 4**, the DoM domain changes in accordance with the Search distance. On the other hand, for every sentence stimulus in the study of Ohta et al. (2013), the construction of syntactic structures was ensured by task requirements, in which three sentence types had to be distinguished while they were completely mixed. Task requirements include not only certain constraints required by experimental tasks, but detailed parsing naturally required to understand a part of phrases or sentences (e.g., subject-verb relationships and noun-pronoun (coreference) relationships).

In the above mentioned paradigm (Kinno et al., 2008), the four task conditions (three sentence types under the Two-argument conditions, as well as one type under the One-argument condition) were completely mixed (see **Figure 6A**). With such task requirements, the DoM domain spanned three relevant words for all sentence types under the Two-argument conditions. Under the One-argument condition, the action of two stick figures was always identical, and thus a subject (a triangle just below N in **Figure 6B**) is regarded as a unit. Under these four task conditions, participants were required to check at least one of the argument-verb relationships, demanding Search at least once. For the scrambled sentences alone, an additional Search operation should match the identical indices of the moved object and its trace. For the active, passive, and scrambled sentences, the estimates of DoM were 2, 3, and 3, respectively, while those of the DoM was 1 under the One-argument condition.

#### **APPLYING THE DoM TO VARIOUS SENTENCE TYPES**

In the study of Kinno et al. (2008), we directly contrasted passive and active sentence conditions to identify a cortical region that is activated by purely syntactic processes. This stringent contrast resulted in significant activation in the left dorsal F3t (L. dF3t) alone [(−48, 24, 21), *Z* = 3.8] (**Figure 7A**), which was very close to the L. F3op/F3t activation in the study of Ohta et al. (2013). The L. dF3t activation was significantly enhanced under both the passive and scrambled sentence conditions compared to that under the active sentence condition (*P* ≤ 0*.*033) (**Figure 7B**), whereas there was no significant difference between the passive and scrambled sentence conditions (*P* = 0*.*15). Taking the One-argument condition as a reference for subtracted estimates, the signal changes in the L. dF3t were precisely correlated in a step-wise manner with the parametric model of the DoM [1, 2, 2], producing the RSS of 0.0001 and *r*<sup>2</sup> of 0.99, without significant deviation for the three contrasts (*P* ≥ 0*.*87). The model of the DoM thus *sufficiently* explains the L. dF3t activations. It should be noted that the parametric model of "the number of nodes" [2, 4, 4] also yielded the same fitting results in this case. The design of experimental paradigms limits the separation of multiple factors.

In a recent fMRI study, only right-branching constructions were examined, and activations in the L. F3t were modulated by the size of constituents (i.e., number of terminal nodes) (Pallier et al., 2011). Since the estimates of the DoM were identical to those of "the number of Merge" or "the number of non-terminal nodes" in this case, it was not possible to separate these factors. Taking their simplest condition (lists of unrelated words) as an appropriate reference, the model of the

reference. Overlaid red dots and lines denote the values fitted with the

DoM actually showed a comparable or better goodness of fit for activations in the L. F3t, when compared with their log-fitting models.

#### **FURTHER CONFIRMATION OF HYPOTHESIS III THE EFFECT OF THE SEARCH DISTANCES ON THE DoM**

Neuroimaging and psycholinguistic studies have reported that English sentences with object-relative clauses have higher processing loads than those with subject-relative clauses (Just et al., 1996; Stromswold et al., 1996; Gibson, 2000). To properly parse the relative clauses, the relative pronoun and its antecedent are coindexed; "*whoi*" and "*the boyi*," respectively, in the example shown in **Figure 8**. In a subject-relative clause, a relative pronoun "*whoi*" was displaced from the *subject* position denoted by a trace *ti* (originally, "*the boyi likes the girl*"), while in an objectrelative clause, a relative pronoun was displaced from the *object* position (originally, "*the girl likes the boyi*"). Following the proposal by Hawkins (1999), we assume that the relative pronoun searches the corresponding trace within tree structures of a sentence (see curved arrows in **Figure 8**). In a subject-relative clause, Search ends at the initiation of the verb phrase, while in an objectrelative clause, Search ends *after* a verb appears within a subordinate clause. In accordance with the Search distances for these examples, the DoM would become one unit larger for the objectrelative clause than the subject-relative one. Higher processing

**FIGURE 8 | The DoM domains varied with the Search distances. (A)** A sentence with a subject-relative clause. **(B)** A sentence with an object-relative clause. In these relative clauses, a relative pronoun *whoi* is displaced from its subject or object position denoted by a trace *ti*. A set of red straight arrows corresponds to the DoM domain. The digits shown in red denote the DoM for each node within the domain. Symbols used: S and S', sentence; N, noun phrase; V, verb phrase; *ti*, trace (subscripts denote the same entity).

estimates (digits in red) for the model of the DoM.

loads observed with object-relative clauses are consistent with this inference about the DoM domain.

#### **THE EFFECT OF TASK REQUIREMENTS ON THE DoM**

If Hypothesis III is correct, then the L.F3op/F3t activations can be different in accordance with task requirements, even when the same sentences are presented. In our previous fMRI study, we compared three explicit linguistic tasks with the same set of normal two-word sentences: syntactic decision, semantic decision, and phonological decision tasks (Suzuki and Sakai, 2003). In the syntactic decision task, the participants judged whether or not the presented sentence was syntactically correct, and this judgment required syntactic knowledge about the distinction between transitive and intransitive verbs (e.g., normal sentence, "*yuki-ga tumoru*," "*snow lies* (*on the ground*)"; anomalous sentence, "*yukio tumoru*," "(*something*) *lies snow*"). In the semantic decision task, lexico-semantic knowledge about selectional restrictions was indispensable. In the phonological decision task, phonological knowledge about accent patterns was required. Neither the semantic decision task nor the phonological decision task, both with *implicit* syntactic processing, elicited significant activations in the L. F3op/F3t (−57, 9, 6), which was significantly activated during *explicit* syntactic processing, even by a direct comparison between the syntactic decision task and the other tasks. These results suggest the presence of the DoM domain in accordance with the task requirements of explicit syntactic processing.

#### **THE MIXED EFFECTS OF THE SEARCH DISTANCES AND TASK REQUIREMENTS ON THE DoM**

In another fMRI study, we directly compared syntactic decision and short-term memory tasks (Hashimoto and Sakai, 2002). In this unique paradigm, we visually presented nested sentences that included two proper nouns, two verbs, and one pronoun, in which either verb or pronoun was underlined. After presenting one complete sentence in a phrase-by-phrase manner, paired phrases including an underlined phrase were shown. In one syntactic decision task (SYN-1), participants were required to judge whether the subject of an underlined verb corresponded to the person in paired phrases (**Figure 9A**). In this case, the Search distance was the structural distance between the subject and verb of the same clause. In the other syntactic decision task (SYN-2), the participants were required to judge whether an underlined pronoun was able to refer to the person in paired phrases (**Figure 9B**). In this case, the Search distance was the structural distance between the coindexed noun and pronoun. In these syntactic decision tasks, the Search distance, and consequently the DoM domain, changed dynamically in accordance with the different task requirements, even when the same sentences were presented. The estimate of the resultant DoM was 2 for both cases. In a short-term memory task with a sentence, the participants memorized the linear order of the phrases, and judged whether the left-hand phrase preceded the right-hand one in the original sequence (**Figure 9C**). With such a task requirement, the factor of DoM would become less effective. Indeed, we found that activations in the L. F3op/F3t were equally enhanced in both syntactic decision tasks when compared with the short-term memory task.

#### **CONCLUSIONS**

In this article, we reviewed recent advances in theoretical linguistics and functional neuroimaging in the following respects. First, we provided theoretical discussions about the hierarchical tree structures of sentences, and introduced the two fundamental linguistic operations of Merge and Search. We also presented our hypotheses that the DoM is a key computational concept to properly measure the complexity of tree structures (Hypothesis I), and that the basic frame of the syntactic structure of a given linguistic expression is determined essentially by functional elements, which trigger Merge and Search operations (Hypothesis II). Second, we presented our recent fMRI studies, which have demonstrated that the DoM, together with the number of Search, is indeed a key syntactic factor that accounts for syntax-selective activations in the L. F3op/F3t and L. SMG (Ohta et al., 2013). Moreover, based on the DCM and DTI results, we revealed the significance of the top-down connection from the L. F3op/F3t to L. SMG, suggesting that information about the DoM is transmitted through this specific dorsal pathway. Third, we further hypothesized that the DoM domain changes dynamically in accordance with iterative Merge applications, the Search distances, and/or task requirements (Hypothesis III). We showed that the DoM sufficiently explains activation modulations due to different structures reported in previous fMRI studies (Kinno et al., 2008; Pallier et al., 2011). Finally, we confirmed that Hypothesis III accounts for higher processing loads observed with object-relative clauses, as well as activations in the L. F3op/F3t during explicit syntactic decision tasks, reported in the previous neuroimaging and psycholinguistic studies (Just et al., 1996; Stromswold et al., 1996; Gibson, 2000; Hashimoto and Sakai, 2002; Suzuki and Sakai, 2003). It is likely that the DoM serves as a key computational principle for other human-specific cognitive capacities, such as mathematics and music, both of which can be expressed by hierarchical tree structures. A future investigation into the computational principles of syntax will further deepen our understanding of uniquely human mental faculties.

#### **ACKNOWLEDGMENTS**

We would like to thank R. Kinno, H. Miyashita, K. Iijima, and T. Inubushi for their helpful discussions, N. Komoro for her technical assistance, and H. Matsuda for her administrative assistance. This research was supported by a Core Research for Evolutional Science and Technology (CREST) grant from the Japan Science and Technology Agency (JST), by Grants-in-Aid for Scientific Research (S) (Nos. 20220005) from the Ministry of Education, Culture, Sports, Science and Technology, and by a Grant-in-Aid for Japan Society for the Promotion of Science (JSPS) Fellows (No. 24·8931).

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 30 September 2013; paper pending published: 17 October 2013; accepted: 01 December 2013; published online: 18 December 2013.*

*Citation: Ohta S, Fukui N and Sakai KL (2013) Computational principles of syntax in the regions specialized for language: integrating theoretical linguistics and functional neuroimaging. Front. Behav. Neurosci. 7:204. doi: 10.3389/fnbeh.2013.00204*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience. Copyright © 2013 Ohta, Fukui and Sakai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Neural substrates of figurative language during natural speech perception: an fMRI study

#### *Arne Nagels <sup>1</sup> \*, Christina Kauschke2, Judith Schrauf 2, Carin Whitney3, Benjamin Straube1 and Tilo Kircher <sup>1</sup>*

*<sup>1</sup> Department of Psychiatry and Psychotherapy, Philipps-University Marburg, Marburg, Germany*

*<sup>2</sup> Department of Germanic Linguistics, Philipps-University Marburg, Marburg, Germany*

*<sup>3</sup> Department of Psychology and York Neuroimaging Centre, University of York, York, UK*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### *Reviewed by:*

*Seth D. Norrholm, Emory University School of Medicine, USA Yueqiang Xue, The University of Tennessee Health Science Center, USA*

#### *\*Correspondence:*

*Arne Nagels, Department of Psychiatry and Psychotherapy, Philipps-University Marburg, Rudolf-Bultmann-Str. 8, 35039 Marburg, Germany e-mail: nagels@med.uni-marburg.de* Many figurative expressions are fully conventionalized in everyday speech. Regarding the neural basis of figurative language processing, research has predominantly focused on metaphoric expressions in minimal semantic context. It remains unclear in how far metaphoric expressions during continuous text comprehension activate similar neural networks as isolated metaphors. We therefore investigated the processing of similes (figurative language, e.g., "He smokes like a chimney!") occurring in a short story. Sixteen healthy, male, native German speakers listened to similes that came about naturally in a short story, while blood-oxygenation-level-dependent (BOLD) responses were measured with functional magnetic resonance imaging (fMRI). For the event-related analysis, similes were contrasted with non-figurative control sentences (CS). The stimuli differed with respect to figurativeness, while they were matched for frequency of words, number of syllables, plausibility, and comprehensibility. Similes contrasted with CS resulted in enhanced BOLD responses in the left inferior (IFG) and adjacent middle frontal gyrus. Concrete CS as compared to similes activated the bilateral middle temporal gyri as well as the right precuneus and the left middle frontal gyrus (LMFG). Activation of the left IFG for similes in a short story is consistent with results on single sentence metaphor processing. The findings strengthen the importance of the left inferior frontal region in the processing of abstract figurative speech during continuous, ecologically-valid speech comprehension; the processing of concrete semantic contents goes along with a down-regulation of bilateral temporal regions.

#### **Keywords: figurative speech, simile, abstractness, inferior frontal gyrus, fMRI**

#### **INTRODUCTION**

Figurative expressions are an established part of everyday speech and are often fully conventionalized. For example, parts of the human body are used in a multitude of figurative expression (head of department, eye of needle, arm of tree, etc.). Thus, figurative speech use goes far beyond the concept of a mere stylistic device and can be seen as an integral part of day-to-day communication. So far, it remains unclear in how far the comprehension of figurative speech draws on additional resources when presented in a naturally-evolving continuous and coherent story. Increases in activation of the left inferior frontal gyrus (IFG) have previously been reported for isolated sentences that contained figurative as opposed to non-figurative elements (Rapp et al., 2004, 2007; Kircher et al., 2007). However, prior context and repeated exposure to figurative speech—as it appears in more natural environments—can have an impact on how we process figures of speech such as similes. Context and conventionalization may facilitate comprehension (for idiom comprehension see Gibbs, 1992) and therefore neural activation of contextualized similes might not differ from non-simile sentences.

The current study focuses on the processing of figurative speech in form of similes such as "The sun is like the eye of heaven" in a natural context without constraining instructions or predetermined cognitive tasks, e.g., decision tasks; asking the participants to press a button, when either abstract or concrete content was presented.

A simile, such as "The sun is like the eye of heaven," can be divided into three different components: (1) the explained element ("the sun"), (2) the explaining element ("the eye of heaven"), (3) and the term of comparison (TOC; "is like"), which connects the two elements in the simile (Leech, 1969). For successful comprehension, the listener needs to refer back to the explained element and identify similarities or common features with the corresponding explaining element (e.g., "He smokes like a chimney!" stresses that someone smokes heavily). The aspect that the two elements have in common is the so-called tertium comparationis, i.e., the third domain involved in a comparison. According to prevailing theories, similes are strongly linked to metaphors, which can be regarded as similes without the TOC (i.e., elliptical similes; cf. Aristotle).

When metaphors are presented, the listener is confronted with a semantic conflict between the explained and explaining element, which needs to be resolved. The initiator of a figurative utterance selects, emphasizes, suppresses, and organizes features of the explained element by applying characteristics of the explaining elements. Thus, a "mental linkage" between the explained element and the corresponding explaining element is required which goes beyond the usual semantic or word-byword analysis (Rapp et al., 2004). The TOC ("like") in similes may facilitate this linking process, since it explicitly points to the comparative nature of the utterance.

Usually, similes or metaphors do not stand alone but are integrated into a speech or text. This context can alter the ease by which meaning is integrated, including processing of figures of speech (Gibbs, 1992). Prior context facilitates the comprehension of idioms when it is consistent with the specific entailments of the idiom (i.e., their conceptual representation): though there might be different ways of expressing anger in an idiomatic way (e.g., "bite your head-off" as opposed to "blow your stack"), it is easier to comprehend idioms whose specific conceptual representation has been primed by previous information e.g., describing anger in a way that refers to anger as "animal behavior" (Nayak and Gibbs, 1990). A coherent and evolving story can provide such information, and participants should therefore understand figurative expressions that are embedded in a story easier than isolated sentences or sentence pairs with little detail (e.g., Rapp et al., 2007; Schmidt and Seger, 2009). Behavioral studies of other complex semantic operations, such as ambiguity processing, have also shown that cognitive resources can be saved when the target item is embedded in semantically coherent context as opposed to isolated or neutral environments [for a review see Simpson (1994)]. Also, the amount of prior, coherent information can facilitate inference making and improve comprehension of non-figurative material (Zwaan and Radvansky, 1998).

A number of functional Magnetic Resonance Imaging (fMRI) studies have investigated the neural correlates of figurative speech mostly in the form of metaphoric sentences (Rapp et al., 2004, 2007; Eviatar and Just, 2006; Kircher et al., 2007; Mashal et al., 2007, 2009; Shibata et al., 2007; Stringaris et al., 2007; Schmidt and Seger, 2009; Desai et al., 2011; Diaz and Hogstrom, 2011; Diaz et al., 2011), and for similes (Shibata et al., 2012). The brain response to similes in a story context, however, remains unexplored, thus far. The results of these previous studies support the involvement of the left lateral prefrontal cortex [for a critical review on the neural basis of metaphor processing see Schmidt et al. (2010)], as a correlate for increased cognitive demand during the comprehension of figurative language. In particular, Rapp et al. (2004) for the first time reported enhanced cortical activation in the left IFG [Brodmann area (BA) 45/47] for metaphor reading as compared to literal sentences. The left IFG is an integral component of the semantic processing network and has been related to executive aspects of meaning retrieval, such as semantic search, retrieval, selection, and integration (e.g., Thompson-Schill et al., 1997; Wagner et al., 2001; Noppeney et al., 2004; Badre et al., 2005; Badre and Wagner, 2007; Bedny et al., 2008; Binder et al., 2009). Moreover, difficult metaphors in comparison to easy metaphors such as "Political success is a house of cards" vs. "Books are treasure chests of information" (Schmidt and Seger, 2009) as well as anomalous metaphors ["Their (financial) capital has a lot of rhythm' (Ahrens et al., 2007)] were found to selectively activate the left IFG. Similarly, conventional metaphors as compared to novel metaphors were found to engage the left IFG, whereas novel metaphors activated the left middle frontal gyrus (LMFG) during a reading paradigm (Mashal et al., 2007). The graded response in prefrontal cortex, particularly the left IFG, suggests that activation correlates with the level of cognitive-semantic resources required for successful performance (e.g., integration effort). This is in accordance with previous semantic retrieval studies of graded difficulty, which showed enhanced left IFG response during tasks of high vs. low executive-semantic demands (Thompson-Schill et al., 1997; Roskies et al., 2001; Wagner et al., 2001; Badre et al., 2005; Snyder et al., 2007; Zempleni et al., 2007; Kuperberg et al., 2008; Nagel et al., 2008; Ruff et al., 2008; Snijders et al., 2009; Whitney et al., 2009a,b). A recent study found activations in the head of the caudate presenting metaphors in a context (Uchiyama et al., 2012). However, the majority of past studies on figurative language comprehension utilized highly controlled sentence reading paradigms with minimal prior information (Rapp et al., 2004, 2007; Eviatar and Just, 2006; Stringaris et al., 2006, 2007; Mashal et al., 2007; Yang et al., 2009). The observed left IFG activations for metaphors might partly reflect executive-semantic processes that are related to the lack of sufficient contextual priming, as it would occur in naturally evolving texts.

The aim of the current study was therefore to analyze the neural responses to simile processing and determine the involvement of the left IFG when similes were presented within a natural, unconstrained short story context. Naturalistic stimuli were successfully investigated in the context of different experimental fMRI paradigms (Hasson et al., 2004; Skipper et al., 2007; Domahs et al., 2012). However, the neural correlates of understanding similes in a short story have not been investigated, so far. We hypothesized left IFG activation during processing of sentences containing similes (e.g., he jumps like a gazelle) vs. literal sentences of comparable frequency, plausibility, comprehensibility, and length. In addition, we expected significant correlations between unfamiliar as well as highly abstract similes and enhanced bloodoxygenation-level-dependent (BOLD) responses in the left frontal region.

#### **MATERIALS AND METHODS**

#### **PARTICIPANTS**

Initially, 19 male subjects took part in the fMRI study. Due to head movement, data of three subjects had to be discarded from further analysis. For the remaining 16 participants movement was minimal as the maximum change in translation and rotation for each participant was less than one voxel size (i.e., 3.5 mm) and less than 1◦, respectively. All 16 participants (mean age = 27.00 years, *SD* = 6*.*65; mean years of education = 14.50 years, *SD* = 1*.*67) were native speakers of German, right-handed according to the Edinburgh Inventory of Handedness (Oldfield, 1971) and showed average or above average verbal IQ as assessed by the German MWT-B multiple choice vocabulary test (Lehrl et al., 1995; mean estimated verbal IQ = 120.06, *SD* = 17*.*16). Subjects with recent substance use or general MRI incompatibility (e.g., metal implants) were excluded. All subjects gave informed consent and were paid 20 Euros for participation in the study. The local ethics committee at RWTH Aachen University approved the study.

#### **STIMULI**

A slightly modified version of the short story "Der Kuli Kimgun" by Dauthendey (1930) was chosen for this study as in Whitney et al. (2009b). Low-frequent or foreign words were substituted by more familiar or high-frequent words. The final version included a total of 3581 words.

In general, short stories are well-structured narratives restricted to a few protagonists and basic narrative events which, together, are ascribed to a single, central conflict. Our story was written from an omniscient perspective, leaving out elaborate descriptions of the character's emotional or mental states. The sequence of events occurs chronologically, which allows the listener to build up a temporally continuous mental representation of the story. The short story chosen for the current fMRI study contained figurative descriptions of events and situations.

For auditory presentation during fMRI, the story was professionally recorded and spoken in a natural way by a trained, male speech therapist. The duration of the story was 23:32 min.

#### **PROCEDURE**

The story was presented via MRI compatible headphones in two successive runs lasting 14:32 and 9:00 min, respectively. Participants were instructed to close their eyes and listen to the story carefully. To make sure that subjects attended to the content, they were informed at the beginning of the experiment about a short interview after the MRI session about the content of the story. Hereby, 10 questions regarding critical episodes of the short story had to be answered.

#### **FIGURATIVE AND CONTROL SENTENCES**

First, 32 similes as well as 50 randomly chosen control sentences (CS) were extracted from the short story independently by two linguists (Christina Kauschke and Judith Schrauf). In a behavioral test, the isolated similes and CS were rated by 20 volunteers, who did not take part in the fMRI study, according to the dimensions "plausibility," "comprehensibility," and "figurativeness." For this purpose, an analogue scale from 1 to 7 was used. Regarding the plausibility rating the instruction was as follows: "Please rate the subsequent sentences according to their plausibility. Very plausible sentences describe ordinary events being easy to follow." Comprehensibility was rated according to the instruction: "Please rate the subsequent sentences according to their comprehensibility. Very comprehensible sentences are those where the meaning can be understood easily and within a very short period of time." Regarding the dimension figurativeness, the rating instruction was formulated: "Sentences can be distinguished with regard to their properties to evoke inner pictures from the content being conveyed. The figurative content can be perceived faster and be grasped more easily in some sentences. Please rate the degree of figurativeness in the following sentences."

For the imaging analysis, 30 similes and 30 CS were chosen so that no significant differences were found between the similes and the CS according to the dimensions of plausibility [*F(*1*.*58*)* = 2*.*68, *p* = 0*.*108], comprehensibility [*F(*1*.*58*)* = 493, *p* = 0*.*49], and figurativeness [*F(*1*.*58*)* = 1*.*76, *p* = 0*.*19]. The similes were found to be rather unfamiliar [mean = 2.91 (*SD* = 2*.*14)] and abstract [mean = 4.86 (*SD* = 2*.*29)]. All similes and CS were matched according to word frequency as well as to the number of syllables. Similes in comparison to CS revealed no significant differences with regard to their individual length in the story.

#### **fMRI DATA ACQUISITION**

All scanning was performed on a 1.5 T scanner (Gyroscan Intera, Philips Medical, Eindhoven, The Netherlands) using standard gradients and a circular polarized phase array head coil. For each subject, we acquired two series of functional volumes of T2∗ weighted axial EPI-scans parallel to the AC/PC line with the following parameters: number of slices (NS), 22; slice thickness (ST), 5.0 mm; interslice gap (IG), 0.55 mm; matrix size (MS), 64 × 64; field of view (FOV), 240 × 240 mm; echo time (TE), 50 ms; repetition time (TR), 2.0 s. Four hundred and thirty-six functional volumes were acquired for the first part of the story and 270 functional volumes for the second part, adding up to 706 volumes in total.

#### **fMRI DATA ANALYSIS**

MR images were analyzed using Statistical Parametric Mapping software (SPM5; www.fil.ion.ucl.ac.uk) implemented in MATLAB (v. R2006b, Mathworks Inc., Sherborn, MA). After discarding the first three volumes, all images were realigned to the first image to correct for head movement. Unwarping was used to correct for the interaction of susceptibility artifacts and head movement. After realignment and unwarping, the signal measured in each slice was shifted relative to the acquisition time of the middle slice using a sinc interpolation in time to correct for their different acquisition times. Volumes were then normalized into standard stereotaxic anatomical MNI-space by using the transformation matrix calculated from the first EPI-scan of each subject and the EPI-template. Afterwards, the normalized data with a resliced voxel size of 4 × 4 × 4 mm were smoothed with a 10 mm full width at half maximum (FWHM) isotropic Gaussian kernel to accommodate intersubject variation in brain anatomy. The time series data were filtered with a high-pass cut-off of 1/128 Hz. The autocorrelation of the data was estimated and corrected for.

Onsets for the simile phrases were set at the beginning of the phrase that referred to the explained element. The duration was measured individually for each simile and included the explained element, TOC, and explaining element. CS were modeled in a similar way, with the onset at the beginning of the phrase and the duration of the event being equal to the duration of the complete phrase. A random-effects group analysis was performed entering the contrast images for similes and CS from the first-level analysis into a full-factorial design matrix.

In a separate analysis, rating values for figurativeness, familiarity, and abstractness were entered individually as parametric variates for each simile into the first-level analysis in order to analyze correlations between brain responses during figurative speech processing and the aforementioned dimensions.

A further *post-hoc* analysis was performed with respect to the particular role of the TOC in the processing of abstract figurative speech. Therefore, the particular onset of the TOC (engl. "as" or "like"; e.g., in "He smokes like a chimney") was individually calculated using an event-related design.

The results were corrected on a voxel-wise threshold of *p <* 0*.*001. Hereby, a Monte Carlo simulation of the brain volume of the current study was conducted to establish an appropriate voxel contiguity threshold (Slotnick et al., 2003). The procedure is based on the fact that the probability of observed clusters of activity due to voxel-wise Type I error (i.e., noise) decreases systematically as cluster size increases. Assuming an individual voxel type I error of *p <* 0*.*001 in our study, a cluster extent of 13 contiguous resampled voxels was indicated as necessary to correct for multiple voxel comparisons.

Each of the reported activations was determined with the Anatomy Toolbox for SPM5 (v. 1.7b, http://www.fz-juelich. de/inm/inm-1/spm\_anatomy\_toolbox). The imaging figures were made with the MRIcron software package (http://www*.*cabiatl*.* com/mricro/mricron/).

#### **RESULTS**

#### **BEHAVIORAL RESULTS**

During the post-scan interview about critical story episodes, all participants were able to recall all the desired details in response to each of the 10 questions (Whitney et al., 2009b). Answers to all questions were provided quickly and effortlessly.

#### **fMRI RESULTS**

#### *Simile > CS*

In the whole brain analysis, the simile sentences as contrasted with the CS selectively activated the left IFG including the pars triangularis and adjacent middle frontal gyrus (LMFG; *p <* 0*.*05, Monte Carlo corr.; **Figure 1**). Local maxima were found in both regions, though BOLD responses for the left IFG were stronger (**Table 1**). Extracted beta values for the activated region revealed activations for CS as well, however, activations were significantly stronger for the simile condition.

#### *CS > simile*

A network of activations encompassing the precuneus, bilateral middle temporal gyri as well as the LMFG was found for the whole brain *CS > simile* contrast. A large cluster of activation extended from the left (BA 23) and right precuneus (BA 7) to left middle (BA 31) and posterior cingulate cortex (BA 7). Activations in the right middle temporal gyrus (BA 19) extended into the region of the angular gyrus (BA 39) as well as the middle occipital gyrus. Contralaterally, left middle temporal as well as the left angular gyrus (BA 39) were found to be more activated during CS processing. Enhanced BOLD responses were also found in the LMFG as well as in the bilateral middle and inferior temporal gyri (BA 20; **Table 1**).

#### **CORRELATION ANALYSES**

#### *Familiarity*

The correlation analysis between the degree of familiarity and BOLD signal changes revealed a pattern of activation in the left parahippocampal region (**Table 1**, **Figure 1**).

**FIGURE 1 | Top:** Imaging results for the contrast Simile *>* CS. The bar graph illustrates the contrast estimates (beta values; yellow bar = Simile, blue bar = CS). **Bottom**: Imaging results for the interaction of brain responses with both familiar and abstract similes.

#### *Abstractness*

Highly abstract similes resulted in BOLD enhancements in the anterior cingulate cortex as well as in activations in the right superior frontal gyrus (**Table 1**, **Figure 1**).

#### *Figurativeness*

No significant relation was found between BOLD enhancements and figurativeness.

#### *Post-hoc* **ANALYSES RESULTS**

#### *TOC > CS*

Enhanced BOLD responses for the TOC as opposed to CS were found in the left-hemispheric IFG (p. triangularis) and the superior parietal region (Supplementary Material).

#### *CS > TOC*

The opposed contrast revealed pronounced activations in the right precuneus, middle temporal gyrus, and angular gyrus. In the left hemisphere, CS *>* TOC elicited neural responses in the middle temporal region (Supplementary Material).

#### *([TOC > CS] > [SIMILE > CS])*

The contrast for TOC *>* CS as opposed to SIMILE *>* CS again resulted in left lateralized activations encompassing the IFG and the superior area of the parietal cortex (Supplementary Material).

#### *([SIMILE > CS] > [TOC > CS])*

The inverse contrast resulted in right hemispheric activations in the precuneus and the middle temporal region (Supplementary Material).

#### *SIMILE > CS* **∩** *TOC > CS*

The conjunction analyses for both similes and TOC as contrasted with CS activated the left pars triangularis (Supplementary Material).

**Table 1 | Peak activation within clusters for the contrasts Simile** *>* **CS, CS** *>* **Simile as well as for the correlation analysis with familiarity and abstractness (whole-brain analysis, Monte Carlo corr.** *p <* **0***.***001).**


*Coordinates refer to MNI space.*

#### *CS > SIMILE* **∩** *CS > TOC*

CS as opposed to similes and TOC activated a neural pattern, encompassing the right precuneus, middle temporal gyrus as well as the angular gyrus. In the left hemisphere, enhanced BOLD responses were for the middle temporal region (Supplementary Material).

#### **DISCUSSION**

Figurative expressions, such as metaphors and similes are fundamental to language and thought. They represent a conventionalized part of everyday communication (Lakoff and Johnson, 2003). The processing of such figurative expressions requires a mental linkage between the explained and the explaining element. In case of a particular kind of metaphors, i.e., similes, this linkage is made explicit by the use of a TOC ("as" or "like," German: "wie"), also referred to as a hedge word (Shibata et al., 2012). The aim of the present study was to investigate the neural basis of simile processing using a highly naturalistic, continuous speech perception paradigm. This allowed us to examine whether brain activation reported in previous investigations of figurative language in the left IFG also holds true in a natural setting of short story comprehension. In line with findings on metaphor processing (Rapp et al., 2004) we observed an involvement of the left IFG for the simile condition, suggesting that more neural resources are required to interpret the figurative meaning of the simile under naturalistic conditions. We moreover found significant correlations between enhanced BOLD responses in the anterior cingulate region and the superior frontal gyrus in the context of highly abstract figurative expressions. Correlation analysis for familiar similes resulted in activations in the left hippocampal region.

The neural processing of continuous, naturalistic stimuli has thus far only rarely been performed. Recent studies have either explored natural speech production (Kircher et al., 2000, 2004; Buchheim et al., 2006), narrative comprehension (Wilson et al., 2008; Whitney et al., 2009b; Brennan et al., 2012; Domahs et al., 2012), or naturalistic audio-visual processing mechanisms (Bartels and Zeki, 2004; Hasson et al., 2004, 2010).

#### **SIMILE PROCESSING IN THE LEFT IFG**

The results demonstrate that similes elicited enhanced BOLD responses in the dorsal part of the pars triangularis (left IFG) and the ventral portion of the LMFG when contrasted with nonfigurative CS. The increased activation in these brain regions might reflect the enhanced demand on deep semantic processing integrating figurative expressions into the surrounding context. Based on the initial semantic conflict between explained and explaining element, the listener compares the figurative expression by means of a parallel, which is drawn to a different entity ("tertium comparationis"). Thus, common characteristics as well as distinct features of both elements are to be selected, emphasized, inhibited, and organized which goes beyond the usual level of contextual semantic word processing (Rapp et al., 2004).

Enhanced neural responses in the left IFG have previously been found in studies on metaphors compared to literal phrases using single sentences (Rapp et al., 2004). In a recent fMRI study, Schmidt and Seger (2009) compared sentences with easy and difficult metaphors. While easy metaphors were found to selectively activate the left MFG, difficult metaphors elicited enhanced BOLD responses in the left IFG. Similarly, the processing of metaphorical sentences taken from poetry resulted in activations of the left dorsolateral prefrontal cortex (Mashal et al., 2009). Bambini and colleagues investigated the neural correlates of implicit metaphor processing as compared to non-metaphorical passages, while being explicitly involved in an adjective matching task to be performed after reading the target passages (Bambini et al., 2011). The authors found a widespread neural network encompassing the left and right inferior frontal gyri, the right superior temporal gyrus, the left angular gyrus, and the anterior cingulate region. Imaging results were interpreted in terms of integrating linguistic material and world knowledge into the context. The left IFG and in particular the pars triangularis may hence represent a key region for figurative speech processing, including similes in a story context.

#### **ACTIVATION FOR CONTROL SENTENCES**

A widespread bilateral cortical network was found to be activated for control vs. simile sentences. Thus, enhanced BOLD responses were observed for CS in contrast to similes in the middle temporal gyrus bilaterally, the precuneus and the LMFG. Since the CS were matched with regard to plausibility and comprehensibility, enhanced activation may reflect general semantic analysis of concrete information continuously presented as previously found by Whitney et al. (2009a,b). The activations, in particular in the bilateral temporal gyri, were consistently reported for auditory language processing [for review see: Ferstl et al. (2008)]. Continuous listening to the concrete sentences includes inferences for bridging successive utterances, the use of background knowledge about concrete entities of the world and discourse context as well as lexical retrieval (Ferstl et al., 2008), all processes that have been attributed to the neural network found in our study.

The fact that the processing of similes resulted in a reduced involvement of this bilateral "concrete sentence" network indicates that either concrete representations were inhibited in favor of the relevant abstract interpretation, or the double representation (e.g., in the sentence "He smokes like a chimney!" the representation of smoke will be activated by both smoke and chimney) of the respective concept led to a facilitation of related processing mechanisms. Nevertheless, this finding in general supports the theory that abstract figurative meaning is mainly represented in the left hemisphere (Perlovsky and Ilin, 2010).

#### **ACTIVATIONS FOR FAMILIARITY**

All of our similes were non-conventional, but more or less familiar to the listeners. We found a positive association between familiarity of similes and activation in the left parahippocampal region. No negative correlations with BOLD enhancements were found. These data suggest that familiar, lexicalized and therefore well-known similes, e.g., "serve like a slave," are associated with enhanced semantic memory processes (Hoenig and Scheef, 2005). Thus, the enhanced semantic memory retrieval from the long-term storage as well as the integrative associative-mnemonic processes (Hoenig and Scheef, 2005) suggest a contribution of the left parahippocampal region to the processing of familiar figurative speech. With regard to the neural substrates of metaphor processing easy and familiar metaphors as contrasted with literal sentences have previously been found to activate the left parahippocampal gyrus (Schmidt and Seger, 2009). Similarly, Yang and colleagues (Yang et al., 2009) revealed activations in the bilateral hippocampal gyri for conventional metaphors, e.g., "She is a peach," as opposed to a redundant condition such as "She is a female." In the current investigation significant correlations for familiar similes were solely restricted to BOLD enhancements in the left parahippocampal region. A number of reasons may account for the selective recruitment of the parahippocampal gyrus. First, the linguistic structure of similes as compared to other metaphoric expressions differs with regard to the presence of a TOC (usually "like"). The presence of a mental linkage presumably facilitates the understanding of the figurative speech part, which might be explained by evoking wider associations of memory. Second, the auditory task design asking the participants to listen carefully to the narrative instead of reading or judging the isolated figurative or literal expressions, respectively, differs from recent experimental designs (Shibata et al., 2007; Schmidt and Seger, 2009; Yang et al., 2009).

#### **ACTIVATIONS FOR ABSTRACTNESS**

In general, the anterior cingulate is involved in many processes, such as verbal working memory as well as in selective attention, online-monitoring processes, and abstract auditory sequencing (Carter et al., 1998; Macdonald et al., 2000; Lee et al., 2011). With regard to figurative speech processing, enhanced neural responses in the anterior cingulate were previously reported (Rapp et al., 2004; Shibata et al., 2007; Bambini et al., 2011; Diaz and Hogstrom, 2011). In the current study, however, BOLD enhancements in this region and in the right superior frontal gyrus were found for more abstract similes, such as "he moved forward like a swimmer against the tide." This pattern of activation suggests an involvement of enhanced semantic selection and monitoring processes, since relevant aspects and appropriate literal meanings of the abstract explaining element must be filtered and interpreted. These enhanced cognitive control mechanisms together with the comparatively stronger abstract semantic integration demands may have resulted in the recruitment of the anterior cingulate.

#### **ACTIVATIONS FOR TERM OF COMPARISON**

*Post-hoc* analyses for the TOC resulted in a neural pattern of activations encompassing left hemispheric pars triangularis as well as superior parts of the parietal region. The conjunction analyses with similes (including the whole figurative speech phrase) again resulted in BOLD enhancements in the pars triangularis. It might be hypothesized that the TOC early predicts the upcoming figurative information; the TOC represents a bridging element linking the concrete—the explained element—to the subsequent abstract mental image. The TOC ("like") may support this linking process, since it explicitly points to the comparative nature of the utterance. Moreover, it can be assumed that the competition between the abstract and the literal meaning resulting in the additional recruitment of the left IFG (Chen et al., 2008).

#### **LIMITATIONS**

Analyzing continuous and authentic speech perception in a natural context goes also along with a number of methodological problems. CS —though carefully matched—still represent an arbitrary selection that could differ in a specific aspect, which cannot be systematically controlled for. Finally, ratings (e.g., plausibility evaluations) have been performed on isolated sentences and consequently do not consider the specific narrative context. However, we could demonstrate a highly comparable result pattern for the processing of similes in contrast to CS previously found for highly controlled experiments on figurative speech processing (e.g., Rapp et al., 2004, 2007; Kircher et al., 2007). Correlation analyses moreover revealed plausible result patterns indicating that sentence evaluations are associated with corresponding cognitive mechanisms.

#### **CONCLUSIONS**

The present study suggests that the left IFG plays a crucial role in the processing of figurative comparisons embedded into highly naturalistic continuous speech processing within a short story. These findings add novel plausibility to previous, highly restrained experiments and show the applicability of this approach. In general, future investigations may consider employing experimental paradigms, using ecologically valid and

#### **REFERENCES**


M., et al. (2006). Measuring attachment representation in an FMRI environment: a pilot study. *Psychopathology* 39, 144–152. doi: 10.1159/000091800


naturally evolving stimulus material, e.g., including contextual information or multi-modal processing (audio-visual perception), with a high resemblance to the real world.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/BehavioralNeuroscience/10*.*3389/ fnbeh*.*2013*.*00121/abstract

studies on text comprehension. *Hum. Brain Mapp.* 29, 581–593. doi: 10.1002/hbm.20422


Chicago Press. doi: 10.7208/ chicago/9780226470993.001.0001


*Neuroimage* 33, 784–793. doi: 10.1016/j.neuroimage.2006.06.057


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 01 July 2013; paper pending published: 13 August 2013; accepted: 26 August 2013; published online: 19 September 2013.*

*Citation: Nagels A, Kauschke C, Schrauf J, Whitney C, Straube B and Kircher T (2013) Neural substrates of figurative language during natural speech perception: an fMRI study. Front. Behav. Neurosci. 7:121. doi: 10.3389/fnbeh. 2013.00121*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience. Copyright © 2013 Nagels, Kauschke, Schrauf, Whitney, Straube and Kircher. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Making fingers and words count in a cognitive robot

#### *Vivian M. De La Cruz 1, Alessandro Di Nuovo2,3\*, Santo Di Nuovo4,5 and Angelo Cangelosi <sup>2</sup>*

*<sup>1</sup> Dipartimento di Scienze Cognitive, della Formazione e degli Studi Culturali, Università degli Studi di Messina, Messina, Italy*

*<sup>2</sup> Centre for Robotics and Neural Systems, School of Computing and Mathematics, Plymouth University, Plymouth, UK*

*<sup>3</sup> Facoltà di Ingegneria e Architettura, Università degli Studi di Enna "Kore", Enna, Italy*

*<sup>4</sup> Dipartimento dei Scienze della Formazione, Università degli Studi di Catania, Catania, Italy*

*<sup>5</sup> Unità operativa di Psicologia, IRCCS Oasi Maria SS di Troina, Enna, Italy*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### *Reviewed by:*

*Anna M. Borghi, University of Bologna and Institute of Cognitive Sciences and Technologies, Italy Giovanni Acampora, Nottingham Trent University, UK*

#### *\*Correspondence:*

*Alessandro Di Nuovo, Centre for Robotics and Neural Systems, School of Computing and Mathematics, Plymouth University, B109 PSQ, Drake Circus, PL4 8AA, Plymouth, UK e-mail: alessandro.dinuovo@ plymouth.ac.uk*

Evidence from developmental as well as neuroscientific studies suggest that finger counting activity plays an important role in the acquisition of numerical skills in children. It has been claimed that this skill helps in building motor-based representations of number that continue to influence number processing well into adulthood, facilitating the emergence of number concepts from sensorimotor experience through a bottom-up process. The act of counting also involves the acquisition and use of a verbal number system of which number words are the basic building blocks. Using a Cognitive Developmental Robotics paradigm we present results of a modeling experiment on whether finger counting and the association of number words (or tags) to fingers, could serve to bootstrap the representation of number in a cognitive robot, enabling it to perform basic numerical operations such as addition. The cognitive architecture of the robot is based on artificial neural networks, which enable the robot to learn both sensorimotor skills (finger counting) and linguistic skills (using number words). The results obtained in our experiments show that learning the number words in sequence along with finger configurations helps the fast building of the initial representation of number in the robot. Number knowledge, is instead, not as efficiently developed when number words are learned out of sequence without finger counting. Furthermore, the internal representations of the finger configurations themselves, developed by the robot as a result of the experiments, sustain the execution of basic arithmetic operations, something consistent with evidence coming from developmental research with children. The model and experiments demonstrate the importance of sensorimotor skill learning in robots for the acquisition of abstract knowledge such as numbers.

**Keywords: embodied cognition, developmental robotics, finger counting, number words, number cognition**

#### **INTRODUCTION**

Whether finger counting is an essential stage in the development of the cognition of number is still highly debated, though strong evidence exists on the positive contribution of sensorimotor skills and representation in numerical cognition. A growing number of researchers, consider finger counting an important tool children (as well as adults) use across a variety of cultures in the development of numerical cognition (e.g., Andres et al., 2008; Di Luca and Pesenti, 2011). Consideration of the links between counting and the emergence of number concepts is not new. Piaget (1952) for example, considered the linking of numbers to objects as being an important characteristic of the sensorimotor stage of cognitive development, possibly being one of a series of prerequisites for the child's construction of the concept of number. Quite recently, however, the topic of finger based number knowledge has seen a surge of new interest, especially from embodied cognition perspectives (for a recent special issue on the topic see Fischer et al., 2012). Finger counting has generally been assumed to be important to the acquisition of a mature counting system (e.g., Gelman and Gallistel, 1978; Fuson et al., 1982; Butterworth, 2005) as well as instrumental to the development of children's arithmetic abilities (e.g., Fuson and Kwon, 1992). It has been hypothesized that this capability helps children to acquire a variety of principles proposed as being fundamental to the development of a counting system (Gelman and Gallistel, 1978) such as: the acquisition of the one-to-one correspondence principle (i.e., when counting only one word is assigned to each object) through the tagging or assignment of one number word to each item, the assimilation of the stable order principle (i.e., when counting, number words are always assigned in the same order) and the cardinality principle (i.e., the last number word uttered when counting, is the total number of objects in a set). More recent studies have reported an association between finger gnosis (the ability to mentally represent one's fingers) and mathematical abilities (Noël, 2005; Costa et al., 2011), and found finger training helpful in improving the performance of children with weak numerical skills (Gracia-Bafalluy and Noël, 2008). Evidence such as this supports the view that finger representations play a special role in number cognition, and might serve as a basic building block in the child's unfolding capacity to mentally manipulate abstract numerical information.

That a close link might exist between finger counting strategies and patterns, and that they may influence the mental representation and processing of number, has also been suggested by evidence coming from neuroimaging studies. For example, studies using fMRI on adult subjects to investigate aspects of embodied theories of cognition have found intrinsic functional links between finger counting and number processing. Cortical motor activity is evoked by Arabic digits and number words, which reflects particular individual finger counting habits (i.e., whether when counting small digits subjects started with their right or left hand) (Tschentscher et al., 2012). These results have been interpreted in several different ways by the authors of this study, one interpretation invoking a shared neural network for number processing and planning of finger movements, which would include parietal cortical areas, the precentral gyrus and the primary motor cortex, in which number perception might very well elicit the sub-threshold tendency to move associated fingers. Another interpretation used by these authors, to explain how the association between numbers, number words and individual finger counting movements might have come about in their subjects, during their individual development of numerical skills in childhood, would be predicted by a Hebbian learning approach to semantic circuits (Pulvermüller, 1999). The prediction is, that due to the fact that children often use their fingers when counting and solving simple counting problems, a correlation between the neuronal activation for the processing of numbers and the movement of fingers is established. A number of neuroimaging studies done in the last decade, using both PET and fMRI, had already found activation of part of the left precentral gyrus (where hand movements are represented) when subjects were asked to engage in numerical tasks such as addition (e.g., Pesenti et al., 2000), subtraction (Rueckert et al., 1996) and multiplication (Dehaene et al., 1996), leading some authors to suggest that the activation of the left precentral gyrus, along with the inferior parietal cortex, might be evidence of a finger moving network that might, in turn, be reflecting a trace of a finger counting strategy (Sato and Lalain, 2008).

The act of counting often involves the acquisition and use of a verbal number system, of which, number words are the basic building blocks. Number words are highly frequent in child directed speech, but their meanings are acquired slowly, with effort and in stages. Wynn (1992) has argued, in fact, that the developing knowledge of the meanings of counting words is a central part of the process of understanding the counting system. Though children as young as 6 months can discriminate between set sizes (Xu et al., 2005), and children 1 and 2 years of age show to be good at reciting the count sequence (Fuson, 1988), as well as capable of recognizing number words as designators of quantity (Bloom and Wynn, 1997), their difficulty seems to lie in understanding *how* specific words match to specific quantities. One proposal is that the syntax of number words as well as the contexts, in which they appear, might be serving as cues that help children bootstrap this process early (e.g., Gleitman, 1990; Wynn, 1992; Bloom and Wynn, 1997), but finger counting may also very well be serving as an early entry point to this understanding.

Other evidence coming from developmental as well as neurocognitive studies, in keeping with what has been found in neuroimaging studies, suggest that finger counting activity, helps build motor-based representations of number that continue to influence number processing well into adulthood, suggesting that abstract cognition may be rooted in bodily experience (Domahs et al., 2010). In fact, these motor-based representations have been argued to facilitate the emergence of number concepts from sensorimotor experience through a bottom-up process (Andres et al., 2008). In our view, finger counting, can also be seen as a means by which direct sensory experience with the body can serve the purpose of *grounding* number as well as number words initially as low level labels, that later serve as the basis for the acquisition of new higher level symbols from the combination of already grounded ones, something known as grounding transfer (e.g., Harnad, 1990; Cangelosi and Riga, 2006). The grounding approach has also been useful for the modeling of the acquisition of words for objects (Morse et al., 2010; Tikhanoff et al., 2011) and for actions (Marocco et al., 2010; Stramandoli et al., 2012) as well as for numbers (Rucinski et al., 2011, 2012 ´ ).

In sum, while finger counting may not be strictly necessary for children to get on their way to the cognition of number, there is evidence that it does seem to help the learning process, serving as a bridge between possibly innate abilities to perceive and respond to numerosity (e.g., Butterworth, 2005) and the development of the capacity to mentally represent and process number as well as linguistic number related concepts (Lafay et al., 2013). Not much work using robotics has attempted to build on this.

A number of connectionist models have simulated different aspects of number learning. Ma and Hirai (1989) for example, studied how children learn to count using an associative memory network model, which mimicked three phenomena proposed by Fuson et al. (1982), to be present in the acquisition of counting by children [i.e., number word sequence produced by children dividable into three distinct portions: conventional, stable nonconventional, and unstable; irregular number words (e.g., "fifteen") omitted more often than regular ones ("fourteen," "sixteen"; initially number word sequence is in recitation form)].

Other models yet, have focused on the identification of the number of objects in a visual scene as a result of learning. Dehaene and Changeux (1993), for example, using a system consisting of three modules (an input retina, an intermediate topological map of object locations, and a map of detectors) created a numerosity detector. The system was able to simulate the distance effect in counting, by which performance increases with increasing numerical distance between two discriminated quantities. More recently, Ahmad et al. (2002), explored quantification abilities and how they might arise in development, using a multi neural net approach, that combined supervised and un-supervised nets and learning techniques in order to simulate subitization (phenomenon by which subjects appear to produce immediate quantification judgments, usually involving up to 4 objects, without the need to count them) and counting. They used a combined and modular approach, providing a simulation of different cognitive abilities that might be involved in the cognition of number, (each of which would have their own evolutionary history in the brain), and is in keeping with Dehaene's triple code model (2000). Rajapakse et al. (2005), targeted aspects of language related to number such as linguistic quantifiers. Using a hybrid artificial vision connectionist architecture, they ground linguistic quantifiers such as *few*, *several*, *many,* in perception, taking into consideration contextual factors. Their model, after being trained and tested with experimental data using a dual-route neural network, is able to count objects (fish) in visual scenes and select the quantifier that best describes the scene. Even more recently, Rucinski et al. (2011) ´ using a cognitive robotics paradigm, have explored embodied aspects of mathematical cognition such as the interactions between numbers and space, reproducing three psychological phenomena connected with number processing, namely size and distance effects, the SNARC effect and the Posner-SNARC effect. The same group in another work using the same paradigm (Rucinski et al., 2012 ´ ), instead focused on counting, and in particular, on the contribution of counting gestures such as pointing. These models, however, did not consider the role of finger counting in numerical abilities.

In this paper, using a Cognitive Developmental Robotics paradigm (Asada et al., 2009; Cangelosi and Schlesinger, 2014) we present results of an exploration on whether finger counting and the association of number words (or tags) to the fingers, could serve to bootstrap the representation of number in a cognitive robot enabling it to perform basic numerical operations, such as addition.

#### **MATERIALS AND METHODS**

The robotic model used for the experiments is a computer simulation model of the iCub humanoid robot (Tikhanoff et al., 2008, 2011). The iCub is an open-source humanoid robot platform designed to facilitate cognitive developmental robotics research as detailed in (Metta et al., 2010). At the current state the iCub platform is a child-like humanoid robot 1,05m tall, with 53 degrees of freedom (DoF) distributed in the head, arms, hands and legs. The simulated iCub has been designed to reproduce, as accurately as possible, the physics and the dynamics of the physical iCub. The simulator allows the creation of realistic physical scenarios in which the robot can interact with a virtual environment. Physical constraints and interactions that occur between the environment and the robot are simulated using a software library that provides an accurate simulation of rigid body dynamics and collisions. One of the most advanced parts of the iCub is the hand, that comprises 9 DoF, for a total of 18 DoF, and it is the result of a design that optimized the level of integration of the hand in the overall robot to meet the child-like project specifications in terms of dimensions, dexterity and sensorization. Details on the iCub hand can be found in (Schmitz et al., 2010).

In this work we focus on the fingers, that means we use 7 DoF for each hand, distributed as follows: 2 DoF for thumb, index and middle fingers, but only one for controlling the ring and pinky fingers, that are "glued" together. Because of the limitation with the last two fingers the finger representation of numbers with the right hand is as in **Figure 1**. Numbers from six to ten are represented by adding left hand fingers with all the right hand fingers open (e.g., six is five right hand plus one left hand). In this work, we suppose that the robot is right handed.

The iCub is not provided with ears, so the auditory input (i.e., the number words) was recorded from a child's voice using a standard microphone and stored as a WAVEform audio file format at 22 KHz with lossless compression. From the waveform, we extracted the mel-frequency cepstral coefficients (MFCC) to represent each number word from one to ten, using the Slaney's auditory toolbox 2.0 for MATLAB (1998). MFCC technique combines an auditory filter-bank with a cosine transform to give a rate representation roughly similar to the auditory system (Davis and Mermelstein, 1980).

**Figure 2** presents the architecture of the robot's cognitive system, in which the different units and their connections are presented in a schematic form. The lower part of the implemented neural system is directly connected with the robotic platform, and can be summarized in: (i) the motor controller/memory (Motor System and Right/Left Layers), that is able to plan finger movements by setting the finger joints' angles and to memorize the finger number sequence; (ii) an auditory memory (auditory system and auditory layer), that is able to memorize the number words sequence. Upper part of **Figure 2** presents the inner units that are responsible for abstract functions (i.e., not directly connected with the robot), they are the switch/associative layer, that allows the two lower systems to cooperate in order to perform other functions, and the competitive layer classifier we implemented to test the quality of the number learning. After supervised training, it is able to represent the correspondence between numbers from 1 to 10 and the internal representations (i.e., hidden layer activations and/or cepstral coefficients). The role of the competitive layer classifier is to simulate the final processing of the numbers, after a number is correctly classified into its class, the appropriate action can be started, e.g., the production of the corresponding word, of a symbol, the manipulation of an object and so on.

The motor controller/memory was designed using two different RNNs in order to model lateralization when processing numbers, as shown by (Tschentscher et al., 2012). In this way,

upper part there are the units with abstract functions that are the switch/associative network and the competitive layer classifier. Bold links indicate a full (one-to-all) connection between each layer, while dotted links are direct (one-to-one) connections. Note that the system's external inputs coincide with the outputs, indeed proprioceptive information from motor and auditory systems is an input for the system during the training phase, while it is the control output when the system is operating.

the network that controls the left hand will be switched off when low numbers (1–5) are processed. The two RNNs that compose the motor controller/memory were trained separately, i.e., with different random weight initialization. Note that the motor controller is implemented by two different RNNs, trained separately, but that we refer to as a single unit. The use of RNNs to learn to count was investigated by Rodriguez et al. (1999), they explored the capabilities of recurrent networks in the task of learning to predict the next character in a simple deterministic context-free language, in order to provide a more detailed understanding of how dynamics can be harnessed to solve language problems.

The artificial neural networks were implemented using the Matlab Neural Network Toolbox 8.0, the supervised training algorithm for all networks was Levenberg-Marquardt algorithm (LMA), one of the fastest and widely used optimization algorithms that can be applied to artificial neural networks (Hagan and Menhaj, 1994). The LMA interpolates between the Gauss– Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts far from the final minimum. Like the quasi-Newton methods, the LMA was designed to approach second-order training speed without having to compute the Hessian matrix. When the performance function has the form of a sum of squares (as is typical in training feedforward networks), then the Hessian matrix can be approximated as *<sup>H</sup>* <sup>=</sup> *<sup>J</sup>T<sup>J</sup>* and the gradient can be computed as *<sup>g</sup>* <sup>=</sup> *<sup>J</sup>T<sup>e</sup>* where *<sup>J</sup>* is the Jacobian matrix that contains first derivatives of the network errors with respect to the weights and biases, and *e* is a vector of network errors. In our implementation, the error *e* is calculated as the average of the squared errors of outputs. The Jacobian matrix can be computed through a standard backpropagation technique (see Hagan and Menhaj, 1994), that is much less complex than computing the Hessian matrix. The LMA uses this approximation to the Hessian matrix in the following Newton-like update:

$$
\Delta \mathbf{x} = \left[ J(\mathbf{x})^T J(\mathbf{x}) + \mu I \right]^{-1} J(\mathbf{x})^T \mathbf{e}(\mathbf{x})
$$

when the scalar µ is zero, this is just Newton's method, using the approximate Hessian matrix. When µ is large, this becomes gradient descent with a small step size. Newton's method is faster and more accurate near an error minimum, so the aim is to shift toward Newton's method as quickly as possible. Thus, µ is decreased after each successful step (reduction in performance function) and is increased only when a tentative step would increase the performance function. In this way, the performance function is always reduced at each iteration of the algorithm. In our experiments the initial value of µ was 0.001, increase factor was 10 while decrease factor was 0.1, maximum µ was 1010. The number of iterations (or epochs) of the algorithm was variable because we adopted as stop criterion a minimum performance gradient of 10−<sup>7</sup> or a maximum of 1000 epochs.

The derivative function of the RNN networks was the backpropagation through time (Rumelhart et al., 1986), that is a gradient based technique that begins by unfolding the recurrent neural network through time into feed-forward neural networks, so that the training then proceeds in a manner similar to training a feed-forward neural network with classic back-propagation, except that each epoch must run through the observations in sequential order.

The competitive layer classifier is implemented using the *softmax* transfer function that gives as output the probability/likelihood of each classification. Naturally, it ensures all of the output values are between 0 and 1, and that their sum is 1. The *softmax* function used is a follows:

$$
\text{softmax}(q, i) = \frac{e^{q\_i}}{\sum\_{j=1}^{n} e^{q\_j}}
$$

where the vector *q* is the net input to a *softmax* node, and *n* is the number of nodes in the *softmax* layer.

The architecture of the hidden layers of RNNs was chosen after a performance test, in which after 100 runs with varying number of hidden neurons, the best trade-off solutions were selected in terms of minimization of the error and number of iterations needed to converge. We found that 10 neurons was not surprisingly the ideal solution, this because 10 is also the number of different states to represent. Furthermore, in our preliminary experiments we also found that the pure linear transfer functions for the hidden layers were more effective than the usual sigmoid. We chose not to use a bias or set them to zero for the RNN. Due to these choices, when the networks are not active, i.e., all activations are zero, they can be activated by incepting the activation values to the respective neurons in order to start counting from a specific number.

In addition to the main blocks, an associative network is included in the system to initiate the computation of the system and to implement the number manipulation. Indeed, after the RNNs have learned the number sequence, the switch is needed to stop the counting and to redirect the signals to the competitive classifier for the processing of the result.

**Figure 3** shows the details of the switch/associative layer that, once the two systems have learned to count, allows them to operate and communicate with each other. In particular, the unit is responsible for starting the counting by initializing all the hidden units to 1, and redirecting the hidden unit activation to the competitive classifier when the counting is finished. Furthermore, this unit is crucial in the development for the acquisition of the ability to add numbers, because it can reset one of the two networks to make it count the new operand, and it lets the other continue as a buffer memory. Finally, thanks to the associative connections between the two layers (with weights **w**<sup>1</sup> and **w**<sup>2</sup> in **Figure 3**) there are other two states that allow inputting a specific number representation starting from another: from fingers to words and vice versa. These states will be studied in more detail in the number manipulation experiments. All states are reported in the table on the left of **Figure 3**.

As can be seen from the switch/state table in **Figure 3** we set to 1 the initial state of all the hidden layers' neurons in order to start the sequence. Vice versa if the initial state is set to 0, there is no activation because RNNs do not have bias in the hidden layer.

**Table 1** presents the actual finger joint positions for the ten number representations plus the rest (zero) position. A high value of the joint position represents the finger when it is closed, while low values indicate the finger is open. Note that because of element collision and tendon limitations the actual values are not the ideal ones (i.e., 90, 180, 220 when finger is closed, 0 when open).

**Table 2** reports the MFCCs for the number words extracted from a child voice.

In our experiments, all values in the input/output datasets used in training were pre-processed by dividing them by the maximum absolute value of the series, in order to have them in the range [−1, 1]. This is beneficial for the learning of weights and biases of the artificial neural networks.

#### **EXPERIMENTS AND RESULTS**

Using the material and methods presented above, first we studied the part of the cognitive system that learns to count. The results of the training are presented in the subsection "Numbers learning." As second step, we build on this by developing the capacity of the associative network to control basic operations like the addition of two operands and to derive the number representation of one of the networks from the other (i.e., from fingers to words and vice versa).

#### **NUMBERS LEARNING**

For this first experiment, the main goal was to test the ability of the proposed cognitive system to learn numbers by comparing the performance of different ways of training the number knowledge of the robot with: (1) the internal representation (hidden units activation) of a given finger sequence, (2) the MFCC coefficients of number words out of sequence, (3) the internal representation of the number words sequence, (4) the internal representation of finger sequences plus the MFCC of number words out of sequence (i.e., learning words while counting); (5) internal representations of the sequences of both fingers and number words together (i.e., learning to count with fingers and words).

To this end, we setup the experiment with the following steps: (i) the motor controller learns the opening of the fingers in a given sequence, in order to later establish a finger counting routine, and creates an internal representation for each step in the sequence by means of the hidden units activations; (ii) MFCCs are extracted from number words; (iii) the auditory memory learns the verbal number words in order from 1 to 10 and creates an internal representation for each word in the sequence. From each learning step, relevant data are collected and stored as datasets for the experimentation, these sequences can be summarized as follows:

(1) Internal representations of the finger sequence: 10 values corresponding to the activation values of the hidden units of motor controller/memory network.


 = scalar product; f2w = fingers to words; w2f=words to fingers; NaN preserves the activation at **w** and **h** are weight vectors, of associative layer and motor/auditory layer respectively.

**FIGURE 3 | Details of the Switch/Associative Layer.** The table on the right summarizes the outputs according to the different states. In practice the layer operates as a recursive feedback with the possibility to start and reset the motor/auditory layers and to derive the

activations of one layer from the ones of the other. Bold lines indicate a full weighted connection, while normal lines are single connections. For simplicity hidden units of the two RNNs of the motor system are represented with one block.



**Table 2 | Mel Frequency Cepstral Coefficients for number words.**



Datasets 4 and 5 are built to model the learning when both fingers and number words are presented together as training input to the cognitive system.

**Figure 4** shows the activation values of hidden layers of RNNs: finger sequences on the left and word sequences on the right. Note that we present together the activations of the two RNNs that compose the motor controller/memory network. Motor activations show a lateralization because the network that controls the left hand (neurons 6–10) is switched off, furthermore the units from 1 to 5 remains fixed from the number five on because we suppose that the right hand is open (we reason as if the robot is right handed). Moreover, in **Figure 5** we present the dendrogram after the optimal leaf order (Bar-Joseph et al., 2001), that shows how the internal finger representation is more similar to the number sequence, indeed, numbers that are close in the actual sequence are linked together. Meanwhile, the grouping of number words (learned in or out of sequence) is more random, and affects the learning as shown in the classification experiment.

All datasets were used to train the competitive layer classifier to be classified in the ten classes that represent the numbers from 1 to 10. Classification results after 10 epochs of training are presented for each class/number in **Table 3**. The low number of epochs is, in this case, is imposed in order to study the robot's number learning in the early stages. However, results show that 10 epochs are enough for the LMA to converge.

**Table 3** reports the medians and standard deviations of class calculated after 100 runs for each classification training dataset. If we consider "good" classification only, the cases in which the likelihood is greater than 0.5, we can consider the plain words dataset as not adequate to train the network because it fails for all numbers. However, if we consider as successful classification the cases when the class has the greatest likelihood, the only misclassification observed is for the number three. All the other datasets are



good and as expected, when finger and word sequences are used together, the cognitive system learns numbers quickly and with a very good likelihood, greater than 90% for all numbers.

We performed a pairwise *t*-test to evaluate the statistical significance of the results reported in **Table 3**. The *t*-test results confirm that all the differences are statistically significant except for the number three, when finger sequences are compared with word sequences, and the two when finger sequences only are compared with finger sequences and number words.

The "finger sequence and number words" (i.e., dataset 4), shows that associating the number words with the fingers sequence helps to drastically improve the classification performance without needing to learn number words in a sequence. However, to learn number words in sequence helps to additionally improve the classification performance to highest likelihood, if internal representations are associated with motor ones.

In order to study in more detail the development of learning, we measured the classification performance over the 10 epochs

for the competitive layer trained with the different datasets. In this case, performance is evaluated by means of the average likelihood of classification (**Figure 6**) and median number of misclassifications (**Figure 7**).

Looking at the developmental results, we once again see that number words learned out of a sequence are the less efficient to learn as there are no misclassifications only after 10 epochs, and the average likelihood is still low (0.256) after ten epochs. Conversely, if number words are learned in sequence and internal representations are used as inputs, the learning is faster in terms of precision of classification (i.e., no errors after just 2 epochs) but the maximum average likelihood, that converges at 0.688, is not as strong as when the learning involves also fingers. Indeed, the finger sequence reaches a higher average likelihood (0.765), but best results are obtained when internal representation of words and fingers are used together as input, in fact the average median likelihood is 0.94 just after 8 epochs.

#### **NUMBERS MANIPULATION**

Once the number sequences are learned, an interesting feature of the proposed cognitive system is the possibility to easily build up the ability to manipulate numbers with the development of the switch-associative network.

Indeed, this ability can be modeled by extending the capabilities of the associative network from the simple start and stop, to its transferring and mapping to the basic operation of addition.

By transferring, we mean the new mapping of the network's representation derived from the number counted by the other network, when the robot hears the number word "three," to the correlated finger representation. This can be considered, in a sense, an associative mapping between internal representations. This is implemented by activating a weighted connection between the two networks, which can be learned by applying the LMA to the two-layer network that comprises the hidden units of both networks. This training is quite fast and effective both ways, as an average of 4 iterations (over 100 trials) are needed to reach an average estimation error lower than 10−15, which practically does not affect the performance of the classifier, that shows differences in the statistical indicators (mean, median, and standard deviation over 100 trials) lower than 10−<sup>12</sup> when its inputs are derived by the associative network.

The operation of addition can be seen as a direct development of the concurrent learning of the two recurrent units (motor and auditory). Indeed, if one of the two does the actual counting of the operands, the other can be used as a buffer memory to add the result, when it is done, the final number can be transferred from the buffer to the other unit and then inputted to the final processor (the classifier in our system).

As an example let us consider 2 + 2, the following steps will be taken:


The steps are depicted in **Figure 8**.

#### **DISCUSSION**

The results obtained in our experiments with the iCub childlike robotic platform, show that learning the number words in sequence along with finger configurations helps the fast building of the initial representation of number in the robot. Number knowledge, is instead, not as efficiently developed when number words are learned out of sequence without finger counting. Furthermore, the internal representations of the finger configurations themselves, developed by the robot as a result of the experiments, sustain the execution of basic arithmetic operations,

something consistent with evidence coming from developmental research with children.

This does not mean that just learning the counting sequence from one to ten, is enough for children (or our robot), to understand number concepts, but it is the repeated experience using the number word sequence when counting sets of things that is important in the development of numerical understanding (Sarnecka and Carey, 2008; Donlan, 2009). While the use of fingers does not necessarily precede the use of language in the acquisition of a symbolic numerical system (e.g., Nicoladis et al., 2010), what many children seem to be doing initially, in fact, is learning small number word sequences by rote, and later, associations between these small number words and objects in the world (first among which, their readily available fingers). Later on down the developmental path, with the child's early schooling experience, this mapping will also include written representations (or numerals). These written representations, eventually take on the meaning of the spoken number word (Fuson and Kwon, 1992). It is this kind of associative multi-modal learning that we are in a sense reproducing in our model.

Studies focusing on how children acquire abstract words and concepts, have proposed that multiple representational systems involving both sensorimotor as well as linguistic information might be playing a role in conceptual representation (e.g., Louwerse and Jeuniaux, 2010). While the case of the acquisition of number words might be considered as a particular type of abstract word learning, theories such as the LASS theory (Barsalou et al., 2008), according to which both the linguistic system as well as the sensorimotor system (through simulation) are activated in the processing of word meaning to different degrees under different task conditions, and the WAT (Words as Tools) proposal put forth by Borghi and Cimatti (2009) (but also see Borghi et al., 2011, and the special issue of Borghi and Pecher, 2011), have argued and furnished evidence on the synergetic role both language and sensorimotor experience play in the acquisition of abstract concepts, and on how important the modality by which words are learned is. In our model, number words or tags heard repeatedly, when coupled to the experience of moving the fingers, do serve as tools, used in the subsequent manipulation of the quantities they come to represent.

In fact, the internal representations of the finger configurations themselves, found as a result of the experiments, can be considered to be a basis for the building of an embodied number representation in the robot, something in line with embodied and grounded cognition approaches to the study of mathematical cognitive processes. Just as has been found with young children, through the use of finger counting and verbal counting strategies, our model develops finger and word representations that subsequently sustain the robot's learning the basic arithmetic operation of addition. While the experiments done with the model in this work have targeted simple addition, the same model can be easily adapted to implement other operations such as subtraction by training the motor controller/memory with the backward number sequence (from 10 to 1) and then selecting this sequence at the beginning (e.g., by setting all hidden activations to −1), this way the subsequent manipulation will give the result of the subtraction. Future work with the model will implement this.

Another thing we would like to highlight is that, since the hidden layer without any external input does the actual counting, our system is also able to count "mentally" without necessarily producing actions or words. Future work with the model, will explore offline simulation aspects of the motor programs involved in finger based representations of number, or the "mental motor imagery" activated whenever a number and/or number word is encountered, after the robot has learned finger counting and finger calculation early in its training. This direction is stimulated by recent research on the simulation of mental imagery in cognitive systems and robots (see Di Nuovo et al., 2013b), in particular by the successful application of motor imagery models for mental practice in the execution of verbal commands (Di Nuovo et al., 2012) and for performance improvement (Di Nuovo et al., 2013a). The interested reader can find more details about this line of research in a recent special issue (see Di Nuovo et al., 2013b). In recent theoretical accounts of embodied numerosity (e.g., Moeller et al., 2012), something akin to the offline simulation of the motor programs involved in finger based representations of number, has been suggested to take place in children as well as adults. This line of investigation with the model will also compare the representation of embodied numerosity or "manunumeral" representations (Fischer and Brugger, 2011), using different culturally transmitted counting habits or strategies, in order to explore how they might influence the number processing in the robot (e.g., Bender and Beller, 2011; Previtali et al., 2011).

The utility of children's learning finger counting strategies early in their mathematical education continues to be debated in mathematics education research, despite the evidence coming from neurocognitive and psychological studies indicating that it does (for review of debate see Moeller et al., 2011). Our experiments show that in fact, learning to count with the fingers, using verbal tags, can be helpful in the numerical training of a robot as well. While being inspired by the evidence from the studies we have cited in previous sections, our implementation is nonetheless, an abstraction of complex and as of yet not totally understood processes that may underlie the development of numerical cognition. Our results, however, are in line with what has been theoretically claimed in the developmental literature (e.g., Gelman and Gallistel, 1978), that is: that finger counting may be playing a functional role in the acquisition of a variety of principles considered necessary for children to have "under their belts" in order to reach an understanding of number. Examples of these principles and the role of finger counting that are relevant to our present study, and that we think we have simulated at least in part are: finger counting as an aid in the keeping track of the number words while reciting the counting sequence; as it contributing to the induction of the one-to-one correspondence principle by which children are helped by their fingers to coordinate the processes of tagging, or the attribution of a number word to each item; and as facilitating the assimilation of the stable-order principle where numerical labels have to be enumerated in the same order across counting sequences (see also Andres et al., 2008; and Lafay et al., 2013).

The study of mathematical cognitive processes, traditionally considered to be quintessential examples of abstract and symbolic processing, have been assumed to primarily involve the mind rather than the body. Our embodied robot experiments indicate, that aspects of the development of this knowledge can be accounted for not only by way of bodily representations, but also with an artificial network in the place of a mind.

#### **ACKNOWLEDGMENTS**

This work was supported in part by the European Commission under Grants n. 288899 (ROBOT-ERA) and n. 288382 (POETICON++) within the Seventh Framework Programme for Research and Technological Development.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 30 September 2013; paper pending published: 25 November 2013; accepted: 08 January 2014; published online: 03 February 2014.*

*Citation: De La Cruz VM, Di Nuovo A, Di Nuovo S and Cangelosi A (2014) Making fingers and words count in a cognitive robot. Front. Behav. Neurosci. 8:13. doi: 10.3389/fnbeh.2014.00013*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience.*

*Copyright © 2014 De La Cruz, Di Nuovo, Di Nuovo and Cangelosi. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Supramodal neural processing of abstract information conveyed by speech and gesture

#### *Benjamin Straube1 \*, Yifei He1,2, Miriam Steines 1, Helge Gebhardt 3, Tilo Kircher 1, Gebhard Sammer <sup>3</sup> and Arne Nagels <sup>1</sup>*

*<sup>1</sup> Department of Psychiatry and Psychotherapy, Philipps-University Marburg, Marburg, Germany*

*<sup>2</sup> Department of General Linguistics, Johannes Gutenberg-University Mainz, Mainz, Germany*

*<sup>3</sup> Cognitive Neuroscience at Centre for Psychiatry, Justus Liebig University Giessen, Giessen, Germany*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### *Reviewed by:*

*Nashaat Z. Gerges, Medical College of Wisconsin, USA Yueqiang Xue, The University of Tennessee Health Science Center, USA*

#### *\*Correspondence:*

*Benjamin Straube, Department of Psychiatry and Psychotherapy, Philipps-University Marburg, Rudolf-Bultmann-Str. 8, 35039 Marburg, Germany e-mail: straubeb@ med.uni-marburg.de*

Abstractness and modality of interpersonal communication have a considerable impact on comprehension. They are relevant for determining thoughts and constituting internal models of the environment. Whereas concrete object-related information can be represented in mind irrespective of language, abstract concepts require a representation in speech. Consequently, modality-independent processing of abstract information can be expected. Here we investigated the neural correlates of abstractness (abstract vs. concrete) and modality (speech vs. gestures), to identify an abstractness-specific supramodal neural network. During fMRI data acquisition 20 participants were presented with videos of an actor either speaking sentences with an abstract-social [AS] or concrete-object-related content [CS], or performing meaningful abstract-social emblematic [AG] or concrete-object-related tool-use gestures [CG]. Gestures were accompanied by a foreign language to increase the comparability between conditions and to frame the communication context of the gesture videos. Participants performed a content judgment task referring to the person vs. object-relatedness of the utterances. The behavioral data suggest a comparable comprehension of contents communicated by speech or gesture. Furthermore, we found common neural processing for abstract information independent of modality (AS *>* CS ∩ AG *>* CG) in a left hemispheric network including the left inferior frontal gyrus (IFG), temporal pole, and medial frontal cortex. Modality specific activations were found in bilateral occipital, parietal, and temporal as well as right inferior frontal brain regions for gesture (G *>* S) and in left anterior temporal regions and the left angular gyrus for the processing of speech semantics (S *>* G). These data support the idea that abstract concepts are represented in a supramodal manner. Consequently, gestures referring to abstract concepts are processed in a predominantly left hemispheric language related neural network.

**Keywords: gesture, speech, fMRI, abstract semantics, emblematic gestures, tool-use gestures**

#### **INTRODUCTION**

Human communication is distinctly characterized by the ability to convey abstract concepts such as feeling, evaluations, cultural symbols, or theoretical assumptions. This can be differentiated from references to our physical environment consisting of concrete objects and their relationships to each other. In addition to our language capacity, humans also employ gestures as flexible tool to communicate both concrete and abstract information (Kita et al., 2007; Straube et al., 2011a). The investigation of abstractness and modality of communicated information can deliver important insight into the neural representation of concrete and abstract meaning. However, up to now, evidence about communalities or differences in the neural processing of abstract vs. concrete meaning communicated by speech vs. gesture is missing.

Recently, a hierarchical model of language and thought has been suggested (Perlovsky and Ilin, 2010) which proposes that abstract thinking is impossible without speech (Perlovsky and Ilin, 2013). According to this model, abstract information is processed by a neural language system, regardless of whether speech or gesture is chosen as a tool to convey this information. Following this assumption, concrete object-related information is represented in mind independent of speech and hence in a modality-dependent manner in brain regions sensitive to for example visual or motor information. The latter assumption at least partly—contradicts existing embodiment theories, which suggest a strong overlap of the sensory-motor and language system in particular with respect to the processing of concrete concepts (Gallese and Lakoff, 2005; Arbib, 2008; Fischer and Zwaan, 2008; D'Ausilio et al., 2009; Pulvermüller and Fadiga, 2010). However, the particular role of the communication modality for the neural representation of abstract as opposed to concrete concepts has not been investigated so far.

The impact of abstractness on speech processing (e.g., Rapp et al., 2004, 2007; Eviatar and Just, 2006; Lee and Dapretto, 2006; Kircher et al., 2007; Mashal et al., 2007, 2009; Shibata et al., 2007; Schmidt and Seger, 2009; Desai et al., 2011) and on the neural integration of speech and gesture information has been demonstrated in several functional magnetic resonance imaging (fMRI) studies using different experimental approaches (Cornejo et al., 2009; Kircher et al., 2009b; Straube et al., 2009, 2011a, 2013a; Ibáñez et al., 2011). There is converging evidence suggesting that especially the left inferior frontal gyrus (IFG) plays a decisive role in the processing of abstract semantic figurative meaning in speech (Rapp et al., 2004, 2007; Kircher et al., 2007; Shibata et al., 2007). However, results can further differ due to other factors, such as familiarity, imagibility, figurativeness, or processing difficulty (Mashal et al., 2009; Schmidt and Seger, 2009; Cardillo et al., 2010; Schmidt et al., 2010; Diaz et al., 2011).

In contrast to abstract information processing, it has been suggested that concrete information is processed in different brain regions sensitive to the specific information type: e.g., spatial information in the parietal lobe (Ungerleider and Haxby, 1994; Straube et al., 2011c), form or color information in the temporal lobe (Patterson et al., 2007). A similar finding is illustrated by Binder and Desai (2011): by reviewing 38 imaging studies that examined concrete knowledge processing during language comprehension tasks, the authors found that the processing of action-related speech material activates brain regions that are also involved in action execution (see also Hauk et al., 2004; Hauk and Pulvermüller, 2004); similarly, the processing of other concrete speech information such as sound and color all tend to show activations in areas that process these perceptual modalities (Binder and Desai, 2011). In sum, abstract information processing has been shown to recruit a mainly left-lateralized fronto-temporal neural network whereas concrete information comprehension involves rather diverse activation foci, which are primarily related to the corresponding perceptual origin.

In addition to our speech capacity, gesturing is a flexible communicative tool which humans use to communicate both concrete and abstract information via the visual modality. Previous studies on object- or person-related gesture processing have either presented pantomimes of tool or object use, hands grasping for tools or objects (e.g., Decety et al., 1997; Faillenot et al., 1997; Decety and Grèzes, 1999; Grèzes and Decety, 2001; Buxbaum et al., 2005; Filimon et al., 2007; Pierno et al., 2009; Biagi et al., 2010; Davare et al., 2010; Emmorey et al., 2010; Jastorff et al., 2010); or symbolic gestures like "thumbs up" (Nakamura et al., 2004; Molnar-Szakacs et al., 2007; Husain et al., 2009; Xu et al., 2009; Andric et al., 2013). However, few studies directly compared abstract-social (person-related) with concreteobject-related gestures. A previous study demonstrated that the left IFG is involved in the processing of expressive (emotional) in contrast to body referred and isolated (object-related) hand gestures (Lotze et al., 2006). This finding suggests that the left IFG is sensitive for the processing of abstract information irrespective of communication modality (speech or gestures).

In sum, the left IFG represents a sensitive region for abstract information processing in speech or gesture, whereas the brain areas activated by concrete information depend on communication modality and semantic content. However, whether the same neural structures are relevant for the processing of gestures and sentences with an abstract content or gestures and sentences with a concrete content remains unknown.

Common neural networks for the processing of speech and gesture information have been suggested (Willems and Hagoort, 2007), and empirically tested in several recent studies (Xu et al., 2009; Andric and Small, 2012; Straube et al., 2012; Andric et al., 2013). Andric et al. (2013) performed an fMRI study on gesture processing presenting two different kinds of hand actions (emblematic gestures and grasping movements) and speech to their participants. Thus, either emblematic gestures—hand and arm movements conveying social or symbolic meaning (e.g., "thumbs up" for having done a good job)—or grasping movements (e.g., grasping a stapler) not carrying any semantic meaning *per se* were presented. The authors identified two different types of brain responses for the processing of emblematic gestures: the first type was related to the processing of linguistic meaning, the other type corresponded to the processing of hand actions or movements, regardless of the symbolic meaning conveyed. The latter type involved brain responses in parietal and premotor areas in connection with hand movements, whereas meaning bearing information, e.g., emblem and speech, resulted in activations in left lateral temporal and inferior frontal areas. Altogether, different modalities were involved distinguishing the level of mere perceptual recognition and interpretation of socially and culturally relevant emblematic gestures. More importantly, although lacking baseline conditions containing more concrete semantics (either in gesture or speech), the results from this study tentatively imply a common neural network for processing abstract meaning, irrespective of its input modality.

In a similar vein, Xu et al. (2009) investigated the processing of emblems and pantomimes and their corresponding speech utterances via fMRI. Their finding converges with Andric and colleagues imaging results in the sense that both input modalities activated a common, left-lateralized network encompassing inferior frontal and posterior temporal regions. However, although utilizing emblems (abstract) and pantomimes (concrete) as stimuli, the authors did not elaborate on how different levels of semantics (abstract/concrete) are processed via gesture or speech. Moreover, in a recent study from our laboratory, Straube et al. (2012) looked at less conventionalized gesture iconic gesture, but still found a fronto-temporal network which was responsible for both the processing of gesture and speech semantics. Altogether, the three aforementioned studies unanimously suggest a common fronto-temporal neural network to be responsible for the processing of not only speech but also gesture semantics.

Although tentative proposals regarding a supramodal neural network for speech and gesture semantics have been made (Xu et al., 2009; Straube et al., 2012), it remains unclear how different levels of semantics—either concrete or abstract—are processed differently with respect to the input modalities. To date, no study results on a direct comparison between abstract and concrete semantic information processing with visual (gesture) or auditory (speech) input are available.

As hypothesized above, concrete object-related information might be represented in mind with and/or without speech, whereas abstract information could require/rely on a representation in speech. Consequently, common processing mechanisms for the processing of speech and gesture semantics can be specifically expected when abstract (in contrast to concrete) information is communicated. Therefore, the current study focused on the neural correlates of abstractness and modality in a communication context. With a factorial manipulation of content (abstract vs. concrete) and communication modality (speech vs. gestures) we wanted to shed light on supramodal neural network properties relevant for the processing of abstract in contrast to concrete information. We tested the following alternative hypotheses: first, if only abstract concepts—activated through speech or gesture in natural communication situations—are processed in a supramodal manner, then we predict consistent neural signatures only for abstract in contrast to concrete contents across different types of communication modality. However, if concrete concepts—activated through speech or gestures—are also represented in a supramodal network, we predict overlapping neural responses for concrete in contrast to abstract contents across modality.

To manipulate abstractness and communication modality we used video clips of an actor either speaking sentences with an abstract-social [AS] or concrete-object-related content [CS], or performing meaningful abstract-social (emblematic) [AG] or concrete-object-related (tool-use) gestures [CG]. Gestures were accompanied by a foreign language (Russian) to increase the comparability between conditions and naturalness of the gesture videos where spoken language frames the communication context. We used emblematic and tool-related gestures to guarantee high comprehensibility of the gestures. During the experiments participants performed a content judgment task referring to the person vs. object-relatedness of the speech and gesture communications to ensure their attention to the semantic information and the adequate comprehension of the corresponding meaning. We hypothesized modality independent activations exclusively for the processing of abstract information (AS *>* CS ∩ AG *>* CG) in language-related regions encompassing the left inferior frontal gyrus, the left middle, and superior temporal gyrus (MTG/STG) as well as regions related to social/emotional processing such as the temporal pole, the medial frontal, and anterior cingulate cortex (ACC). In addition, modality specific activations were expected in bilateral occipital, parietal, and temporal brain regions for gesture (G *>* S) and in left temporal, temporo-parietal, and inferior frontal regions for the processing of speech semantics (S *>* G).

#### **METHODS PARTICIPANTS**

Twenty healthy subjects (7 females) participated in the study. The mean age of the subjects was 25.4 years (*SD*: 3.42, range: 22.0–35.0). All participants were right handed (Oldfield, 1971), native German speakers and had no knowledge of Russian. All subjects had normal or corrected-to-normal vision, none reported any hearing deficits. Exclusion criteria were a history of relevant medical or psychiatric illness of the participants. All subjects gave written informed consent prior to participation in the study. The study was approved by the local ethics committee.

#### **STIMULUS MATERIAL**

Video clips were selected from a large pool of different videos. Some of them have been used in previous fMRI studies, focusing on different aspects of speech and gesture processing (Green et al., 2009; Kircher et al., 2009b; Straube et al., 2009, 2010, 2011a,b, 2012, 2013a,b; Leube et al., 2012; Mainieri et al., 2013). Here, we used emblematic and tool-related gestures and corresponding sentences to guarantee high comprehensibility of the gestures and a strong difference in abstractness between conditions. For the current analysis, 208 (26 videos per condition × 4 conditions × 2 sets) short video clips depicting an actor were used. The actor performed the following conditions: (1) German sentences with an abstract-social content [AS], (2) Russian sentences with abstract-social (emblematic) gestures [AG], (3) German sentences with a concrete-object-related content [CS], and (4) Russian sentences with concrete-object-related (tool-use) gestures [CG] (**Figure 1**). Thus, we presented videos with semantic information only in speech or only in gesture, both of them in either a highly abstract-social or a concrete-objectrelated version. Additionally, two bimodal meaningful speechgesture conditions and one meaningless speech-gesture condition have been presented, which are not of interest for the current analysis.

We decided to present gestures accompanied by a foreign language to increase the comparability between conditions and the naturalness of the gesture videos where spoken language frames the communication context. All sentences had a similar grammatical structure (subject—predicate—object) and were translated into Russian for the gesture conditions. Words that sounded similar in each language were avoided. Examples for the German sentences are: "The blacksmith **hammers** on the metal plate" ("Der Schmied hämmert auf die Metallplatte"; CS condition) or "The bishop **exhorts** the believers" ("Der Bischof ermahnt die

**FIGURE 1 | For each of the four conditions (AG, abstract-gesture; CG, concrete-gesture; AS, abstract-speech; CS, concrete-speech) an example of the stimulus material is depicted.** Note: For illustrative purposes the spoken German sentences were translated into English and all spoken sentences were written into speech bubbles.

Gläubigen"; AS condition; see **Figure 1**). Thus, the sentences had a similar length of five to eight words and a similar grammatical form, but differed considerably in content. The corresponding gestures (keyword indicated in bold) matched the corresponding speech content, but were presented here only in a foreign language context.

The same male bilingual actor (German and Russian) performed all the utterances and gestures in a natural spontaneous way. Intonation, prosody and movement characteristics in the corresponding variations of one item were closely matched. At the beginning and at the end of each clip the actor stood with arms hanging comfortably. Each clip had a duration of 5 s including 500 ms before and after the experimental manipulation, where the actor neither spoke nor moved. In the present study the semantic aspects of the stimulus material refer to differences in abstractness of the communicated information (abstract vs. concrete content).

For stimulus validation, 20 participants not taking part in the fMRI study rated each video on a scale from 1 to 7 concerning understandability, imageability and naturalness (1 = very low to 7 = very high). In order to assess *understandability* participants were asked: How understandable is the video clip? (original: "Wie VERSTÄNDLICH ist dieser Videoclip?"). The rating scale ranged from 1 = very difficult to understand (sehr schlecht verständlich) to 7 = very easy/good to understand (sehr gut verständlich). For *naturalness* ratings the participants were asked: How natural is the scene? (original: "Wie NATÜRLICH ist diese Szene?"). The rating scale ranged from 1 = very unnatural (sehr unnatürlich) to 7 = very natural (sehr natürlich). Finally, for judgments of *imageability* the participants were asked: How pictorial/imageable is the scene? (original: "Wie BILDHAFT ist dieser Videoclip?"). The rating scale ranged from 1 = very abstract (sehr abstrakt) to 7 = very pictoral/imageable (sehr bildhaft). These scales have been used in previous investigations, too (Green et al., 2009; Kircher et al., 2009b; Straube et al., 2009, 2010, 2011a,b). A set of 338 video clips (52 German sentences with concrete-object-related content, 52 German sentences with abstract-social content and their counterparts in Russian-gesture and German-gesture condition and 26 Russian control condition) were chosen as stimuli for the fMRI experiment on the basis of high naturalness and high understandability for the German and gesture conditions. The stimuli were divided into two sets in order to present each participant with 182 clips during the scanning procedure (26 items per condition), counterbalanced across subjects. A single participant only saw complementary derivatives of one item, i.e., the same sentence or gesture information was only presented once per participant. This was done to avoid speech or gesture repetition or carryover effects. Again, all parameters listed above were used for an equal assignment of the video clips to the two experimental sets, to avoid set-related between-subject differences. As an overview, **Table 1** lists the mean durations of speech and gestures as well as the mean ratings of comprehension, imageability, and naturalness of the items used for the current analyses.

The ratings on understandability for the videos of the four conditions used in this study clearly show a main effect of modality, with the speech varieties scoring higher than the gesture varieties [*F(*1*,* <sup>113</sup>*.*51*)* = 1878*.*79, *P <* 0*.*001, two-factorial


**Table 1 | Number of videos and their mean durations of stimulus parameters speech and gesture as well as their mean stimulus ratings of**

**understandability, imageability, and naturalness according to the four conditions abstract-gesture (AG), concrete-gesture (CG), abstract-speech (AS), and concrete-speech (CS) for set 1, set 2 and in total.**

*SD, standard deviation.*

between-subjects ANOVA with adjusted degrees of freedom according to Brown–Forsythe]. This effect stems from the fact that different languages were used for speech only and gesture with speech conditions. Video clips with German speech scored higher than 6 while Russian speech with gestures videos scored between 3 and 4 (6.58 vs. 3.47, respectively). This difference is in line with the assumption that when presented without the respective sentence context isolated gestures are less meaningful, but even then they still are more or less understandable, which was important for the current study.

Imageability ratings indicated that there were also differences between the conditions concerning their property to evoke mental images. A significant main effect for modality showed that videos consisting of Russian sentences with gesture were evaluated as being better imaginable than videos consisting only of German sentences [4.57 vs. 3.25, respectively; *F(*1*,* <sup>144</sup>*.*92*)* = 349*.*89, *P <* 0*.*001, two-factorial between-subjects ANOVA with adjusted degrees of freedom according to Brown–Forsythe]. A significant interaction effect indicated that this difference was even more pronounced for the concrete conditions [*F(*1*,* <sup>144</sup>*.*92*)* = 24*.*22, *P <* 0*.*001, two-factorial between-subjects ANOVA with adjusted degrees of freedom according to Brown–Forsythe].

Naturalness ratings showed a main effect for modality as well. Videos including Russian sentences with gestures were evaluated as more natural than videos including German speech [4.39 vs. 3.59, respectively; *F(*1*,* <sup>160</sup>*.*63*)* = 225*.*65, *P <* 0*.*001, two-factorial between-subjects ANOVA with adjusted degrees of freedom according to Brown–Forsythe]. There was also a difference in naturalness ratings concerning the abstractness of the included content. Videos depicting concrete content were evaluated as being less natural than videos depicting abstract content [4.26 vs. 3.72, respectively; *F(*1*,* <sup>160</sup>*.*63*)* = 104*.*48, *P <* 0*.*001, two-factorial between-subjects ANOVA with adjusted degrees of freedom according to Brown–Forsythe]. Additionally, an interaction effect indicated that videos consisting of German speech with concrete content were evaluated as least natural [*F(*1*,* <sup>160</sup>*.*63*)* = 28*.*18, *P <* 0*.*001, two-factorial between-subjects ANOVA with adjusted degrees of freedom according to Brown–Forsythe].

The sentences had an average speech duration of 2263 ms (*SD* = 340 ms), with German sentences being somewhat longer than Russian sentences [2335 vs. 2192 ms, respectively; *F(*1*,* <sup>180</sup>*.*94*)* = 9*.*51, *P <* 0*.*05, two-factorial between-subjects ANOVA with adjusted degrees of freedom according to Brown– Forsythe]. The gestures analyzed here had an average gesture duration of 2639 ms (*SD* = 538 ms), with gestures for concrete content being longer than gestures for abstract content [3011 vs. 2266 ms, respectively; *T(*102*)* = 9*.*78, *P <* 0*.*001].

Events for the fMRI statistical analysis were defined in accordance with the bimodal German conditions [compare for example Green et al. (2009); Straube et al. (2012)] as the moment with the highest semantic correspondence between speech and gesture stroke (peak movement): Each sentence contained only one element that could be illustrated, which was intuitively done by the actor. The events occurred on average 2036 ms (*SD* = 478 ms) after the video start and were used for the modulation of events in the event-related fMRI analysis. The use of these predefined integration time points (see Green et al., 2009) for the fMRI data analysis had the advantage that the timing for all conditions of one stimulus was identical since conditions were counterbalanced across subjects. Additionally, speech and gesture duration were used as parameters of no interest on single trial level to control for condition specific differences in these parameters.

#### **EXPERIMENTAL PROCEDURE**

During fMRI data acquisition participants were presented with videos of an actor either speaking sentences (S) or performing meaningful gestures (G) with an abstract-social (A) or concreteobject-related (C) content. Gestures were accompanied by an unknown foreign language (Russian). Participants performed a content judgment task referring to the person vs. objectrelatedness of the utterances.

#### **fMRI DATA ACQUISITION**

All MRI data were acquired on a 3T scanner (Siemens MRT Trio series). Functional images were acquired using a T2-weighted echo planar image sequence (*TR* = 2 s, *TE* = 30 ms, flip angle 90◦, slice thickness 4 mm with a 0.36 mm interslice gap, 64 × 64 matrix, FoV 230 mm, in-plane resolution 3*.*59 × 3*.*59 mm, 30 axial slices orientated parallel to the AC-PC line covering the whole brain). Two runs of 425 volumes were acquired during the experiment. The onset of each trial was synchronized to a scanner pulse.

#### **EXPERIMENTAL DESIGN AND PROCEDURE**

An experimental session comprised 182 trials (26 for each condition) and consisted of two 14-min blocks. Each block contained 91 trials with a matched number of items from each condition (13). The stimuli were presented in an event-related design in pseudo-randomized order and counterbalanced across subjects. As described above (stimulus material) across subjects each item was presented in corresponding conditions, but a single participant only saw complementary derivatives of one item, i.e., the same sentence or gesture information was only seen once per participant. Each clip was followed by a gray background with a variable duration of 2154–5846 ms (jitter average: 4000 ms).

Before scanning, each participant received at least six practice trials outside the scanner to ensure comprehensive understanding of the experimental task. Prior to the start of the experiment, the volume of the videos was individually adjusted so that the clips were clearly audible. During scanning, participants were instructed to watch the videos and to indicate via left hand key presses whether the content of the sentence or the gesture referred to objects index finger or interpersonal social information (e.g., feelings, requests, etc.) middle finger. This task enabled us to focus participants' attention to the semantic content of speech and gesture and to investigate comprehension in a rather implicit manner. Performance rates and reaction times were recorded.

#### **MRI DATA ANALYSIS**

MR images were analyzed using Statistical Parametric Mapping (SPM8) standard routines and templates (www*.*fil*.*ion*.*ucl*.*ac*.*uk). After discarding the first five volumes to minimize T1-saturation effects, all images were spatially and temporally realigned, normalized (resulting voxel size 2 × 2 × 2 mm3), smoothed (8 mm isotropic Gaussian filter) and high-pass filtered (cut-off period 128 s).

Statistical whole-brain analysis was performed in a two-level, mixed-effects procedure. In the first level, single-subject BOLD responses were modeled by a design matrix comprising the onsets of each event within the videos (see stimulus material) of all seven experimental conditions. As additional factor each video phase was modeled as mini-bock with 5 s duration. To control for condition specific differences in speech and gesture duration these stimulus characteristics were used as parameters of no interest on single trial level. The hemodynamic response was modeled by the canonical hemodynamic response function (HRF). Parameter estimate (β-) images for the HRF were calculated for each condition and each subject. Parameter estimates for the four relevant conditions were entered into a within-subject flexible factorial ANOVA.

A Monte Carlo simulation of the brain volume was employed to establish an appropriate voxel contiguity threshold (Slotnick and Schacter, 2004). This correction has the advantage of higher sensitivity to smaller effect sizes, while still correcting for multiple comparisons across the whole brain volume. Assuming an individual voxel type I error of *P <* 0*.*001, a cluster extent of 50 contiguous resampled voxels was indicated as necessary to correct for multiple voxel comparisons at *P <* 0*.*05. This cluster threshold (based on the whole brain volume) has been applied to all contrasts. The reported voxel coordinates of activation peaks are located in MNI space. For the anatomical localization, functional data were referenced to probabilistic cytoarchitectonic maps (Eickhoff et al., 2005) and the AAL toolbox (Tzourio-Mazoyer et al., 2002).

#### **CONTRASTS OF INTEREST**

The neural processing of abstract information was isolated by computing the difference contrast of abstract-social vs. concreteobject-related sentences [AS *>* CS] and gestures [AG *>* CG], whereas the opposite contrasts were applied to reveal brain regions sensitive for the processing of concrete information communicated by speech [CS *>* AS] and gesture [CG *>* AG].

In order to find regions that are commonly activated by both processes, contrasts were entered into a conjunction analysis (abstract: [AS *>* CS ∩ AG *>* CG]; concrete: [CS *>* AS ∩ CG *>* AG]), testing for independently significant effects compared at the same threshold (conjunction null, see Nichols et al., 2005).

The identical approach has been applied to demonstrate the effect of modality by calculating the following conjunctional analyses, for gesture [AG *>* AS ∩ CG *>* CS] and for speech semantics [AS *>* AG ∩ CS *>* CG].

Finally, interaction analyses were performed ([AS vs. AG] vs. [CS vs. CG]) to explore modality specific effects with regard to the processing of abstract vs. concrete information. Masking procedure has been used to ensure that all interactions are based on significant differences of the first contrast (e.g., [CG *>* CS] *>* [AG *>* AS] inclusively masked by [CG *>* CS]).

#### **RESULTS**

#### **BEHAVIORAL RESULTS**

Subjects were instructed to indicate via button press whether the actor in the video described a socially related action or an objectrelated action. Correct responses and their reaction times were analyzed each with a Two-Way within-subjects ANOVA with the repeated measurement factors modality (gesture vs. speech) and abstractness (abstract vs. social).

Correct responses showed a significant main effect for modality with videos depicting gesture with Russian speech receiving slightly lower scores than videos depicting German speech only [21.8 vs. 22.95 out of 26, respectively; *F(*1*,* <sup>19</sup>*)* = 8*.*369, *P <* 0*.*05, partial-eta-squared = 0.31]. A significant main effect for abstractness clearly indicated that videos describing abstract social content were less often identified correctly than videos showing concrete object-related content [20.3 vs. 24.45 out of 26, respectively; *F(*1*,* <sup>19</sup>*)* = 15*.*361, *P <* 0*.*001, partial-eta-squared = 0.45]. The factors modality and abstractness also showed a modest significant interaction effect on correct responses [*F(*1*,* <sup>19</sup>*)* = 4*.*572, *P <* 0*.*05, partial-eta-squared = 0.19] stemming from the fact that for videos depicting abstract content the difference between gesture with Russian speech and German speech was more pronounced than for videos showing concrete object-related content (**Figure 2A**).

For each participant the median reaction time for each condition was computed from all correct responses of that condition. A significant interaction effect of modality and abstractness [*F(*1*,* <sup>19</sup>*)* = 5*.*227, *P <* 0*.*05, partial-eta-squared = 0.22] indicated that while there was no difference for videos depicting concrete content, participants reacted slightly faster to videos depicting abstract content with gesture and slightly slower to videos of abstract content with German speech (**Figure 2B**).

#### **fMRI RESULTS**

#### *Effects of modality*

For the effect of gesture in contrast to speech semantics independent of the abstractness [AG *>* AS ∩ CG *>* CS] we found activation in bilateral occipital, parietal, and right frontal brain regions (see **Table 2**, and **Figure 3C**, yellow). By contrast, for the processing of speech semantics independent of abstractness [AS *>* AG ∩ CS *>* CG] we found activations in the left anterior

**FIGURE 2 | Graphical illustration of the interaction effects of the two factors modality (gesture vs. speech) and abstractness (abstract vs. concrete) on (A) the number of correct responses in percent and on (B) the corresponding reaction times in ms (vertical lines indicate standard errors of the mean).**


**Table 2 | Activation peaks and anatomical regions comprising activated clusters for the conjunction contrasts representing effects of modality (speech vs. gesture and vice versa).**

*Table lists the respective contrast, anatomical regions, cluster size, MNI coordinates, and t-values for each significant activation (p < 0.05 corrected for multiple comparisons). MNI, Montreal Neurological Institute; AS, abstract speech; AG, abstract gesture; CS, concrete speech; CG, concrete gesture; IFG, inferior frontal gyrus; L, left; R, right.*

temporal lobe and the supramarginal gyrus (see **Table 2**, and **Figure 3D**, yellow).

The exploration of general activation for each condition in contrast to low-level baseline (gray background) indicates that other regions are commonly activated in all conditions (**Figures 3A,B**). Most interestingly, the IFG seems to be activated bilaterally in the gesture conditions (**Figure 3A**) and left lateralized in the speech conditions (**Figure 3B**).

#### *Within modality effects of abstractness*

Analyses targeting at within-modality processing of abstractness in language semantics [AS *>* CS] showed activation in a mainly left-lateralized network encompassing an extended fronto-temporal cluster (IFG, precentral gyrus, middle, inferior, and superior temporal gyrus) as well as medial frontal regions and the right anterior middle temporal gyrus (**Table 3** and **Figure 4** top, blue). We obtained a comparable activation pattern for the within-modality processing of abstractness in gesture semantics ([AG *>* CG] see **Figure 4** top, yellow). The opposite contrasts revealed activation in clusters encompassing the left cerebellum, fusiform, and inferior temporal gyrus in the language contrast (CS *>* AS; see **Figure 4** bottom, blue) and the bilateral occipital lobe for the gesture contrast (CG *>* AG; see **Figure 4** bottom, yellow).

#### *Common activations for abstractness contained in gestures and spoken language*

Processing of abstract information independent of input modality as disclosed by the conjunction of [AS *>* CS ∩ AG *>* CG] was related to a left-sided frontal cluster including the temporal pole, the IFG (pars triangularis and orbitalis), the middle temporal and angular as well as the medial superior frontal gyrus (**Table 3** and **Figure 4** top middle/right, green). The opposite conjunction analyses [CS *>* AS ∩ CG *>* AG] revealed no significant common activation for the processing of concrete in contrast to abstract information.

#### *Interaction*

No significant activation could be identified in the interaction analyses on the selected significance threshold. However, by applying a different cluster size to voxel level threshold proportion to correct for multiple comparisons (*p <* 0*.*005 and 86 voxels) as indicated by an additional Monte Carlo simulation, we found an interaction in occipital (MNI *x, y, z*: −20, −90, −8, *t* = 3*.*63, *p <* 0*.*001, 140 voxels), parietal (MNI *x, y, z*: −34, −48, 68, *t* = 3*.*80, *p <* 0*.*001, 143 voxels; MNI *x, y, z*: −34, −40, 48, *t* = 3*.*11, *p <* 0*.*001, 88 voxels) and premotor (MNI *x, y, z*: −34, −4, 62, *t* = 3*.*55, *p <* 0*.*001, 129 voxels) regions reflecting an specific increase of activation in these regions for the processing of concrete-object-related gesture meaning ([CG *>* CS] *>* [AG *>* AS] inclusively masked by [CG *>* CS]).

#### **DISCUSSION**

We hypothesized that the processing of abstract semantic information of spoken language and symbolic emblematic gestures is based on a common neural network. Our study design tailored the comparison to the level of abstract semantics, controlling for processing of general semantic meaning of speech and gesture by using highly meaningful concrete object-related information as control condition. The results demonstrate that the pathways engaged in the processing of semantics contained in both abstract spoken language and

abstract-social gestures comprise the temporal pole, the IFG (pars triangularis and orbitalis), the middle temporal, angular and the superior frontal gyri. Thus, in line with our hypothesis we found modality-independent activation in a left hemispheric fronto-temporal network for the processing of abstract information. The strongly left lateralized activation pattern supports the theory that abstract semantics is independent of communication modality represented in language (at least on neural level represented in language-related brain regions).


**Table 3 | Activation peaks and anatomical regions comprising activated clusters for the contrasts representing effects of abstractness (abstract vs. concrete and vice versa) dependent of modality (speech or gesture).**

*Table lists the respective contrast, anatomical regions, cluster size, MNI coordinates, and t-values for each significant activation (p < 0.05 corrected for multiple comparisons). MNI, Montreal Neurological Institute; AS, abstract speech; AG, abstract gesture; CS, concrete speech; CG, concrete gesture; IFG, inferior frontal gyrus; L, left; R, right.*

#### **EFFECTS OF MODALITY**

The results of the speech [CS *>* CG ∩ AS *>* AG] and gesture contrasts [CG *>* CS ∩ AG *>* AS] clearly demonstrate that communication modality affects neural processing in the brain independent of the communication content (abstract/concrete). In line with other studies that contrasted the processing of a native against an unknown foreign language (Perani et al., 1996; Schlosser et al., 1998; Pallier et al., 2003; Straube et al., 2012), we found activation along the left temporal lobe (including STG, MTG, and ITG) for German speech contrasted with Russian speech and gesture. This strongly left-lateralized pattern has been found in all of the above mentioned studies. Apart from these studies with conditions very similar to ours, temporal as well as inferior frontal regions have been frequently implicated in various language tasks (for reviews see Bookheimer, 2002; Vigneau et al., 2006; Price, 2010). The lack of IFG activation in our study is probably dependent on the fact that we compared a native language (CS, AS) with a foreign language which was accompanied by a meaningful gesture (CG, AG). Thus, motoric or semantic processes of the left IFG might be equally involved in the speech and gesture conditions as indicated by baseline contrasts (see **Figures 3A,B**).

In line with studies on action observation (e.g., Decety et al., 1997; Decety and Grèzes, 1999; Grèzes and Decety, 2001; Filimon et al., 2007) and co-verbal gesture processing (e.g., Green et al., 2009; Kircher et al., 2009b; Straube et al., 2011a), we found for the processing of gesture in contrast to speech information a bilaterally distributed network of activation including occipital, parietal, posterior temporal, and right frontal brain regions.

#### **SUPRAMODAL PROCESSING OF ABSTRACT SEMANTICS OF SPEECH AND GESTURE**

of figure illustrate the contrast estimates (extracted eigenvariates) for the commonly activated (green) medial superior frontal (left) and

The processing of abstract spoken language semantics (AS *>* CS) and abstract semantic information conveyed through abstract-social in contrast to concrete-object-related gestures (AG *>* CG) activated an overlapping network of brain regions. These include a cluster in the left inferior frontal cortex (BA 44, 45) which expanded into the temporal pole, the left inferior, and middle temporal gyrus as well as a cluster in the left medial superior frontal gyrus. Those findings support the model of a supramodal semantic network for the processing of abstract information. By contrast, for concrete vs. abstract information we obtained no overlapping activation.

These results extend studies from both the gesture and the language domain (see above) in showing a common neural representation of specific speech and gesture semantics. Furthermore, the findings go beyond previous reports about common activation for symbolic gestures and speech semantics (Xu et al., 2009), in showing a specific effects for abstract but not concrete speech and gesture information. Interestingly, we previously found similar activation of the left IFG and temporal brain regions for the processing of concrete speech and gesture semantics of iconic gestures (Straube et al., 2012). Whereas iconic gestures are not symbolic and usually occur in a concrete sentence context (e.g., "The ball is round," using both hands to indicate a round shape), they might implicate rather abstract information without speech, since any concrete meaning can be revealed from these iconic

the **bottom** of figure. Here we found no overlap between

activation patterns.

gestures in this context. Thus, the left IFG activation in our previous study could also be explained by an abstract interpretation of isolated iconic gestures (Straube et al., 2012).

The left-lateralization of our findings is congruent with the majority of fMRI studies on language (see Bookheimer, 2002; Price, 2010, for reviews). Left fronto-temporal activations have been frequently observed for semantic processing [e.g., Gaillard et al., 2004; for a review see Vigneau et al. (2006)], the decoding of meaningful actions (e.g., Decety et al., 1997; Grèzes and Decety, 2001) and also with regard to co-verbal gesture processing (Willems et al., 2007, 2009; Holle et al., 2008, 2010; Kircher et al., 2009b; Straube et al., 2011a).

With regard to the inferior frontal activations, functional imaging studies have underlined the importance of this region in the processing of language semantics. The junction of the precentral gyrus and the pars opercularis of the left IFG has been involved in controlled semantic retrieval (Thompson-Schill et al., 1997; Wiggs et al., 1999; Wagner et al., 2001), semantic priming (Sachs et al., 2008a,b, 2011; Kircher et al., 2009a; Sass et al., 2009a,b) and a supramodal network for semantic processing of words and pictures (Kircher et al., 2009a). The middle frontal gyrus (MFG) was found activated by intramodal semantic priming (e.g., Tivarus et al., 2006). However, medial frontal activation in our study might be better explained by differences in socialemotional content between conditions, which have been often found for social functioning, social cognition, theory of mind, or mentalizing (e.g., Uchiyama et al., 2006, 2012; Krach et al., 2009; Straube et al., 2010).

Since semantic memory represents the basis of semantic processing, an amodal semantic memory (Patterson et al., 2007) is a likely explanation for how speech and gesture semantics could activate a common neural network. Our findings suggest supramodal semantic processing in regions including the left temporal pole, which has been described as best candidate for a supramodal semantic "hub" (Patterson et al., 2007). Thus, abstract semantic information contained in speech and gestures might have activated supramodal semantic knowledge in our study more strongly than concrete information communicated by speech and gesture.

Our data also partially coincide with Binder and Desai's (2011) neuroanatomical model of semantic processing: in this model, low level (concrete) sensory, action and emotion semantics are processed in brain areas that are located near corresponding perceptual networks; higher-level semantics (abstract semantics), on the contrary, converges at temporal, and inferior parietal regions (Binder and Desai, 2011). Additionally, as a next step, inferior prefrontal cortices are responsible for the selection of the information stored in temporo-parietal cortices. In the current experiment, abstract information activates both temporal and inferior frontal cortices, and this could be considered as evidence supporting the role of fronto-temporal pathways in the processing of higher-level semantics. More importantly, our results suggest that this processing of abstract information is independent of input modality.

As for the processing of concrete semantics, our results are somewhat surprising because we did not find an overlap between gestural and verbal-auditory input. This result falls beyond the prediction of both strict embodiment theories (Barsalou, 1999; Gallese and Lakoff, 2005; Pulvermüller and Fadiga, 2010) and theories which propose less strict embodiment: all these theories would predict that the concrete semantics in our experiment, being predominantly action-driven, would activate motoric brain regions such as (pre-)motor and parietal cortices, and this activation pattern should be independent of the input modality. However, previous support for these theories is based on studies using single words (e.g., Willems et al., 2010; Moseley et al., 2012) instead of sentences, which might increase the task effort and specifically trigger motoring simulation. Thus, one explanation for the discrepancy between studies could be that we investigated the processing of tool-use information in a sentence context (see Tremblay and Small, 2011). Here, motoric simulation might not be necessary since contextual information facilitates semantic access (e.g., the blacksmith primes the hammer).

Our results are also in line with a recent mathematicallymotivated language-cognition model proposed by Perlovsky and Ilin (2013). This model suggests that high-level abstract thinking relies on the language system and low-level and concrete thinking does not necessarily have to. Transferred to a neural perspective, both abstract meaning (irrespective of input modality) and language (processing) would recruit similar neural networks. In our experiment, the left-lateralized network for abstract meaning comprehension fits perfectly to this prediction. Although it still remains unclear how language and higher-level thinking are related at a functional level, our study provides initial neural evidence, which closely connects the two different domains.

#### **CONCLUSION**

Language is not only a communication device, but also a fundamental part of cognition and learning concepts, especially with respect to abstract concepts (Perlovsky and Ilin, 2013). In the last years the understanding of speech and gesture processing has increased; both communication channels have been disentangled and brought together again. Here we investigated the neural correlates of abstractness (abstract vs. concrete) and modality (speech vs. gestures), to demonstrate the existence of an abstractness specific supramodal neural network.

In fact, we could demonstrate the activation of a supramodal network for abstract speech and abstract gestures semantics. The identified left lateralized fronto-temporal network not only maps sound patterns and their corresponding abstract meanings in the auditory domain, but also combines gestures and their abstract meanings in the gestural-visual domain. This modality-independent network most likely gets input from modality-specific areas in the superior temporal (speech) and occipito-temporal brain regions (gestures), where the main characteristics of the spoken and gestured signals are decoded. The inferior frontal regions are responsible for the process of selection and integration, relying on more general world knowledge distributed throughout the brain (Xu et al., 2009). The challenge for future studies will be the identification of specific aspects of speech and gesture semantics or the respective format relevant for the understanding of natural receptive and productive communicative behavior and its dysfunctions in patients, for example with schizophrenia or autism (Hubbard et al., 2012; Straube et al., 2013a,b).

#### **ACKNOWLEDGMENTS**

This research project is supported by a grant from the "Von Behring-Röntgen-Stiftung" (project no. 59-0002) and by the

#### **REFERENCES**


et al. (2009). Gesture and metaphor comprehension: electrophysiological evidence of cross-modal coordination by audiovisual stimulation. *Brain Cogn.* 70, 42–52. doi: 10.1016/j.bandc.2008.12.005


"Deutsche Forschungsgemeinschaft" (project no. DFG: Ki 588/6- 1). Yifei He and Helge Gebhardt are supported by the "Von Behring-Röntgen-Stiftung" (project no. 59-0002). Arne Nagels and Miriam Steines are supported by the DFG (project no. Ki 588/6-1). Benjamin Straube is supported by the BMBF (project no. 01GV0615).

994–1005. doi: 10.1016/j. neuroimage.2009.08.001


fMRI investigation of the neural correlates underlying the processing of novel metaphoric expressions. *Brain Lang.* 100, 115–126. doi: 10.1016/j.bandl.2005.10.005


speech and gesture in schizophrenia: evidence for differential processing of metaphoric gestures. *Hum. Brain Mapp.* 34, 1696–1712. doi: 10.1002/hbm.22015


brain activity by varying semantic distances. *Cogn. Behav. Neurol.* 19, 194–201.


left hemisphere language areas: phonology, semantics, and sentence processing. *Neuroimage* 30, 1414–1432. doi: 10.1016/j.neuro image.2005.11.002


*Neuroimage* 47, 1992–2004. doi: 10.1016/j.neuroimage.2009.05.066

Xu, J., Gannon, P. J., Emmorey, K., Smith, J. F., and Braun, A. R. (2009). Symbolic gestures and spoken language are processed by a common neural system. *Proc. Natl. Acad. Sci. U.S.A.* 106, 20664–20669. doi: 10.1073/pnas.0909197106

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 09 July 2013; accepted: 24 August 2013; published online: 13 September 2013.*

*Citation: Straube B, He Y, Steines M, Gebhardt H, Kircher T, Sammer G and Nagels A (2013) Supramodal neural processing of abstract information conveyed by speech and gesture. Front. Behav. Neurosci. 7:120. doi: 10.3389/ fnbeh.2013.00120*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience.*

*Copyright © 2013 Straube, He, Steines, Gebhardt, Kircher, Sammer and Nagels. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Reinforcement and inference in cross-situational word learning

#### *Paulo F. C. Tilles and José F. Fontanari\**

*Departamento de Física e Informática, Instituto de Física de São Carlos, Universidade de São Paulo, São Carlos, Brazil*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### *Reviewed by:*

*Kenny Smith, University of Edinburgh, UK Angelo Cangelosi, University of Plymouth, UK George Kachergis, Leiden University, Netherlands*

#### *\*Correspondence:*

*José F. Fontanari, Departamento de Física e Informática, Instituto de Física de São Carlos, Universidade de São Paulo, Caixa Postal 369, 13560-970 São Carlos, Brazil e-mail: fontanari@ifsc.usp.br*

Cross-situational word learning is based on the notion that a learner can determine the referent of a word by finding something in common across many observed uses of that word. Here we propose an adaptive learning algorithm that contains a parameter that controls the strength of the reinforcement applied to associations between concurrent words and referents, and a parameter that regulates inference, which includes built-in biases, such as mutual exclusivity, and information of past learning events. By adjusting these parameters so that the model predictions agree with data from representative experiments on cross-situational word learning, we were able to explain the learning strategies adopted by the participants of those experiments in terms of a trade-off between reinforcement and inference. These strategies can vary wildly depending on the conditions of the experiments. For instance, for fast mapping experiments (i.e., the correct referent could, in principle, be inferred in a single observation) inference is prevalent, whereas for segregated contextual diversity experiments (i.e., the referents are separated in groups and are exhibited with members of their groups only) reinforcement is predominant. Other experiments are explained with more balanced doses of reinforcement and inference.

**Keywords: statistical learning, word learning, cross-situational learning, associative learning, mutual exclusivity**

#### **1. INTRODUCTION**

A desirable goal of a psychological theory is to offer explanations grounded on elementary principles to the data available from psychology experiments (Newell, 1994). Although most of these quantitative psychological data are related to mental chronometry and memory accuracy, recent explorations on the human performance to acquire an artificial lexicon in controlled laboratory conditions have paved the way to the understanding of the learning strategies humans use to infer a word-object mapping (Yu and Smith, 2007; Kachergis et al., 2009; Smith et al., 2011; Kachergis et al., 2012; Yu and Smith, 2012a). These experiments are based on the cross-situational word-learning paradigm which avers that a learner can determine the meaning of a word by finding something in common across all observed uses of that word (Gleitman, 1990; Pinker, 1990). In that sense, learning takes place through the statistical sampling of the contexts in which a word appears in accord with the classical associationist stance of Hume and Locke that the mechanism of word learning is sensitivity to covariation: if two events occur at the same time, they become associated (Bloom, 2000).

In a typical cross-situational word-learning experiment, participants are exposed repeatedly to multiple unfamiliar objects concomitantly with multiple spoken pseudo-words, such that a word and its correct referent (object) always appear together on a learning trial. Different trials exhibiting distinct word-object pairs will eventually allow the disambiguation of the word-object associations and the learning of the correct mapping (Yu and Smith, 2007). However, it is questionable whether this scenario is suitable to describe the actual word learning process by children even in the unambiguous situation where the single novel object is followed by the utterance of its corresponding pseudo-word. In fact, it was shown that young children will only make the connection between the object and the word provided they have a reason to believe that they are in presence of an act of naming and for this the speaker has to be present (Baldwin et al., 1996; Bloom, 2000; Waxman and Gelman, 2009). Adults could learn those associations either because they were previously instructed by the experimenter that they would be learning which words go with which objects or because they could infer that the disembodied voice is an act of naming by a concealed person. Although there have been claims that cross-situational statistical learning is part of the repertoire of young word learners (Yu and Smith, 2008), the effect of individual differences in attention and vocabulary development of the infants complicates considerably this issue which is still a matter for debate (Yu and Smith, 2012b; Smith and Yu, 2013).

There are several other alternative or complementary approaches to the statistical learning formulation of language acquisition considered in this paper. For instance, the socialpragmatic hypothesis claims that the child makes the connections between words and their referents by understanding the referential intentions of others. This approach, which seems to be originally due to Augustine, implies that children use intuitive psychology to "read" the adults' minds (Bloom, 2000). A more recent approach that explores the grounding of language in perception and action has been proved effective in the design of linguistic capabilities in humanoid cognitive robots (Cangelosi et al., 2007; Cangelosi, 2010; Pezzulo et al., 2013) as well as in the support of word learning by toddlers through the stabilization of their attention on the selected object (Yu and Smith, 2012b). In contrast with the unsupervised cross-situational learning scheme, the scenario known as operant conditioning involves the active participation of the agents in the learning process, with exchange of non-linguistic cues to provide feedback on the learner inferences. This supervised learning scheme has been applied to the design of a system for communication by autonomous robots in the Talking Heads experiments (Steels, 2003). We note that a comparison between the cross-situational and operant conditioning learning schemes indicates that they perform similarly in the limit of very large lexicon sizes (Fontanari and Cangelosi, 2011).

As our goal is to interpret the learning performance of adults using a few plausible reasoning tenets, here we assume that in order to learn a word-object mapping within the cross-situational word-learning scenario the learner should be able to (i) recall at least a fraction of the word-object pairings that appeared in the learning trials, (ii) register both co-occurrences and nonco-occurrences of words and objects and (iii) apply the mutual exclusivity principle which favors the association of novel words to novel objects (Markman and Wachtel, 1988). Of course, we note that a hypothetical learner could achieve cross-situational learning solely by registering and recalling co-occurrences of words and objects without carrying out any inferential reasoning (Blythe et al., 2010; Tilles and Fontanari, 2012a), but we find it implausible that human learners would not reap the benefits (e.g., fast mapping) of employing mutual exclusivity (Vogt, 2012; Reisenauer et al., 2013).

In this paper we offer an adaptive learning algorithm that comprises two parameters which regulate the associative reinforcement of pairings between concurrent words and objects, and the non-associative inference process that handles built-in biases (e.g., mutual exclusivity) as well as information of past learning events. By setting the values of these parameters so as to fit a representative selection of experimental data presented in Kachergis et al. (2009, 2012) we are able to identify and explain the learning strategies adopted by the participants of those experiments in terms of a trade-off between reinforcement and inference.

#### **2. CROSS-SITUATIONAL LEARNING SCENARIO**

We assume there are *N* objects *o*1*,..., oN*, *N* words *w*1*,..., wN* and a one-to-one mapping between words and objects represented by the set *-* = {*(w*1*, o*1*),...,(wN, oN)*}. At each learning trial, *C* word-object pairs are selected from  and presented to the learner without providing any clue on which word goes with which object. For instance, pictures of the *C* objects are displayed in a slide while *C* pseudo-words are spoken sequentially such that their spatial and temporal arrangements do not give away the correct word-object associations (Yu and Smith, 2007; Kachergis et al., 2009). We refer to the subset of words and their referents (objects) presented to the learner in a learning trial as the context = {*w*1*, o*1*, w*2*, o*2*,..., wC, oC*}. The context size *C* is then a measure of the within-trial ambiguity, i.e., the number of co-occurring word-object pairs per learning trial. The selection procedure from the set *-*, which may favor some particular subsets of word-object pairs, determines the different experimental setups discussed in this paper. Although each individual trial is highly ambiguous, repetition of trials with partially overlapping contexts should in principle allow the learning of the *N* word-object associations.

After the training stage is completed, which typically comprises about two dozen trials, the learning accuracy is measured by instructing the learner to pick the object among the *N* objects on display which the learner thinks is associated to a particular target word. The test is repeated for all *N* words and the average learning accuracy calculated as the fraction of correct guesses (Kachergis et al., 2009).

This cross-situational learning scenario does not account for the presence of noise, such as the effect of out-of-context words. This situation can be modeled by assuming that there is a certain probability (noise) that the referent of one of the spoken words is not part of the context (so that word can be said to be out of context). Although theoretical analysis shows that there is a maximum noise intensity beyond which statistical learning is unattainable (Tilles and Fontanari, 2012b), as yet no experiment was carried out to verify the existence of this threshold phenomenon on the learning performance of human subjects.

#### **3. MODEL**

We model learning as a change in the confidence with which the algorithm (or, for simplicity, the learner) associates the word *wi* to an object *oj* that results from the observation and analysis of the contexts presented in the learning trials. More to the point, this confidence is represented by the probability *Pt* - *wi, oj* that *wi* is associated to *oj* at learning trial *t*. This probability is normalized such that *oj Pt* - *wi, oj* = 1 for all *wi* and *t >* 0, which then implies that when the word *wi* is presented to the learner in the testing stage the learning accuracy is given simply by *Pt(wi, oi)*. In addition, we assume that *Pt* - *wi, oj* contains information presented in the learning trials up to and including trial *t* only.

If at learning trial *t* the learner observes the context *<sup>t</sup>* = {*w*1*, o*1*, w*2*, o*2*,..., wC, oC*} then it can infer the existence of two other informative sets. First, the set of the words (and their referents) that appear for the first time at trial *t*, which we denote by ˜ *<sup>t</sup>* = *<sup>w</sup>*˜ <sup>1</sup>*, <sup>o</sup>*˜1*, <sup>w</sup>*˜ <sup>2</sup>*, <sup>o</sup>*˜2*,..., <sup>w</sup>*˜ *<sup>C</sup>*˜ *, <sup>o</sup>*˜*C*˜*<sup>t</sup>* . Clearly, ˜ *<sup>t</sup>* ⊆ *<sup>t</sup>* and *C*˜*<sup>t</sup>* ≤ *C*. Second, the set of words (and their referents) that do not appear in *<sup>t</sup>* but that have already appeared in the previous trials, ¯ *<sup>t</sup>* = *w*¯ <sup>1</sup>*, o*¯1*,..., w*¯ *Nt*−*C, o*¯*Nt*−*<sup>C</sup>* where *Nt* is the total number of different words that appeared in contexts up to and including trial *t*. Clearly, ¯ *<sup>t</sup>* ∩ *<sup>t</sup>* = ∅. The update rule of the confidences *Pt* - *wi, oj* depends on which of these three sets the word *wi* and the object *oj* belong to (if *i* = *j* they may belong to different sets). In fact, our learning algorithm comprises a parameter χ ∈ [0*,* 1] that measures the associative reinforcement capacity and applies only to known words that appear in the current context, and a parameter β ∈ [0*,* 1] that measures the inference capacity and applies either to known words that do not appear in the current context or to new words in the current context. Before the experiment begins (*t* = 0) we set *P*<sup>0</sup> - *wi, oj* = 0 for all words *wi* and objects *oj*. Next we describe how the confidences are updated following the sequential presentation of contexts.

In the first trial (*t* = 1) all words are new (*C*˜ <sup>1</sup> = *N*<sup>1</sup> = *C*), so we set

$$P\_1\left(\tilde{w}\_i, \tilde{o}\_j\right) = \frac{1}{C} \tag{1}$$

for *w*˜*i, o*˜*<sup>j</sup>* ∈ ˜ = . In the second or in an arbitrary trial *t* we expect to observe contexts exhibiting both novel and repeated words. Novel words must go through an inference preprocessing stage before the reinforcement procedure can be applied to them. This is so because if *w*˜*<sup>i</sup>* appears for the first time at trial *t* then *Pt* <sup>−</sup> <sup>1</sup> - *w*˜*i, oj* = 0 for all objects *oj* and since the reinforcement is proportional to *Pt* <sup>−</sup> <sup>1</sup> - *w*˜*i, oj* the confidences associated to *w*˜*<sup>i</sup>* would never be updated (see Equation 5 and the explanation thereafter). Thus, when a novel word *w*˜*<sup>i</sup>* appear at trial *t* ≥ 2, we redefine its confidence values at the previous trial (originally set to zero) as

$$P\_{t-1} \left( \tilde{w}\_i, \tilde{o}\_j \right) = \frac{\beta}{\tilde{C}\_t} + \frac{1-\beta}{N\_{t-1} + \tilde{C}\_t},\tag{2}$$

$$P\_{t-1} \left( \tilde{w}\_i, o\_j \right) = \frac{1 - \beta}{N\_{t-1} + \tilde{C}\_t},\tag{3}$$

$$P\_{t-1} \left( \tilde{\boldsymbol{w}}\_i, \bar{\boldsymbol{o}}\_j \right) = \frac{1 - \beta}{N\_{t-1} + \tilde{C}\_t}. \tag{4}$$

On the one hand, setting the inference parameter β to its maximum value β = 1 enforces the mutual exclusivity principle which requires that the new word *w*˜*<sup>i</sup>* be associated with equal probability to the *C*˜*<sup>t</sup>* new objects *o*˜*<sup>j</sup>* in the current context. Hence in the case *C*˜*<sup>t</sup>* = 1 the meaning of the new word would be inferred in a single presentation. On the other hand, for β = 0 the new word is associated with equal probability to all objects already seen up to and including trial *t*, i.e., *Nt* = *Nt* <sup>−</sup> <sup>1</sup> + *C*˜*t*. Intermediate values of β describe a situation of imperfect inference. Note that using Equations 2–4 we can easily verify that *o*˜*j Pt* <sup>−</sup> <sup>1</sup> - *w*˜*i, o*˜*<sup>j</sup>* + *oj Pt* <sup>−</sup> <sup>1</sup> - *w*˜*i, oj* + *o*¯*j Pt* <sup>−</sup> <sup>1</sup> - *w*˜*i, o*¯*<sup>j</sup>* = 1, in accord with the normalization constraint.

Now we can focus on the update rule of the confidence *Pt* - *wi, oj* in the case both word *wi* and object *oj* appear in the context at trial *t*. The rule applies both to repeated and novel words, provided the confidences of the novel words are preprocessed according to Equations 2–4. In order to fulfill automatically the normalization condition for word *wi*, the increase of the confidence *Pt* - *wi, oj* with *oj* ∈ *<sup>t</sup>* must be compensated by the decrease of the confidences *Pt* - *wi, o*¯*<sup>j</sup>* with *o*¯*<sup>j</sup>* ∈ ¯ *<sup>t</sup>*. This can be implemented by distributing evenly the total flux of probability out of the latter confidences, i.e., *<sup>o</sup>*¯*j*∈¯ *<sup>t</sup> Pt* <sup>−</sup> <sup>1</sup> - *wi, o*¯*<sup>j</sup>* , over the confidences *Pt* - *wi, oj* with *oj* ∈ *t*. Hence the net gain of confidence on the association between *wi* and *oj* is given by

$$r\_{t-1}(\boldsymbol{\omega}\_i, o\_j) = \chi P\_{t-1}(\boldsymbol{\omega}\_i, o\_j) \frac{\sum\_{\bar{o}\_j \in \bar{\Omega}\_t} P\_{t-1}(\boldsymbol{\omega}\_i, \bar{o}\_j)}{\sum\_{o\_j \in \Omega\_t} P\_{t-1}(\boldsymbol{\omega}\_i, o\_j)} \tag{5}$$

where, as mentioned before, the parameter χ ∈ [0*,* 1] measures the strength of the reinforcement process. Note that if both *oj* and *ok* appear in the context together with *wi* then the reinforcement procedure should not create any distinction between the associations - *wi, oj* and *(wi, ok)*. This result is achieved provided that the ratio of the confidence gains equals the ratio of the confidences before reinforcement, i.e., *rt* <sup>−</sup> <sup>1</sup> - *wi, oj /rt* <sup>−</sup> <sup>1</sup>*(wi, ok)* = *Pt* <sup>−</sup> <sup>1</sup> - *wi, oj /Pt* <sup>−</sup> <sup>1</sup>*(wi, ok)*. This is the reason that the reinforcement gain of a word-object association given by Equation 5 is proportional to the previous confidence on that association. The total increase in the confidences between *wi* and the objects that appear in the context, i.e., *oj* <sup>∈</sup> *<sup>t</sup> rt* <sup>−</sup> <sup>1</sup> - *wi, oj* , equals the product of χ and the total decrease in the confidences between *wi* and the objects that do not appear in the context, i.e., *<sup>o</sup>*¯*<sup>j</sup>* <sup>∈</sup> ¯ *<sup>t</sup> Pt* <sup>−</sup> <sup>1</sup> - *wi, o*¯*<sup>j</sup>* . So for χ = 1 the confidences associated to objects absent from the context are fully transferred to the confidences associated to objects present in the context. Lower values of χ allows us to control the flow of confidence from objects in ¯ *<sup>t</sup>* to objects in *t*.

Most importantly, in order to implement the reinforcement process the learner should be able to gauge the relevance of the information about the previous trials, which is condensed on the confidence values *Pt* - *wi, oj* . The gauging of this information is quantified by the word and trial dependent quantity α*t(wi)* ∈ [0*,* 1] that allows for the interpolation between the cases of maximum relevance (α*t(wi)* = 1) and complete irrelevancy (α*t(wi)* = 0) of the information stored in the confidences *Pt* - *wi, oj* . In particular, we assume that the greater the certainty on the association between word *wi* and its referent, the more relevant that information is to the learner. A quantitative measure of the uncertainty associated to the confidences regarding word *wi* is given by the entropy

$$H\_t(\boldsymbol{w}\_i) = -\sum\_{o\_j \in \Omega\_l \cup \vec{\Omega}\_l} P\_t(\boldsymbol{w}\_i, o\_j) \log \left[ P\_t(\boldsymbol{w}\_i, o\_j) \right] \qquad (6)$$

whose maximum (log *Nt*) is obtained by the uniform distribution *Pt* - *wi, oj* = 1*/Nt* for all *oj* ∈ *<sup>t</sup>* ∪ ¯ *<sup>t</sup>*, and whose minimum (0) by *Pt* - *wi, oj* = 1 and *Pt(wi, ok)* = 0 for *ok* = *oj*. So we define

$$\alpha\_t(\boldsymbol{w}\_i) = \alpha\_0 + (1 - \alpha\_0) \left[ 1 - \frac{H\_t(\boldsymbol{w}\_i)}{\log N\_t} \right],\tag{7}$$

where α<sup>0</sup> ∈ [0*,* 1] is a baseline information gauge factor corresponding to the maximum uncertainty about the referent of a target word.

Finally, recalling that at trial *t* the learner has access to the sets *t*, ¯ *<sup>t</sup>* as well as to the confidences at trial *t* − 1 we write the update rule

$$P\_t\left(\mathbf{w}\_i, o\_j\right) = P\_{t-1}\left(\mathbf{w}\_i, o\_j\right) + \alpha\_{t-1}(\mathbf{w}\_i)r\_{t-1}\left(\mathbf{w}\_i, o\_j\right)$$

$$+ \left[1 - \alpha\_{t-1}\left(\mathbf{w}\_i\right)\right] \left[\frac{1}{N\_t} - P\_{t-1}\left(\mathbf{w}\_i, o\_j\right)\right] \quad (8)$$

for *wi, oj* ∈ *t*. Note that if α*<sup>t</sup>* <sup>−</sup> <sup>1</sup>*(wi)* = 0 the learner would associate word *wi* to all objects that have appeared up to and including trial *t* with equal probability. This situation happens only if α<sup>0</sup> = 0 and if there is complete uncertainty about the referent of word *wi*. Hence the quantity α*t(wi)* determines the extent to which the previous confidences on associations involving word *wi* influence the update of those confidences.

Now we consider the update rule for the confidence *Pt* - *wi, o*¯*<sup>j</sup>* in the case that word *wi* appears in the context at trial *t* but object *o*¯*<sup>j</sup>* does not. (We recall that object *o*¯*<sup>j</sup>* must have appeared in some previous trial.) According to the reasoning that led to Equation 5 this confidence must decrease by the amount χ*Pt* <sup>−</sup> <sup>1</sup> - *wi, o*¯*<sup>j</sup>* and so, taking into account the information gauge factor, we obtain

$$P\_t(\boldsymbol{w\_i}, \tilde{o}\_{\hat{\jmath}}) = P\_{t-1}(\boldsymbol{w\_i}, \tilde{o}\_{\hat{\jmath}}) - \alpha\_{t-1}(\boldsymbol{w\_i}) \, \chi P\_{t-1}(\boldsymbol{w\_i}, \tilde{o}\_{\hat{\jmath}})$$

$$+ \left[1 - \alpha\_{t-1}(\boldsymbol{w\_i})\right] \left[\frac{1}{N\_t} - P\_{t-1}(\boldsymbol{w\_i}, \tilde{o}\_{\hat{\jmath}})\right] \quad (9)$$

which can be easily seen to satisfy the normalization

$$\sum\_{\boldsymbol{\rho}\_{\mathcal{O}\_{\boldsymbol{t}}} \in \Omega\_{\boldsymbol{t}}} P\_{\boldsymbol{t}}(\boldsymbol{w}\_{\boldsymbol{t}}, \boldsymbol{o}\_{\boldsymbol{j}}) + \sum\_{\boldsymbol{\bar{o}}\_{\boldsymbol{j}} \in \bar{\Omega}\_{\boldsymbol{t}}} P\_{\boldsymbol{t}}(\boldsymbol{w}\_{\boldsymbol{i}}, \boldsymbol{\bar{o}}\_{\boldsymbol{j}}) = 1. \tag{10}$$

We focus now on the update rule for the confidence *Pt* - *w*¯*i, o*¯*<sup>j</sup>* with *w*¯*i, o*¯*<sup>j</sup>* ∈ ¯ *<sup>t</sup>*, i.e., both the word *w*¯*<sup>i</sup>* and the object *o*¯*<sup>j</sup>* are absent from the context shown at trial *t*, but they have already appeared, not necessarily together, in previous trials. A similar inference reasoning that led to the expressions for the preprocessing of new words would allow the learner to conclude that a word absent from the context should be associated to an object that is also absent from it. In that sense, confidence should flow from the associations between *w*¯*<sup>i</sup>* and objects *oj* ∈ *<sup>t</sup>* to the associations between *w*¯*<sup>i</sup>* and objects *o*¯*<sup>j</sup>* ∈ ¯ *<sup>t</sup>*. Hence, ignoring the information gauge factor for the moment, the net gain to confidence *Pt* - *w*¯*i, o*¯*<sup>j</sup>* is given by

$$\bar{r}\_{t-1}(\bar{\boldsymbol{w}}\_{i},\bar{\boldsymbol{o}}\_{j}) = \boldsymbol{\beta}P\_{t-1}(\bar{\boldsymbol{w}}\_{i},\bar{\boldsymbol{o}}\_{j})\,\frac{\sum\_{o\_{j}\in\mathcal{Q}\_{t}}P\_{t-1}(\bar{\boldsymbol{w}}\_{i},o\_{j})}{\sum\_{\bar{\boldsymbol{o}}\_{j}\in\bar{\boldsymbol{\Omega}}\_{t}}P\_{t-1}(\bar{\boldsymbol{w}}\_{i},\bar{\boldsymbol{o}}\_{j})}.\tag{11}$$

The direct proportionality of this gain to *Pt* <sup>−</sup> <sup>1</sup> - *w*¯*i, o*¯*<sup>j</sup>* can be justified by an argument similar to that used to justify Equation 5 in the case of reinforcement. The information relevance issue is also handled in a similar manner so the desired update rule reads

$$\begin{aligned} P\_t(\bar{w}\_i, \bar{o}\_j) &= P\_{t-1}(\bar{w}\_i, \bar{o}\_j) + \alpha\_{t-1}(\bar{w}\_i)|\bar{r}\_{t-1}(\bar{w}\_i, \bar{o}\_j) \\ &+ [1 - \alpha\_{t-1}(\bar{w}\_i)] \left[ \frac{1}{N\_t} - P\_{t-1}(\bar{w}\_i, \bar{o}\_j) \right] \end{aligned} (12)$$

for *w*¯*i, o*¯*<sup>j</sup>* ∈ ¯ *<sup>t</sup>*. To ensure normalization the confidence *Pt* - *w*¯*i, oj* must decrease by an amount proportional to β*Pt* <sup>−</sup> <sup>1</sup> - *w*¯*i, oj* so that

$$P\_t(\bar{w}\_i, o\_j) = P\_{t-1}(\bar{w}\_i, o\_j) - \alpha\_{t-1}(\bar{w}\_i) \left\| P\_{t-1}(\bar{w}\_i, o\_j) \right\|$$

$$+ \left[ 1 - \alpha\_{t-1}(\bar{w}\_i) \right] \left[ \frac{1}{N\_t} - P\_{t-1}(\bar{w}\_i, o\_j) \right] \quad (13)$$

for *w*¯*<sup>i</sup>* ∈ ¯ *<sup>t</sup>* and *oj* ∈ *t*. We can verify that prescriptions (12) and (13) satisfy the normalization

$$\sum\_{\bar{\boldsymbol{\phi}}\_{\bar{\boldsymbol{\phi}}} \in \bar{\boldsymbol{\Omega}}\_{\boldsymbol{t}}} P\_t(\bar{\boldsymbol{w}}\_{\bar{\boldsymbol{\iota}}}, \bar{\boldsymbol{o}}\_{\bar{\boldsymbol{\jmath}}}) + \sum\_{o\_{\bar{\boldsymbol{\jmath}}} \in \boldsymbol{\Omega}\_{\boldsymbol{t}}} P\_t(\bar{\boldsymbol{w}}\_{\bar{\boldsymbol{\iota}}}, o\_{\bar{\boldsymbol{\jmath}}}) = 1,\tag{14}$$

as expected.

In summary, before any trial (*t* = 0) we set all confidence values to zero, i.e., *P*<sup>0</sup> - *wi, oj* = 0, and fix the values of the parameters α0, χ and β. In the first trial (*t* = 1) we set the confidences of the words and objects in <sup>1</sup> according to Equation (1), so we have the values of *P*<sup>1</sup> - *wi, oj* for *wi, oj* ∈ 1. In the second trial, we separate the novel words *w*˜*<sup>i</sup>* ∈ ˜ <sup>2</sup> and reset *P*<sup>1</sup> - *w*˜*i, oj* with *oi* ∈ <sup>2</sup> ∪ ¯ <sup>2</sup> according to Equations 2–4. Only then we calculate α1*(wi)* with *wi* ∈ <sup>1</sup> ∪ ˜ <sup>2</sup> using Equation (7). The confidences at trial *t* = 2 then follows from Equations (8), (9), (12), and (13). As before, in the third trial we separate the novel words *w*˜*<sup>i</sup>* ∈ ˜ 3, reset *P*<sup>2</sup> - *w*˜*i, oj* with *oi* ∈ <sup>3</sup> ∪ ¯ <sup>3</sup> according to Equations 2–4, calculate α2*(wi)* with *wi* ∈ <sup>1</sup> ∪ <sup>2</sup> ∪ ˜ <sup>3</sup> using Equation (7), and only then resume the evaluation of the confidences at trial *t* = 3. This procedure is repeated until the training stage is completed, say, at *t* = *t* <sup>∗</sup>. At this point, knowledge of the confidence values *Pt*<sup>∗</sup> - *wi, oj* allows us to answer any question posed in the testing stage.

Our model borrows many features from other proposed models of word learning (Siskind, 1996; Fontanari et al., 2009; Frank et al., 2009; Fazly et al., 2010; Kachergis et al., 2012). In particular, the entropy expression (6) was used by Kachergis et al. (2012) to allocate attention trial-by-trail to the associations presented in the contexts. Here we use that expression to quantify the uncertainty associated to the various confidences in order to determine the extent to which those confidences are updated on a learning trial. A distinctive feature of our model is the update of associations that are not in the current trial according to Equation (12). In particular, we note that whereas *ad hoc* normalization can only decrease the confidences on associations between words and objects that did not appear in the current context, our update rule can increase those associations as well. The extent of this update is weighted by the inference parameter β and it allows the application of mutual exclusivity to associations that are not shown in the current context. In fact, the splitting of mental processes in two classes, namely, reinforcement processes that update associations in the current context and inference processes that update the other associations is the main thrust of our paper. In the next section we evaluate the adequacy of our model to describe a selection of cross-situational word-learning experiments carried out on adult subjects by Kachergis et al. (2009, 2012).

#### **4. RESULTS**

The cross-situational word-learning experiments of Kachergis et al. (2009, 2012) aimed to understand how word sampling frequency (i.e., number of trials in which a word appears), contextual diversity (i.e., the co-occurrence of distinct words or groups of words in the learning trials), within-trial ambiguity (i.e., the context size *C*), and fast-mapping of novel words affect the learning performance of adult subjects. In this section we compare the performance of the algorithm described in the previous section with the performance of adult subjects reported in Kachergis et al. (2009, 2012). In particular, once the conditions of the training stage are specified, we carry out 10<sup>4</sup> runs of our algorithm for fixed values of the three parameters α0, β, χ, and then calculate the average accuracy at trial *t* = *t* <sup>∗</sup> over all those runs for that parameter setting. Since the algorithm is deterministic, what changes in each run is the composition of the contexts at each learning trial. As our goal is to model the results of the experiments, we search the space of parameters to find the setting such that the performance of the algorithm matches that of humans within the error bars (i.e., one standard deviation) of the experiments.

#### **4.1. WORD SAMPLING FREQUENCY**

In these experiments the number of words (and objects) is *N* = 18 and the training stage totals *t* <sup>∗</sup> = 27 learning trials, with each trial comprising the presentation of 4 words together with their referents (*C* = 4). Following Kachergis et al. (2009), we investigate two conditions which differ with respect to the number of times a word is exhibited in the training stage. In the twofrequency condition, the 18 words are divided into two subsets of 9 words each. The words in the first subset appear 9 times and those in the second only 3 times. In the three-frequency condition, the 18 words are divided into three subsets of 6 words each. Words in the first subset appear 3 times, in the second, 6 times and in the third, 9 times. In these two conditions, the same word was not allowed to appear in two consecutive learning trials.

**Figures 1**, **2** summarize our main results for the two-frequency and three-frequency conditions, respectively. The left panels show the regions (shaded areas) in the *(*χ*,* β*)* plane for fixed α<sup>0</sup> where the algorithm describes the experimental data. We note that if those regions are located left to the diagonal χ = β then the inference process is dominant whereas if they are right to the diagonal then reinforcement is the dominant process. The middle panels show the accuracy of the best fit as function of the parameter α<sup>0</sup> and the right panels exhibit the values of χ and β corresponding to that fit. The broken horizontal lines and the shaded zones around them represent the means and standard deviations of the results of experiments carried out with 33 adult subjects (Kachergis et al., 2009).

It is interesting that although the words sampled more frequently are learned best in the two-frequency condition as expected, this advantage practically disappears in the threefrequency condition in which case all words are learned at equal levels within the experimental error. Note that the average accuracy for the words sampled 3 times is actually greater than the accuracy for the words sampled 6 times, but this inversion is not statistically significant, although, most surprisingly, the algorithm does reproduce it for α<sup>0</sup> ∈ [0*.*7*,* 0*.*8]. According to Kachergis et al. (2009), the reason for the observed sampling frequency insensitivity might be because the high-frequency words are learned quickly and once they are learned subsequent trials containing those words will exhibit an effectively smaller within-trial ambiguity. In this vein, the inversion could be explained if by chance the words less frequently sampled were generally paired with the highly sampled words. Thus, contextual diversity seems to play a key role in cross-situational word learning.

#### **4.2. CONTEXTUAL DIVERSITY AND WITHIN-TRIAL AMBIGUITY**

In the first experiment aiming to probe the role of contextual diversity in the cross-situational learning, the 18 words were divided in two groups of 6 and 12 words each, and the contexts of size *C* = 3 were formed with words belonging to the same group only. Since the sampling frequency was fixed to 6 repetitions for each word, those words belonging to the more numerous group are exposed to a larger contextual diversity (i.e., the variety of different words with which a given word appear in the course of the training stage). The results summarized in **Figure 3** indicate clearly that contextual diversity enhances the learning accuracy. Perhaps more telling is the finding that incorrect responses are

regions around them (one standard deviation). The blue symbols represent the accuracy for the group of words sampled 9 times whereas the red symbols represent the accuracy for the words sampled 3 times. **Right panel:** Parameters χ and β corresponding to the best fit shown in the middle panel. The other parameters are *N* = 18 and *C* = 4.

**FIGURE 2 | Summary of the results for the three-frequency condition experiment. Left panel:** Regions in the plane *(*χ*,* β*)* where the algorithm fits the experimental data for fixed α<sup>0</sup> as indicated in the figure. **Middle panel:** Average accuracy for the best fit to the results of Experiment 1 of Kachergis et al. (2009) represented by the broken horizontal lines (means) and shaded

regions around them (one standard deviation). The blue symbols represent the accuracy for the group of words sampled 9 times, the green symbols for the words sampled 6 times, and the red symbols for the words sampled 3 times. **Right panel:** Parameters χ and β corresponding to the best fit shown in the middle panel. The other parameters are *N* = 18 and *C* = 4.

**FIGURE 3 | Summary of the results of the two-level contextual diversity experiment. Left panel:** Regions in the plane *(*χ*,* β*)* where the algorithm fits the experimental data for fixed α<sup>0</sup> as indicated in the figure. **Middle panel:** Average accuracy for the best fit to the results of Experiment 2 of Kachergis et al. (2009) represented by the broken horizontal lines (means) and shaded regions around them (one standard

deviation). The blue symbols represent the accuracy for the group of words belonging to the 12-components subgroup and the red symbols for the words belonging to the 6-components subgroup. All words are repeated exactly 6 times during the *t*<sup>∗</sup> = 27 learning trials. **Right panel:** Parameters χ and β corresponding to the best fit shown in the middle panel. The other parameters are *N* = 18 and *C* = 3.

largely due to misassignments to referents whose words belong to the same group of the test word. In particular, Kachergis et al. (2009) found that this type of error accounts for 56% of incorrect answers when the test word belongs to the 6-components subgroup and for 76% when it belongs to the 12-components subgroup. The corresponding statistics for our algorithm with the optimal parameters set at α<sup>0</sup> = 0*.*9 are 43% and 70%, respectively. The region in the space of parameters where the model can be said to describe the experimental data is greatly reduced in this experiment and even the best fit is barely within the error bars. It is interesting that, contrasting with the previous experiments, in this case the reinforcement procedure seems to play the more important role in the performance of the algorithm.

The effect of the context size or within-trial ambiguity is addressed by the experiment summarized in **Figure 4**, which is similar to the previous experiment, except that the words that compose the context are chosen uniformly from the entire repertoire of *N* = 18 words. Two context sizes are considered, namely, *C* = 3 and *C* = 4. In both cases, there is a large selection of parameter values that explain the experimental data, yielding results indistinguishable from the experimental average accuracies. This is the reason we do not exhibit a graph akin to those shown in the right panels of the previous figures. Since a perfect fitting can be obtained both for χ *>* β and for χ *<* β, this experiment is uninformative with respect to these two abilities. As expected, increase of the within-trial ambiguity difficilitate

learning. In addition, the (experimental) results for *C* = 3 yield a learning accuracy value that is intermediary to those measured for the 6 and 12-components subgroups, which is in agreement with the conclusion that the increase of the contextual diversity enhances learning, since the mean number of different co-occurring words is 4*.*0 in the 6-components subgroup, 9*.*2 in the 12-components subgroup and 8*.*8 in the uniformly mixed situation (Kachergis et al., 2009).

#### **4.3. FAST MAPPING**

The experiments carried out by Kachergis et al. (2012) were designed to elicit participants' use of the mutual exclusivity principle (i.e., the assumption of one-to-one mappings between words and referents) and to test the flexibility of a learned wordobject association when new evidence is provided in support to a many-to-many mapping. To see how mutual exclusivity implies fast mapping assume that a learner who knows the association *(w*1*, o*1*)* is exposed to the context = {*w*1*, o*1*, w*2*, o*2} in which the word *w*<sup>2</sup> (and its referent) appears for the first time. Then it is clear that a mutual-exclusivity-biased learner would infer the association *(w*2*, o*2*)* in this single trial. However, a purely associative learner would give equal weights to *o*<sup>1</sup> and *o*<sup>2</sup> if asked about the referent of *w*2.

In the specific experiment we address in this section, *N* = 12 words and their referents are split up into two groups of 6 words each, say *A* = {*(w*1*, o*1*),...,(w*6*, o*6*)*} and *B* = {*(w*7*, o*7*),...,(w*12*, o*12*)*}. The context size is set to *C* = 2 and the training stage is divided in two phases. In the early phase, only the words belonging to group *A* are presented and the duration of this phase is set such that each word is repeated 3, 6 or 9 times. In the late phase, the contexts consist of one word belonging to *A* and one belonging to *B* forming fixed couples, i.e., whenever *wi* appears in a context, *wi*+6, with *i* = 1*,...,* 6, must appear too. The duration of the late phase depends on the number of repetitions of each word that can be 3, 6, or 9 as in the early phase (Kachergis et al., 2012). The combinations of the sampling frequencies yield 9 different training conditions but here we will consider only the case that the late phase comprises 6 repetitions of each word.

The testing stage comprises the play of a single word, say *w*1, and the display of 11 of the 12 trained objects (Kachergis et al., 2012). Each word was tested twice with a time lag between the tests: once without its corresponding early object (*o*<sup>1</sup> in the case) and once without its late object (*o*<sup>7</sup> in the case). This procedure requires that we renormalize the confidences for each test. For instance, in the case *o*<sup>1</sup> is left out of the display, the renormalization is

$$P\_{t^\*}\left(\left.w\_1,o\_j\right) = P\_{t^\*}\left(\left.w\_1,o\_j\right)\right> / \sum\_{o\_k \neq o\_1} P\_{t^\*}\left(\left.w\_1,o\_k\right) \tag{15}$$

with *j* = 2*,...,* 12 so that *oj*=*o*<sup>1</sup> *Pt*<sup>∗</sup> - *w*1*, oj* = 1. Similarly, in the case *o*<sup>7</sup> is left out the renormalization becomes

$$P\_{t^\*}\left(\boldsymbol{w}\_1, \boldsymbol{o}\_j\right) = P\_{t^\*}\left(\boldsymbol{w}\_1, \boldsymbol{o}\_j\right) / \sum\_{o\_k \neq o\_7} P\_{t^\*}\left(\boldsymbol{w}\_1, \boldsymbol{o}\_k\right) \tag{16}$$

with *j* = 1*,...,* 6*,* 8*,...,* 12 so that *oj*=*o*<sup>7</sup> *Pt*<sup>∗</sup> - *w*1*, oj* = 1. We are interested on the (renormalized) confidences *Pt*<sup>∗</sup> *(w*1*, o*1*)*, *Pt*<sup>∗</sup> *(w*1*, o*7*)*, *Pt*<sup>∗</sup> *(w*7*, o*7*)*, and *Pt*<sup>∗</sup> *(w*7*, o*1*)*, which are shown in **Figures 5**, **6** for the conditions where words *wi, i* = 1*,...,* 6 are repeated 3 (left panel), 6 (middle panel), and 9 (right panel) times in the early learning phase, and the words *wi, i* = 1*,...,* 12 are repeated 6 times in the late phase. The figures exhibit the performance of the algorithm for the set of parameters χ and β that fits best the experimental data of Kachergis et al. (2012) for fixed α0. This optimum set is shown in **Figure 7** for the 6 early repetition condition, which is practically indistinguishable from the optima of the other two conditions. The conditions with the different word repetitions in the early phase intended to produce

**FIGURE 5 | Results of the experiments on mutual exclusivity in the case the late phase of the training process comprises 6 repetitions of each word.** The blue symbols represent the probability that the algorithm picks object *o*<sup>1</sup> as the referent of word *w*<sup>1</sup> whereas the red symbols represent the probability it picks *o*7. The broken horizontal lines and the shaded zones around them represent the experimental means and standard deviations

(Kachergis et al., 2012) represented by the broken horizontal lines (means) and shaded regions around them (one standard deviation). The left panel shows the results for 3 repetitions of *w*<sup>1</sup> in the early training phase, the middle panel for 6 repetitions and the right panel for 9 repetitions. The results correspond to the parameters χ and β that best fit the experimental data for fixed α0.

(Kachergis et al., 2012) represented by the broken horizontal lines (means) and shaded regions around them (one standard deviation). The left panel shows the results for 3 repetitions of *w*<sup>1</sup> in the early training phase, the middle panel for 6 repetitions and the right panel for 9 repetitions. The results correspond to the parameters χ and β that best fit the experimental data for fixed α0.

distinct confidences on the learned association *(w*1*, o*1*)* before the onset of the late phase in the training stage. The insensitivity of the results to these conditions probably indicates that association was already learned well enough with 3 repetitions only. Finally, we note that, though the testing stage focused on words *w*<sup>1</sup> and *w*<sup>7</sup> only, all word pairs *wi* and *wi*<sup>+</sup><sup>6</sup> with *i* = 1*,...,* 6 are strictly equivalent since they appear the same number of times during the training stage.

The experimental results exhibited in **Figure 6** offer indirect evidence that the participants have resorted to mutual exclusivity to produce their word-object mappings. In fact, from the perspective of a purely associative learner, word *w*<sup>7</sup> should be associated to objects *o*<sup>1</sup> or *o*<sup>7</sup> only, but since in the testing stage one of those objects was not displayed, such a learner would surely select the correct referent. However, the finding that *Pt*<sup>∗</sup> *(w*7*, o*7*)* is considerably greater than *Pt*<sup>∗</sup> *(w*7*, o*1*)* (they should be equal for an associative learner) indicates that there is a bias against the association *(w*7*, o*1*)* which is motivated, perhaps, from the previous understanding that *o*<sup>1</sup> was the referent of word *w*1. In fact, a most remarkable result revealed by **Figure 6** is that *Pt*<sup>∗</sup> *(w*7*, o*7*) <* 1. Since word *w*<sup>7</sup> appeared only in the late phase context = {*w*1*, o*1*, w*7*, o*7} and object *o*<sup>1</sup> was not displayed in the testing stage, we must conclude that the participants produced spurious associations between words and objects that never appeared together in a context. Our algorithm accounts for these associations through Equation (4) in the case of new words and, more importantly, through eqs. (9) and (13) due to the effect of the information efficiency factor α*<sup>t</sup> (wi)*. The experimental data is well described only in the narrow range α<sup>0</sup> ∈ [0*.*85*,* 0*.*9].

**Figure 8** exhibits the developmental timeline of the crosssituational learning history of the algorithm with the optimal set of parameters (see the figure caption) for the three different training conditions in the early training phase. This phase is characterized by the steady growth of the confidence on the association *(w*1*, o*1*)* (blue symbols) accompanied by the decrease of the confidence on association *(w*1*, o*7*)* (red symbols). As the word *w*<sup>7</sup> does not appear in the early training phase, the confidences on its association with any object remain constant corresponding to the accuracy value 1*/*11 (we recall that *o*<sup>1</sup> is left out of the

**strength) corresponding to the best fit shown in Figures 5 and 6 in the case word** *w***<sup>1</sup> is repeated 6 times in the early training phase.**

display in the testing stage). The beginning of the late training stage is marked by a steep increase of the confidence on the association *(w*7*, o*7*)*(green symbols) whereas the confidence on *(w*1*, o*1*)* decreases gradually. A similar gradual increase is observed on the confidence on the association *(w*7*, o*1*)* (orange symbols). As expected, for large *t* all confidences presented in this figure tend to the same value, since the words *w*<sup>1</sup> and *w*<sup>7</sup> always appear together in the context = {*w*1*, o*1*, w*7*, o*7}. Finally, we note that this developmental timeline is qualitatively similar to that produced by the algorithm proposed by Kachergis et al. (2012).

#### **5. DISCUSSION**

The chief purpose of this paper is to understand and model the mental processes used by human subjects to produce their word-object mappings in the controlled cross-situational word-learning scenarios devised by Yu and Smith (2007) and Kachergis et al. (2009, 2012). In other words, we seek to analyze the psychological phenomena involved in the production of those mappings. Accordingly, we assume that the completion of that task requires the existence of two cognitive abilities, namely, the associative capacity to create and reinforce associations between words and referents that co-occur in a context, and the non-associative capacity to infer word-object associations based on previous learning events, which accounts for the mutual exclusivity principle, among other things. In order to regulate the effectiveness of these two capacities we introduce the parameters χ ∈ [0*,* 1], which yields the reinforcement strength, and β ∈ [0*,* 1], which determines the inference strength.

In addition, since the reinforcement and inference processes require storage, use and transmission of past and present information (coded mainly on the values of the confidences *Pt* - *wi, oj* ) we introduce a word-dependent quantity α*t(wi)* ∈ [0*,* 1] which gauges the impact of the confidences at trial *t* − 1 on the update of the confidences at trial *t*. In particular, the greater the certainty

**FIGURE 8 | Knowledge development for the model parameters that best fit the results of the mutual exclusivity experiments summarized in Figures 5, 6 in the case the late phase of the training process comprises 6 repetitions of each word.** The symbol colors follow the convention used in those figures, i.e., the blue symbols represent the confidence on association *(w*1*, o*1*)*, the red symbols on association *(w*1*, o*7*)*, the green symbols on

association *(w*7*, o*7*)* and the orange symbols on association *(w*7*, o*1*)*. The left panel shows the results for 3 repetitions of *w*<sup>1</sup> in the early training phase (α<sup>0</sup> = 0*.*85, χ = 0*.*25, β = 0*.*95), the middle panel for 6 repetitions (α<sup>0</sup> = 0*.*85, χ = 0*.*45, β = 0*.*99) and the right panel for 9 repetitions (α<sup>0</sup> = 0*.*85, χ = 0*.*4, β = 0*.*95). For each trial *t* the symbols represent the average over 10<sup>5</sup> realizations of the learning process.

about the referent of word *wi*, the greater the relevance of the previous confidences. However, there is a baseline information gauge factor α<sup>0</sup> ∈ [0*,* 1] used to process words for which the uncertainty about their referents is maximum. The adaptive expression for α*t(wi)* given in Equation (7) seems to be critical for the fitting of the experimental data. In fact, our first choice was to use a constant information gauge factor (i.e., α*t(wi)* = α ∀*t, wi*) with which we were able to describe only the experiments summarized in **Figures 1**, **4** (data not shown). Note that a consequence of prescription (7) is that once the referent of a word is learned with maximum confidence (i.e., *Pt* - *wi, oj* = 1 and *Pt(wi, ok)* = 0 for *ok* = *oj*) it is never forgotten.

The algorithm described in Section 3 comprises three free parameters χ, β and α<sup>0</sup> which are adjusted so as to fit a representative selection of the experimental data presented in Kachergis et al. (2009, 2012). A robust result from all experiments is that the baseline information gauge factor is in the range 0*.*7 *<* α<sup>0</sup> *<* 1. Actually, the fast mapping experiments narrow this interval down to 0*.*85 *<* α<sup>0</sup> *<* 0*.*9. This is a welcome result because we do not have a clear-cut interpretation for α0—it encompasses storage, processing and transmission of information—and so the fact that this parameter does not vary much for wildly distinct experimental settings is evidence that, whatever its meaning, it is not relevant to explain the learning strategies used in the different experimental conditions. Fortunately, this is not the case for the two other parameters χ and β.

For instance, in the fast mapping experiments discussed in Subsection 4.3 the best fit of the experimental data is achieved for β ≈ 1 indicating thus the extensive use of mutual exclusivity, and inference in general, by the participants of those experiments. Moreover, in that case the best fit corresponds to a low (but non-zero) value of χ, which is expected since for contexts that exhibit two associations (*C* = 2) only, most of the disambiguations are likely to be achieved solely through inference. This contrasts with the experiments on variable word sampling frequencies discussed in Subsection 4.1, for which the best fit is obtained with intermediate values of β and χ so the participants' use of reinforcement and inference was not too unbalanced. The contextual diversity experiment of Subsection 4.2, in which the words are segregated in two isolated groups of 12 and 6 components, offers another extreme learning situation, since the best fit corresponds to χ ≈ 1 and β ≈ 0 in that case. To understand this result, first we recall that most of the participants' errors were due to misassignments of referents belonging to the same group of the test word, and those confidences were strengthened mainly by the reinforcement process. Second, in contrast to the inference process, which creates and strengthens spurious intergroup associations via Equation (12), the reinforcement process solely weakens those associations via Equation (9). Thus, considering the learning conditions of the contextual diversity experiment it is no surprise that reinforcement was the participants' choice strategy.

It is interesting to note that the optimal set of parameters that describe the fast mapping experiments (see **Figures 5**–**8**) indicate that there is a trade-off in the values of those parameters, in the sense that high values of the inference parameter β require low values of the reinforcement parameter χ. Since this is not an artifact of the model which poses no constrain on those values (e.g., they are both large for small α0), the trade-off may reveal a limitation on the amount of attentional resources available to the learner to distribute among the two distinct mental processes.

Our results agree with the findings of Smith et al. (2011) that participants use various learning strategies, which in our case are determined by the values of the parameters χ and β, depending on the specific conditions of the cross-situational word-learning experiment. In particular, in the case of low within-trial ambiguity those authors found that participants generally resorted to a rigorous eliminative approach to infer the correct word-object mapping. This is exactly the conclusion we reached in the analysis of the fast mapping experiment for which the within-trial ambiguity takes the lowest possible value (*C* = 2).

Although the adaptive learning algorithm presented in this paper reproduced the performance of adult participants in cross-situational word-learning experiments quite successfully, the deterministic nature of the algorithm hindered somewhat the psychological interpretation of the information gauge factor α*<sup>t</sup> (wi)*. In fact, not only learning and behavior are best described as stochastic processes (Atkinson et al., 1965) but also the modeling of those processes requires (and facilitates) a precise interpretation of the model parameters, since they are introduced in the model as transition probabilities.

#### **ACKNOWLEDGMENTS**

The work of José F. Fontanari was supported in part by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Paulo F. C. Tilles was supported by grant # 2011/11386-1, São Paulo Research Foundation (FAPESP).

#### **REFERENCES**


Gleitman, L. (1990). The structural sources of verb meanings. *Lang. Acquis.* 1, 1–55. doi: 10.1207/s15327817la0101\_2


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 14 July 2013; paper pending published: 01 October 2013; accepted: 28 October 2013; published online: 19 November 2013.*

*Citation: Tilles PFC and Fontanari JF (2013) Reinforcement and inference in cross-situational word learning. Front. Behav. Neurosci. 7:163. doi: 10.3389/fnbeh. 2013.00163*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience.*

*Copyright © 2013 Tilles PFC and Fontanari JF. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### The role of semantic abstractness and perceptual category in processing speech accompanied by gestures

#### *Arne Nagels <sup>1</sup> \*, Anjan Chatterjee2, Tilo Kircher <sup>1</sup> and Benjamin Straube1*

*<sup>1</sup> Department of Psychiatry and Psychotherapy, Philipps-University Marburg, Marburg, Germany*

*<sup>2</sup> Department of Neurology and the Center for Cognitive Neuroscience, The University of Pennsylvania, Philadelphia, PA, USA*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### *Reviewed by:*

*D. Caroline Blanchard, University of Hawaii at Manoa, USA Mar Sanchez, Emory University, USA*

#### *\*Correspondence:*

*Arne Nagels, Department of Psychiatry and Psychotherapy, Philipps-University Marburg, Rudolf-Bultmann-Straße 8, 35039 Marburg, Germany e-mail: nagels@med.uni-marburg.de* Space and shape are distinct perceptual categories. In language, perceptual information can also be used to describe abstract semantic concepts like a "rising income" (space) or a "square personality" (shape). Despite being inherently concrete, co-speech gestures depicting space and shape can accompany concrete or abstract utterances. Here, we investigated the way that abstractness influences the neural processing of the perceptual categories of space and shape in gestures. Thus, we tested the hypothesis that the neural processing of perceptual categories is highly dependent on language context. In a two-factorial design, we investigated the neural basis for the processing of gestures containing shape (SH) and spatial information (SP) when accompanying concrete (c) or abstract (a) verbal utterances. During fMRI data acquisition participants were presented with short video clips of the four conditions (cSP, aSP, cSH, aSH) while performing an independent control task. Abstract (a) as opposed to concrete (c) utterances activated temporal lobes bilaterally and the left inferior frontal gyrus (IFG) for both shape-related (SH) and space-related (SP) utterances. An interaction of perceptual category and semantic abstractness in a more anterior part of the left IFG and inferior part of the posterior temporal lobe (pTL) indicates that abstractness strongly influenced the neural processing of space and shape information. Despite the concrete visual input of co-speech gestures in all conditions, space and shape information is processed differently depending on the semantic abstractness of its linguistic context.

**Keywords: iconic gestures, deictic gestures, metaphoric gestures, functional magnetic resonance imaging, speech-associated gestures, cognition**

#### **INTRODUCTION**

In face-to-face communication people often use gestures to complement the content of their verbal message. People produce different kinds of gestures (McNeill, 1992), such as iconic gestures illustrating shape (e.g., "The ball is round") or deictic gestures referring to spatial information in our physical environment (e.g., "The cat is sitting on the roof"; pointing gesture). Shape gestures resemble the information they convey, as when someone draws a circle in the air to indicate a round shape ("The table in the kitchen is round," circle gesture). Space and shape gestures typically refer to concrete entities in the world. However, they can also make abstract references depending on the nature of the verbal message (McNeill, 1992; McNeill et al., 1993a,b). For instance, shape-related gestures can illustrate a deep connection between twins when the speaker touches the fingertips of both hands ("The twins had a spiritual bond between them"). Similarly space-related gestures can refer to abstract relationships or locations such as lifting the hand when saying that the discussion occurred at a very "high level."

In direct face-to-face communication people use gestures (Ozyurek and Kelly, 2007), regardless of whether the utterances are concrete or abstract. In line with theories suggesting gestures may represent the phylogenetic origin of human speech (Corballis, 2003, 2009, 2010; Gentilucci and Corballis, 2006; Gentilucci et al., 2006; Bernardis et al., 2008), gestures might represent the basis of spatial or action representations in human language [for example, see Tettamanti and Moro (2011)]. Such spatial elements transferred into speech and gestures could be an expression of how our language is rooted in embodied experiences (Gibbs, 1996; Lakoff, 1987). Following this idea perceptual elements and the sensory-motor system might both contribute to the processing and comprehension of figurative abstract language (particularly in the context of metaphors such as "grasp an idea"), as suggested by the embodiment theory (Gallese and Lakoff, 2005; Arbib, 2008; Fischer and Zwaan, 2008; D'Ausilio, 2009; Pulvermüller and Fadiga, 2010). Thus, the investigation of the neural substrates underlying the processing of perceptual categories such as shape or space in the context of concrete vs. abstract language semantics would give an answer to this hypothesis.

Recent fMRI investigations have focused on the processing of speech and gesture for different gesture types beat gestures: (Hubbard et al., 2009); iconic gestures: (Willems et al., 2007, 2009); and metaphoric gestures: (Kircher et al., 2009; Straube et al., 2009, 2011a). In general, left hemispheric posterior temporal (Holle et al., 2008, 2010; Green et al., 2009) and inferior frontal brain regions (Willems et al., 2007; Kircher et al., 2009; Straube et al., 2009, 2011a) are commonly found for the semantic processing of speech and gesture. The left posterior temporal lobe (pTL) seems to be involved during the apprehension of co-verbal gestures, whereas the left inferior frontal gyrus (IFG) seems to be additionally recruited when processing gestures in an abstract sentence context (Kircher et al., 2009; Straube et al., 2011a, 2013) or when accompanying incongruent ("The fisherman has caught a huge fish," while the actor is angling his arms) concrete speech (Willems et al., 2007; Green et al., 2009; Willems et al., 2009). However, these studies do not examine the neural effects of processing of concrete or abstract utterances with different perceptual categories, such as gestures referring to shape (e.g., "The ball is round") or space (e.g., "The shed is next to the building").

In a previous study, we compared brain activation in response to object-related (non-social) and person-related (social) coverbal gestures (Straube et al., 2010). Person-related as opposed to object-related gestures activated anterior brain regions including the medial and bilateral frontal cortex as well as the temporal lobes. These data indicate that dependent of speech and gesture content (person-related vs. object-related) different brain regions are activated during comprehension. However, in the aforementioned study the content of the verbal utterances was confounded by differences in the level of abstractness, since person-related gestures are not only social, but also more abstract symbolic than object-related gestures (e.g., "The actor did a good job in the play"). Therefore, the specific influence of person-related and object-related content independent of abstractness was not disentangled.

Beside this evidence for a posterior to anterior gradient of processing for concrete to abstract speech-gesture information, it is generally assumed that specific regions of the brain are specialized for the processing of specific kinds of contents (Patterson et al., 2007). Information about shapes of objects are processed in lateral occipital and inferior temporal brain areas (e.g., Kourtzi and Kanwisher, 2000; Grill-Spector et al., 2001; Kourtzi and Kanwisher, 2001; Kourtzi et al., 2003; Panis et al., 2008; Karnath et al., 2009, whereas the parietal lobe is involved in processing of spatial information (Rizzolatti et al., 1997, 2006; Koshino et al., 2000, 2005; Rizzolatti and Matelli, 2003; Chica et al., 2011; Gillebert et al., 2011). Although gestures can be distinguished by perceptual category [e.g., deictic gestures convey spatial information and iconic gestures predominantly convey shape information (McNeill, 1992)] there is insufficient knowledge about the neural processing of these different perceptual categories in the context of abstract and concrete sentence contexts.

Here we investigate the way in which perceptual category and semantic abstractness of co-verbal gestures interact. Our experiment aims at the question whether different perceptual categories are processed in the same or in distinct brain regions, irrespective of their linguistic abstractness. To approach this research question, we applied a naturalistic approach comparing shape-related and space-related gestures in the context of concrete and abstract sentences.

On a cognitive level (concrete physical) gesture content has to be aligned with the content of speech, regardless of whether the message is concrete or abstract. We hypothesize that the effort to incorporate both abstract speech with concrete gestures will likely result in enhanced neural responses in the left inferior frontal cortex (Willems et al., 2007) and in bilateral temporal brain regions (Kircher et al., 2009) as compared to the concrete conditions, independent of perceptual category. With regard to shape-related and space-related gestural information we expected differential activation within the inferior temporal and parietal lobe, respectively. For the interaction of perceptual (space and shape) and semantic category (concreteness and abstractness) two alternative results were hypothesized: (1) If the same neural processes are engaged when processing shape and space information regardless of the abstractness of the message, we will find no significant activation in interaction analyses. In this case, conjunction analyses (e.g., aSP *>* aSH ∩ cSP *>* cSH) will result in common activation patterns in the parietal cortex for space and inferior temporal cortex for shape. (2) If abstractness influences the processing of shape-related and space-related gesture information, interaction analyses will show differential activations between conditions. Here, we expected an interaction since language content may differentially influence the interpretation of perceptual categories and consequently the neural processing predominantly in the left IFG and pTL. Enhanced neural responses in classical "language regions" would strengthen the assumption that perceptual categories are differentially processed if embedded into an abstract vs. concrete language context.

#### **MATERIALS AND METHODS PARTICIPANTS**

Seventeen male right handed (Oldfield, 1971) healthy volunteers, all native speakers of German (mean age = 23.8 ± 2.7 years, range: 20–30 years, mean years of school education = 12*.*65 ± 0*.*86, range: 10–13 years), without impairments of vision or hearing, participated in the study. None of the participants had any serious medical, neurological or psychiatric illness, past or present. All participants gave written informed consent and were paid 20 Euro for participation. The study was approved by the local ethics committee. Because of technical problems one fMRI-data set was excluded from the analyses.

#### **STIMULUS CONSTRUCTION**

A set of 388 short video clips depicting an actor was initially created, consisting of 231 concrete and 157 abstract sentences, each accompanied by co-verbal gestures.

Iconic gestures refer to the concrete content of sentences, whereas metaphoric gestures illustrate abstract information in sentences. For example in the sentences "To get down to business" (drop of the hand) or "The politician builds a bridge to the next topic" (depicting an arch with the hand), abstract information is illustrated using metaphoric gestures. By contrast, the same gestures can be iconic (drop of the right hand or depicting an arch with the right hand) with the sentences "The man goes down the hill" or "There is a bridge over the river" when they illustrate concrete physical features of the world. Thus, concrete utterances are those containing referents that are perceptible to the senses ("The man ascends to the top of the mountain"). Abstract sentences, on the other hand, contain referents that are not directly perceptible ("The man ascends to the top of the company"), where the spatial or shape terms in the utterance are being used figuratively. For the distinction between concrete and abstract concepts see Holmes and Rundle (1985).

Here we were interested in the neural processing of the following types of sentences accompanied by gestures: (1) utterances with concrete content and space-related perceptual information (cSP; "deictic gesture"); (2) utterances with concrete content and shape-related perceptual information (cSH; "iconic gesture"); (3) utterances with an abstract content and space-related perceptual information (aSP; "abstract deictic gestures"); and (4) utterances with an abstract content and shape-related perceptual information (aSH; "metaphoric gestures").

All sentences accompanying gestures had a length of 5–10 words, with an average duration of 2.37 s (*SD* = 0*.*35) and a similar grammatical form (subject—predicate—object). The speech and gestures were performed by the same male actor in a natural, spontaneous way. This procedure was continuously supervised by two of the authors (Benjamin Straube, Tilo Kircher) and timed digitally. All video clips had the same length of 5 s with at least 0.5 s before and after the sentence onset and offset, respectively, where the actor did not speak or move.

#### **STIMULUS SELECTION: RATING / MATERIAL SELECTION/MATCHING**

For stimulus validation, 17 raters not participating in the fMRI study evaluated each video on a scale ranging from 1 to 7 (1 = very low to 7 = very high) according to three content dimensions (space, shape and action information) and familiarity. Other general parameters like "understandability" and "naturalness" were previously validated and controlled for (for detailed information see (Green et al., 2009; Kircher et al., 2009; Straube et al., 2011a,b).

Material was selected to address our manipulations of interest (cf. above):


For each condition 30 sentences were selected to differentiate both factors. Therefore, co-verbal gestures conveying space-related perceptual information (cSP, aSP) were selected to have similar spatial rating scores independent of the level of the abstractness of the utterance (c vs. a). Abstract co-verbal gestures (aSP, aSH) were selected to be similarly abstract independent of the perceptual category of information (space or shape; see **Table 1**).

To confirm that our stimuli met our design criteria, we calculated analyses of variances for the factors perceptual (space-, shape related) and semantic category (concrete, abstract) as represented in the 2 × 2 experimental design.

As intended we found for the rating of spatial information a significant main effect for perceptual category [SP *>* SH; *F(*1*,* <sup>116</sup>*)* = 72*.*532, *p <* 0*.*001], but no significant effects for the main effect of semantic category [a vs. c; *F(*1*,* <sup>116</sup>*)* = 0*.*149, *p* = 0*.*603] or the interaction of perceptual and semantic category [*F(*1*,* <sup>116</sup>*)* = 3*.*250, *p* = 0*.*074].

For the rating of shape information we obtained again a significant main effect for perceptual category [SH *>* SP; *F(*1*,* <sup>120</sup>*)* = 98*.*466, *p <* 0*.*001], but no significant effects for the main effect of abstractness [a vs. c; *F(*1*,* <sup>120</sup>*)* = 0*.*001, *p* = 0*.*988] or the interaction of perceptual category and abstractness [*F(*1*,* <sup>120</sup>*)* = 2*.*053, *p* = 0*.*155].

For the rating of abstractness we obtained a significant main effect for abstractness [a *>* c; *F(*1*,* <sup>116</sup>*)* = 116*.*124, *p <* 0*.*001], but no significant effects for the main effect of perceptual category [SP vs. SH; *F(*1*,* <sup>116</sup>*)* = 0*.*005, *p* = 0*.*942] or the interaction of perceptual category and abstractness [*F(*1*,* <sup>116</sup>*)* = 2*.*975, *p* = 0*.*087]. For means and confidence intervals see **Table 1**. Together, these analyses confirm that stimulus selection worked out and stimulus characteristics for each condition met our design criteria.

For the control variables familiarity, naturalness and action information we found no significant main effects or interactions (for all *p >* 0*.*10). However, we found significant effects for understandability [main effect perceptual category: SP *>* SH: *F(*1*,*120*) <* 4.960, *p* = 0*.*028; interaction: *F(*1*,*116*) <* 17.704, *p <* 0*.*001], speech duration [main effect abstractness: a *>* c: *F(*1*,* <sup>116</sup>*)* = 9*.*024, *p <* 0*.*003] and gesture duration [main effect abstractness: c *>* a: *F(*1*,*116*) <* 10.821, *p <* 0*.*001]. However, differences in understandability were small (*<*0.22 rating points) and most likely because of ceiling effects in the aSP (skewness = − 1*.*68; kurtosis = 4.31) and cSH (skewness = −1*.*40; kurtosis =1.80) conditions. For means and confidence intervals of the control variables see **Table 2**.

In the event-related fMRI study design focusing on the co-occurrence of speech and gesture, differences in speech or gesture duration should not have a crucial impact on our results. However, we included differences in speech and gesture duration for each event as a covariate of no interest in our single-subject design matrix.


*Rating results for the conditions cSP, cSH, aSP, and aSH for space-relatedness, abstractness and shape-relatedness.*


**Table 2 | Control variables: understandability, familiarity and naturalness.**

*Rating results for the conditions cSP, cSH, aSP, and aSH for the dimensions understandability, familiarity and naturalness.*

Apart from the aforementioned factors, further differences in movement characteristics were found between the conditions. For all four conditions predominantly right (cSP = 19; cSH = 13; aSP = 16; aSH = 11) or bimanual movements were performed (cSP = 11; cSH = 17; aSP = 14; aSH = 19). To ensure that none of the patterns of neural activation were produced by differences in hand movements (right hand vs. both hands) and speech length, a separate control analysis was run accounting for the aforementioned dimensions. A set of 11 exactly paired video clips for each condition was used for the additional analysis.

To account for differences in the size of movements between conditions, we coded each video clip with regard to the extent of the hand movement. We divided the video screen into small rectangles that corresponded to the gesture space described by McNeill (1992); McNeill (2005) and counted the number of rectangles in which gesture movements occurred see Straube et al. (2011a). For each video the number of rectangles was also included as covariate of no interest in the single subject model.

#### **EXPERIMENTAL DESIGN AND PROCEDURE**

During the fMRI scanning procedure, videos were presented via MR-compatible video goggles (VisuaStim©, Resonance Technology, Inc.) and non-magnetic headphones (audio presenting systems for stereophonic stimuli: Commander; Resonance Technology, Inc.), which additionally dampened scanner noise.

Thirty items of each of the four conditions were presented in an event-related design, in a pseudo-randomized order and counterbalanced across subjects. Each video was followed by a baseline condition (gray background with a fixation cross) with a variable duration of 3750–6750 ms (average: 5000 ms) see **Figure 1**.

During scanning participants were instructed to watch the videos and to indicate via left hand key presses at the beginning of each video whether the spot displayed on the actor's sweater was light or dark colored. This task was chosen to focus participants' attention on the middle of the screen and enabled us to investigate implicit speech and gesture processing without possible instruction-related attention biases. Performance rates and reaction times were recorded. Prior to scanning, each participant received at least 10 practice trials outside the scanner, which were different from the stimuli used in the main experiment. During the preparation scans additional clips were presented to adjust the volume of the headphone. Each participant performed two runs with 60 video clips and a total duration of 10.5 min each.

**FIGURE 1 | Examples of the different speech and gesture video-clips.** The stimulus material consisted of video clips of an actor performing either space-related **(top)** or shape-related **(bottom)** gestures to corresponding sentences with an concrete **(left)** or abstract content **(right)**. One screen shot of an example video is shown for each condition (cSP, concrete space-related; cSH, concrete shape-related; aSP, abstract space-related; aSH, abstract shape-related). In order to exemplify the stimulus material German sentences are translated into English, and written in speech bubbles for illustration (unlike in the actual stimuli).

#### **fMRI DATA ACQUISITION**

MRI was performed on a 3T Siemens scanner (Siemens MRT Trio series). Functional data were acquired with echo planar images in 38 transversal slices (repetition time [TR] = 2000 ms; echo time [TE] = 30 ms; flip angle = 90◦; slice thickness = 3 mm; interslice gap = 0.30 mm; field of view [FoV] = 220 × 199 mm, voxel resolution = 3.44 × 3.44 mm, matrix dimensions 64 × 58 mm). Slices were positioned to achieve whole brain coverage. During each functional run 315 volumes were acquired.

#### **DATA ANALYSIS**

MR images were analyzed using Statistical Parametric Mapping (SPM2; www*.*fil*.*ion*.*ucl*.*ac*.*uk) implemented in MATLAB 6.5 (Mathworks Inc., Sherborn, MA). The first five volumes of every functional run were discarded from the analysis to minimize T1 saturation effects. To correct for different acquisition times, the signal measured in each slice was shifted relative to the acquisition time of the middle slice using a slice interpolation in time. All images of one session were realigned to the first image of a run to correct for head movement and normalized into standard stereotaxic anatomical MNI-space by using the transformation matrix calculated from the first EPI-scan of each subject and the EPItemplate. Afterwards, the normalized data with a resliced voxel size of 3*.*5 × 3*.*5 × 3*.*5 mm were smoothed with a 6 mm FWHM isotropic Gaussian kernel to accommodate intersubject variation in brain anatomy. Proportional scaling with high-pass filtering was used to eliminate confounding effects of differences in global activity within and between subjects.

The expected hemodynamic response at the defined "points of integration" for each event-type was modeled by two response functions, a canonical hemodynamic response function (HRF; Friston et al., 1998) and its temporal derivative. The temporal derivative was included in the model to account for the residual variance resulting from small temporal differences in the onset of the hemodynamic response, which is not explained by the canonical HRF alone. The functions were convolved with the event sequence, with fixed event duration of 1 s, for the onsets corresponding to the integration points of gesture stroke and sentence keyword to create the stimulus conditions in a general linear model (Green et al., 2009; Kircher et al., 2009; Straube et al., 2010, 2011b). The fixed event duration of 1 s was chosen to get a broader range of data around the assumed time point of integration. This methodological approach was also applied successfully in previous studies of co-verbal gesture processing (Kircher et al., 2009; Straube et al., 2010, 2011b).

A group analysis was performed by entering contrast images into a flexible factorial analysis as implemented in SPM5 in which subjects are treated as random variables. A Monte Carlo simulation of the brain volume of the current study was conducted to establish an appropriate voxel contiguity threshold (Slotnick et al., 2003). Assuming an individual voxel type I error of *p <* 0*.*005, a cluster extent of 8 contiguous re-sampled voxels was necessary to correct for multiple voxel comparisons at *p <* 0*.*05. Thus, voxels with a significance level of *p <* 0*.*005 uncorrected, belonging to clusters with at least eight voxels are reported (Straube et al., 2010). Activation peaks of some of the activation clusters also hold a family wise error (FWE) correction. Corresponding corrected *p*-values for each activation peak were included in the tables. The reported voxel coordinates of activation peaks are located in MNI space. Statistical analyses of data other than fMRI were performed using SPSS version 14.0 for Windows (SPSS Inc., Chicago, IL, USA). Greenhouse–Geisser correction was applied whenever necessary.

#### **CONTRASTS OF INTEREST**

To test our hypothesis on the neural processing of different perceptual categories in concrete vs. abstract sentence contexts (cf. Introduction section), baseline contrasts (main effects of condition), conjunction analysis and interaction analysis were run.

At first, baseline contrasts were calculated in order to detect general activations with regard to the four main conditions (aSP, cSP, aSH, cSH) as compared to baseline (fixation cross).

In a next step, main effects (SH vs. SP and a vs. c) as well as the interaction were calculated (t-contrasts) to show brain regions involved in the processing of different factors (directed general effects).

To test the hypothesis that perceptual category is processed in the same neural structures regardless of the language context we performed conjunction analyses of difference contrasts (aSP *>* aSH ∩ cSP *>* cSH and aSH *>* aSP ∩ cSH *>* cSP). To test for general effects of abstractness independent of both space-related as well as shape-related contents the same approach was used (aSP *>* cSP ∩ aSH *>* cSH and cSH *>* aSH ∩ cSP *>* aSP).

Finally, we performed two interaction analyses to test the hypothesis that abstractness significantly changes the processing of perceptual categories, space and shape: (1) = (aSP *>* cSP) *>* (aSH *>* cSH) masked for (aSP *>* cSP) and aSP; (2) = (aSH *>* cSH) *>* (aSP *>* cSP) masked for (aSH *>* cSH) and aSH. The masking procedure was applied to avoid the interpretation of deactivation in the concrete conditions and restrict the effects to increased activity for aSP vs. low-level baseline and its concrete derivative (cSP). Based on our hypothesis, this methodological approach enables us to find specific neural responses for semantic category (concrete/abstract) in space-related (1) and shape-related (2) perceptual contexts.

#### **RESULTS**

#### **BEHAVIORAL RESULTS**

The average reaction time for the control task ("indicate the color of the spot on the actor's sweater") did not differ with regard to color or gesture condition [color: *F(*1*,*15*)* = 0*.*506, *P* = 0.488; condition: *F(*4*,*60*)* = 0*.*604, *P* = 0.604; interaction: *F(*4*,*60*)* = 1*.*256, *P* = 0.301; within-subjects two-factorial ANOVA; mean = 1.23 sec, *SD* = 0*.*94]. The participants showed an average accuracy rate of 99% which did not differ across conditions [*F(*4*,*60*)* = 0*.*273, *P* = 0*.*841, within-subjects ANOVA]. Thus, the attention control task indicated that participants did pay attention to the video clips.

#### **fMRI RESULTS**

#### *Baseline contrasts (aSP, cSP, aSH, cSH)*

To explore the general processing mechanisms for each condition and the high comparability between conditions baseline contrasts were calculated (**Figure 2**, **Table 3**). We found comparable activation patterns as in previous studies on speech and gesture stimuli (Straube et al., 2011a).

#### *Main effects for perceptual category*

To identify the general effect of speech-gesture information, the main effect for the factors perception category [space-related (SP) vs. shape-related (SH)] were calculated.

For the effect of space-related vs. shape-related information (SP *>* SH) we found an extended network of activations including left middle [Brodmann Area (BA) 6] and superior frontal (BAs 6/8) as well as temporo-parietal (BAs 21/39/40) brain regions (**Table 4**).

The processing of shape-related vs. space-related information (SH *>* SP) resulted in enhanced neural responses in bilateral occipital-parietal (BAs 18/37) and middle (BA 11) as well as inferior frontal (BA 45) gyri and left parietal (BA 40) brain region (**Table 4**).

#### *Main effects for abstractness*

Abstract vs. concrete speech-gesture information (a *>* c) revealed a widespread pattern of activation. A large cluster of activation

was found in the left IFG extending to the temporal lobe, including the temporal pole and the middle temporal gyrus. Activations were also found in the right superior temporal gyrus, in the left precuneus and right cuneus as well as in the left precentral and superior medial gyri (BAs 6/9). Enhanced neural responses were also found in the middle cingulate, the left superior frontal and superior medial cortex as well as in the left angular gyrus (BA 39/40) (see **Table 5, Figure 3**).

For the reverse contrast (c *>* a) we found activations in the left and right parahippocampal and fusiform gyri (BA 36/37), in the left inferior frontal (BA 46) and in the temporo-occipital region (BA 37) as well as in the left superior occipital gyrus (BA 19) (see **Table 5**). Smaller clusters of activation were found in the right cerebellum, the middle frontal (BA 11) and in the precentral gyrus (BA 4).

#### *Interaction of perceptual categories and abstractness*

For the interaction of perceptual category and abstractness (aSP *>* cSP)*>*(aSH *>* cSH) we found superior medial frontal, left inferior frontal (BA45/44) and middle temporal and superior parietal brain regions (see **Table 6**).


*Significance level (t-value), size of the respective activation cluster (No. voxels; number of voxels > 8) at p < 0.005 MC corrected for multiple comparisons. Coordinates are listed in MNI space. BA is the Brodmann area nearest to the coordinate and should be considered approximate. (cSP, concrete spatial; cSH, concrete descriptive; AS, abstract spatial; aSH, abstract descriptive).*


**Table 4 | Main effects for space-related and shape-related semantic contents.**

*Significance level (t-value), size of the respective activation cluster (No. voxels; number of voxels > 8) at p <* 0*.*005 *MC corrected for multiple comparisons. Coordinates are listed in MNI space. BA is the Brodmann area nearest to the coordinate and should be considered approximate.*


For the contrast in the opposite direction (aSH *>* cSH) *>* (aSP *>* cSP) we found a more distributed predominantly right hemispheric activation pattern including the occipital lobe, the middle frontal gyrus, the inferior parietal lobe, the precuneus, the IFG (BA44/45), the middle occipital gyrus and the bilateral fusiform gyri (see **Table 6**).

#### *Specific contrasts of interest*

*Brain areas sensitive for shape-related and space-related perceptual contents independent of abstractness.* A conjunction analysis for shape-related form descriptive perceptual contents irrespective of the level of abstractness (aSH *>* aSP ∩ cSH *>* cSP) revealed enhanced neural responses in the left middle occipital gyrus (BA 37; see supplementary material **Table 7**).

**FIGURE 3 | Significant brain activations for abstractness, concreteness as well as for shape-related co-verbal gesture processing (whole-brain,** *p <* **0***.***005, cluster extend threshold = 8 voxels; MC corrected** *p <* **0***.***05) (cSP, concrete spatial; cSH, concrete shape; AS, abstract spatial; aSH, abstract shape).**

#### **Table 6 | Interaction of semantic categories and abstractness.**

No region was found to be significantly activated for space vs. shape-related processing on concrete and abstract level (aSP *>* aSH ∩ cSP *>* cSH) (see supplementary material **Table 8**).

*Brain areas sensitive for abstractness independent of perceptual category (shape/space).* Common activations for abstract as opposed to concrete co-verbal gestures, irrespective of descriptive or spatial information (aSH *>* cSH ∩ aSP *>* cSP), resulted in a large cluster of activation encompassing the left temporal pole and the middle temporal gyrus. Another cluster of activation was found in the right superior temporal gyrus and in the left IFG, including the pars Orbitalis as well as the pars Triangularis (BA 44; see supplementary material **Table 9**).

The imaging results for concreteness independent of the shape-related or space-related perceptual content (cSH *>* aSH ∩ cSP *>* aSP) revealed enhanced BOLD responses in the left parahippocampal gyrus (BA 35; see supplementary material **Table 10**).

*Specific neural responses for abstractness in space-related (1) as well as in shape-related (2) content domains.* The specifically masked interaction analyses (see Contrast of Interest section) revealed distinct activation for abstractness on space-related information [(sSP *>* cSP) *>* (aSH *>* cSH) masked for (aSP *>* cSP) and aSP] within the left IFG (MNIxyz: −53, 28, 0; *t* = 4*.*77; 42 voxels) and the left pTL (MNIxyz: −60, −46, 4; *t* = 3*.*07; 10 voxels; see **Figure 4**). The other direction of contrasts did not reveal any significant results.

Taken together, significant main effects and interactions of brain activation with regard to the manipulated factors [type of communicated perceptual information (SP, SH) and abstractness (c, a)] revealed different patterns of activation. The specific contrasts indicated that subregions of the left IFG and the left


pTL have common [conjunction analyses: IFG [MNIxyz: −39, 28, −4; *t* = 3*.*27; 11 voxels], pTL (MNIxyz: −53, −38, 0; *t* = 4*.*26; 196)] and distinct functions [interaction: IFG (MNIxyz: −53, 28, 0; *t* = 4*.*77; 42 voxels], pTL [MNIxyz: −60, −46, 4; *t* = 3*.*07; 10 voxels)] with regard to perceptual type and abstractness.

The same analysis, including only right-handed gesture stimuli of equal length (speech duration) revealed the same pattern of activation encompassing the left IFG as well as the left middle temporal gyrus, indicating that this effect is not based on irrelevant differences in stimulus material.

#### **DISCUSSION**

Space and shape are distinct perceptual categories. Words referring to space and shape also describe abstract concepts like "rising income" (space) or a "square personality" (shape). Gestures are an important part of human communication that underpin verbal utterances and can convey shape or space information even when accompanying abstract sentences. Recent studies have investigated the neural processing of speech and gesture (Willems and Hagoort, 2007; Willems et al., 2007, 2009; Dick et al., 2009, 2012; Green et al., 2009; Hubbard et al., 2009; Kelly et al., 2010; Kircher et al., 2009; Skipper et al., 2009; Straube et al., 2009; Holle et al., 2010). Despite the fact that the investigation of perceptual categories used in speech and gesture could give important answers with regard to the effect of abstractness on particular neural networks relevant for the processing of such perceptual information, the related effect is not known. Thus, the purpose of the current fMRI study was to investigate the neural processing of shape-related vs. space-related co-speech gesture information when presented with abstract or concrete utterances aiming at the question whether similar or distinct neural networks are involved.

In line with previous findings (Straube et al., 2011a) we found enhanced cortical activations for abstract (a) as opposed to concrete (c) utterances in the bilateral temporal lobes and in the left IFG for both, space as well as shape-related sentences (aSP *>* cSP and aSH *>* cSH). The interaction of perceptual category and abstractness in a more anterior part of the left IFG and inferior part of the pTL indicates that abstractness strongly influenced the neural processing of space and shape information. Only the effect of shape- vs. space-related information revealed activation in a single cluster of the left inferior occipital gyrus independent of abstractness (cSH *>* cSP ∩ aSH c*>* aSP). By contrast, the interaction resulted in enhanced BOLD responses in a more anterior part of the left IFG and inferior part of the pTL. Thus, we demonstrate the interaction of perceptual category and abstractness on the neural processing of speech accompanied by gestures. These data suggest a functional division of the pTL and left IFG being sensitive to the processing of both the level of abstractness and the type of categorical information. These imaging results further offer neural support for the traditional categorization of co-verbal gestures with regard to their content and abstractness (McNeill, 1992, 2005).

The imaging results for the abstract co-verbal gesture condition revealed BOLD enhancements in the left inferior frontal and the bilateral temporal regions, respectively. This finding is consistent with previous evidence of involvement of the left IFG and bilateral temporal lobes in the integration of gestures with abstract sentences (Kircher et al., 2009; Straube et al., 2009, 2011a). With regard to the underlying neuro-cognitive processes, we assume that the concrete visual gesture information (e.g., illustrating an arch of a bridge) is being interpreted in context of the abstract sentence meaning ("the politician builds a bridge to the next topic"). Thus, correspondence of gesture and sentence meaning must be identified and figurative components of speech and gesture must be translated from their literal/concrete meanings. To build this relation between speech and gesture information on the level of abstractness, additional online unification processes within the IFG seem to be relevant (Straube et al., 2011a). Such processes might be similar to those responsible for making inferences (e.g., Bunge et al., 2009, relational reasoning (e.g., Wendelken et al., 2008), the building of analogies (e.g., Luo et al., 2003; Bunge et al., 2005; Green et al., 2006; Watson and Chatterjee, 2012), and unification (Hagoort et al., 2009; Straube et al., 2011a). Those processes may also be involved in the comprehension of novel metaphoric or ambiguous communications and consistently activate the left IFG (Rapp et al., 2004, 2007; Stringaris et al., 2007; Chen et al., 2008; Cardillo et al., 2012). Consequently, enhanced neural responses in the frontotemporal network may be evoked by the higher cognitive demand in an abstract metaphoric context which may have resulted in the recruitment of the left inferior frontal and middle temporal region (Kircher et al., 2009; Straube et al., 2011a).

Concrete speech accompanied by gestures revealed a pattern of enhanced BOLD responses in parahippocampal regions bilaterally as well as in the left superior occipital gyrus. Concrete co-verbal utterances such as, "the workman builds a bridge over the river," evokes a comparatively transparent connection/relation to a familiar everyday event. Accordingly, an experienced-based understanding of a scene may have resulted in the recruitment of the parahippocampal regions, whereas the direct imagery of concrete objects or actions may have resulted in enhanced neural responses in the left superior occipital region (Green et al., 2009) facilitating the understanding of the concrete co-verbal content.

The shape-related sentences accompanied by shape-related gestures revealed activations in the left middle occipital region. Similar to the activations found for the concrete condition (c *>* a), imagery of an experience-based perceptual representation resulted in the activations of the left occipital area. However, we did not observe common activation for the processing of spatial information in a concrete and abstract sentence context. Together these data do not support a universal neural processing of space and shape in a multimodal communication context.

By contrast, we found an interaction for perceptual category and abstractness, as spatial information on an abstract level (aSP) specifically (in contrast to all other conditions) activated a particular part of the left IFG and the left superior temporal region. This finding was robust and independent of both hand movement and speech duration. Thus, BOLD enhancements in these regions suggest that predominantly spatial information is processed differently in an abstract vs. concrete sentence context. Additional semantic information is retrieved from the left superior temporal region. The higher cognitive load together with the resulting enhanced effort with regard to information-specific abstract and spatial lexical retrieval may account for the recruitment of the fronto-temporal network. However, specific activation of the left IFG could also represent competition between meanings of spatial terms in the aSP condition, including at a minimum the concrete/literal and the abstract/metaphoric interpretations (Chatterjee, 2008; Chen et al., 2008).

For the processing of shape-related information we found common activation within the inferior temporal gyrus and the occipital lobe for concrete and abstract utterances, suggesting a common perceptual representation activated during comprehension of shape information. This perceptual representation probably compensated for the need of additional resources of the IFG and pTL, which were activated for space-related information in an abstract sentence context. Thus, this finding suggests that a concrete representation of shape is also activated in an abstract sentence context. This might have further facilitated the processing of the abstract representation of shape. For the processing of space-related information we found no common activation for concrete and abstract utterances, indicating different neural processing mechanism for both types of communications. The transformation of space-related gesture information in an abstract sentence context probably required higher order semantic processing mechanisms (Straube et al., 2011a) which probably inhibited the actual perceptual spatial representation of these gestures.

A limitation of this study is that the specific effects of gesture as well as integration processes cannot be disentangled. Distinguishing between speech and gesture was not the purpose of the current study. The problem with regard to the interpretation of our results for the main effect of abstractness, irrespective of perceptual category, might be that the activation patterns found for abstract speech accompanied by gestures in the left IFG and bilateral temporal lobes is produced by differences in the abstractness between the sentences, as demonstrated by several studies about metaphoric speech processing (Rapp et al., 2004, 2007; Eviatar and Just, 2006; Mashal et al., 2007, 2009; Nagels et al., 2013; Stringaris et al., 2007; Chen et al., 2008). However, in a previous study we observed increased activation in the left IFG for metaphoric co-verbal gestures in contrast to control sentences with the identical abstract semantic content (Kircher et al., 2009). Furthermore, there is evidence that activation of the left IFG is specifically related to the processing of novel and therefore unconventional metaphoric sentences (Rapp et al., 2004, 2007; Cardillo et al., 2012), in which abstract information must be interpreted online in terms of its non-literal meaning. However, the abstract sentences used in the current study were conventional and part of everyday communication, e.g., "The talk was on a high level." This is supported by our rating results, which revealed no differences between the conditions with regard to familiarity. Despite the fact, that we cannot exclude that differences between conditions might be explained by differences in difficulty due to our language manipulation (concrete vs. abstract), the lack of commonalities (e.g., Spa SHa ∩ SPc *>* SHc) cannot be explained by these potential differences. The robustness of the imaging results in the aforementioned regions is further supported by the separate control analyses encompassing a carefully matched subset of paired (hand movements and speech length) stimuli. *>*

A further limitation is that the distinction between space- and shape-related information in the current experiment is artificial and do not represent independent factors. Shape gestures include some spatial information. However, despite this intrinsic connection between space and shape, our data demonstrate that these perceptual categories can be distinguished by independent raters and produce distinct interacting activation patterns with regard to abstractness. Therefore, our data support the validity of this separation, which has been traditionally applied in terms of deictic or abstract deictic gestures (which refer to space) in contrast to iconic and metaphoric gestures (which rather refer to form or shape; e.g., McNeill, 1992).

With this study we demonstrate the interaction of perceptual category and abstractness in the neural processing of speechgesture utterances. Besides abstractness, the type of information was relevant to the neural processing of speech accompanied by gestures. This finding illustrates the relevance of the interaction between language and cognition, which characterizes the complexity of natural interpersonal communication. Future studies should therefore consider the importance of perceptual type and abstractness for the interpretation of their imaging results. Our data suggest a functional subdivision of the pTL and left IFG with regard to the processing of space and shape-related information in an abstract sentence context. Such differences support the theoretically based traditional categorization of co-verbal gestures with regard to information type and abstractness (McNeill, 1992). Most likely the investigation of other types of co-verbal gestures will demonstrate further important differences in the processing of specific co-verbal gesture types, which will enlighten the finegrained differences of processing mechanisms, which underlie the comprehension of multimodal natural communication.

#### **ACKNOWLEDGMENTS**

This research project is supported by a grant from the Interdisciplinary Center for Clinical Research "BIOMAT" (IZKF VV N68). Arne Nagels is supported by a grant form the "Deutsche Forschungsgemeinschaft" (DFG: Ki 588/6-1), Benjamin Straube is supported by the BMBF (project no. 01GV0615). We thank Katharina Augustin, Bettina Freese and Simone Schröder for the preparation and evaluation of the stimulus material.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fnbeh.2013. 00181/abstract

**Table 7 | Brain areas sensitive for shape-related contents (independent of abstractness).** Significance level (t-value), size of the respective activation cluster (No. voxels; number of voxels > 8) at p < 0.005 MC corrected for multiple comparisons. Coordinates are listed in MNI space. BA is the Brodmann area nearest to the coordinate and should be considered approximate. (cSP, concrete spatial; cSH, concrete shape; aSP, abstract spatial; aSH, abstract shape).

**Table 8 | Brain areas sensitive for space-related contents (independent of**

**abstractness).** Significance level (t-value), size of the respective activation cluster (No. voxels; number of voxels > 8) at p < 0.005 MC corrected for multiple comparisons. Coordinates are listed in MNI space. BA is the Brodmann area nearest to the coordinate and should be considered approximate. (cSP, concrete spatial; cSH, concrete shape; aSP, abstract spatial; aSH, abstract shape).

**Table 9 | Brain areas sensitive for abstractness (independent of content).**

Significance level (t-value), size of the respective activation cluster (No. voxels; number of voxels > 8) at p < 0.005 MC corrected for multiple comparisons. Coordinates are listed in MNI space. BA is the Brodmann area nearest to the coordinate and should be considered approximate. (cSP, concrete spatial; cSH, concrete shape; aSP, abstract spatial; aSH, abstract shape).

#### **Table 10 | Brain areas sensitive for concreteness (independent of content).**

Significance level (t-value), size of the respective activation cluster (No. voxels; number of voxels > 8) at p < 0.005 MC corrected for multiple comparisons. Coordinates are listed in MNI space. BA is the Brodmann area nearest to the coordinate and should be considered approximate. (cSP, concrete spatial; cSH, concrete shape; aSP, abstract spatial; aSH, abstract shape).

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 02 July 2013; accepted: 11 November 2013; published online: 18 December 2013.*

*Citation: Nagels A, Chatterjee A, Kircher T and Straube B (2013) The role of semantic abstractness and perceptual category in processing speech accompanied by gestures. Front. Behav. Neurosci. 7:181. doi: 10.3389/fnbeh.2013.00181*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience.*

*Copyright © 2013 Nagels, Chatterjee, Kircher and Straube. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Toward a self-organizing pre-symbolic neural model representing sensorimotor primitives

#### *Junpei Zhong1,2\*, Angelo Cangelosi <sup>3</sup> and Stefan Wermter <sup>1</sup>*

*<sup>1</sup> Department of Computer Science, University of Hamburg, Hamburg, Germany*

*<sup>2</sup> School of Computer Science, University of Hertfordshire, Hatfield, UK*

*<sup>3</sup> School of Computing and Mathematics, University of Plymouth, Plymouth, UK*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### *Reviewed by:*

*Matthew Schlesinger, Southern Illinois University, USA Stefano Nolfi, Institute of Cognitive Sciences and Technologies CNR, Italy*

#### *\*Correspondence:*

*Junpei Zhong, Department of Computer Science, Knowledge Technology, University of Hamburg, Vogt-Kölln-Straße 30, 22527 Hamburg, Germany e-mail: zhong@ informatik.uni-hamburg.de*

The acquisition of symbolic and linguistic representations of sensorimotor behavior is a cognitive process performed by an agent when it is executing and/or observing own and others' actions. According to Piaget's theory of cognitive development, these representations develop during the sensorimotor stage and the pre-operational stage. We propose a model that relates the conceptualization of the higher-level information from visual stimuli to the development of ventral/dorsal visual streams. This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model. We exemplify this model through a robot passively observing an object to learn its features and movements. During the learning process of observing sensorimotor primitives, i.e., observing a set of trajectories of arm movements and its oriented object features, the pre-symbolic representation is self-organized in the parametric units. These representational units act as bifurcation parameters, guiding the robot to recognize and predict various learned sensorimotor primitives. The pre-symbolic representation also accounts for the learning of sensorimotor primitives in a latent learning context.

**Keywords: pre-symbolic communication, sensorimotor integration, recurrent neural networks, parametric biases, horizontal product**

#### **1. INTRODUCTION**

Although infants are not supposed to acquire the symbolic representational system at the sensorimotor stage, based on Piaget's definition of infant development, the preparation of language development, such as a pre-symbolic representation for conceptualization, has been set at the time when the infant starts babbling (Mandler, 1999). Experiments have shown that infants have established the concept of animate and inanimate objects, even if they have not yet seen the objects before (Gelman and Spelke, 1981). Similar phenomena also include the conceptualization of object affordances such as the conceptualization of containment (Bonniec, 1985). This conceptualization mechanism is developed at the sensorimotor stage to represent sensorimotor primitives and other object-affordance related properties.

During an infants' development at the sensorimotor stage, one way to learn affordances is to interact with objects using tactile perception, observe the object from visual perception and thus learn the causality relation between the visual features, affordance and movements as well as to conceptualize them. This learning starts with the basic ability to move an arm toward the visual-fixated objects in new-born infants (Von Hofsten, 1982), continues through object-directed reaching at the age of 4 months (Streri et al., 1993; Corbetta and Snapp-Childs, 2009), and can also be found during the object exploration of older infants (c.f. Ruff, 1984; Mandler, 1992). From these interactions leading to visual and tactile percepts, infants gain experience through the instantiated "bottom-up" knowledge about object affordances and sensorimotor primitives. Building on this, infants at the age of around 8–12 months gradually expand the concept of object features, affordances and the possible causal movements in the sensorimotor context (Gibson, 1988; Newman et al., 2001; Rocha et al., 2006). For instance, they realize that it is possible to pull a string that is tied to a toy car to fetch it instead of crawling toward it. An associative rule has also been built that connects conceptualized visual feature inputs, object affordance and the corresponding frequent auditory inputs of words, across various contexts (Romberg and Saffran, 2010). At this stage, categories of object features are particularly learned in different contexts due to their affordance-invariance (Bloom et al., 1993).

Therefore the integrated learning process of the object's features, movements according to the affordances, and other knowledge is a globally conceptualized process through visual and tactile perception. This conceptualized learning is a precursor of a pre-symbolic representation of language development. This learning is the process to form an abstract and simplified representation for information exchange and sharing<sup>1</sup> . To conceptualize from visual perception, it usually includes a planning process: first the speaker receives and segments visual knowledge in the perceptual flow into a number of states on the basis of different criteria, then the speaker selects essential elements, such as the units to be verbalized, and last the speaker constructs certain temporal perspectives when the events have to be anchored and linked (c.f. Habel and Tappe, 1999; von Stutterheim and Nuse, 2003). Assuming this planning process is distributed between ventral and dorsal streams, the conceptualization process should

<sup>1</sup>For comparison of conceptualization between engineering and language perspectives, see (Gruber and Olsen, 1994; Bowerman and Levinson, 2001).

also emerge from the visual information that is perceived in each stream, associating the distributed information in both streams. As a result, the candidate concepts of visual information are statistically associated with the input stimuli. For instance, they may represent a particular visual feature with a particular class of label (e.g., a particular visual stimuli with an auditory wording "circle") (Chemla et al., 2009). Furthermore, the establishment of such links also strengthens the high-order associations that generate predictions and generalize to novel visual stimuli (Yu, 2008). Once the infants have learned a sufficient number of words, they begin to detect a particular conceptualized cue with a specific kind of wording. At this stage, infants begin to use their own conceptualized visual "database" of known words to identify a novel meaning class and possibly to extend their wording vocabulary (Smith et al., 2002). Thus, this associative learning process enables the acquisition and the extension of the concepts of domain-specific information (e.g., features and movements in our experiments) with the visual stimuli.

This conceptualization will further result in a pre-symbolic way for infants to communicate when they encounter a conceptualized object and intend to execute a correspondingly conceptualized well-practised sensorimotor action toward that object. For example, behavioral studies showed that when 8-to-11-monthold infants are unable to reach and pick up an empty cup, they may point it out to the parents and execute an arm movement intending to bring it to their lips. The conceptualized shape of a cup reminds infants of its affordance and thus they can communicate in a pre-symbolic way. Thus, the emergence from the conceptualized visual stimuli to the pre-symbolic communication also gives further rise to the different periods of learning nouns and verbs in infancy development (c.f. Gentner, 1982; Tardif, 1996; Bassano, 2000). This evidence supports that the production of verbs and nouns are not correlated to the same modality in sensory perception: experiments performed by Kersten (1998) suggest that nouns are more related to the movement orientation caused by the intrinsic properties of an object, while verbs are more related to the trajectories of an object. Thus we argue that such differences of acquisitions in lexical classes also relate to the conceptualized visual ventral and dorsal streams. The finding is consistent with Damasio and Tranel (1993)'s hypothesis that verb generation is modulated by the perception of conceptualization of movement and its spatio-temporal relationship.

For this reason, we propose that the conceptualized visual information, which is a prerequisite for the pre-symbolic communication, is also modulated by perception in two visual streams. Although there have been studies of modeling the functional modularity in the development of ventral and dorsal streams (e.g., Jacobs et al., 1991; Mareschal et al., 1999), the bilinear models of visual routing (e.g., Olshausen et al., 1993; Memisevic and Hinton, 2007; Bergmann and von der Malsburg, 2011), in which a set of control neurons dynamically modifies the weights of the "what" pathway on a short time scale, or transform-invariance models (e.g., Földiák, 1991; Wiskott and Sejnowski, 2002) by encouraging the neurons to fire invariantly while transformations are performed in their input stimuli. However, a model that explains the development of conceptualization from both streams and results in an explicit representation of conceptualization of both streams while the visual stimuli is presented is still missing in the literature. This conceptualization should be able to encode the same category for information flows in both ventral and dorsal streams like "object files" in the visual understanding (Fields, 2011) so that they could be discriminated in different contexts during language development.

On the other hand, this conceptualized representation that is distributed in two visual streams is also able to predict the tendency of appearance of an action-oriented object in the visual field, which causes some sensorimotor phenomena such as object permanence (Tomasello and Farrar, 1986) showing the infants' attention usually is driven by the object's features and movements. For instance, when infants are observing the movement of the object, recording showed an increase of the looking times when the visual information after occlusion is violated in either surface features or location (Mareschal and Johnson, 2003). Also the words and sounds play a top–down role in the early infants' visual attention (Sloutsky and Robinson, 2008). This could hint at the different development stages of the ventral and dorsal streams and their effect on the conceptualized prediction mechanism in the infant's consciousness. Accordingly, the model we propose about the conceptualized visual information should also be able to explain the emergence of a predictive function in the sensorimotor system, e.g., the ventral stream attempts to track the object and the dorsal stream processes and predicts the object's spatial location, when the sensorimotor system is involved in an object interaction. We have been aware of that this build-in predictive function in a forward sensorimotor system is essential: neuroimaging research has revealed the existence of internal forward models in the parietal lobe and the cerebellum that predict sensory consequences from efference copies of motor commands (Kawato et al., 2003) and supports fast motor reactions (e.g., Hollerbach, 1982). Since the probable position and the movement pattern of the action should be predicted on a short time scale, sensory feedback produced by a forward model with negligible delay is necessary in this sensorimotor loop.

Particularly, the predictive sensorimotor model we propose is suitable to work as one of the building modules that takes into account the predictive object movement in a forward sensorimotor system to deal with object interaction from visual stimuli input as **Figure 1** shows. This system is similar to Wolpert et al. (1995)'s sensorimotor integration, but it includes an additional sensory estimator (the lower brown block) which takes into account the visual stimuli from the object so that it is able to predict the dynamics of both the end-effector (which is accomplished by the upper brown block) and the sensory input of the object. This object-predictive module is essential in a sensorimotor system to generate sensorimotor actions like tracking and avoiding when dealing with fast-moving objects, e.g., in ball sports. We also assert that the additional inclusion of forward models in the visual perception of the objects can explain some predictive developmental sensorimotor phenomena, such as object permanence.

In summary, we propose a model that establishes links between the development of ventral/dorsal visual streams and the emergence of the conceptualization in visual streams, which further leads to the predictive function of a sensorimotor system. To validate this proof-of-concept model, we also conducted experiments in a simplified robotics scenario. Two NAO robots were employed in the experiments: one of them was used as a "presenter" and moved its arm along pre-programmed trajectories as motion primitives. A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the "observer". In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object's visual features and the movements into account. Though we could also use one robot and a human presenter to run the same tasks, we used two identical robots, due to the following reasons: (1) the object movement trajectories can be done by a pre-programmed machinery so that the types and parameters of it can be adjusted; (2) the use of two identical robots allows to interchange the roles of the presenter and observer in an easier manner. As other humanoid robots, a sensorimotor cycle that is composed of cameras and motors also exists in NAO robots. Although its physical configurations and parameters of sensory and motor systems are different from those in human beings' or other biological systems, our model only handles the pre-processed information extracted from visual stimuli. Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.

#### **2. MATERIALS AND METHODS**

#### **2.1. NETWORK MODEL**

A similar forward model exhibiting sensory prediction for visual object perception has been proposed in our recently published work (Zhong et al., 2012b) where we suggested an RNN implementation of the sensory forward model. Together with a CACLA trained multi-layer network as a controller model, the forward model embodied in a robot receiving visual landmark percepts enabled a smooth and robust robot behavior. However, one drawback of this work was its inability to store multiple sets of spatial-temporal input–output mappings, i.e., the learning did not converge if there appeared several spatial-temporal mapping

sequences in the training. Consequently, a simple RNN network was not able to predict different sensory percepts for different reward-driven tasks. Another problem was that it assumed only one visual feature appeared in the robot's visual field, and that was the only visual cue it could learn during development. To solve the first problem, we further augment the RNN with parametric bias (PB) units. They are connected like ordinary biases, but the internal values are also updated through back-propagation. Comparing to the generic RNN, the additional PB units in this network act as bifurcation parameters for the non-linear dynamics. According to Cuijpers et al. (2009), a trained RNNPB can successfully retrieve and recognize different types of pre-learned, non-linear oscillation dynamics. Thus, this bifurcation function can be regarded as an expansion of the storage capability of working-memory within the sensory system. Furthermore, it adds the generalization ability of the PB units, in terms of recognizing and generating non-linear dynamics. To tackle the second problem, in order to realize sensorimotor prediction behaviors such as object permanence, the model should be able to learn objects' features and object movements separately in the ventral and dorsal visual streams, as we have shown in Zhong et al. (2012a).

Merging these two ideas, in the context of sensorimotor integration in hand-object interaction, the PB units can be considered as a small set of high-level conceptualized units that describe various types of non-linear dynamics of visual percepts, such as features and movements. This representation is more related to the "natural prototypes" from visual perception, for instance, than a specific language representation (Rosch, 1973).

The development of PB units can also be seen as the presymbolic communication that emerges during sensorimotor learning. The conceptualization, on the other hand, could also result in the prediction of future visual percepts of moving objects in sensorimotor integration.

In this model (**Figure 2**), we propose a three-layer, horizontal product Elman network with PB units. Similar to the original

RNNPB model, the network is capable of being executed under three running modes, according to the pre-known conditions of inputs and outputs: learning, recognition and prediction. In learning mode, the representation of object features and movements are first encoded in the weights of both streams, while the bifurcation parameters with a smaller number of dimensions are encoded in the PB units. This is consistent with the emergence of the conceptualization at the sensorimotor stage of infant development.

Apart from the PB units, another novelty in the network is that the visual object information is encoded in two neural streams and is further conceptualized in PB units. Two streams share the same set of input neurons, where the coordinates of the object in the visual field are used as identities of the perceived images. The appearance of values in different layers represents different visual features: in our experiment, the color of the object detected by the yellow filter appears in the first layer whereas the color detected by the green filter appears in the second layer; the other layer remains zero. For instance, the input *((*0*,* 0*),(x, y))* represents a green object at *(x, y)* coordinates in the visual field. The hidden layer contains two independent sets of units representing dorsal-like "*d*" and ventral-like "*v*" neurons, respectively. These two sets of neurons are inspired by the functional properties of dorsal and ventral streams: (1) fast responding dorsal-like units predict object position and hence encode movements; (2) slow responding ventral-like units represent object features. The recurrent connection in the hidden layers also helps to predict movements in layer *d* and to maintain a persistent representation of an object's feature in layer *v*. The horizontal product brings both pathways together again in the output layer with one-step ahead predictions. Let us denote the output layer's input from layer *d* and layer *v* as *x<sup>d</sup>* and *xv*, respectively. The network output *s <sup>o</sup>* is obtained via the horizontal product as

$$s^{\rho} = \mathfrak{x}^{d} \odot \mathfrak{x}^{\nu} \tag{1}$$

where indicates element-wise multiplication, so each pixel is defined by the product of two independent parts, i.e., for output unit *k* it is *s o <sup>k</sup>* <sup>=</sup> *<sup>x</sup><sup>d</sup> <sup>k</sup>* · *<sup>x</sup><sup>v</sup> k*.

#### **2.2. NEURAL DYNAMICS**

We use *s <sup>b</sup>(t)* to represent the activation and *PBd/v(t)* to represent the activation of the dorsal/ventral PB units at the time-step *t*. In some of the following equations, the time-index *t* is omitted if all activations are from the same time-step. The inputs to the hidden units *y<sup>v</sup> <sup>j</sup>* in the ventral stream and *<sup>y</sup><sup>d</sup> <sup>j</sup>* in the dorsal stream are defined as

$$\gamma\_l^d(t) = \sum\_i s\_i^b(t)\boldsymbol{w}\_{li}^d + \sum\_l s\_l^d(t-1)\boldsymbol{v}\_{ll'}^d + \sum\_{n\_2} P B\_{n\_2}^{\nu}(t)\tilde{\boldsymbol{w}}\_{ln\_2}^d \tag{2}$$

$$\nu\_{\vec{j}}^{\vec{\nu}}(t) = \sum\_{i} s\_{i}^{b}(t)\nu\_{\vec{j}i}^{\vec{\nu}} + \sum\_{\vec{j}'} s\_{\vec{j}}^{\vec{\nu}}(t-1)\nu\_{\vec{j}\vec{\nu}}^{\vec{\nu}} + \sum\_{n\_{1}} PB\_{n\_{1}}^{d}(t)\bar{\nu}\_{\vec{j}n\_{1}}^{\vec{\nu}} \tag{3}$$

where *w<sup>d</sup> li*, *<sup>w</sup><sup>v</sup> ji* represent the weighting matrices between dorsal/ventral layers and the input layer, *w*¯ *<sup>d</sup> li*, *<sup>w</sup>*¯ *<sup>v</sup> ji* represent the weighting matrices between PB units and the two hidden layers, and *vd ll* and *<sup>v</sup><sup>v</sup> jj* indicate the recurrent weighting matrices within the hidden layers.

The transfer functions in both hidden layers and the PB units all employ the sigmoid function recommended by LeCun et al. (1998),

$$s\_{l/j}^{d/v} = 1.7159 \cdot \tanh\left(\frac{2}{3} \mathcal{y}\_{l/j}^{d/v}\right) \tag{4}$$

$$\text{PB}\_{n\_1/n\_2}^{d/\nu} = 1.7159 \cdot \tanh\left(\frac{2}{3} \rho\_{n\_1/n\_2}^{d/\nu}\right) \tag{5}$$

where ρ*d/<sup>v</sup>* represent the internal values of the PB units.

The terms of the horizontal products of both pathways can be presented as follows:

$$\boldsymbol{\alpha}\_{k}^{\boldsymbol{\nu}} = \sum\_{j} s\_{j}^{\boldsymbol{\nu}} \boldsymbol{u}\_{kj}^{\boldsymbol{\nu}}; \quad \boldsymbol{\alpha}\_{k}^{d} = \sum\_{l} s\_{l}^{d} \boldsymbol{u}\_{kl}^{d} \tag{6}$$

The output of the two streams composes a horizontal product for the network output as we defined in Equation (1).

#### *2.2.1. Learning mode*

The training progress is basically determined by the cost function:

$$C = \frac{1}{2} \sum\_{t}^{T} \sum\_{k}^{N} (s\_k^b(t+1) - s\_k^o(t))^2 \tag{7}$$

where *s b <sup>i</sup> (t* + 1*)* is the one-step ahead input (as well as the desired output), *s o <sup>k</sup>(t)* is the current output, *T* is the total number of available time-step samples in a complete sensorimotor sequence and *N* is the number of output nodes, which is equal to the number of input nodes. Following gradient descent, each weight update in the network is proportional to the negative gradient of the cost with respect to the specific weight *w* that will be updated:

$$
\Delta \omega\_{\vec{ij}} = -\eta\_{i\vec{j}} \frac{\partial C}{\partial \omega\_{\vec{ij}}} \tag{8}
$$

where η*ij* is the adaptive learning rate of the weights between neuron *i* and *j*, which is adjusted in every epoch (Kleesiek et al., 2013). To determine whether the learning rate has to be increased or decreased, we compute the changes of the weight *wi,<sup>j</sup>* in consecutive epochs:

$$
\sigma\_{i,j} = \frac{\partial C}{\partial w\_{i,j}}(e-1)\frac{\partial C}{\partial w\_{i,j}}(e) \tag{9}
$$

The update of the learning rate is

$$\eta\_{i,j}(e) = \begin{cases} \min(\eta\_{i,j}(e-1) \cdot \xi^+, \eta\_{\max}) & \text{if } \sigma\_{i,j} > 0, \\\max(\eta\_{i,j}(e-1) \cdot \xi^-, \eta\_{\min}) & \text{if } \sigma\_{i,j} < 0, \\\eta\_{i,j}(e-1) & \text{else}. \end{cases}$$

where ξ<sup>+</sup> *>* 1 and ξ<sup>−</sup> *<* 1 represent the increasing/decreasing rate of the adaptive learning rates, with ηmin and ηmax as lower and upper bounds, respectively. Thus, the learning rate of a particular weight increases by ξ<sup>+</sup> to speed up the learning when the changes of that weight from two consecutive epochs have the same sign, and vice versa.

Besides the usual weight update according to backpropagation through time, the accumulated error over the whole time-series also contributes to the update of the PB units. The update for the *i*-th unit in the PB vector for a time-series of length *T* is defined as:

$$\rho\_i(e+1) = \rho\_i(e) + \gamma\_i \sum\_{t=1}^{T} \\$\_{i,j}^{PB} \tag{10}$$

where δ*PB* is the error back-propagated to the PB units, *e* is *e*th time-step in the whole time-series (e.g., epoch), γ*<sup>i</sup>* is PB units' adaptive updating rate which is proportional to the absolute mean value of the back-propagation error at the *i*-th PB node over the complete time-series of length *T*:

$$\gamma\_i \propto \frac{1}{T} \left\| \sum\_{t=1}^{T} \mathbf{8}\_{i,j}^{PB} \right\| \tag{11}$$

The reason for applying the adaptive technique is that it was realized that the PB units converge with difficulty. Usually a smaller learning rate is used in the generic version of RNNPB to ensure the convergence of the network. However, this results in a tradeoff in convergence speed. The adaptive learning rate is an efficient technique to overcome this trade-off (Kleesiek et al., 2013).

#### *2.2.2. Recognition mode*

The recognition mode is executed with a similar information flow as the learning mode: given a set of the spatio-temporal sequences, the error between the target and the real output is back-propagated through the network to the PB units. However, the synaptic weights remain constant and only the PB units will be updated, so that the PB units are self-organized as the pre-trained values after certain epochs. Assuming the length of the observed sequence is *a*, the update rule is defined as:

$$\rho\_i(e+1) = \rho\_i(e) + \chi \sum\_{t=T-a}^T \\$\_{i,j}^{PB} \tag{12}$$

where δ*PB* is the error back-propagated from a certain sensory information sequence to the PB units and γ is the updating rate of PB units in recognition mode, which should be larger than the adaptive rate γ*<sup>i</sup>* at the learning mode.

#### *2.2.3. Prediction mode*

The values of the PB units can also be manually set or obtained from recognition, so that the network can generate the upcoming sequence with one-step prediction.

#### **3. RESULTS**

In this experiment, as we introduced, we examined this network by implementing it on two NAO robots. They were placed face-to-face in a rectangle box of 61*.*5 cm × 19*.*2 cm as shown in **Figure 3**. These distances were carefully adjusted so that the observer was able to keep track of movement trajectories in its visual field during all experiments using the images from the lower camera. The NAO robot has two cameras. We use the lower one to capture the images because its installation angle is more suitable to track the balls when they are held in the other NAO's hand.

Two 3*.*8 cm diameter balls with yellow/green color were used for the following experiments. The presenter consecutively held each of the balls to present the object interaction. The original image, received from the lower camera of the observer, was pre-processed with thresholding in HSV color-space and the coordinates of its centroid in the image moment were calculated. Here we only considered two different colors as the only feature to be encoded in the ventral stream, as well as two sets of movement trajectories encoded in the dorsal stream. Although we have only tested a few categories of trajectories and features, we believe the results can be extrapolated to multiple categories in future applications.

#### **3.1. LEARNING**

The two different trajectories are defined as below, The *cosine* curve,

*x* = 12 (13)

**FIGURE 3 | Experimental Scenario: two NAOs are standing face-to-face with in a rectangle box.**

$$y = 8 \cdot \left(-\frac{t}{2}\right) + 0.04\tag{14}$$

$$z = 4 \cdot \cos(2t) + 0.10\tag{15}$$

and the *square* curve,

$$\begin{aligned} x &= 12\\ y &= \begin{cases} 0 & t \le -\frac{3\pi}{4} \\ \frac{16}{\pi}t + 12 & -\frac{3\pi}{4} < t \le -\frac{\pi}{4} \\ 8 & -\frac{\pi}{4} < t \le \frac{\pi}{4} \\ -\frac{16}{\pi}t + 12\frac{\pi}{4} & < t \le \frac{3\pi}{4} \\ 0 & t > \frac{3\pi}{4} \end{cases} \end{aligned} \tag{17}$$

$$z = \begin{cases} \frac{16}{\pi}t + 20 & t \le -\frac{3\pi}{4} \\ 14 & -\frac{3\pi}{4} < t \le -\frac{\pi}{4} \\ -\frac{16}{\pi}t + 10 & -\frac{\pi}{4} < t \le \frac{\pi}{4} \\ 6 & \frac{\pi}{4} < t \le \frac{3\pi}{4} \\ \frac{16}{\pi}t - 6 & t > \frac{3\pi}{4} \end{cases} \tag{18}$$

where the 3-dimension tuple *(x, y,z)* are the coordinates (centimeters) of the ball w.r.t the torso frame of the NAO presenter. *t* loops between *(*−π*,*π]. In each loop, we calculated 20 data points to construct trajectories with 4 s sleeping time between every two data points. Note that although we have defined the optimal desired trajectories, the arm movement was not ideally identical to the optimal trajectories due to the noisy position control of the end-effector of the robot. On the observer side, the *(x, y)* coordinates of the color-filtered moment of the ball in the visual field were recorded to form a trajectory with sampling time of 0*.*2 s. Five trajectories, in the form of tuple *(x, y,z)* w.r.t the torso frame of the NAO observer were recorded with each color and each curve, so total 20 trajectories were available for training.

In each training epoch, these trajectories, in the form of tuples, were fed into the input layer one after another for training, with the tuples of the next time-step serving as a training target. The parameters are listed in **Table 1**. The final PB values were examined after the training was done, and the values were shown in **Figure 4**. It can be seen that the first PB unit, along with the dorsal


stream, was approximately self-organized with the color information, while the second PB unit, along with the ventral stream, was self-organized with the movement information.

#### **3.2. RECOGNITION**

Another four trajectories were presented in the recognition experiment, in which the length of the sliding-window is equal to the length of the whole time-series, i.e., *T* = *a* in Equation (12). The update of the PB units were shown in **Figure 5**. Although we used the complete time-series sequence for the recognition, it should also be possible to use only part of the sequence, e.g., through the sliding-window approach with a smaller number of *a* to fulfil the real-time requirement in the future.

#### **3.3. PREDICTION**

In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object. Then, the one-step prediction from the output units were again applied to the input at the next time-step, so that the whole time-series corresponding to the object's movements and features were obtained. **Figure 6** presents the comparisons between the true values (the same as used in recognition) and the predicted ones.

From **Figure 6** and **Table 2**, it can be observed that the estimation was biased quite largely to the true value within the first few time-steps, as the RNN needs to accumulate enough input values to access its short-term memory. However, the error became smaller and it kept track of the true value in the following time-steps. Considering that the curves are automatically generated given the PB units and the values at the first time-step, the error between the true values and the estimated ones are acceptable. Moreover, this result show clearly that the conceptualization affects the (predictive) visual perception.

**FIGURE 4 | Values of two sets of PB units in the two streams after training.** The square markers represent those PB units after the *square* curves training and the triangle markers represent those of the *cosine* curves training. The colors of the markers, yellow and green, represent the colors of the balls used for training.

#### **3.4. GENERALIZATION IN RECOGNITION**

To testify whether our new computational model has the generalization ability as Cuijpers et al. (2009) proposed, we recorded another set of sequences of a circle trajectory. The trajectory is defined as:

$$\mathbf{x} = 12 \tag{19}$$

$$y = 4 \cdot \sin(2t) + 0.04\tag{20}$$

$$z = 4 \cdot \cos(2t) + 0.10\tag{21}$$

The yellow and green balls were still used. We ran the recognition experiment again with the weight previously trained. The update of the PB units were shown in **Figure 7**. Comparing to **Figure 4**, we can observe that the positive and negative signs of PB values are similar as the square trajectory. This is probably because the visual perception of circle and square movements have more similarities than those between circle and cosine movements.

#### **3.5. PB REPRESENTATION WITH DIFFERENT SPEEDS**

We further generated 20 trajectories with the same data functions (Equations 13–18) but with a slower sampling time. In other words, the movement of the balls seemed to be faster with robot's observation. The final PB values after training were shown in **Figure 8**.

It can be seen that generally the PB values were smaller comparing to **Figure 4**, which was probably because there was less error being propagated during training. Moreover, the corresponding PB values corresponding to colors (green and yellow) and movements (cosine and square) were interchanged within the same PB unit (i.e., along the same axis) due to the difference of random initial parameters of the network. But the PB unit along with the dorsal stream still encoded color information, while the PB unit along with ventral stream encoded movement information. The network was still able to show properties of spatio-temporal sequences data in the PB units' representation.

#### **4. DISCUSSION**

#### **4.1. NEURAL DYNAMICS**

An advancement of the HP-RNN model is that it can learn and encode the "what" and "where" information separately in two streams (more specifically, in two hidden layers). Both streams are connected through horizontal products, which means fewer connections than full multiplication (as the conventional bilinear


model) (Zhong et al., 2012a). In this paper, we further augmented the HP-RNN with the PB units. One set of units, connecting to one visual stream, reflects the dynamics of sequences in the other stream. This is an interesting result since it shows the neural dynamics in the hybrid combination of the RNNPB units and the horizontal product model. Taking the dorsal-like hidden layer for example, the error of the attached PB units is

$$\delta\_{n\_2}^{PB} = \sum\_{l} \delta\_l^d \cdot f'(\rho\_{n\_2}^\nu) \cdot \bar{\mathbf{w}}\_{ln\_2}^d \tag{22}$$

$$=\sum\_{l}\left[\sum\_{k}(s\_k^b - s\_k^o)\mathbf{g}'(s\_k^o)u\_{kl}^d f'(s\_l^v)\right]f'(\rho\_{n\_2}^v)\cdot \bar{w}\_{ln\_2}^d \tag{23}$$

where *g (*·*)* and *f (*·*)* are the derivatives of the linear and sigmoid transfer functions. Since we have the linear output, according to the definition of the horizontal product, the equation becomes,

$$\mathcal{S}\_{n\_2}^{PB} = \sum\_{l} \left[ \sum\_{k} (s\_k^b - s\_k^o) \odot \varkappa\_k^v \cdot \mu\_{kl}^d f'(s\_l^d) \right] f'(\rho\_{n\_2}^v) \cdot \bar{\boldsymbol{w}}\_{ln\_2}^d \tag{24}$$

The update of the internal values of the PB units becomes

$$
\rho\_{n\_2}^{\nu}(e+1) = \rho\_i^{\nu}(e) + \gamma \sum\_{t=T-a}^{T} \boldsymbol{\delta}\_{n\_2}^{PB}(t) \tag{25}
$$

$$=\rho\_i^{\boldsymbol{\nu}}(\boldsymbol{e}) + \boldsymbol{\chi} \sum\_{t=T-\boldsymbol{a}}^{T} \left\{ \sum\_{l} \left[ \sum\_{k} (s\_k^{\boldsymbol{b}}(t) - s\_k^{\boldsymbol{o}}(t)) \odot \mathbf{x}\_k^{\boldsymbol{\nu}}(t) \right] \right\}$$

$$\left[\cdot, u\_{kl}^{d}(t)f'(s\_l^{d})\right]f'(\rho\_{n\_2}^{v}(t))\cdot \bar{w}\_{ln\_2}^{d}(t)\tag{26}$$

where the *x<sup>v</sup> <sup>k</sup>(t)* term refers to the contribution of the weighted summation from the ventral-like layer at time *t*. Note that the term *f (*ρ*<sup>v</sup> <sup>n</sup>*<sup>2</sup> *(t))* is actually constant within one epoch and it is only updated after each epoch with a relatively small updating rate. Therefore, from the experimental perspective, given the same

object movement but different object features, the difference of the PB values mostly reflects the dynamic changes in the hidden layer of the ventral stream. The same holds for the PB units attached to the ventral-like layer. This brief analysis shows the PB units for one modularity in RNNPB networks with horizontal product connections, effectively accumulating the non-linear dynamics of other modularities.

#### **4.2. CONCEPTUALIZATION IN VISUAL PERCEPTION**

The visual conceptualization and perception are intertwined processes. As experiments from Schyns and Oliva (1999) show, when the visual observation is not clear, the brain automatically extrapolates the visual percept and updates the categorization labels on various levels according to what has been gained from the visual field. On the other hand, this conceptualization also affects the immediate visual perception in a top–down predictive manner. For instance, the identity conceptualization of a human face predictively spreads conceptualizations in other levels (e.g., face emotion). This top–down process propagates from object identity to other local conceptualizations, such as object affordance, motion, edge detection and other processes at the early stages of visual processing. This can be tested by classic illusions, such as "the goblet illusion," where perception depends largely on top– down knowledge derived from past experiences rather than direct observation. This kind of illusion may be explained by the error in the first few time steps of the prediction experiment of our model. Therefore, our model to some extent also demonstrates the integrated process between the conceptualization and the spatio-temporal visual perception. This top–down predictive perception may also arouse other visual based predictive behaviors such as object permanence.

Particularly, the PB units act as a high-level conceptualization representation, which is continuously updated with the partial sensory information perceived in a short-time scale. The prediction process of the RNNPB is assisted by the conceptualized PB units of visual perception, which is identical to the integration conceptualization and (predictive) visual perception. This is the reason why PB units were not processed as a binary representation, as Ogata et al. (2007) did for human-robotinteraction; the original values of PB units are more accurate in generating the prediction of the next time-step and performing generalization tasks. As we mentioned, this model is merely a proof-of-concept model that bridges conceptualized visual streams and sensorimotor prediction. For more complex tasks, besides expanding of the network size as we mentioned, more complex networks that are capable of extracting and predicting higher-level spatio-temporal structures (e.g., predictive recurrent networks owning large learning capacity by Tani and colleagues: Yamashita and Tani, 2008; Murata et al., 2013) can be also applied. It should be interesting to further investigate the functional modularity representation of these network models when they are interconnected with horizontal product too.

Furthermore, the neuroscience basis that supports this paper, in the context of the mirror neuron system based on objectoriented-actions (grasping), can be stated as the "data-driven" models such as MNS (Oztop and Arbib, 2002) and MNS2 (Bonaiuto et al., 2007; Bonaiuto and Arbib, 2010), although the main hypothesis in our model is not taken from the mirror neuron system theory. In the MNS review paper by Oztop et al. (2006), the action generation mode of the RNNPB model was considered to be excessive as there has no evidence yet to show that the mirror neuron system participates in action generation. However, in our model the generation mode has a key role of conceptualized PB units in the sensorimotor integration of object interaction. Nevertheless, the similar network architecture (RNNPB) used in modeling mirror neurons (Tani et al., 2004) and our pre-symbolic sensorimotor integration models may imply a close relationship between language (pre-symbolic) development, object-oriented actions, and the mirror neuron theory.

#### **5. CONCLUSION**

In this paper a recurrent network architecture integrating the RNNPB model and the horizontal product model has been presented, which sheds light on the feasibility of linking the conceptualization of ventral/dorsal visual streams, the emergence of pre-symbol communication, and the predictive sensorimotor system.

Based on the horizontal product model, here the information in the dorsal and ventral streams is separately encoded in two network streams and the predictions of both streams are brought together via the horizontal product while the PB units act as a conceptualization of both streams. These PB units allow for storing multiple sensory sequences. After training, the network is able to recognize the pre-learned conceptualized information and to predict the up-coming visual perception. The network also shows robustness and generalization abilities. Therefore, our approach offers preliminary concepts for a similar development of conceptualized language in pre-symbolic communication and further in infants' sensorimotor-stage learning.

#### **ACKNOWLEDGMENTS**

The authors thank Sven Magg, Cornelius Weber, Katja Kösters as well as reviewers (Matthew Schlesinger, Stefano Nolfi) for improvement of the paper, Erik Strahl for technical support and NAO assistance in Hamburg and Torbjorn Dahl for the generous allowance to use the NAOs in Plymouth.

#### **FUNDING**

This research has been partly supported by the EU projects RobotDoC under grand agreement 235065 ROBOT-DOC, KSERA under grant agreement n◦ 2010-248085, POETICON++ under grant agreement 288382, ALIZ-E under grant agreement 248116 and UK EPSRC project BABEL.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 15 July 2013; accepted: 14 January 2014; published online: 04 February 2014. Citation: Zhong J, Cangelosi A and Wermter S (2014) Toward a self-organizing presymbolic neural model representing sensorimotor primitives. Front. Behav. Neurosci. 8:22. doi: 10.3389/fnbeh.2014.00022*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience. Copyright © 2014 Zhong, Cangelosi and Wermter. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Temporal relation between top-down and bottom-up processing in lexical tone perception

#### *Lan Shuai <sup>1</sup> \* and Tao Gong2 \**

*<sup>1</sup> Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA <sup>2</sup> Department of Linguistics, University of Hong Kong, Hong Kong, China*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### *Reviewed by:*

*Ryan Giuliano, University of Oregon, USA I-Fan Su, The University of Hong Kong, Hong Kong Wentao Gu, Nanjing Normal University, China*

#### *\*Correspondence:*

*Lan Shuai, Department of Electrical and Computer Engineering, Johns Hopkins University, North Charles Street 3400, Baltimore, MD 21218, USA e-mail: lshuai@jhu.edu; Tao Gong, Department of Linguistics, University of Hong Kong, Pokfulam Road, Hong Kong, China e-mail: gtojty@gmail.com*

Speech perception entails both top-down processing that relies primarily on language experience and bottom-up processing that depends mainly on instant auditory input. Previous models of speech perception often claim that bottom-up processing occurs in an early time window, whereas top-down processing takes place in a late time window after stimulus onset. In this paper, we evaluated the temporal relation of both types of processing in lexical tone perception. We conducted a series of event-related potential (ERP) experiments that recruited Mandarin participants and adopted three experimental paradigms, namely dichotic listening, lexical decision with phonological priming, and semantic violation. By systematically analyzing the lateralization patterns of the early and late ERP components that are observed in these experiments, we discovered that: auditory processing of pitch variations in tones, as a bottom-up effect, elicited greater right hemisphere activation; in contrast, linguistic processing of lexical tones, as a topdown effect, elicited greater left hemisphere activation. We also found that both types of processing co-occurred in both the early (around 200 ms) and late (around 300–500 ms) time windows, which supported a parallel model of lexical tone perception. Unlike the previous view that language processing is special and performed by dedicated neural circuitry, our study have elucidated that language processing can be decomposed into general cognitive functions (e.g., sensory and memory) and share neural resources with these functions.

#### **Keywords: lexical tone, ERP, lateralization, serial model, parallel model**

#### **INTRODUCTION**

Perception in general comprises two types of processing, bottomup (or data-based) processing and top-down (or knowledgebased) processing, which are based, respectively, on incoming data and prior knowledge (Goldstein, 2009). For speech perception, the TRACE model (McClelland and Elman, 1986) claims that both types of processing are necessary, and the auditory sentence processing model (Friederici, 2002) proposes that the cognitive processes involved in speech perception proceed in a series of steps. Following these models, bottom-up processing such as acoustic processing of incoming signal and generalization of speech features happens first, whereas top-down processing such as recognition based on knowledge of phonemes, semantics, or syntax takes effect at a later stage of perception. These models, as well as other theories or models of speech perception (e.g., Liberman and Mattingly, 1985; Fowler, 1986; Stevens, 2002; Diehl et al., 2004; Hickok and Poeppel, 2007), are based primarily on evidence from non-tonal languages. Two issues remain to be explored: (a) whether the processing of tonal languages, which take up about 60–70% of world languages (Yip, 2002), follows the same cognitive processes; and (b) what is the role of lexical tone perception in a general model of speech perception. In addition, considering that the lexical tone attached to a syllable is carried mainly by the vowel nucleus of the syllable, the temporal dimension of cognitive processes underlying lexical tone perception is of special interest.

In this paper, we discussed the temporal relationship between bottom-up and top-down processing in lexical tone perception, with the purpose of not only examining the underlying mechanisms of lexical tone perception but also shedding valuable light on the general models of speech perception concerning tonal languages. Lexical tone is a primary use of pitch variations to distinguish lexical meanings (Wang, 1967). Noting that pitch perception belongs to the general auditory perception that is also shared by other animals (Hulse et al., 1984; Izumi, 2001; Yin et al., 2010) and word semantics are acquired primarily through language learning, lexical tone perception also entails both bottomup and top-down processing. In our study, we defined *bottom-up processing* as auditory processing and feature extraction of incoming acoustic signals, which referred specifically to pitch contour perception. By contrast, we defined *top-down processing* as recognition and comprehension of incoming signals according to language knowledge, which referred specifically to influence of language experience on recognizing and comprehending a certain syllable in a tonal language. The recognition of a pitch contour as a certain tonal category was also ascribed to top-down processing.

Ample of available studies on lexical tone perception focused on the lateralization patterns of lexical tone processing (e.g., Van Lancker and Fromkin, 1973; Baudoin-Chial, 1986; Hsieh et al., 2001; Wang et al., 2001; Gandour et al., 2002, 2004; Tervaniemia and Hugdahl, 2003; Luo et al., 2006; Zatorre and Gandour, 2008; Li et al., 2010; Krishnan et al., 2011; Jia et al., 2013), and reported mixed results even under the same experimental paradigms. For example, by employing the dichotic listening (DL) paradigm and materials from Mandarin, Baudoin-Chial (1986) reported no hemisphere advantages of lexical tone perception, but Wang et al. (2001) found a left hemisphere advantage. Using fMRI, Gandour and colleagues compared lexical tone processing with intonation or vowel processing. The study of lexical tone and intonation (Gandour et al., 2003b) revealed a left hemisphere advantage in frontal lobe, whereas the study of lexical tone and segments (Li et al., 2010) discovered a right hemisphere advantage in frontoparietal area for the perception of tones. A right lateralization of lexical tone perception was also reported in an ERP (Luo et al., 2006) and a DL experiment (Jia et al., 2013).

The inconsistent laterality effects could be due to different experimental conditions in these studies. For example, in the DL experiments reporting a left hemisphere advantage (Van Lancker and Fromkin, 1973; Wang et al., 2001), tonal language speakers participated into more difficult tasks than non-tonal language speakers, and the heavier load of these tasks (e.g., hearing trials at a faster pace) might enhance the left hemisphere advantage in tonal language speakers. By contrast, there were no hemisphere advantages in the study that had no task differences between tonal and non-tonal language speakers (Baudoin-Chial, 1986). In addition, in the DL tasks that involved meaningless syllables and hums, which could direct participants' attention toward pitch contours only, a right hemisphere advantage was shown (Jia et al., 2013). More importantly, whether language-related tasks are involved is the primary noticeable difference between studies showing an explicit right hemisphere advantage (e.g., Luo et al., 2006) and those reporting a left hemisphere advantage of lexical tone perception (Van Lancker and Fromkin, 1973; Hsieh et al., 2001; Wang et al., 2001; Gandour et al., 2002, 2004). For example, Luo et al. (2006) conducted a passive listening task in which participants were engaged in a silent movie, whereas the other studies carried out explicit language tasks such as lexical tone identification. Accordingly, the right lateralization reported in Luo et al. (2006)'s study could be attributed to the pure bottom-up effect without top-down influence, whereas the other studies did not address the underlying mechanisms of lexical tone perception. This could lead to the inconsistent results between these studies. These mixed results also reflect a multifaceted perspective on lexical tone processing and hemispheric lateralization. As stated in Zatorre and Gandour (2008)'s review, in tonal processing, "it appears that a more complete account will emerge from consideration of general sensory-motor and cognitive processes in addition to those associated with linguistic knowledge."

To our knowledge, among the available studies, there was only one work (Luo et al., 2006) that discussed these two types of processing in lexical tone perception and a few that examined the cognitive processes involved for lexical tone perception (Ye and Connie, 1999; Schirmer et al., 2005; Liu et al., 2006; Tsang et al., 2010). In Luo et al. (2006)'s study, a serial model of lexical tone processing was proposed, which suggested that bottom-up processing (i.e., pitch perception) took effect in an early time window around 200 ms and top-down processing (i.e., semantic comprehension) happened in a late time window around 300–500 ms. The first half of this model was based on their experimental results that phonemes with slow- (lexical tone) and fast-changing (stopconsonant) acoustic properties inducted, respectively, right and left lateralization patterns of the MMN (Mismatch Negativity) component. The second half was proposed to address the confliction between their results and the previous literature that showed a general left hemisphere advantage of lexical tone perception. They proposed that during the late stage a left lateralization should be shown in the semantics-associated late ERP component, N400 (Kutas and Hillyard, 1980).

This serial model associated the right hemisphere advantage with bottom-up processing of lexical tones, and the left hemisphere advantage with top-down processing. In terms of lexical tone perception, there exists ample evidence in support of such association between the two types of processing and the two types of hemisphere advantage. For example, in studies of language experience and prosody, Gandour et al. (2004) dissociated linguistic processing in the left hemisphere and acoustic processing in the right hemisphere, by locating a left lateralization in certain brain regions in tonal language speakers and a right lateralization in non-tonal language speakers during speech prosody processing. Pitch processing has a right hemisphere advantage, as shown in behavior experiments such as DL (Sidtis, 1981), PET studies (Zatorre and Belin, 2001), and later fMRI studies (Boemio et al., 2005; Jamison et al., 2006); for review, see Zatorre et al. (2002). By contrast, compared to non-tonal language speakers, tonal language speakers have greater left hemisphere activities during lexical tone perception (Gandour et al., 1998, 2004; Hsieh et al., 2001; Wang et al., 2004), and multiple brain regions in the left hemisphere were believed to be the primary source of N400 (Lau et al., 2008). Noting these, we also adopted the lateralization pattern in our study to investigate top-down and bottom-up processing of lexical tones.

In addition, in Luo et al. (2006)'s study, there was insufficient direct evidence to manifest the top-down effect at the late stage of processing, because this study only explored acoustic factor without involving explicit language-related tasks or any linguistic factor. Therefore, it is hard to comprehensively evaluate Luo et al. (2006)'s serial model. Considering these, in order to make sure that language knowledge (top-down) would take effect, we adopted a number of explicit language-related tasks, including DL, lexical decision with phonological priming, and semantic violation. Meanwhile, we manipulated both the acoustic (requiring bottom-up processing) and semantic (requiring topdown processing) factors in the experimental design and analyzed the ERP components at both the early (around 200 ms) and late (around 300–500 ms) processing stages to explore the temporal relationship of the two types of processing during lexical tone perception.

Our experimental results showed that both bottom-up (acoustic) processing and top-down (semantic) processing exist in both the processing state around 200 ms and that around 300–500 ms, which inspired a parallel model of top-down and bottomup processing in lexical tone perception. In the rest of the paper, we described the two ERP components traced in our experiments of Mandarin lexical tone perception (section ERP Components Reflecting Bottom-up and Top-down Processing), reported these experiments and their findings (section ERP Experiments of Lexical Tone Perception), discussed the lateralization patterns of the ERP components shown in these experiments and the derived parallel model of lexical tone perception (section General Discussions), connected language processing with general cognitive functions (section Language Processing and General Cognitive Functions), and finally, concluded the paper (section Conclusion).

#### **ERP COMPONENTS REFLECTING BOTTOM-UP AND TOP-DOWN PROCESSING**

We examine two ERP components in our experiments, namely auditory P2 and auditory N400, which occur, respectively, in the early and late time windows after stimulus onset.

Auditory P2 is the second positive going ERP component. It usually has a central topographic distribution, and peaks in the early time window around 200 ms (Luck, 2005). The lateralization of P2 is subject to both acoustic properties and tasks (e.g., categorizing emotional words, Schapkin et al., 2000). The corresponding MEG component is P2m or M200. Previous research reported a general left lateralization of P2m in doing languagerelated tasks (e.g., perceiving consonants and vowels, Liebenthal et al., 2010), but acoustic properties of incoming signals also affect the lateralization of P2m (e.g., the voice onset time of consonants, Ackermann et al., 1999).

As a negative going potential, the auditory N400 appears in the late time window (around 250–550 ms) when the target sound stimulus is incongruent with the context (Kutas and Federmeier, 2011). The semantic violation paradigm can elicit N400 (Kutas and Hillyard, 1980). The phonological priming experiment can also elicit N400, when comparing the control condition with the priming condition (Praamstra and Stegeman, 1993; Dumay et al., 2001). The auditory N400 usually has a more frontal topological distribution than the visual N400 (Holcomb and Anderson, 1993; Kutas and Federmeier, 2011). In young population, the auditory N400 tends to have a frontal distribution (Curran et al., 1993; Tachibana et al., 2002). The source of N400 is believed to lie in the frontal and temporal brain areas (Maess et al., 2006; Lau et al., 2008), starting from 250 ms in the posterior half of the left superior temporal gyrus, migrating forward and ventrally to the left temporal lobe by 365 ms, and then moving to the right anterior temporal lobe and both frontal lobes after 370 ms (Kutas and Federmeier, 2011).

#### **ERP EXPERIMENTS OF LEXICAL TONE PERCEPTION**

We designed three ERP experiments to explore the temporal relation of bottom-up and top-down processing in lexical tone perception. These experiments recruited Mandarin participants and traced the above two ERP components in three tasks, respectively, at the syllable, word, and sentence levels, which cover aspects of acoustics and phonetics, phonology, and semantics processing. According to the serial models (e.g., Friederici, 2002), these types of processing could be reflected by different ERP components shown at the early and late stages. However, a parallel model would predict a co-existence of these types of processing at both the early and late stages of lexical tone perception.

These experiments were designed primarily for the following two reasons. First, we were interested in clarifying whether top-down effects could happen at the early stage of a "lower-level" processing. To this purpose, we designed Experiment 1 using the DL task. Apart from the bottom-up effect on phoneme identification, we introduced a semantics factor to see whether a top-down effect inducted by this factor could exist in the early stage of perception and whether such effect could be reflected by the early ERP components (e.g., P2).

Second, we were interested in identifying bottom-up effects at the late stage of a "high-level" processing. To this purpose, we designed Experiment 2 using an auditory lexical decision task, which entailed a top-down semantic processing and a bottom-up processing induced by various types of phonological primes. We also designed Experiment 3 using a semantic violation task. In this task, semantic integration could be reflected by the late ERP component (e.g., N400). Meanwhile, phonemes bearing different acoustic properties could also induce the bottom-up acoustic processing at this stage.

Experiment 1 involved a DL task, which is a widely-adopted paradigm in behavioral and ERP studies examining the lateralization in the auditory modality. For example, Eichele et al. (2005) adopted a DL task using stop consonants as stimuli, and discovered that the latency of the ERP waveforms in the left hemisphere were shorter than those in the right hemisphere, thus reflecting a quicker response of the left hemisphere in perceiving stop consonants. Wioland et al. (1999) explored pitch perception in a DL task, and found that the ERP waveforms had higher amplitudes when the tone change happened in the left ear than in the right ear, thus indicating that the right hemisphere had prevalence in pitch discrimination. In our experiment, we adopted the DL paradigm to explore tone lateralization, and used the amplitude of auditory P2 as a temporal indicator, rather than ear advantages as in previous contradictory behavioral responses (Van Lancker and Fromkin, 1973; Baudoin-Chial, 1986), to reflect hemispheric specialization. We compared the lateralization patterns under tones and stop consonants in both words and nonwords. In terms of acoustic properties, the stop consonants have fast-changing properties, whereas the lexical tones in Mandarin have slow-changing properties.

We expected an increase in the activity of the hemisphere for a certain processing, when there was a heavier load of information in the corresponding hemisphere. For example, in dichotic trials containing two different lexical tones, there would be a relatively greater right hemisphere advantage (equivalent to a less left hemisphere advantage) than dichotic trials containing two different stop-consonants but the same lexical tones. Similarly, dichotic trials containing words should generate a greater left hemisphere activity than dichotic trials containing non-words. In line with previous literature (Wioland et al., 1999; Luo et al., 2006), we examined the ERP waveforms in the C3 and C4 electrode groups.

Experiment 2 involved a lexical decision task with phonological priming. Priming refers to the phenomenon of acceleration in response after repetition. An early study of child language acquisition (Bonte and Blomert, 2004) adopted such a task. It used Dutch words and non-words as testing materials, and discovered different N400 reduction patterns in different language groups. Our experiment adopted a similar design, but used consonants and tones, as well as Chinese words and non-words as testing materials. In Experiment 2, consonant or tone primes appeared before target words, and we examined the auditory P2 and auditory N400 under the tone or consonant priming paradigm.

Other than the enhancement effect of DL as in Experiment 1, we expected that there would be a reduction of ERP components (smaller amplitude) due to the priming effect, and that semantic violation would induce a reduction of ERP amplitudes (as shown by the smaller amplitude in the positive component and greater amplitude in the negative component). These reductions could be greater in the corresponding hemisphere related to a certain processing. For example, the reduction caused by lexical tone priming should have a greater right hemisphere advantage (equivalent to a less left hemisphere advantage) compared to that of consonants, whereas the reduction caused by non-words should be greater in the left hemisphere compared to that of words. Considering the topographic distributions of auditory P2 and N400 (Curran et al., 1993; Tachibana et al., 2002; Luck, 2005) as well as the auditory brain regions involved in the tone priming tasks (Wong et al., 2008), In Experiment 2, we examined the ERP waveforms in the posterior (P3, P4) and frontal (F3, F4) electrode groups.

Experiment 3 involved a semantic violation task in sentences. N400 has been one of the most widely-explored ERP components in such studies, and we adopted the semantic violation paradigm to explore whether acoustic property affected the lateralization of auditory N400 occurring in the late time window during sentence comprehension, which was a high-level linguistic task. The violation was induced by changing either the stop consonant or the tone of the target syllable in a sentence. We expected a greater right lateralization of the N400 induced by lexical tone violation compared to consonant violation. Here we set the central (C3, C4) electrode groups as the regions of interest.

#### **PARTICIPANTS AND SETTINGS**

All these experiments were approved by the College Research Ethics Committee (CREC) of Hong Kong. Thirty-two university students (16 females, 16 males) volunteered for these experiments (age range: 19–29, mean = 27, *SD* = 4.2). In Experiment 1, data from all participants were analyzed. In Experiment 2, the data of one participant were excluded due to excessive eye movements, thus leaving 31 participants (age range: 19–29, mean = 25, *SD* = 2.3). In Experiment 3, the data of three participants were excluded, thus leaving 29 participants (age range: 19–29, mean = 26, *SD* = 2.8).

All these participants were native Mandarin speakers with no musical training. They had normal hearing (below 25 dBHL) in both ears and less than 10 dBHL differences at 125, 250, 500, 750, and 1000 Hz between the two ears, according to the PTA (pure tone analysis) test. They were all right-handed according to the Edinburgh handedness test (Oldfield, 1971), and reported no history of head damage or mental illness. They signed informed consent forms before each of these experiments, and got compensation at a rate of 50 HKD per hour after completing these experiments.

These experiments were conducted on three separate days. They were conducted in a dimly lit, quiet room. During experiment, participants were seated comfortably in front of a computer monitor, and the sound stimuli were presented via ER-3A

documentation/placards/gsn200\_128\_map.pdf.

air-conducting insert earphones, which diminished the environmental noise by 20–30 dB. Sound pressure level, measured by a sound level meter, was set to 75 dBSPL during experiment. The sound materials in these experiments were recorded from a female, native Mandarin speaker. The recording was conducted in a sound-proof booth using Shure SM10A microphone and Sony PCM-2700A audio recorder. The adjustments on recorded sound materials were implemented by the PSOLA (pitch-synchronous overlap add) algorithm in Praat (Boersma and Weenink, 2013), the experimental procedures were implemented using E-Prime (Psychology Software Tools, Pittsburgh, PA), and the statistical analyses were conducted using the SPSS software (version 18.0, SPSS Inc. Chicago, IL).

#### **EEG DATA RECORDING AND ERP PROCESSING**

The EEG (electroencephalography) data were collected by a 128 channel EEG system with Geodesic Sensor Net (EGI Inc., Eugene, OR, USA) (see **Figure 1**). The impedances of all electrodes were kept below 50 k at the beginning of the recording. In all the three experiments, participants were encouraged to avoid blinking or moving their body parts at certain points. Eye blinks and movements were monitored through electrodes located above and below each eye and outside of the outer canthi. The original reference point was the vertex. The ERPs were re-referenced to the averages of all 129 scalp channels in data processing (average reference). During recording, signals were sampled at 250 Hz with a 0.01–100 Hz band-pass filter.

During offline ERP processing, the recorded continuous data were filtered by a 40 Hz low-pass filter and segmented from −100

to 900 ms by referring to the stimulus onset. The segments having either an amplitude change exceeding 100μV in the vertical eye channels and all electrodes, or a voltage fluctuation exceeding 50μV in the horizontal eye channels were excluded from analyses. In each experiment, at least a half number of total trials were preserved for analysis in each condition and for every participant. The baseline correction was conducted from −100 to 0 ms.

In the following sections, we reported the materials, procedures, and results of these three experiments.

#### **EXPERIMENT 1: MANDARIN TONE DICHOTIC LISTENING TASK** *Materials*

The recorded stimuli included Mandarin real- and pseudosyllables, which are formed by two stop consonants (/p/ and /t/ in the IPA notation), two diphthongs (/au/ and /ua/ in the IPA notations), and two Mandarin tones (tone 1, the high level tone; and tone 2, the high rising tone). Eight syllables were constructed using these phonemes and tonemes, among which four were real-syllables, having corresponding Chinese characters, whereas the other four were pseudo-syllables, made of valid consonants and diphthongs but having no corresponding Chinese characters. All these stimuli were cut to 350 ms based on intensity profile. **Figure 2** shows their waveforms and spectrograms, among which the pitch contours are also marked.

#### *Procedure*

In the DL task, participants simultaneously heard two distinct syllables, respectively, in their left and right ears, and were asked to

#### **Table 1 | Experimental conditions in Experiment 1, each containing two syllables and four DL trials.**


*Words have corresponding Chinese characters (shown in brackets, together with their meanings). The pronunciations are annotated with the IPA characters, the numbers in which denote Mandarin tones.*

respond, according to the Chinese character " " (left/right) shown on the screen, both the consonant and the tone of the corresponding side of the auditory input, by pressing the corresponding keys on the respond pad.

We adopted a two-by-two design, with word and non-word as two levels of the lexicality factor, and stop consonant and tone as two levels of the acoustic contrast factor. The eight syllables formed four experimental conditions, including the word, consonant condition; word, tone condition; non-word, consonant condition; and non-word, tone condition, each containing two syllables (see examples in **Table 1**). In the two word conditions, the words were formed by meaningful real-syllables in Chinese; in the two non-word conditions, the non-words were formed by pseudo-syllables having no meanings in Chinese. In the two consonant conditions, the two syllables had the same diphthong and tone, but different initial consonants; in the two tone conditions, however, the two syllables had the same initial consonant and diphthong, but different tones. To balance the two syllables, respectively, played to the left and right ears of participants and the two directions in participants' responses (left or right), each of these four conditions corresponded to four DL trials. In total, there were 16 DL trials.

In each trial, a fixation first appeared on the center of the screen and remained there. After 400 ms, the two syllables in a DL trial were simultaneously played to the left and the right ears of participants, respectively. Participants were encouraged not to blink or move their body parts during the appearance of the fixation. After 1000 ms, the fixation on the screen was replaced by the Chinese character that indicated left or right, and accordingly, participants reported the consonant and the tone of the syllable heard by their corresponding ears. The purpose of letting participants hear the stimuli before seeing the indication (left or right) was to avoid inducing prior bias in their attention. The indication stayed on the screen for 2000 ms, during which participants gave their responses. The presentation sequence of the stimuli was randomized, and the order of choices between the two consonants and between the two tones on the response box was counter-balanced across participants.

Participants first went through a practice session (16 trials) to familiarize the experimental paradigm. In the experimental session, a total of 256 trials were presented to participants, each lasting around 5 s. The experiment consisted of four blocks, each having 64 trials that lasted about 5 min. In each block, the 16 DL trials randomly repeat four times. Participants could take a 2-min break after each block, and the whole experiment lasted approximately 30 min.

#### *Data analysis and results*

As for the behavioral data, the overall rate of response was 95.7%. A Three-Way repeated-measures ANOVA of rates of correct response, with lexicality (word vs. non-word), acoustic contrast (consonant vs. tone), and hemisphere (left vs. right) as three factors, revealed a significant three-way interaction [*F(*1*,* <sup>31</sup>*)* = 5*.*981, *p <* 0*.*024, η<sup>2</sup> *<sup>p</sup>* = 0*.*239]. In addition, the *post-hoc* analysis revealed a significant left hemisphere (right ear) advantage in the non-word, consonant condition [*t(*19*)* = −2*.*280, *p <* 0*.*034]. Since participants needed to respond to both the consonant and the tone of the syllable in one ear, we did not analyze the reaction time.

As for the ERP data, considering the central distribution of auditory P2 (Luck, 2005) and previous literature (Luo et al., 2006), we averaged the data recorded by the four homolog pairs of adjacent central electrodes including C3 and C4 [electrodes 37 (C3), 38, 42, 43 in the left hemisphere, and 105 (C4), 88, 104, 94 in the right hemisphere, according to the EGI system] for analysis. Since the P2 peak appeared between 180 and 200 ms, we averaged the amplitude of P2 within this time range. A Three-Way repeated-measures ANOVA of P2 amplitudes, with lexicality, acoustic contrast, and hemisphere as three factors, revealed two significant interactions, one between acoustic contrast and hemisphere [*F(*1*,* <sup>31</sup>*)* = 7*.*744, *p <* 0*.*0091, η<sup>2</sup> *<sup>p</sup>* = 0*.*200] and the other between lexicality and hemisphere [*F(*1*,* <sup>31</sup>*)* <sup>=</sup> <sup>12</sup>*.*687, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*0012, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*290], and two main effects, hemisphere [*F(*1*,* <sup>31</sup>*)* = 14*.*393, *p <* 0*.*0006, η2 *<sup>p</sup>* = 0*.*317] and acoustic contrast [*F(*1*,* <sup>31</sup>*)* = 22*.*024, *<* 0*.*0001, η2 *<sup>p</sup>* = 0*.*415].

**Figure 3** shows the average ERP waveforms of the C3 and C4 electrode groups, **Figure 4** shows the topographies of the ERP component contrasts, and **Figure 5** shows the average P2 amplitudes between 180 and 200 ms in different conditions.

The greater left lateralization of P2 shown by comparing the word conditions with the non-word conditions reflected top-down processing in the early time window around 200 ms. This indicates the involvement of language experience in the process. In order to differentiate words from non-words, participants needed prior language knowledge, which casted as a top-down effect, regardless of whether this effect belonged to word form recognition (Friederici, 2002) or semantic processing.

The greater left lateralization of P2 shown by comparing the consonant conditions with the tone conditions also reflected bottom-up processing in the same time window. The difference between these conditions was the speed of changes in acoustic cues. In line with previous results (Jamison et al., 2006), we found a greater left lateralization in perceiving fast changing acoustic cues (formant transition in stop consonants) and a less left lateralization in perceiving relativelyslow changing acoustic cues (pitch changes in tones). We expected that the less left lateralization of tone processing compared to consonant processing was due to the greater right lateralization of tone processing compared to consonant and rhyme in certain brain regions in Li et al.'s work 2010.

**FIGURE 3 | Average ERP waveforms of the C3 and C4 electrode groups under the four conditions of Experiment 1. (A)** Word, consonant condition; **(B)** Non-word, consonant condition; **(C)** Word, tone condition; **(D)** Non-word, tone condition.

**in Experiment 1.**

#### **EXPERIMENT 2: LEXICAL DECISION TASK WITH PHONOLOGICAL PRIMING**

#### *Materials*

The recorded stimuli included monosyllabic words as primes and disyllabic words as targets. These words could be real- or non-words in Chinese. The mean duration of the primes was 383.06 ms (range: 251–591 ms, *SD* = 49.17), and that of the targets 619.58 ms (range: 510–751 ms, *SD* = 52.05). There was no significant differences of either the prime duration [*F(*5*,* <sup>354</sup>*)* = <sup>1</sup>*.*098, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*3609, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*015] or the target duration [*F(*5*,* <sup>354</sup>*)* = <sup>1</sup>*.*244, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*2878, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*017] between conditions. The onset asynchrony between the primes and the targets was fixed at 1000 ms. The sound intensity of the primes was set to 55 dB, and that of the targets 75 dB. The purpose of presenting primes at a lower intensity was to maximize the priming effect (Lau and Passingham, 2007).

#### *Procedure*

In the lexical decision task, participants were asked to judge whether the heard disyllabic words (targets) were words or

non-words. They were instructed to ignore the monosyllabic words (primes) played before the disyllabic words and focus on the latter.

Similar to Experiment 1, we adopted a two-by-two design, with word and non-word as the two levels of the lexicality factor, and stop consonant and tone as the two levels of the priming condition factor. The materials in **Table 2** formed six experimental conditions. In the two consonant conditions, the syllable in the prime shared the initial consonant with the first syllable of the target; in the two tone conditions, the syllable in the prime shared the tone with the first syllable of the target; and in the two control conditions, the syllable in the prime and the first syllable of the target shared no phonemes or tonemes.

In each trial, participants first heard the prime. After 600 ms, a fixation appeared on the center of the screen and remained there. After another 400 ms, participants heard the target. After another 1000 ms, the fixation disappeared, and participants had 3000 ms to give their response, by pressing one of the two keys marked by "yes" and "no" in the keyboard. Participants were encouraged not



*The pronunciations are annotated with the IPA characters, the numbers in which denote Mandarin tones. As for the primes, Chinese characters are shown in brackets. As for the targets, the Chinese words are shown in brackets, together with their meanings. Each non-word includes two real-syllables (shown in brackets), but their combination does not form a meaningful disyllabic word in Chinese (marked by \*).*

to blink or move their body parts during the appearance of the fixation, and not to respond until the fixation disappeared. One half of the participants responded to the "yes" key with their left index finger and the "no" key with their right index finger. The other half did the reverse. The left or right response order was randomly assigned to participants.

There were in total 360 trials, with 60 trials in each of the six conditions. Each trial lasted around 5 s. Participants first went through a practice session (36 trials) to familiarize the experimental paradigm. The experiment consisted of six blocks, each having 60 trials and lasting about 5 min. Trials were arranged in a random order. Participants could take a 2-min break after each block, and the whole experiment lasted approximately 40 min.

#### *Data analysis and results*

As for the behavioral data, the response correctness was 96.0%. The average reaction time was 962.72 ms (*SD* = 159.99). Outliers greater than three times of standard deviation from mean were replaced with mean value in each participant. A marginal significant priming effect in the word, consonant priming condition was observed [*t(*30*)* = −1*.*993, *p <* 0*.*055], while the tone priming conditions showed interference effects [as for the word, tone priming condition, *t(*30*)* = 1*.*822, *p* = 0*.*078; as for the non-word, tone priming condition, *t(*30*)* = 2*.*524, *p <* 0*.*017].

As for the ERP data, considering both of the P2 topography (Luck, 2005) and the brain regions for tone priming (Wong et al., 2008), we averaged four homolog pairs of adjacent posterior electrodes including P3 and P4 [electrodes 53 (P3), 61, 54, 38 in the left hemisphere, and 87 (P4), 79, 80, 88 in the right hemisphere, according to the EGI system] for analysis. We calculated the priming effects of tones or consonants by subtracting the ERP waveforms in the word or non-word control conditions from those in the word or non-word experimental conditions. Similar to Experiment 1, based on a Three-Way repeated-measures ANOVA, with lexicality (word vs. non-word), priming condition (consonant vs. tone), and hemisphere (left vs. right) as three factors, we found that the average P2 amplitude in the early time window (200–220 ms, P2 peak values were within this time range) showed two significant interactions, one between lexicality and hemisphere [*F(*1*,* <sup>30</sup>*)* = 5*.*618, *p <* 0*.*0244, η<sup>2</sup> *<sup>p</sup>* = 0*.*158], and the other between priming condition and hemisphere [*F(*1*,* <sup>30</sup>*)* <sup>=</sup> <sup>8</sup>*.*515, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*0066, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*221], and a main effect of lexicality [*F(*1*,* <sup>30</sup>*)* <sup>=</sup> <sup>8</sup>*.*242, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*0074, <sup>η</sup><sup>2</sup> *p* = 0*.*216]. Similar to Experiment 1, these results indicated that both semantics and acoustic properties affected lateralization.

Apart from P2, we conducted another analysis of the ERP waveform in the late time window (500–550 ms) based on four homolog pairs of adjacent frontal electrodes including F3 and F4 [electrodes 25 (F3), 28, 29, 35 in the left hemisphere, and 124 (F4), 123, 118, 117 in the right hemisphere, according to the EGI system]. Rather than deriving auditory N400 by contrasting non-word and word conditions, we analyzed these conditions separately in order to preserve the lexicality factor and make factors in statistic analysis consistent with the previous one based on P2, though the interested time windows in these two analyses were different. The data for statistical analysis here were all from priming conditions without subtracting control conditions. By examining the same three factors as in the previous analysis, this analysis showed three main effects, priming condition [*F(*1*,* <sup>30</sup>*)* <sup>=</sup> <sup>6</sup>*.*564, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*0157, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*180], lexicality [*F(*1*,* <sup>30</sup>*)* <sup>=</sup> <sup>7</sup>*.*892, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*0087, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*208], and hemisphere [*F(*1*,* <sup>30</sup>*)* <sup>=</sup> <sup>9</sup>*.*193, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*0050, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*235]. Priming condition interact significantly with lexicality [*F(*1*,* <sup>30</sup>*)* = 5*.*636, *p <* 0*.*0242, η2 *<sup>p</sup>* = 0*.*158]. More importantly, there was a significant interactions between lexicality and hemisphere [*F(*1*,* <sup>30</sup>*)* = 10*.*729, *p <* 0*.*0027, η<sup>2</sup> *<sup>p</sup>* = 0*.*263].

**Figure 6** shows the average ERP waveforms of the P3 and P4 electrode groups, and **Figure 7** shows those of the F3 and F4 electrode groups. **Figure 8** shows the topographies of the contrasts of the P2 component around 200 ms (200–220 ms), and **Figure 9** shows those of the late ERP component around 500 ms (500–550 ms). **Figure 10** shows the average amplitudes of P2, and **Figure 11** shows those of the late component in different conditions.

The lateralization pattern of P2 could be interpreted as follows. The greater the priming effect, the lower the amplitude of P2, due to the repetition effect. Since there was no main effect of priming condition, the priming effects of consonants and tones were not much different from each other. However, the left and right hemispheres showed different trends of these priming effects, due to the significant interaction between hemisphere and priming condition. By examining the amplitude difference between the priming effects of consonants and tones (see **Figure 8A**) and those of words and non-words (see **Figure 8B**), we found a stronger priming effect of consonants than tones was shown in the left hemisphere compared to the right hemisphere around centro-parietal region, and a greater left hemisphere advantage in processing words compared to non-words. These showed that the left hemisphere responded significantly differently in the consonant priming and tone priming conditions, as well as word and non-word conditions. In line with Experiment 1, these results illustrated that both bottom-up and top-down processing took place around 200 ms. The left hemisphere responded to the fast changing acoustic cues greater than the slow changing acoustic cues, and it also responded to word semantics greater than non-words that had no meanings.

By comparing the word and non-word conditions, we found that the amplitudes of frontal region at the late component (around 500 ms) were lower in the non-word conditions compared to the word conditions (consistent with the main effect of lexicality), which was right lateralized (consistent with the

**FIGURE 6 | Average ERP waveforms of the P3 and P4 electrode groups under the six conditions of Experiment 2. (A)** consonant priming condition; **(B)** Non-word, consonant priming Word,

interaction between lexicality and hemisphere) (see **Figure 9**). Such a right lateralization was also shown in Experiment 3 when comparing the tone-induced N400 with the consonant-induced N400.

#### **EXPERIMENT 3: SEMANTIC VIOLATION IN SENTENCES** *Materials*

The recorded stimuli included a number of Chinese sentences. Each sentence consisted of 11 syllables, and the last two were always a verb and its object. Semantic violation was induced by changing the tone or consonant of the last syllable of a sentence. The average duration of these sentences from the onset of the first syllable to the stop of the last one was 3206.1 ms (*SD* = 156.1). There was no significant difference of these durations between different conditions [*F(*2*,* <sup>177</sup>*)* = <sup>0</sup>*.*697, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*4996, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*008]. The intensity of these sentences was adjusted to 75 dB. **Table 3** shows examples of such sentences.

#### *Procedure*

In the sentence comprehension task, participants were asked to judge whether the last syllable in a sentence was consistent with

comparison between the word, consonant priming condition and the word, control condition, and "NCon" between the non-word, consonant priming condition and the non-word, control condition, "WTon" between the word, tone priming condition and the word, control condition, "NTon" between the non-word, tone priming condition and the non-word, control condition.

the context or not. Since the violation appeared toward the end of the sentence, participants were encouraged not to blink or move their body parts toward the end of the sentences.

There were three experimental conditions (see **Table 3**): in the control condition, there was no semantic violation; in the consonant violation condition, the violation was induced by changing the initial of the last syllable of the sentence; and in the tone

**FIGURE 11 | Average amplitudes of the late ERP component around 500–550 ms in the F3 and F4 electrode groups in Experiment 2.**

**Table 3 | Example sentences and experimental conditions in Experiment 3.**


*The last syllable of each sentence is the target word, and all the previous ones are the context. In each condition, the Chinese transcription of the sentence and its meaning are shown. The pronunciations of the last two syllables are annotated with the IPA characters, the numbers in which denote Mandarin tones. In the violation conditions, the last syllable induces violation, and the syllable after change is still a real-syllable in Chinese. For comparison, the syllables within the square brackets are consistent with the context.*

violation condition, the violation was induced by changing the tone of the last syllable of the sentence.

In each trial, a fixation first appeared on the screen and remained there. After 400 ms, one of the sentences was presented to participants. The fixation disappeared at 1000 ms after the onset of the last syllable in the sentence, and then, participants had 2000 ms for response, by pressing one of the two keys marked by "yes" and "no" in the keyboard. One half of the participants pressed the "yes" key with their left index finger and the "no" key with their right index finger. The other half did the reverse. The left or right response order was randomly assigned to participants.

There were in total 180 testing sentences, with 60 sentences in each condition. We also added ten filler sentences, each having the same length as the testing sentences, no semantic violation, and a free structure. The purpose of incorporating filler sentences was to make the yes and no responses have equal chances. The average length of each trial was 6606.1 ms (*SD* = 156.1). Participants first went through a practice session (30 trials) to familiarize the experimental paradigm. The experiment consisted of six blocks, each containing 40 sentences. These sentences included ten randomly chosen sentences from each of the three conditions, and ten filler

sentences. The order of these sentences was randomized. Each block lasted about 4 min. Participants could take a 2-min break between blocks. The whole experiment lasted about 30 min.

#### *Data analysis and results*

As for the behavioral data, the response correctness was 94.3%. The averaged reaction time was 854.13 ms (*SD* = 186.61). Outliers greater than three times of standard deviation from mean were replaced by the mean value in each participant.

As for the ERP data, we referred to the data recorded by the four homolog pairs of adjacent electrodes including C3 and C4 [electrodes 37 (C3), 38, 42, 43 in the left hemisphere, and 105 (C4), 88, 104, 94 in the right hemisphere, according to the EGI system] for analysis. A Two-Way repeated-measures ANOVA, with violation type (consonant vs. tone) and hemisphere (left vs. right) as two factors, revealed a main effect of violation type [*F(*1*,* <sup>28</sup>*)* <sup>=</sup> <sup>9</sup>*.*622, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*0044, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*256], and a significant interaction between hemisphere and violation type [*F(*1*,* <sup>28</sup>*)* = 9*.*573, *p <* 0*.*0044, η<sup>2</sup> *<sup>p</sup>* = 0*.*255]. A *post-hoc T*-test revealed a significance of the right lateralized N400 [*t(*28*)* = 2*.*164, *p <* 0*.*0391] in the tone violation condition, and no significant lateralization in the consonant violation condition. Similar to Experiment 2, this analysis considered the control conditions, by subtracting the ERP waveforms in them from those in the experimental conditions.

**Figure 12** shows the average ERP waveforms of the C3 and C4 electrode groups, **Figure 13** shows the topographies and differences of auditory N400, and **Figure 14** shows the average amplitudes of N400 (300–350 ms) in different conditions.

The significant interaction between hemisphere and violation type reflected bottom-up processing. Noticeably, there was a right lateralization in the difference between the N400 inducted by tone violation and that induced by consonant violation, which supported that bottom-up (acoustic) processing also existed in the late stage of perception.

#### **GENERAL DISCUSSIONS**

To sum up, in Experiment 1 and Experiment 2, we discovered both top-down (semantic) and bottom-up (acoustic) processing in the early time window around 200 ms. In the late stage around 300–500 ms, we only found the top-down effect in Experiment 2, probably because that the phonological primes were presented too early and the bottom-up effect could not last long enough. However, in Experiment 3, the bottom-up effect was reflected by N400 in the late stage. As indicated by the late component in Experiment 2 and Experiment 3, we suggested that both topdown and bottom-up processing existed at the late stage. The N400 component had a shorter latency in Experiment 3 than that in Experiment 2 because of the context effect, and the topography of the earlier N400 in Experiment 3 had a more central distribution compared to the late frontal N400 in Experiment 2, which is consistent to the description of N400 in time and spatial domains (Kutas and Federmeier, 2011).

#### **RELATION BETWEEN TOP-DOWN AND BOTTOM-UP PROCESSING DURING LEXICAL TONE PERCEPTION**

During lexical tone perception, the prior knowledge formed by language experience helps match a large variety of pitch contours onto clear tonal categories, and the semantic representation requires combining tonal categories with carrying syllables. Therefore, the prior knowledge of tonal categories and the lexical semantics of linking tonal categories with carrying syllables become the primary top-down factors during lexical tone perception. Since semantic and categorical information of phonemes is processed dominantly in the left hemisphere (McDermotta et al., 2003; Liebentral et al., 2005), a general left lateralization pattern during lexical tone perception reflects top-down processing. Similarly, the primary acoustic cue of lexical tone is pitch variation, and processing of pitch variations is bottom-up. Since it is widely accepted that the right hemisphere is dominant for pitch

processing (Sidtis, 1981; Tenke et al., 1993), a relative right lateralization pattern during lexical tone perception also reflects bottom-up processing.

By tracing the auditory ERP components in the early and late time windows, our experiments explore the general and relative lateralization patterns in conditions with or without lexical semantics and slow or fast changing acoustic cues. Though involving distinct tasks, these experiments reveal two consistent lateralization patterns of early (P2) and late (N400) ERP components: (a) manipulation of linguistic information in words modulates lateralization: meaningful words tend to generate a greater left lateralization; and (b) manipulation of physical property of auditory input modulates lateralization: faster changing cues generate a greater left lateralization. These two patterns reflect top-down (lexical semantic) and bottom-up (acoustic phonetic) processing, respectively. Since lexical tone perception concerns both acoustic properties and lexical semantics, the observed lateralization patterns are never a simple dichotomy of purely left or right lateralization, as observed in the previous studies focusing only on one aspect of lexical tone perception. The modulation effects on lateralization at both the early and late time windows suggest that both top-down and bottom-up processing exist at different stages of perception, which support a parallel model of top-down and bottom-up processing.

#### **THREE-STAGE, PARALLEL LEXICAL TONE PROCESSING MODEL**

Previous explorations revealed that lexical tone differed from segmental cues (Ye and Connie, 1999; Lee, 2007), and that lateralization of lexical tone processing differed from that of segment processing (Li et al., 2010). Neuroimaging studies of lexical tone processing also revealed that separate brain regions were involved

in perceiving lexical tones compared to segments (Gandour et al., 2003a). Apart from perception, differences between tone and segment processing were also found in lexical tone production (Liu et al., 2009). However, all these explorations did not disentangle language experience as top-down factors and acoustic cues as bottom-up factors. Treating processing of pitch information and semantic role as a cohort processing makes it difficult to figure out the cognitive processes during tone processing (Zatorre and Gandour, 2008), since lexical tone processing involves both acoustic and linguistic factors.

In our study, we regard pitch processing as bottom-up processing in lexical tone perception, since it concerns acoustic cues, and semantic processing as top-down processing, since it involves language experience. In this way, we separate these two cognitive functions called for tone perception. By exploring the temporal relation between bottom-up and top-down processing in the time windows around 200 ms and around 300–500 ms after stimulus onset, we confirm that both types of processing participate in tone perception during these early and late time periods.

Apart from these findings, there was also evidence showing a greater left lateralization of contour tones than level tones as well as a general left lateralization of Cantonese lexical tone perception in the N1 component around 100 ms after stimulus onset (Ho, 2010; Shuai et al., in press). The result of Cantonese perception reflected a bottom-up acoustic effect at around 100 ms, whereas the general left lateralization was consistent with the lateralization of top-down effect as observed in our experiments of Mandarin tone perception.

Based on the findings in our experiments and those previous studies, we propose a detailed, three-stage, parallel model of lexical tone processing. The three stages are defined based on the occurrences of different ERP components in our and previous experiments (e.g., N1, around 100 ms after stimulus onset; P2, around 100–300 ms; and N400, after 300 ms; among these components, N1 and P2 belong to the early stage and N400 belongs to the late stage).

At the first stage (before and around 100 ms after stimulus onset, as in Ho, 2010 and Shuai et al., in press), syllable initials are processed to provide the basic structure of the syllable. At this stage, if the syllable starts with a vowel or a sonorant consonant, pitch information is available; if it starts with a voiceless consonant, there is no pitch information. In either case, tonal category is not formed yet, since the recognition of pitch variation or pattern, as slow-varying acoustic cues, requires a time window longer than 100 ms. At this stage, top-down linguistic processing also occurs, no matter whether there is contextual information before the syllable initial. With contextual information, top-down processing would become stronger, though it may play distinct roles from bottom-up processing at this stage.

At the second stage (100–300 ms after stimulus onset, as in our experimental conditions), predictions about pitch patterns and following segmental information are made, based on the information gathered at the first stage and prior language experience. At this stage, semantic information at a gross level is also activated via top-down processing, which is initiated based on the information gathered at the first stage. Since lexical tone is not fully recognized, bottom-up processing is still ongoing. According to previous literature (Kaan et al., 2007, 2008), pitch variation in the middle proportion of a syllable is the most important for recognizing contour tones for native tonal language speakers. At this stage, listeners keep integrating the incoming pitch information with the information gathered at the first stage in order to recognize the tone and the whole syllable. In this sense, both top-down prediction of tonal categories and bottom-up generalization of tonal categories are taking places at this stage.

At the third stage (after 300 ms, as in our experimental conditions), top-down processing becomes more prominent, helping listeners recognize the tone, the whole syllable, and its meaning, based on prior or previous language experience and the information gathered at the first two stages. Detailed semantic information is recognized at this stage. However, bottom-up processing keeps taking effect, helping to confirm the recognized tone and syllable.

Compatible with our findings in the series of experiments involving various levels of language processing, this parallel model of lexical tone perception can shed important lights on general speech perception models in many aspects.

First, this parallel model, as a cognitive model, proposes that top-down processing is available in both the early and the later stages of lexical tone perception, especially when contextual information is available.

Second, this parallel model refutes the claim that semantic processing (top-down) occurs always after acoustic processing (bottom-up), as in Friederici's general auditory processing model and Luo et al.'s serial model. The influence of previous language experience always exists during speech perception, especially in the case of auditory sentence processing. There are various types of cues and ample information that can serve as context for perceiving incoming syllables, and the neural system in humans always makes predictions. Even in the case of single syllable perception, if the task is linguistic relevant, top-down processing based on language experience is inevitable. As shown in our experiences, the greater left lateralization of P2 and N400 under the semantic conditions explicitly reflects such language-relevant, top-down processing.

Third, the bilateral lexical tone processing also complements to the neuroimaging models of speech perception. For example, in the dorsal-ventral pathway hypothesis of speech perception (Hickok and Poeppel, 2007), phoneme perception was regarded as involving only the left hemisphere, since such generalization did not involve tonal languages. Considering that 60–70% of world languages are tonal languages (Yip, 2002), a speech perception model leaving out tones is incomplete.

#### **TOP-DOWN PROCESSING AT THE PREATTENTIVE STAGE**

Humans often predict incoming signals based on experience. Therefore, top-down processing could accompany the whole processes of speech perception. In addition, information generated by bottom-up processing is also used to match the prediction coming from top-down processing. In this sense, top-down processing is a pre-determined process, making the relevant hemisphere or brain regions get prepared for the forthcoming task. When stimuli come in, they will evoke responses from the corresponding hemisphere or brain regions, and these responses may adjust or even alter the degrees of lateralization. A similar effect is shown in the attention or memory modulation of lateralization in DL tasks (Hugdahl, 2005; Saetrevik and Hugdahl, 2007). For example, by asking participants to attend to stimuli in either left or right ears (Hugdahl, 2005), the degree of lateralization is adjusted in favor of the side attended.

Even though long-term language experience keeps affecting automatic processing at the preattentive stage, it is generally hard to observe online top-down processing at this stage. It is only until recently that the automatic top-down processing has gained researchers' attention (Kherif et al., 2011; Wager et al., 2013). Considering that the automatic preattentive processing discovered via the MMN paradigm can be induced by either acoustic properties or long-term language experience, the MMN component is able to reflect not only bottom-up processing, but also top-down processing relevant for semantics and syntax at the preattentive stage (Pulvermüller, 2001; Pulvermüller et al., 2001a,b; Pulvermüller and Shtyrov, 2006; Penolazzi et al., 2007; Shtyrov and Pulvermüller, 2007; Gu et al., 2012). Top-down processing was also found as early as around 200 ms after stimulus onset with attention (Bonte et al., 2006). However, there was an MMN study (Luo et al., 2006) of tone perception only found a general right lateralization of tone perception. Two factors may cause this. First, many of these experiments only concern acoustic processing, i.e., bottom-up processing at the preattentive stage (e.g., Luo et al., 2006). Second, without recruiting linguistic factors, top-down processing that is dominant primarily in the left hemisphere and in the case of semantics processing would have the least influence at the preattentive stage (e.g., Xi et al., 2010).

In our study, we consider both semantic roles that require linguistic top-down processing and pitch variations that require acoustic bottom-up processing in active, language-relevant tasks, which distinguish our experiments from those MMN experiments. Via a series of tasks that involve different levels of language processing, we explicitly address top-down processing at both the early and late stages of lexical tone perception, and gather consistent evidence of the co-occurrence of top-down and bottom-up processing at those stages.

#### **LANGUAGE PROCESSING AND GENERAL COGNITIVE FUNCTIONS**

Separating lexical tone perception into bottom-up and top-down processing not only decomposes this language-specific function into general cognitive functions like sensory or memory, but also reveals that speech processing share similar mechanisms with other cognitive functions. For example, there are bottom-up attention (automatic attention shift to an unexpected event, without requiring any sort of executive processing nor involving any active engagement beforehand) and task-related top-down attention (Connor et al., 2004; Buschman and Miller, 2007; Pinto et al., 2013), both of which take part in information processing.

On the one hand, although previous work on speech perception focuses mainly on the left hemisphere, there are ample findings arguing against the existence of a centralized "core" in the left hemisphere dedicated exclusively to language processing. For example, the language function of intonation shows a right hemisphere advantage (Gandour et al., 2003b). Following a decompositional view, such right hemisphere advantage can be ascribed to consistent right hemisphere advantages of general cognitive components, including perception of slow-varying cues and emotions. Although lexical tone perception is special in the sense that it involves advantageous components in both the left hemisphere (semantic processing) and the right hemisphere (pitch processing), we can apply the same view to it. Similarly, this decompositional view can also be extended to aspects of semantics and syntax. Rather than arguing that there is no brain region that is specific for language processing, what the decompositional view emphasizes is that language must be supported by many general functions and share or recruit similar computational resources as those general functions.

On the other hand, it is not uncommon to conceptualize complex cognitive functions like language as a combination of general functions in terms of cognitive models and neural circuitry (Dehaene and Cohen, 2007; Hurley, 2007; Anderson, 2010). Take the example of attention, there is a heat debate on whether our attention is drawn voluntarily by top-down, task-dependent factor or involuntarily by bottom-up, saliency factor (Theeuwes, 1991; Buschman and Miller, 2007). Similarly, language processing also involves domain-general functions (Yip, 2002; Hurford, 2007; Fitch, 2010; Arbib, 2012). Although examining top-down and bottom-up mechanisms in cognitive functions is already prevailing, such a separation has not been commonly practiced in previous language processing literature.

Noting these, unlike the previous research that puts too much emphasis on discovering domain-specific cognitive or neural mechanisms for language processing, we advocate that decomposing language functions into more basic components (Fitch, 2010) and locating the neural networks that systematically marshal these functions (Sporns, 2011) can lead to rigorous views about the essential commonalities between language and other cognitive functions.

#### **CONCLUSION**

In this paper, we reported three ERP experiments that collectively illustrated that both bottom-up processing and top-down processing during lexical tone perception co-occurred in both the early (around 200 ms) and late (around 300–500 ms) time windows of processing. Based on these findings, we proposed a parallel lexical tone processing model that entailed both types of processing throughout various processing stages. This experimental study discussed not only the temporal relation between bottom-up and top-down processing during tone perception, but also the similarities between language processing and other cognitive functions, the latter of which pointed out an important direction in future research of language processing and general cognition.

#### **ACKNOWLEDGMENTS**

This work is supported in part by the Seed Fund for Basic Research of the University of Hong Kong. The preliminary results in this paper were first presented in the 3rd International Symposium on Tonal Aspects of Languages (TAL 2012). We thank William S.-Y. Wang for his generous support on this work. We are also grateful to the three anonymous reviewers for their useful comments on this work.

#### **REFERENCES**


Sporns, O. (2011). *The Networks of the Brain*. Cambridge, MA: MIT Press.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

#### *Received: 15 July 2013; paper pending published: 11 December 2013; accepted: 09 March 2014; published online: 25 March 2014.*

*Citation: Shuai L and Gong T (2014) Temporal relation between top-down and bottom-up processing in lexical tone perception. Front. Behav. Neurosci. 8:97. doi: 10.3389/fnbeh.2014.00097*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience.*

*Copyright © 2014 Shuai and Gong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### Left-right compatibility in the processing of trading verbs

#### **Carmelo M. Vicario1,2,3\* and Raffaella I. Rumiati <sup>2</sup>**

<sup>1</sup> School of Psychology, The University of Queensland, Brisbane, QLD, Australia

<sup>2</sup> Cognitive Neuroscience Sector, SISSA, Trieste, Italy

<sup>3</sup> School of Psychology, Bangor University, Bangor, UK

#### **Edited by:**

Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA

#### **Reviewed by:**

Daniel Casasanto, University of Chicago, USA Feng Kong, Beijing Normal University, China

#### **\*Correspondence:**

Carmelo M. Vicario, School of Psychology, Bangor University, Brigantia Building, Bangor, Gwynedd, Wales LL57 2AS, UK e-mail: carmelo.vicario@uniroma1.it

The research investigating the nature of cognitive processes involved in the representation of economical outcomes is growing. Within this research, the mental accounting model proposes that individuals may well use cognitive operations to organize, evaluate, and keep track of their financial activities (Thaler, 1999). Here we wanted to test this hypothesis by asking to a group of participants to detect a syntax mistake of verbs indicating incoming and going out activities related to economical profit (trading verbs), swapping (swapping verbs) and thinking (thinking verbs). We reported a left-right compatibility for trading verbs (i.e., participants were faster with their right hand while detecting verb referring to a monetary gain with respect to a monetary loss; and faster with their left hand while detecting a monetary loss with respect to a monetary gain). However, this pattern of result was not reported while detecting swapping verbs. Results are discussed taking into account the mental accounting theory as well as to the spatial mapping of valence hypothesis.

**Keywords: language, economics, SNARC effect, mental accounting theory, spatial valence hypothesis**

#### **INTRODUCTION**

The interest in the nature of cognitive processes involved in the representation of economical outcomes has been growing in recent years (see e.g., Wu et al., 2012 for a recent review). Several studies in cognitive sciences and financial economics propose the inextricable interdependence between rationality and emotion (Grossberg and Gutowski, 1987; Damasio, 1994; Elster, 1998; Loewenstein, 2000; Harvey et al., 2010) in influencing human economical choices and behaviors (see also Glimcher and Rustichini, 2004). For instance, Loewenstein (2000) highlights the impact of immediate emotions, as well as wide range of visceral factors associated with them, in determining systematic behaviors that could also be amenable in a formal model. Moreover, a crucial role is played by the activation of reward-related brain areas, such as the striatum (Fehr and Camerer, 2007).

Nevertheless, other factors, beyond those mentioned above, might play a role in processing and representing economic outcomes. The mental accounting theory (Thaler, 1980), a model developed in the field of behavioral economy, proposes some intriguing suggestions in this direction. This model attempts to describe the process whereby people code, categorize and evaluate economic outcomes. According to this model, individuals use cognitive operations to organize, evaluate, and keep track of their financial activities (Thaler, 1999). Since this model holds that accounting operations are engaged in evaluating economical outcomes, one could expect a specific role of the mathematical brain processes in representing financial meanings.

A way to test this hypothesis is provided by the study of language. In fact, one could argue that the same cognitive processes active while manipulating quantity might be involved in processing financial words. Thus, following this suggestion, linguistic items such as verbs referring to monetary *gain* and/or *loss* could be conceptualized in terms of mental shifts toward higher or lower quantities or as two mental accounting operations such as *addition* and *subtraction*. This proposal originates from the evidence that Western populations are endowed with a leftto-right Mental Number Line (MNL) for representing quantities, from lower to higher, respectively (Dehaene et al., 1990). Moreover, Knops et al. (2009) have shown that during no-symbolic addition, subjects preferentially selected numbers at the upper right location, whereas during no-symbolic subtraction, they were biased toward the upper left location.

In consideration of these findings one could expect a similar left-to-right spatial encoding for the representation of linguistic terms which refer to monetary gain and loss. Accordingly, one can hypothesize that the same cognitive mechanisms underlying the representation of quantity are covertly engaged when people read verbs associable to a monetary gain or loss. Given the direct relation, in the cognitive system, between quantity (low vs. high), arithmetic operations (addition vs. subtraction) and spatial coordinates (left vs. right), one could expect to detect faster reaction times (RTs) in using the right hand while processing verbs related to a monetary *gain* with respect to verbs related to a monetary *loss* (namely trading verbs). On the other hand, one could expect faster RTs in using the left hand while processing verbs associated to a monetary *loss* with respect to verbs associated to a monetary *gain*. Participants were also performed a second block of stimuli (namely swapping verbs) which refer to verbs describing incoming and coming out outcomes perceived as an exchange. In fact, the main difference between trading and the swapping verbs is that only trading verbs explicitly suggest the meaning of "economical profit", although a monetary outcome can be associated to both categories (e.g., money loss vs. money donation). We use of swapping verbs to create an incoming vs. going out condition in absence of high economical relevance (compared to the trading category). In this way, we could have more elements to understand whether the origin of the hypothesized left-right encoding is linked to the economical relevance of the linguistic term rather than to the incoming vs. going out meaning covertly suggested by all these verbs.

#### **MATERIALS AND METHODS**

#### **PARTICIPANTS**

Twenty-two right-handed graduate students (10 men, 12 women, mean age: 26 ± 7.03 years) recruited from the University of Trieste, participated in the studies after providing verbal informed consent. The experiment was performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki. Participants received a payment of 10 Euros for having taken part in this study.

#### **PROCEDURE AND INSTRUMENTS**

Using their left and right index fingers, participants were required to establish, as soon as possible and in two consecutive sessions (counterbalanced design), whether 108 verbal stimuli contained (or not) a syntax mistake. The task was identical for both trading and swapping verbs.

#### **Trading verbs block**

Fifty-four of 108 items were spelt correctly; of these, 18 (6 verbs × 3 trials) indicated a monetary gain and 18 (6 verbs × 3 trials) indicated a monetary loss. Moreover this block included 18 *thinking* verbs (6 verbs × 3 trials) as control items (see **Table 1** for the complete list).

#### **Swapping verbs block**

Fifty-four of 108 items were spelt correctly; of these, 18 (6 verbs × 3 trials) indicated a receiving action and 18 (6 verbs × 3 trials) indicated a giving action. Even in this block were included 18 *thinking* verbs (6 verbs × 3 trials) as control items (see **Table 1** for the complete list).

All verbs were presented in first person and in the simple present tense. Each trial was preceded by an alerting sentence (ready) lasting 500 ms and followed by fixation cross lasting 500 ms. The within subjects variable was the responding finger. Incorrect responses (Trading verbs: 2.82%; Swapping verbs: 3.05%) were not considered in the analysis.

#### **STATISTICAL ANALYSIS**

The RTs performance in detecting stimuli written correctly was analyzed using ANOVA for repeated measures, with VERB (trading vs. swapping), MANUAL RESPONSE (left vs. right) and MEANING (incoming, thinking and going out) as main factors. Trading and swapping verbs were presented in separated blocks. *Post-hoc* comparisons were performed using the Duncan *post-hoc* test. For all statistical analyses, a *p*-value of 0.05 was considered to be significant. Data analysis was performed using Statistica software, version 8.0, StatSoft, Inc., Tulsa, USA. We also performed a permutation analysis where we relabelled and shuffled verbs across conditions. This analysis was conducted to have an approximation of what could have happen if Swapping and Trading verbs were randomly assigned. In this case, the analysis was conducted by using Matlab software, R 2013 A version. The number of permutation selected for this procedure was 1.000; *p*value of 0.05 was considered to be significant.

#### **RESULTS**

In order to evaluate the grade of familiarity of participants with the proposed verbs, they were asked to use a five point rating scale to have a subjective measure of their level of experience/familiarity. Therefore, the higher the reported score the higher the subjective experience/familiarity with a verb.

#### **TRADING VERBS**

We detected a significant difference (*F*(2,40) = 35.00, *p* < 0.001). *Post-hoc* comparisons revealed that both Incoming vs. Going out verbs significantly differed from the control category (*Thinking*: *M* = 4.753 ± 0.250 vs. *Incoming*: *M* = 3.531 ± 0.922 SD, *p* < 0.001; *Thinking*: *M* = 4.789 ± 0.234 vs. *Going out*: *M* = 3.515 ± 0.783 SD, *p* < 0.001), while no difference was observed between them (*p* = 0.100).



The corsive items refer to the verbs (written in Italian) originally used in the task. In the parenthesis it is proposed the English translation.

#### **SWAPPING VERBS**

We detected a significant difference (*F*(2,40) = 67.50, *p* < 0.001). *Post-hoc* comparison showed that both types of *Swapping Verbs* significantly differed from the control category (*Thinking: M* = 4.734 ± 0.260 vs. *Incoming*: *M* = 3.174 ± 0.749, *p* < 0.001; *Thinking: M* = 4.734 ± 0.260 vs. *Going out: M* = 3.795*, p* < 0.001). A significant difference was found also between swapping verbs (*p* < 0.001) showing that going out verbs were perceived as more familiar than incoming verbs.

The analysis of RTs was conducted by excluding 2 participants from the original sample: one participant was excluded because his low performance accuracy (i.e., <80%); the other participant was excluded because he did not complete the experiment (i.e., the swapping block). According to our research hypothesis, we detected a significant result for the VERB \* MEANING \* MAN-UAL RESPONSE interaction factor (*F*(2,38) = 5.24, *p* = 0.009). *Post-hoc* comparison shows a double dissociation in RTs for verbs of the trading category. In particular, participants were significantly faster in detecting going out verbs (*M* = 998.7 ± 59.69) with respect to incoming verbs (*M* = 1063.6 ± 62.94) while using their left hand (*p* = 0.013); and incoming verbs (*M* = 1056.9 ± 67.02) with respect to going out verbs (*M* = 1109.9 ± 75.74) while using their right hand (*p* = 0.042). On the other hand, *posthoc* comparison concerning verbs of the swapping category only showed faster RTs in detecting going out items, with respect to incoming items. This was significant for both left (*p* = 0.007) and right (*p* = 0.001) hand responses (see **Figure 1** for details).

This pattern of results was confirmed by the permutation analysis in almost all cases. In particular we show significant RTs

detecting the correctly spelt items of both trading and swapping categories. Vertical bars indicate one Standard Error. The "\*" indicates significant post-hoc comparison differences.

difference for the trading category by comparing going out with respect to incoming verbs for the left (*p* = 0.024) and the right (*p* = 0.005) hand; We also detected a significant RTs difference for the swapping category by comparing going out with respect to incoming verbs for the right hand (*p* < 0.001). However, this difference is not significant for the left hand (*p* = 0.1727).

Ngram viewer by Google was also used to provide an idea about the frequency of use of these verbs in the Italian language. In particular, we focused on the temporal interval between the 1998 and the 2008 (i.e., the most recent temporal range available with Google Ngram viewer).

First, we performed a *t*-test analysis by comparing the Ngram viewer output (i.e., the amount of citations) provided for the 12 Trading verbs (*M* = 7921698,8) with that of the 12 Swapping verbs (*M* = 12908273,3). Results did not report a significant difference (*t* = −0.91, *p* = 0.367). We also performed two repeated measures ANOVA in which we compared the Google Ngram viewer output of Incoming, Going-out and Thinking verbs of both Trading and Swapping categories. The analysis for the Trading category documented a significant differences (*F*(2,10) = 5.99, *p* = 0.01). In particular we found that Thinking verbs (*M* = 38387857,5) were more frequently cited than Incoming (*M* = 3805269,5), (*p* = 0.009) and Going out (*M* = 12038128,3) verbs (*p* = 0.030). No difference was reported by comparing them to each other (*p* = 0.448). On the other hand, we did not detect a significant difference for the swapping block, although the trend (*F*(2,10) = 3,67, *p* = 0.06). **Figure 2** shows a plot of the three verb categories.

#### **DISCUSSION**

The purpose of this study was to address the question suggested by the mental accounting theory, that is, whether people use cognitive operations for processing financial activities (Thaler, 1999). This hypothesis was addressed by studying the participants' performance in a linguistic task.

Recently, Baroni et al. (2013) have documented a left-right compatibility by using financial words (i.e., *monopoly*, *salary*, *discount*) with faster left hand RTs while detecting words indicating loss concepts (i.e., unemployment) and, *vice versa*, faster right had RTs while detecting words indicating gain concepts (i.e., salary). However, the effect on RTs was only found when participants were required to explicitly discriminate between gain and loss, while there was failure in detecting this effect when they were required to discriminate between economic and no economic terms. On the other hand, in a further experiment, in which participants were asked to arbitrary allocate financial words along a line, the authors documented a left space preference in a spontaneous allocation of "loss" words and, *vice versa*, a right space preference in the spontaneous allocation of "gain" words. This last experiment suggests a left to right encoding for gain and loss meaning even in implicit tasks. Our study differs from the research conducted by Baroni et al. (2013) not only with respect to the adopted procedure (i.e., participants were asked to identify a syntax mistake), but also with respect to the verbal material [i.e., Baroni et al. (2013), used financial words while we used verbs which referred to incoming vs. going out outcomes, with high (i.e., trading) vs. low (i.e., swapping) economical relevance].

According to the initial prediction we found a left-to-right spatial compatibility for trading verbs. In particular, participants were significantly faster in detecting verbs indicating a monetary loss (i.e., going out) with respect to verbs indicating a monetary gain (i.e., incoming) while using their left hand; on the other hand, participants were faster in detecting verbs indicating a monetary gain with respect to verbs indicating a monetary loss while using their right hand. However, we did not detect a left-to-right spatial compatibility for swapping verbs. This suggests that the incoming vs. going out meaning implicitly associated to both verb categories might be not relevant, *per se*, in explaining the left-right compatibility found for the trading category. In fact, if this was the case, a left-right compatibility would have been detected also for the swapping category.

In the light of this argument, one could argue at least two alternative suggestions to explain the current result.

One possibility is that the left-right compatibility reported for trading verbs might reflect the higher economical relevance of this linguistic category, compared to the swapping category. As already discussed in the introduction, a dense literature (Dehaene et al., 1990; Loetscher et al., 2008; Vicario, 2012; Holmes and Lourenco, 2013; Shaki and Fischer, 2013) has repeatedly demonstrated a leftto-right mapping for low and high numbers, which is reflected in the so called SNARC effect. Accordingly, the current results can be explained assuming that when Western participants read verbs (with high economical relevance) associated to a monetary *loss*, their cognitive activation moves "leftward" as when detecting small numbers and/or performing arithmetical subtraction; *vice versa*, reading verbs (with high economical relevance) associated to a monetary *gain* activates a mental rightward shift as when detecting high numbers and while performing arithmetic addition (Knops et al., 2009). According to this interpretation our data can be intended as a support to the mental accounting theory (Thaler, 1999) stating that people use cognitive operations for processing financial activities. In fact, the current result suggests that linguistic terms referring to economics are spatially mapped similarly to numbers. This implies the suggestion that the leftright compatibility reported for trading verbs might reflect a SNARC-like effect for this linguistic material as well as for magnitude processing (Vicario and Martino, 2010).

Several arguments can be provided in support of this hypothesis. First, a common Intraparietal Sulcus (IPS) activation has been observed when participants performed calculation, linguistic and saccadic movement tasks (Sereno et al., 2001). In fact, this area has been identified by these authors as the neural correlate of the mental accounting and linguistic competence interplay; Second, learning difficulties in mathematics (i.e., developmental dyscalculia) frequently co-occur with impairments in reading (i.e., developmental dyslexia). This co-morbidity could be related to the malfunctioning of the left angular gyrus, a brain area that has been found to be affected in patients with Gerstmann (1940) who show not only acalculia but also left–right disorientation; Third, patients with cortico-basal degeneration (CBD) can show a severe difficulty in understanding small numbers as well as quantifier terms (McMillan et al., 2006). They also provided a further support to this view by performing a neuroimaging study investigating quantifier comprehension in healthy adults (McMillan et al., 2005). Semantic theorists (e.g., Szymanik and Zajenkowski, 2010) make a general distinction between first-order quantifiers, which identify a number state (e.g., "some" or "at least 3") and higher-order quantifiers, which are those not expressible in first-order logic (e.g., "most" or "every other"). McMillan et al. (2005) reported that first-order and higher-order quantifiers both recruit right inferior parietal cortex, suggesting that a number processing component contributes to quantifier comprehension. In fact, parietal activation was also widely reported in subjects asked to perform a simple number processing (Cohen et al., 2000; Kazui et al., 2000; Pinel et al., 2001; Simon et al., 2002) or arithmetic task (Menon et al., 2000; Knops et al., 2009; Krueger et al., 2011).

An alternative, no less important, interpretation to the current results might refer to the *body-specificity hypothesis* (Casasanto, 2009, 2011; Kominsky and Casasanto, 2013; Kong, 2013), stating that people conceptualize bad and good in terms of left-right spatial encoding, according to their handedness. For example, Casasanto (2009) showed that right-handers tend to associate rightward space with positive ideas and leftward space with negative ideas (this pattern was reversed in left-handers). In fact, while trading verbs might be conceptualized as endowed of an emotionally positive (i.e., incoming) and negative (i.e., going out) meanings, all swapping verbs might be interpreted positively (e.g., "donation" is easily interpreted as "positive", although it indicates a going out outcome).

The results reported for the swapping category provide some support to the Space-valence hypothesis. In fact, the going out verbs of the swapping category such as "to donate" (which were the most positive) show the largest advantage for the right hand, and the moderately positive incoming verbs like "to receive" show an advantage in the same direction. Moreover, the permutation test did not confirm a significant difference comparing going out with respect to incoming swapping verbs when using the left hand. This might be explained with the fact that all our participants were right handed. In fact, the spatial mapping of valence hypothesis predicts a performance advantage with the dominant hand. Therefore, according to this view, one could argue that the left-right compatibility reported in our study might be ruled by the "value" (i.e., positive vs. negative) of the presented verbs. The spatial mapping of valence hypothesis might represent a valid interpretation for explaining the current results. However, this study does not provide definitive evidence in support of this interpretation since we did not test left-handed people.

Worthy of some discussion is the difference in the familiarity ranking score provided by our participants for the three verb categories. In fact, going out swapping verbs were perceived as more familiar than incoming swapping verbs. This difference in familiarity scores for the swapping category might have played some role in the detection of these verbs, although we don't believe that this factor might explain the absence of a left/right compatibility for this linguistic category.

Our study bears some important limitations that might be addressed in future works. First, we did not collect any ranking about the economical relevance subjectively associated to the verbs presented in this research. Second, trading and swapping verbs were administered in separated blocks. This might represent an issue since verbs within blocks might have interacted such as cueing. In fact, thinking verbs, which were used as a control condition, were detected faster in the swapping block than in the trading block. However, the permutation analysis suggests that the reported effects for the trading category are not related to the blocked design. Finally, we did not test RTs performance in left handed participants, this because our research goal was testing the existence of a SNARC like-effect for linguistic items associated to the economical profit category (i.e., trading verbs).

Further investigations including brain imaging and noninvasive brain stimulation methods, but also left-handed participants are needed to clarify whether the current results underlie the linguistic representation of economical outcomes.

#### **ACKNOWLEDGMENTS**

We would like to thank Miss Anica Newman for her help in the checking of English grammar and Dr. Andrea Pavan for the assistance with the permutation analysis.

#### **REFERENCES**


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 27 September 2013; paper pending published: 29 October 2013; accepted: 10 January 2014; published online: 28 January 2014*.

*Citation: Vicario CM and Rumiati RI (2014) Left-right compatibility in the processing of trading verbs. Front. Behav. Neurosci. 8:16. doi: 10.3389/fnbeh.2014. 00016*

*This article was submitted to the journal Frontiers in Behavioral Neuroscience*.

*Copyright © 2014 Vicario and Rumiati. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms*.

### Why open-access publication should be nonprofit—a view from the field of theoretical language science

#### *Martin Haspelmath\**

*Department of Linguistics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany \*Correspondence: haspelmath@eva.mpg.de*

*Edited by:*

*Leonid Perlovsky, Harvard University, USA*

Many of my fellow theoretical linguistics researchers have not noticed the momentous changes in the world of science publication yet. When confronted with the idea that publication costs should be covered by author fees ("author processing charges," or APCs), they often react with disbelief and indignation.

But the signs of inefficiency of the old subscription-based system are just as clear in my field as elsewhere, so I see no reasonable alternative to Gold open access (i.e., freely accessible electronic publications on the publisher's website). Green open access is inefficient because of the duplication of efforts, and subscription is inefficient because it is very difficult to predict for an institution to what extent its members will want to use a journal or book. Moreover, the subscription-based model is even worse for scholars with low budgets: While a low-budget scholar can at least read the richer scholars' works on the APC-based open access model, not even that is possible on the traditional model, and usually one can publish in prestigious places only if one knows the relevant literature.

But is APC-based publication of scientific results by profit-oriented companies (such as Macmillan Publishers, which owns Nature Publishing Group, the partner of Frontiers) a good alternative to subscription? Clearly, the old author-pays model removes a major inefficiency of the subscription-based system, because the authors know that they want to publish, whereas the subscribers only suspect that they want to use the publications. According to Stuart Shieber, an open-access expert and theoretical linguist at Harvard University, subscriptionbased publication can lead to market dysfunction (unreasonably high publication prices) because science journals are not competitive goods: If you subscribe to

one science journal, this doesn't mean that you don't need another one (see Shieber, 2013). But from the author's perspective, Shieber says, they are competitive goods: You just need to publish in one journal, and you can choose the cheapest one.

Shieber's article is very sophisticated from an economics perspective, but it completely leaves aside a crucial component of scientific publication that I will argue leads to market dysfunction also with the APC-based open-access model: Scientific publications serve both to disseminate research results and to build careers of scientists. The success of a scientist (and of groups of scientists) is routinely measured by the place of publication of the work. When evaluating a scientist, the evaluators not only look at the amount of research output and the amount of citations, but also at the place of publication. Moreover, when deciding what to cite, scientists routinely privilege papers published in more prestigious journals and books published in more prestigious imprints. Thus, to be a successful scientist, one needs to publish in the same places as other successful scientists. Thus, journals and imprints have a significance for science that goes far beyond the purpose of dissemination of research results. The latter can nowadays be achieved much more easily, by archives such as Arxiv.org, or by publishing in one's personal blog, or on Academia.edu. The primary purpose of peer review is actually peer selection: One needs to make a special effort to present one's results in such a way that one's peers recognize their value. It is only in this way that one's research is likely to have an impact on others. Being selected for publication in a particular place (journal or book imprint) means being successful.

One could imagine alternative models of establishing scientific credentials, e.g., by a rating system similar to the one found in online bookshops, but discussing these is beyond the scope of this note. The big advantage of anonymous peer review and selection that I see for my own field is that it gives younger scholars the chance to become more widely visible even without traveling to many conferences. In the following, I assume that peer selection of publications will be the prevalent mode of establishing scientific credentials also in the future.

Now crucially, the association of place of publication with prestige means that the market for APC-based journals does NOT provide for competition after all: I cannot simply submit my paper to a cheaper journal if the cheaper journal has much less prestige and will lead to much fewer citations of the article. I will quite likely submit my paper to the best journal in my subfield even if this means that I will pay higher APCs (as long as my budget still allows it). Publishers will be able to price their journals according to their prestige, not according to their services. But in the 21st century, the prestige of a journal is primarily the result of the work of the scientists who publish in it, who serve as editors and as reviewers, and not the result of the publisher's efforts. If I publish an excellent piece of research in a journal, or if I write a careful review of a submitted manuscript, I thereby enhance the prestige of this journal, and I thereby contribute to making the journal more expensive for future submitters. The publishers will reap the benefits of my excellent and conscientious work, because they can charge more without improving their services. This situation is clearly undesirable for science.

Journal and book publication has become very simple and cheap as a result of technological developments: One just needs typesetting, hosting and web presentation, as well as perhaps some kind of print-on-demand service (for open-access books). This can be done very easily without major investments, and as a result, journal publication in the less wealthy countries has increased dramatically over the last 20 years. For example, the Brazilian platform Scielo.org hosts over 1000 journals that are freely accessible and do not charge any author fees.

Of course, even nowadays journal and book publication does not come for free, and somebody has to pay for it. But in order to have a functioning market with reasonable prices, one needs real competition. My research institution can replace its cleaning company by another one, or it can buy its computers and printers from different companies if we are dissatisfied with the services and products. But we cannot simply replace journals and imprints, because we use these to build our careers and to measure our success.

A functioning model would be one where the scientists own the journal titles and book imprints, and where they choose typesetters, webdesign companies, and hosting companies that can be easily replaced by others if the prices are not right. Just as basic science itself is not a profit-oriented activity, publication of scientific results would not be a profitoriented activity. APCs could be charged by the nonprofit organizations of the scientists (universities, scientific libraries, scholarly associations), but these would not increase as a result of excellent and high-impact work being published by the journals and imprints. On the contrary, since universities and scholarly associations derive their prestige in part from their publications, it is to be expected that the best work will be published without any APCs: These nonprofit organizations would benefit from their prestigious journals and imprints, so it would make sense for them to subsidize them in much the same way as they are subsidizing nonprofit-oriented basic research itself.

The alternative model, where APCs are charged by profit-oriented publishers, has another serious drawback: It creates a strong incentive to create journals and book imprints that function like "vanity presses," allowing authors to publish their low-quality work without significant risk of rejection. Vanity presses have long existed in the regular book market, and they have not been a problem because no public money went into them. Of course, everyone should be free to publish their bad novels or low-quality scientific articles if they desire. However, when it comes to scientific publications, the idea is that the APCs are covered by grants for scientific research, i.e., mostly by public money that would otherwise go into science. In the traditional system, grant holders are free to publish the results of their research wherever they want—but there used to be a limited set of possibilities, and scientific vanity publishers hardly existed. But nowadays increasingly, grant agencies are trying to impose the restriction that the publication should be open access—and with the for-profit approach, there is an unlimited set of possibilities. Anyone can easily found a new journal and offer publication for APCs, simply claiming that it is peer-reviewed. For example, I recently heard of two Chinese companies that are publishing a large number of open-access journals, some of them in my field of linguistics: Wuhanbased SCIRP (http://www.scirp.org/, over 250 journals) and Beijing-based MDPI (http://www.mdpi.com/, over 120 journals). The business model here is to start a large number of new journals and to hope that some of them will succeed and bring profit. For example, MDPI's journal *Languages* does not even have an editor yet. This is of course reminiscent of the business model of spam e-mail, and in fact, some observers have warned of the danger of "predatory journals." In particular, Jeffrey Beal noted in a *Nature* column in 2012 that there are hundreds of journals with this business model, and he writes:

*The competition for author fees among fraudulent publishers is a serious threat to the future of science communication. To compete in a crowded market, legitimate open-access publishers are being forced to promise shorter submission-topublication times; this weakens the peerreview process, which takes time to do properly. To tackle the problem, scholars must resist the temptation to publish quickly and easily. . . (Beall, 2012)*

But the problem with Beall's argumentation is that it is difficult to say in what sense the business model of "predatory" publishers is "fraudulent." They are just exploiting a new niche that has been created by the notion that authors should pay for publication by profit-oriented companies. Clearly, given the current system, where not only quality, but also quantity of publication counts, scholars have an incentive to publish "quickly and easily." Moral exhortations to "resist the temptation" will not make this problem go away.

In order to prevent scholars from publishing their work in less than fully respectable venues, science funders will have to set up a new control system that monitors journal publishers and that prevents grant holders from using grant money to publish in these journals. It is difficult to see how this can be done efficiently and without unduly restricting the freedom of scientists. In any event, it will cost money that would be saved if publication costs were carried by the publishers (universities, libraries, scholarly associations), rather than by the authors.

Another argument that Shieber (2013) cites against toll-access publication is that traditional publishers typically use price bundling, so that canceling individual journal subscriptions does not significantly reduce the costs of the libraries. But is this different in the for-profit openaccess model? Not at all: Once open-access publication becomes the norm, for-profit publishers will introduce price bundling for APCs: If your institution enters into an agreement with the publisher, you will pay only EUR 500 for publishing your paper instead of the usual EUR 1000. There are already signs that this is happening: In January 2013, De Gruyter and the Max Planck Society came to an agreement about open-access publication of Max Planck books by De Gruyter (see http:// www.mpdl.mpg.de / news / pressrel\_2013/ PM\_deGruyter\_MPG\_de.pdf).

To summarize, the major argument for open access is that toll access is inefficient because there can be no functioning market (Shieber, 2013) and because it is difficult for subscribers to predict their needs. How should open-access publication be funded? One common funding option is by public funds, i.e., publication is funded in the same way in which science is funded. The other major funding option is by for-profit companies, on the basis of APCs. The major argument against for-profit companies is again that there can be no functioning market: Scientific publications not only serve to disseminate research findings, but they also build scientific prestige and reputation. Thus, they should be owned by scientists and their institutions, not by companies whose main purpose is to make money. If scientific work is published by for-profit companies, they make money from the reputation that is built up by publicly-funded scientific work. This means that scientific work should be published by nonprofit organizations—those very organizations that are engaged in doing science. This is in fact the traditional model of the 19th century, when it was primarily the scholarly societies and academies that published scientific works. It turns out that this is also the best model for the future.

#### **REFERENCES**

Beall, J. (2012). Predatory publishers are corrupting open access. *Nature* 489, 179. doi: 10.1038/489179a Shieber, S. (2013). *Why Open Access is Better for Scholarly Societies.* Available online at: http:// blogs.law.harvard.edu/pamphlet/2013/01/29/whyopen-access-is-better-for-scholarly-societies/

*Received: 08 May 2013; accepted: 16 May 2013; published online: 06 June 2013.*

*Citation: Haspelmath M (2013) Why open-access publication should be nonprofit—a view from the field of theoretical language science. Front. Behav. Neurosci. 7:57. doi: 10.3389/fnbeh.2013.00057*

*Copyright © 2013 Haspelmath. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.*

### The importance of Open Access publishing in the field of Linguistics for spreading scholarly knowledge and preserving languages diversity in the era of the economic financial crisis

#### *Nicola L. Bragazzi \**

*Department of Neuroscience, Rehabilitation, Ophthalmology, Genetics, Maternal and Child Health (DINOGMI), Section of Psychiatry, and School of Public Health, Department of Health Sciences (DISSAL), University of Genoa, Genoa, Italy \*Correspondence: robertobragazzi@gmail.com*

#### *Edited by:*

*Leonid Perlovsky, Harvard University and Air Force Research Laboratory, USA*

#### **A commentary on**

#### **Why open-access publication should be nonprofit-a view from the field of theoretical language science**

*by Haspelmath, M. (2013). Front. Behav. Neurosci. 7:57. doi: 10.3389/fnbeh.2013. 00057*

I read with great interest the article recently published on "Frontiers in Behavioral Neuroscience," written by Haspelmath and concerning the Open Access (OA) with a particular emphasis from the field of theoretical linguistics (Haspelmath, 2013). However, there are some points in which I disagree with him and I would like to discuss these points as well as I would like to put forth further arguments, adding also an Italian perspective to the topic.

First, the author claims that OA has been marginally noticed in the field of theoretical linguistics, at least in his university context. Moreover, he describes the reactions of "disbelief and indignation" of his colleagues when they hear that they should pay a fee for having their manuscripts published. However, I maintain that this picture for OA journals (OAJs) devoted to linguistics is not exactly true: if we search for linguistics-related OAJs on the Directory of OAJs (DOAJ) database, the largest database collecting OAJs, we discover that only 11 journals have a publishing fee (out of 213, that is to say approximately the 5.2%), while for 4 of them there are no available information (about the 1.9%).

OA seems to be a vivid and expanding reality, and in the field of linguistics it is highly supported by initiatives such as the OA Initiative in Linguistics (OALI), coordinated by the German linguist Stephen Müller (Müller, 2012) and by the same Haspelmath.

Also in Italy, as in Germany, tollbased or subscription-based publishing has revealed its inefficiency: costs that have inflated and increased out of proportion (Giunta, 2010). This kind of publishing is no longer sustainable and affordable. OA seems more cost-effective.

Another point I would like to put forward is that OA could provide an exceptional opportunity and a platform for preserving languages diversity. Article Processing Charges (APCs) for OAJs are (completely or partially) waived for researchers coming from underdeveloped or developing countries, even if as we have already seen that are no publishing fees for the majority of linguisticsrelated OAJs. OA is growing especially in underdeveloped countries and developing nations (Bayry, 2013), since for them it represents a great opportunity to make their voices heard. The service offered to scholars in these countries is undoubtedly of value: they can freely access scientific content and distribute and communicate with other scholars. Researchers from emerging countries could exploit the benefits of OA for sharing information about languages at risk of extinction in a cost-effective way, uploading also audio-resources and creating a great digital archive. UNESCO has already recognized the importance of the Internet and of other new media to fill the linguistic divide, to preserve multiculturalism and to foster "pluralistic, equitable, open and inclusive knowledge societies" (UNESCO).

In conclusion, the importance of OA in the era of a severe financial and economic crisis is acknowledged. The current crisis is imposing cuts of funding, recruitment/hiring and turnover freezes, and some countries such as Greece have lost access to prestigious journals (Trachana, 2013). OA can overcome the hurdles and the barriers to the spread and the dissemination of the knowledge. OA could provide an excellent platform for delivering and sharing scholarly data and results, contributing at the same time to make the cyberspace a more multilingual reality.

#### **REFERENCES**


*Received: 23 June 2013; accepted: 09 July 2013; published online: 02 August 2013.*

*Citation: Bragazzi NL (2013) The importance of Open Access publishing in the field of Linguistics for spreading scholarly knowledge and preserving languages diversity in the era of the economic financial crisis. Front. Behav. Neurosci. 7:91. doi: 10.3389/ fnbeh.2013.00091*

*Copyright © 2013 Bragazzi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### ADVANTAGES OF PUBLISHING IN FRONTIERS

FAST PUBLICATION Average 90 days from submission to publication

COLLABORATIVE PEER-REVIEW

Designed to be rigorous – yet also collaborative, fair and constructive

RESEARCH NETWORK Our network increases readership for your article

#### OPEN ACCESS

Articles are free to read, for greatest visibility

#### TRANSPARENT

Editors and reviewers acknowledged by name on published articles

GLOBAL SPREAD Six million monthly page views worldwide

#### COPYRIGHT TO AUTHORS

No limit to article distribution and re-use

IMPACT METRICS Advanced metrics track your article's impact

SUPPORT By our Swiss-based editorial team

EPFL Innovation Park · Building I · 1015 Lausanne · Switzerland T +41 21 510 17 00 · info@frontiersin.org · frontiersin.org