ADVANCES IN VIRTUAL AGENTS AND AFFECTIVE COMPUTING FOR THE UNDERSTANDING AND REMEDIATION OF SOCIAL COGNITIVE DISORDERS

EDITED BY: Eric Brunet-Gouet, Ali Oker, Jean-Claude Martin, Ouriel Grynszpan and Philip L. Jackson PUBLISHED IN: Frontiers in Human Neuroscience

#### *Frontiers Copyright Statement*

*© Copyright 2007-2016 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-787-3 DOI 10.3389/978-2-88919-787-3

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## **ADVANCES IN VIRTUAL AGENTS AND AFFECTIVE COMPUTING FOR THE UNDERSTANDING AND REMEDIATION OF SOCIAL COGNITIVE DISORDERS**

#### Topic Editors:

**Eric Brunet-Gouet,** HandiResp, EA4047, Université de Versailles Saint-Quentin, Université Paris-Saclay, Pôle de Psychiatrie, Centre Hospitalier de Versailles, France **Ali Oker,** HandiResp, EA4047, Université de Versailles Saint-Quentin, Université Paris-Saclay, Pôle de Psychiatrie, Centre Hospitalier de Versailles, Franc**e Jean-Claude MARTIN,** Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI-CNRS), UPR3251, Université Paris Sud, France **Ouriel Grynszpan,** Institut des Systèmes Intelligents et de Robotique, Université Pierre et

Marie Curie, France

**Philip L. Jackson,** Universite Laval, Canada

Ekman and Friesen's emotional face seen through the eyes of an ill-parameterized openCV contour extraction algorithm (source picture: Unmasking the Face: A Guide to Recognizing Emotions From Facial Expressions, Ekman & Friesen, Malor Books September 1, 2003)

Advances in modern sciences occur thanks to within-fields discoveries as well as confrontation of concepts and methods from separated, sometimes distant, domains of knowledge. For instance, the fields of psychology and psychopathology benefited from accumulated contributions from cognitive neurosciences, which, in turn, received insights from molecular chemistry, cellular biology, physics (neuroimaging), statistics and computer sciences (data processing), etc. From the results of these researches, one can argue that among the numerous cognitive phenomena supposedly involved in the emergence the human intelligence and organized behavior, some of

them are specific to the social nature of our phylogenetic order. Scientific reductionism allowed to divide the social cognitive system into several components, i.e. emotion processing and regulation, mental state inference (theory of mind), agency, etc. New paradigms were progressively designed to investigate these processes within highly-controlled laboratory settings. Moreover, the related constructs were successful at better understanding psychopathological conditions such as autism and schizophrenia, with partial relationships with illness outcomes.

Here, we would like to outline the parallel development of concepts in social neurosciences and in other domains such as computer science, affective computing, virtual reality development, and even hardware technologies. While several researchers in neurosciences pointed out the necessity to consider naturalistic social cognition (Zaki and Ochsner, Ann N Y Acad Sci 1167, 16-30, 2009), the second person perspective (Schilbach et al., Behav Brain Sci 36(4), 393-414, 2013) and reciprocity (de Bruin et al., Front Hum Neurosci 6, 151, 2012), both computer and software developments allowed more and more realistic real-time models of our environment and of virtual humans capable of some interaction with users. As noted at the very beginning of this editorial, a new convergence between scientific disciplines might occur from which it is tricky to predict the outcomes in terms of new concepts, methods and uses.

Although this convergence is motivated by the intuition that it fits well ongoing societal changes (increasing social demands on computer technologies, augmenting funding), it comes with several difficulties for which the current Frontiers in' topic strives to bring some positive answers, and to provide both theoretical arguments and experimental examples. The first issue is about concepts and vocabulary as the contributions described in the following are authored by neuroscientists, computer scientists, psychopathologists, etc. A special attention was given during the reviewing process to stay as close as possible to the publication standards in psychological and health sciences, and to avoid purely technical descriptions. The second problem concerns methods: more complex computerized interaction models results in unpredictable and poorly controlled experiments. In other words, the assets of naturalistic paradigms may be alleviated by the difficulty to match results between subjects, populations, conditions. Of course, this practical question is extremely important for investigating pathologies that are associated with profoundly divergent behavioral patterns. Some of the contributions of this topic provide description of strategies that allowed to solve these difficulties, at least partially.

The last issue is about heterogeneity of the objectives of the researches presented here. While selection criteria focused on the use of innovative technologies to assess or improve social cognition, the fields of application of this approach were quite unexpected. In an attempt to organize the contributions, three directions of research can be identified: 1) how innovation in methods might improve understanding and assessment of social cognition disorders or pathology? 2) within the framework of cognitive behavioral psychotherapies (CBT), how should we consider the use of virtual reality or augmented reality? 3) which are the benefits of these techniques for investigating severe mental disorders (schizophrenia or autism) and performing cognitive training?

The first challenging question is insightfully raised in the contribution of Timmermans and Schilbach (2014) giving orientations for investigating alterations of social interaction in psychiatric disorders by the use of dual interactive eye tracking with virtual anthropomorphic avatars. Joyal, Jacob and collaborators (2014) bring concurrent and construct validities of a newly developed set of virtual faces expressing six fundamental emotions. The relevance of virtual reality was exemplified with two contributions focusing on anxiety related phenomena. Jackson et al. (2015) describe a new environment allowing to investigate empathy for dynamic FACS-coded facial expressions including pain. Based on a systematic investigation of the impact of social stimuli modalities (visual, auditory), Ruch and collaborators are able to characterize the specificity of the interpretation of laughter in people with gelotophobia (2014). On the issue of social anxiety, Aymerich-Franch et al. (2014) presented two studies in which public speaking anxiety has been correlated with avatars' similarity of participants' self-representations.

The second issue focuses on how advances in virtual reality may benefit to cognitive and behavioral therapies in psychiatry. These interventions share a common framework that articulates thoughts, feelings or emotions and behaviors and proposes gradual modification of each of these levels thanks to thought and schema analysis, stress reduction procedures, etc. They were observed to be somehow useful for the treatment of depression, stress disorders, phobias, and are gaining some authority in personality disorders and addictions. The main asset of new technologies is the possibility to control the characteristics of symptom-eliciting stimuli/situations, and more precisely the degree to which immersion is enforced. For example, Baus and Bouchard (2014) provide a review on the extension of virtual reality exposure-based therapy toward recently described augmented reality exposure-based therapy in individuals with phobias. Concerning substance dependence disorders, Hone-Blanchet et collaborators (2014) present another review on how virtual reality can be an asset for both therapy and craving assessment stressing out the possibilities to simulate social interactions associated with drug seeking behaviors and even peers' pressure to consume.

The last issue this Frontiers' topic deals with encompasses the questions raised by social cognitive training or remediation in severe and chronic mental disorders (autistic disorders, schizophrenia). Here, therapies are based on drill and practice or strategy shaping procedures, and, most of the time, share an errorless learning of repeated cognitive challenges. Computerized methods were early proposed for that they do, effortlessly and with limited costs, repetitive stimulations. While, repetition was incompatible with realism in the social cognitive domain, recent advances provide both immersion and full control over stimuli. Georgescu and al. (2014) exhaustively reviews the use of virtual characters to assess and train non-verbal communication in highfunctioning autism (HFA). Grynszpan and Nadel (2015) present an original eye-tracking method to reveal the link between gaze patterns and pragmatic abilities again in HFA. About schizophrenia, Oker and collaborators (2015) discuss and report some insights on how an affective and reactive virtual agents might be useful to assess and remediate several defects of social cognitive disorders. About assessment within virtual avatars on schizophrenia, Park et al., (2014) focused on effect of perceived intimacy on social decision making with schizophrenia patients. Regarding schizophrenia remediation, Peyroux and Franck (2014) presented a new method named RC2S which is a cognitive remediation program to improve social cognition in schizophrenia and related disorders.

To conclude briefly, while it is largely acknowledged that social interaction can be studied as a topic of its own, all the contributions demonstrate the added value of expressive virtual agents and affective computing techniques for the experimentation. It also appears that the use of virtual reality is at the very beginning of a new scientific endeavor in cognitive sciences and medicine.

**Citation:** Brunet-Gouet, E., Oker, A., Martin, J-C., Grynszpan, O., Jackson, P. L., eds. (2016). Advances in Virtual Agents and Affective Computing for the Understanding and Remediation of Social Cognitive Disorders. Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-787-3

# Table of Contents


*09 Investigating alterations of social interaction in psychiatric disorders with dual interactive eye tracking and virtual faces*

Bert Timmermans and Leonhard Schilbach

*13 Virtual faces expressing emotions: An initial concomitant and construct validity study*

Christian C. Joyal, Laurence Jacob, Marie-Hélène Cigna, Jean-Pierre Guay and Patrice Renaud


Willibald F. Ruch, Tracey Platt, Jennifer Hofmann, Radosław Niewiadomski, Jérôme Urbain, Maurizio Mancini and Stéphane Dupont


Antoine Hone-Blanchet, Tobias Wensing and Shirley Fecteau

*86 The use of virtual characters to assess and train non-verbal communication in high-functioning autism*

Alexandra Livia Georgescu, Bojana Kuzmanovic, Daniel Roth, Gary Bente and Kai Vogeley


Ali Oker, Elise Prigent, Matthieu Courgeon, Victoria Eyharabide, Mathieu Urbach, Nadine Bazin, Michel-Ange Amorim, Christine Passerieux, Jean-Claude Martin and Eric Brunet-Gouet

*120 Effect of perceived intimacy on social decision-making in patients with schizophrenia*

Sunyoung Park, Jung Eun Shin,Kiwan Han, Yu-Bin Shin and Jae-Jin Kim

*128 RC2S: A cognitive remediation program to improve social cognition in schizophrenia and related disorders*

Elodie Peyroux and Nicolas Franck

# Editorial: Advances in Virtual Agents and Affective Computing for the Understanding and Remediation of Social Cognitive Disorders

Eric Brunet-Gouet 1, 2 \*, Ali Oker 1, <sup>2</sup> , Jean-Claude Martin<sup>3</sup> , Ouriel Grynszpan<sup>4</sup> and Philip L. Jackson<sup>5</sup>

<sup>1</sup> Faculté de Médecine, Université Versailles Saint-Quentin, Versailles, France, <sup>2</sup> Pôle de Psychiatrie, Centre Hospitalier de Versailles, Le Chesnay, France, <sup>3</sup> Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur-Centre National de la Recherche Scientifique, Orsay, France, <sup>4</sup> Institut des Systèmes Intelligents et de Robotique, Centre National de la Recherche Scientifique, Université Pierre et Marie Curie, Paris, France, <sup>5</sup> École de Psychologie, Université Laval, Québec, QC, Canada

Keywords: virtual reality, social cognition, affective computing, psychotherapy, social cognition in clinical groups

**The Editorial on the Research Topic**

#### **Advances in Virtual Agents and Affective Computing for the Understanding and Remediation of Social Cognitive Disorders**

The present topic emphasizes the parallel development of concepts in social neurosciences and in other domains such as computer science, affective computing, virtual reality development, and even hardware technologies. While several researchers in neurosciences pointed out the necessity to consider naturalistic social cognition (Zaki and Ochsner, 2009), the second person perspective (Schilbach et al., 2013), social interaction (Pfeiffer et al.) and reciprocity (de Bruin et al.), both computer and software developments allowed more and more realistic real-time models of our environment and of virtual humans capable of some interaction with users. A new convergence between scientific disciplines might occur from which it is tricky to predict the outcomes in terms of new concepts, methods, and uses.

Although this convergence meets ongoing societal changes (increasing social demands on computer technologies, augmenting funding), it comes with several difficulties for which the current Frontiers in' topic strives to bring some positive answers, and to provide both theoretical arguments and experimental examples. The first problem was about concepts and vocabulary as contributions described in the following were authored by neuroscientists, computer scientists, psychopathologists, each coming with separate knowledge, and key literature. A special attention was given to avoid purely technical descriptions and to focus on the added value of virtual reality in neurosciences and psychopathology. Another problem concerned methods: more complex computerized interaction models results in unpredictable and poorly controlled experiments. In other words, the assets of naturalistic paradigms may be alleviated by the difficulty to match results between subjects, populations, and conditions. It is crucial to consider this question when investigating pathologies that are associated with profoundly divergent behavioral patterns. The last issue we encountered was about heterogeneity of the objectives of the researches presented here. While selection criteria focused on the use of innovative technologies to assess or improve social cognition, the fields of application of this approach were quite unexpected.

The first group of contributions exemplifies how innovation in methods improves understanding and assessment of social cognition disorders or pathology. Timmermans and Schilbach provide technological orientations for investigating alterations of social interaction in

Edited and reviewed by: Srikantan S. Nagarajan, University of California, San Francisco, USA

> \*Correspondence: Eric Brunet-Gouet ebrunet@ch-versailles.fr

Received: 01 September 2015 Accepted: 11 December 2015 Published: 07 January 2016

#### Citation:

Brunet-Gouet E, Oker A, Martin J-C, Grynszpan O and Jackson PL (2016) Editorial: Advances in Virtual Agents and Affective Computing for the Understanding and Remediation of Social Cognitive Disorders. Front. Hum. Neurosci. 9:697. doi: 10.3389/fnhum.2015.00697 psychiatric disorders by the use of dual interactive eye tracking with virtual anthropomorphic avatars. Joyal et al. bring concurrent and construct validities of a newly developed set of virtual faces expressing six fundamental emotions. The relevance of virtual reality was shown with two contributions focusing on anxiety related phenomena. Jackson et al. describe a new environment allowing investigating empathy for dynamic FACScoded facial expressions including pain. Based on a systematic investigation of the impact of social stimuli modalities (visual, auditory), Ruch and collaborators are able to characterize the specificity of the interpretation of laughter in people with gelotophobia (Ruch et al.). On the related issue of social anxiety, Aymerich-Franch et al. presented two studies in which public speaking anxiety has been correlated with avatars' similarity of participants' self-representations. Finally, three contributions demonstrate the feasibility and the usefulness of virtual reality settings on chronic developmental and psychiatric conditions like high functioning autism (HFA) or schizophrenia. Grynszpan and Nadel present an original eye-tracking method to reveal the link between gaze patterns and pragmatic abilities in HFA. About schizophrenia, Oker et al. discuss and report some insights on how affective and reactive virtual agents might be useful to assess and remediate several defects of social cognitive disorders. About assessment within virtual avatars on schizophrenia, Park et al. focused on effect of perceived intimacy on social decision making with schizophrenia patients.

The second set of contributions focus on therapeutic intervention with either a cognitive behavioral therapy (CBT) framework or with cognitive remediation/training procedures. CBT interventions share common principles and take in consideration thoughts, feelings or emotions and behaviors in order to generate gradual modification of each of these levels thanks to thought and schema analysis, stress reduction procedures, etc. They were observed to be somehow useful for the treatment of depression, stress disorders, phobias, and are gaining some authority in personality disorders and addictions. The main asset of new technologies is the possibility to control the characteristics of symptom-eliciting stimuli/situations, and

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial more precisely the degree to which immersion is enforced. For example, Baus and Bouchard provide a review on the extension of virtual reality exposure-based therapy toward recently described augmented reality exposure-based therapy in individuals with phobias. Concerning substance dependence disorders, Hone-Blanchet et al. present another review on how virtual reality can be an asset for both therapy and craving assessment stressing out the possibilities to simulate social interactions associated with drug seeking behaviors and even peers' pressure to consume.

The last contributions of the topic concern social cognitive training or remediation in severe and chronic mental disorders (autistic disorders, schizophrenia). Here, therapies are based on drill and practice or strategy shaping procedures, and, most of the time, share an errorless learning of repeated cognitive challenges. Computerized methods were early proposed for that they do, effortlessly and with limited costs, repetitive stimulations. While, repetition was incompatible with realism in the social cognitive domain, recent advances provide both immersion and full control over stimuli. Georgescu et al. exhaustively reviews the use of virtual characters to assess and train non-verbal communication in HFA. Regarding schizophrenia remediation, Peyroux and Franck presented a new method, named RC2S, which is a cognitive remediation program to improve social cognition in schizophrenia and related disorders.

To conclude briefly, while it is largely acknowledged that social interaction can be studied as a topic of its own, all the contributions demonstrate the added value of expressive virtual agents and affective computing techniques for the experimentation. It also appears that the use of virtual reality is at the very beginning of a new scientific endeavor in cognitive sciences and medicine.

### FUNDING

Authors EB, AO, JM received funding from ANR ANR-11- EMCO-0007.

relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Brunet-Gouet, Oker, Martin, Grynszpan and Jackson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

OPINION ARTICLE published: 23 September 2014 doi: 10.3389/fnhum.2014.00758

## Investigating alterations of social interaction in psychiatric disorders with dual interactive eye tracking and virtual faces

#### **Bert Timmermans <sup>1</sup>\* and Leonhard Schilbach<sup>2</sup>**

<sup>1</sup> Social Interaction and Consciousness Lab (SINC), School of Psychology, University of Aberdeen, Aberdeen, UK

<sup>2</sup> Neuroimaging Group, Psychiatry and Psychotherapy Clinic, University Hospital of Cologne, Cologne, Germany

\*Correspondence: bert.timmermans@abdn.ac.uk

#### **Edited by:**

Ouriel Grynszpan, Université Pierre et Marie Curie, France

#### **Reviewed by:**

Daniel Belyusar, Massachusetts Institute of Technology, USA Gnanathushran Rajendran, Heriot-Watt University, UK

**Keywords: eye tracking, social interaction, anthropomorphic avatars, autism, schizophrenia**

#### **PSYCHIATRIC DISORDERS AS DISORDERS OF SOCIAL INTERACTION**

Impairments of social interaction and communication are an important if not essential component of many psychiatric disorders. In the context of psychopathology, one tends to think predominantly of autism spectrum disorders. However,many psychopathologies are to some degree characterized by alterations or impairments of interpersonal functioning in the DSM-5 (American Psychiatric Association, 2013), for instance schizophrenia [even auditory hallucinations have been linked to social cognition; (Bell, 2013)], or personality disorders such as borderline personality disorder (Wright et al., 2013). For different pathologies, the difficulties in social interaction may originate in different impairments; for instance in schizophrenia they may be related to a deficit in context processing (Cohen et al., 1999). Still, irrespective of the specific place that social interaction impairments take within different etiologies, it is clear that the systematic study of interaction patterns could teach us a lot about how they manifest themselves in patients, how healthy people with whom the patients interact engage with these patterns, and how they relate to underlying neurobiology. Here, we argue why this should and how this could be accomplished.

One important aspect of social interaction that is increasingly shown to be impaired in psychiatric disorders is the recognition and production gaze behavior, often related to disorder-specific attentional bias (Armstrong and Olatunji, 2012). Schizophrenia has been associated with gaze-related attention deficits (Tso et al., 2012; Dalmaso et al., 2013). A recent study shows that patients with schizophrenia can be distinguished from neurotypical controls with astonishing accuracy on the basis of abnormal eye-tracking patterns on simple tasks such as fixation and smooth pursuit (Benson et al., 2012). Depression and bipolar disorder have been associated with prefrontal and cerebellar disturbances of oculomotor control during episodes of major depression, problems with antisaccade tasks (production of saccades away from a cue), and delayed initiation of saccades made on command (Sweeney et al., 1998). Finally, it is well known that people with autism orient to different kinds of contingencies (Gergely, 2001; Klin et al., 2009).

However, most of the experimental paradigms used to establish gaze anomalies are essentially non-interactive and focus on how particular clinical populations differentially perceive stimuli or social scenes, passively. Likewise, the study of social cognition has only in the last decade begun to incorporate social interaction into its explanation of how we come to understand others and how we manage to navigate a complex social world (Schilbach et al., 2013). This "interactive turn" marks a departure from more traditional approaches, which have emphasized the importance of being able to think about the mental states of others. We have argued that the core problem with social interaction in clinical populations may not only lie in passive perception of social cues, but rather in a skewed experience of how one's own actions influence the social world and in patient's abilities to automatically and rapidly generate behavioral adjustments in response to social stimuli (Schilbach et al., 2012, 2013). Recently, methodological advances have allowed for the study of real-time dynamic social coordination in for instance children with autism (Fitzpatrick et al., 2013), but while undoubtedly rich, the problem with full-body social interaction is precisely that it is so rich, which makes it most difficult to operationalize so as to be used to quantify aspects of interpersonal coordination. Furthermore, fully interactive approaches become problematic if one wanted to use neuroimaging techniques that can access deeper brain structures, such as fMRI, to investigate the neural correlates of interpersonal coordination in on-going social interactions. Indeed, we have showed that the experience of self-initiated (gaze-based) contingencies is linked to activity in the brain's reward system, notably the ventral striatum [Pfeiffer et al. (2014); **Figure 1A**], and it has been suggested that for instance individuals with autism may have difficulties with precisely those rewarding aspects of social interaction (Schmitz et al., 2008; Kohls et al., 2012).

#### **GAZE AND AVATARS TO STUDY REAL-TIME SOCIAL INTERACTION**

Interactive and even dual interactive eye tracking have been around for a couple of years (Richardson and Dale, 2005; Sangin et al., 2008; Carletta et al.,

**morphic avatars**. **(A)** Interactive eye-tracking setup operationalized for fMRI: a virtual character is shown on screen and can be made "responsive" to the participant's looking behavior by means of an algorithm-based, real-time analysis of the eye-tracking obtained from the study participant. **(B)** Schematic setup: two eye-tracking devices are linked via a local area network (LAN), which allows to simultaneously measure two study participants engaged in a

character for the respective other). **(C)** Two participants engaged in a two-person perceptual decision-making task, in which both are asked to discriminate stimuli while the gaze behavior of the respective other participant is visualized on the stimulus screen as well. Importantly, people do not just see where the other is looking via cursor or similar, but actually experience the other's gaze, to which they can dynamically adapt.

2010; Neider et al., 2010). Interactive eye tracking is a method whereby a person's eye gaze is tracked and fed back into the on-going experiment, not so much as a behavioral response akin to a button press but rather as a way of making the trial or experiment course in some way contingent upon the person's gaze. In dual interactive eye tracking, the gaze of two participants is simultaneously tracked and not only fed into their own experimental course, but also in that of the other person. Due to different experimental questions, all dual eye-tracking setups have either simply collected joint gaze data (non-interactive), or used them to display for one person where the other was looking or reading, by means of a pointer or a little rectangle. While we do not deny the merits of these methods, in social interaction one does not see where others are looking via a rectangle overlaid on a scene (though probably with Google Glass this is not so far away). Instead, what is minimally needed to emulate social interaction is the *visibility of one person's social cuetothe other*. One logical option is to have people watch live videos of one another (Redcay et al., 2010), but the disadvantage of this is that facial features provide massive social cues that are not always controllable, and that live videos only allow for delay of the video or playing back an unrelated recorded sequence, but do not allow a systematic manipulation of interaction contingencies.

In order to combine both the experimental controllability of depicting the other's gaze via an on-screen stimulus and the social aspect of perceived gaze, we developed a setup in which a person's eye gaze either influences an avatar's gaze behavior [simple interactive; (Wilms et al., 2010; Pfeiffer et al., 2011); **Figure 1A**], or is displayed onto the eyes of the avatar on another person's screen and vice versa [dual interactive; (Barišic et al., 2013); **Figure 1B**]. It has been shown that virtual avatars can robustly elicit social effects comparable to real faces, for instance, social inhibition and facilitation, interpersonal distance regulation and social presence, empathy, and pro-social behavior have been shown to be comparable with virtual avatars (Bailenson et al., 2003; Hoyt et al., 2003; Bente et al., 2007; Gillath et al., 2008; Slater et al., 2009). Therefore, using anthropomorphic virtual characters

and making them interactive provides an excellent compromise of ecological validity and experimental control. Using the dual eye-tracking setup, in particular, one can generate two-person tasks, during which an integration of the interaction partner's gaze behavior may (or may not) become relevant for task performance and measures of subjective experience (**Figure 1C**).

#### **EMPIRICAL QUESTIONS AND PATHOLOGIES**

We see four ways in which interactive and dual setups as described above can be useful for psychiatry. First, a simple interactive eye-tracking setup, which allows for control of the algorithm by which the avatar behaves in response to the person's gaze, could be used for diagnosis just as the setup used by Benson and colleagues (Benson et al., 2012), which had people perform three simple tasks: smooth pursuit, a fixation stability task, and a free-viewing tasks, but more along a social dimension, in that it would tell us to what degree persons are sensitive to action contingencies, or the disruption thereof.

Second, dual interactive setups would allow us to start looking at whether and how particular psychopathologies are associated with skewed interaction patterns. Indeed, the major advantage of a dual interactive setup is that it allows for a precise quantification of the gaze-interaction dynamics, using non-linear methods such as cross-recurrence quantification analysis. Such quantified interaction dynamics have been shown to correlate with person perception (Miles et al., 2009) and social motives (Lumsden et al., 2012), and have shown a deficit in simultaneous movement synchronization in children with Autism Spectrum Disorder (Fitzpatrick et al., 2013). Thus, it would be possible to tease apart the degree to which patients (a) elicit gaze patterns that differ from controls (and entrain controls), (b) are differentially sensitive to controls' gaze patterns, (c) are differentially sensitive to how their gaze impacts a control person's and vice versa, and (d) are differentially sensitive to the communicative signals that certain variance in the other's gaze or in the dyadic gaze patterns entails. Establishing such measures would lend itself to neuroimaging purposes, which could investigate the neural correlates of social

interaction dynamics in one or both brains of the interaction partners. Also, the fact that the setup is virtual means that it is possible to manipulate this virtual environment in such a way that interactors perceive different scenes and one can study the degree to which communication breaks down in certain cases.

Third, a dual setup would allow for quantification not simply of a clinically significant aberration in gaze pattern, but rather of the diagnostic intuition: what does a clinician do and how does the patient have to react in order to be diagnosed as belonging to a particular clinical group?

Finally, a dynamically interactive setup could be implemented therapeutically. For instance, the currently existing VIGART system [virtual interactive system with gaze-sensitive adaptive-response technology, (Lahiri et al., 2011a,b)] has participants interact with a virtual avatar while their gaze is monitored in real-time. Following the interaction, participants receive feedback about their gaze behavior, which helps adolescents with Autism Spectrum Disorder improve their eye gaze patterns. A fully dual interactive eye-tracking setup would allow such feedback in real time via calculated indices not just of gaze behavior but of gaze contingencies.

Thus, just as from a research point of view dual setups will allow us to study social cognition in a truly social setting, such setups, particularly when implemented with eye tracking and virtual avatars, would allow us to look at psychopathology in terms of the clinical symptoms being embedded (and perhaps reinforced) by the social environment, as both try and engage in a social interaction for which each has a different "sketchbook." Indeed, persons with autism often report problems in interaction with non-autistic persons, but not so much with other persons with autism. Such questions can only be addressed in interactive setups, whereby the use of virtual avatars provides many advantages.

#### **REFERENCES**


to non-social contingencies rather than biological motion. *Nature* 459, 257–261. doi:10.1038/ nature07868


together? Social gaze influences action control in a comparison group, but not in individuals with high-functioning autism. *Autism* 16, 151–162. doi: 10.1177/1362361311409258


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 21 April 2014; accepted: 08 September 2014; published online: 23 September 2014.*

*Citation: Timmermans B and Schilbach L (2014) Investigating alterations of social interaction in psychiatric disorders with dual interactive eye tracking and virtual faces. Front. Hum. Neurosci. 8:758. doi: 10.3389/fnhum.2014.00758*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Timmermans and Schilbach. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# HUMAN NEUROSCIENCE

## Virtual faces expressing emotions: an initial concomitant and construct validity study

#### **Christian C. Joyal 1,2\* † , Laurence Jacob<sup>1</sup>† , Marie-Hélène Cigna<sup>3</sup> , Jean-Pierre Guay 2,3 and Patrice Renaud2,4**

<sup>1</sup> Department of Psychology, University of Quebec at Trois-Rivières, Trois-Rivières, QC, Canada

<sup>2</sup> Research Center, Philippe-Pinel Institute of Montreal, Montreal, QC, Canada

<sup>3</sup> Department of Criminology, University of Montreal, Montreal, QC, Canada

<sup>4</sup> Department of Psychology, University of Quebec in Outaouais, Gatineau, QC, Canada

#### **Edited by:**

Ali Oker, Université de Versailles, France

#### **Reviewed by:**

Ouriel Grynszpan, Université Pierre et Marie Curie, France Jean-Claude Martin, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

(LIMSI-CNRS), France

#### **\*Correspondence:**

Christian C. Joyal, University of Quebec at Trois-Rivières, 3351 Boul. Des Forges, C.P. 500, Trois-Rivières, QC G9A 5H7, Canada

e-mail: christian.joyal@uqtr.ca

†Christian C. Joyal and Laurence Jacob have contributed equally to this work.

**Background:** Facial expressions of emotions represent classic stimuli for the study of social cognition. Developing virtual dynamic facial expressions of emotions, however, would openup possibilities, both for fundamental and clinical research. For instance, virtual faces allow real-time Human–Computer retroactions between physiological measures and the virtual agent.

**Objectives:**The goal of this study was to initially assess concomitants and construct validity of a newly developed set of virtual faces expressing six fundamental emotions (happiness, surprise, anger, sadness, fear, and disgust). Recognition rates, facial electromyography (zygomatic major and corrugator supercilii muscles), and regional gaze fixation latencies (eyes and mouth regions) were compared in 41 adult volunteers (20 , 21 ) during the presentation of video clips depicting real vs. virtual adults expressing emotions. ♂ ♀

**Results:** Emotions expressed by each set of stimuli were similarly recognized, both by men and women. Accordingly, both sets of stimuli elicited similar activation of facial muscles and similar ocular fixation times in eye regions from man and woman participants.

**Conclusion:** Further validation studies can be performed with these virtual faces among clinical populations known to present social cognition difficulties. Brain–Computer Interface studies with feedback–feedforward interactions based on facial emotion expressions can also be conducted with these stimuli.

**Keywords: virtual, facial, expressions, emotions, validation**

#### **INTRODUCTION**

Recognizing emotions expressed non-verbally by others is crucial for harmonious interpersonal exchanges. A common approach to assess this capacity is the evaluation of facial expressions. Presentations of photographs of real faces allowed the classic discovery that humans are generally able to correctly perceive six fundamental emotions (happiness, surprise,fear, sadness, anger, and disgust) experienced by others from their facial expressions (Ekman and Oster, 1979). These stimuli also helped documenting social cognition impairment in neuropsychiatric disorders such as autism (e.g., Dapretto et al., 2006), schizophrenia (e.g., Kohler et al., 2010), and psychopathy (Deeley et al., 2006). Given their utility, a growing number of sets of facial stimuli were developed during the past decade, including the Montreal Set of Facial Displays of Emotion (Beaupré and Hess, 2005), the Karolinska Directed Emotional Faces (Goeleven et al., 2008), the NimStim set of facial expressions (Tottenham et al., 2009), the UC Davis set of emotion expressions (Tracy et al., 2009), the Radboud faces database (Langner et al., 2010), and the Umeå University database of facial expressions (Samuelsson et al., 2012). These sets, however, have limitations. First, they consist of static photographs of facial expressions from real persons, which cannot be readily modified to fit a specific requirement of particular studies (e.g., presenting elderly Caucasian females). Second, static facial stimuli elicit weaker muscle mimicry responses, and they are less ecologically valid than dynamic stimuli (Sato et al., 2008; Rymarczyk et al., 2011). Because recognition impairments encountered in clinical settings might be subtle, assessment of different emotional intensities is often required, which is better achieved with dynamic stimuli (incremental expression of emotions) than static photographs (Sato and Yoshikawa, 2007).

Custom-made video clips of human actors expressing emotions have also been used (Gosselin et al., 1995), although it is a time and financially consuming process. Recent sets of validated video clips are available (van der Schalk et al., 2011; Bänziger et al., 2012), but again, important factors such as personal expressive style and physical characteristics (facial physiognomy, eye–hair color, skin texture, etc.) of the stimuli are fixed and difficult to control. Furthermore, video clips are not ideal for novel treatment approaches that use Human–Computer Interfaces (HCI; Birbaumer et al., 2009; Renaud et al., 2010).

A promising avenue to address all these issues is the creation of virtual faces expressing emotions (Roesch et al., 2011). Animated synthetic faces expressing emotions allow controlling of a number of potential confounds (e.g., equivalent intensity, gaze, physical appearance, socio-demographic variables, head angle, ambient luminosity), while giving experimenters a tool to create specific stimuli corresponding to their particular demands. Before being used with HCI in research or clinical settings, sets of virtual faces expressing emotions must be validated. Although avatars expressing emotions are still rare (Krumhuber et al., 2012), interesting results emerged from previous studies. First, basic emotions are well recognized from simple computerized line drawing depicting facial muscle movements (Wehrle et al., 2000). Second,fundamental emotions expressed by synthetic faces are equally, if not better, recognized than those expressed by real persons (except maybe for disgust; Dyck et al., 2008). Third, virtual facial expressions of emotions elicit sub-cortical activation of equivalent magnitude than that observed with real facial expressions (Moser et al., 2007). Finally, clinical populations with deficits of social cognition also show impaired recognition of emotions expressed by avatars (Dyck et al., 2010). In brief, virtual faces expressing emotions represent a promising approach to evaluate aspects of social cognition both for fundamental and clinical research (Mühlberger et al., 2009).

We recently developed a set of adult (males and females) virtual faces from different ethnic backgrounds (Caucasian, African, Latin, or Asian), expressing seven facial emotional states (neutral, happiness, surprise, anger, sadness, fear, and disgust) with different intensities (40, 60, 100%), from different head angles (90°, 45°, and full frontal; Cigna et al., in press). The purpose of this study was to validate a dynamic version of these stimuli. In addition to verify convergent validity with stimuli of dynamic expressions from real persons, the goal of this study was to demonstrate construct validity with physiological measures traditionally associated with facial emotion recognition of human expressions: facial electromyography (fEMG) and eye-tracking.

Facial muscles of an observer generally react with congruent contractions while observing the face of a real human expressing a basic emotion (Dimberg, 1982). In particular, the zygomatic major (lip corner pulling movement) and corrugator supercilii (brow lowering movement) muscles are rapidly, unconsciously, and differentially activated following exposition to pictures of real faces expressing basic emotions (Dimberg and Thunberg, 1998; Dimberg et al., 2000). Traditionally, these muscles are used to distinguish between positive and negative emotional reactions (e.g., Cacioppo et al., 1986; Larsen et al., 2003). In psychiatry, fEMG have been used to demonstrate sub-activation of the zygomatic major and/or the corrugator supercilii muscles in autism (McIntosh et al., 2006), schizophrenia (Mattes et al., 1995), personality disorders (Herpertz et al., 2001), and conduct disorders (de Wied et al., 2006). Interestingly, virtual faces expressing basic emotions induce the same facial muscle activation in the observer as do real faces, with the same dynamic >static stimulus advantage (Weyers et al., 2006, 2009). Thus, recordings of the zygomatic major and the corrugator supercilii muscle activations should represent a good validity measure of computer-generated faces.

Eye-trackers are also useful in the study of visual emotion recognition because gaze fixations on critical facial areas (especially mouth and eyes) are associated with efficient judgment of facial expressions (Walker-Smith et al., 1977). As expected, different ocular scanning patterns and regional gaze fixations are found among persons with better (Hall et al., 2010) or poorer recognition of facial expressions of emotions (e.g., persons with autism,Dalton et al., 2005; schizophrenia, Loughland et al., 2002; or psychopathic traits, Dadds et al., 2008). During exposition to virtual expressions of emotions, very few eye-tracking studies are available, although the data seem comparable to those with real stimuli (e.g., Wieser et al., 2009). In brief, fEMG and eye-tracking measures could serve not only to validate virtual facial expressions of emotions, but also to demonstrate the possibility of using peripheral input (e.g., muscle activation and gaze fixations) with virtual stimuli for HCI. The main goal of this study was to conduct three types of validation with a new set of virtual faces expressing emotions: (1) primary (face) validity with recognition rates; (2) concurrent validity with another, validated instrument; and (3) criterion validity with facial muscle activation and eye gaze fixations. This study was based on three hypotheses. H1: the recognition rates would not differ significantly between the real and virtual conditions for any of the six expressed emotions; H2: real and virtual conditions would elicit similar mean activation of the zygomatic major and corrugator supercilii muscles for the six expressed emotions; H3: the mean time of gaze fixations on regions of interest would be similar in both conditions (real and virtual).

#### **MATERIALS AND METHODS PARTICIPANTS**

Forty-one adult Caucasian volunteers participated in the study (mean age: 24.7 ± 9.2, 18–60 interval; 20 males and 21 females). They were recruited via Facebook friends and university campus advertisement. Exclusion criteria were a history of epileptic seizures, having received a major mental disorder diagnosis, or suffering from motor impairment. Each participant signed an informed consent form and received a 10\$ compensation for their collaboration. This number of participants was chosen based on previous studies concerned with emotional facial expressions of emotion (between 20 and 50 participants; Weyers et al., 2006, 2009; Dyck et al., 2008; Likowski et al., 2008; Mühlberger et al., 2009; Roesch et al., 2011; Krumhuber et al., 2012).

#### **MATERIALS AND MEASURES**

Participants were comfortably seated in front of a 19<sup>00</sup> monitor in a sound attenuated, air-conditioned (19°C) laboratory room. The stimuli were video clips of real Caucasian adult faces and video clips of avatar Caucasian adult faces dynamically expressing a neutral state and the six basic emotions (happiness, surprise, anger, sadness, fear, and disgust). Video clips of real persons (one male, one female) were obtained from computerized morphing (FantaMorph software, Abrasoft) of two series of photographs from the classic Picture of Facial Affect set (Ekman and Friesen, 1976; from neutral to 100% intensity). Video clips of virtual faces were obtained from morphing (neutral to 100% intensity) static expressions of avatars from our newly developed set (one male, and female; Cigna et al., in press; **Figure 1**). The stimuli configurations were based on the POFA (Ekman and Friesen, 1976) and the descriptors of the Facial Action Coding System (Ekman et al., 2002). In collaboration with a professional computer graphic designer specialized in facial expressions (BehaVR solution)<sup>1</sup> , virtual dynamic facial movements were obtained by

<sup>1</sup>http://www.behavrsolution.com

gradually moving multiple facial points (action units) along vectors involved in the 0–100% expressions (Rowland and Perrett, 1995). For the present study, 24 video clips were created: 2 (real and virtual) × 2 (man and woman) × 6 (emotions). A series example is depicted in **Figure 2**. Video clips of 2.5, 5, and 10 s. were obtained and pilot data indicated that 10 s presentations were optimal for eye-tracking analyses. Therefore, real and synthetic expressions were presented during 10 s, preceded by a 2 s central cross fixation. During the inter-stimulus intervals (max 10 s), participants had to select (mouse click) the emotion expressed by the stimulus from a multiple-choice questionnaire (Acrobat Pro software) appearing on the screen. Each stimulus was presented once, pseudo randomly, in four blocks of six emotions, counterbalanced across participants (Eyeworks presentation software, Eyetracking Inc., CA, USA).

Fiber contractions (microvolts) of the zygomatic major and the corrugator supercilii muscles (left side) were recorded with 7 mm bipolar (common mode rejection) Ag/AgCl pre-gelled adhesive electrodes<sup>2</sup> , placed in accordance with the guidelines of Fridlund and Cacioppo (1986). The skin was exfoliated with NuPrep (Weaver, USA) and cleansed with sterile alcohol prep pads (70%). The raw signal was pre-amplified through a MyoScan-Z sensor (Thought Technology, Montreal, QC, Canada) with builtin impedance check (<15 kΩ), referenced to the upper back. Data were relayed to a ProComp Infinity encoder (range of 0–2000µV; Thought Technology) set at 2048 Hz, and post-processed with the Physiology Suite for Biograph Infinity (Thought Technology). Data were filtered with a 30 Hz high-pass filter, a 500 Hz low pass filter, and 60 Hz notch filter. Baseline EMG measures were obtained at the beginning of the session, during eye-tracking calibration. Gaze fixations were measured with a FaceLab5 eye-tracker (SeeingMachines, Australia), and regions of interest were defined as commissures of the eyes and the mouth (Eyeworks software; **Figure 3**). Assessments were completed in approximately 30 min.

#### **STATISTICAL ANALYSES**

Emotion recognition and physiological data from each participant were recorded in Excel files and converted into SPSS for statistical analyses. First, recognition rates (%) for real vs. avatar stimuli from male and female participants were compared

<sup>2</sup>http://www.bio-medical.com

with Chi-square analyses, corrected (*p* < 0.008) and uncorrected (*p* < 0.05) for multiple comparisons. The main goal of this study was to demonstrate that the proportion of recognition of each expressed emotion would be statistically similar in both conditions (real vs. virtual). To this end, effect sizes (ES) were computed using the Cramer's *V* statistic. Cramer's *V* values of 0–10, 11– 20, 21–30, and 31 and are considered null, small, medium, and large, respectively (Fox, 2009). Repeated measures analyses of variance (ANOVAs) between factors (real vs. virtual) with the within-subject factor emotion (happiness, surprise, anger, sadness, fear, or disgust) were also conducted on the mean fiber contractions of the zygomatic major and the corrugator supercilii muscles, as well as the mean time spent looking at the mouth, eye, and elsewhere. For these comparisons, ES were computed with the *r* formula, values of 0.10, 0.30, and 0.50 were considered small, medium, and large, respectively (Field, 2005).

#### **ETHICAL CONSIDERATION**

This study was approved by the ethical committee of the University of Quebec at Trois-Rivières (CER-12-186-06.09).

#### **RESULTS**

No significant difference emerged between male (90%) and female (92.1%) raters (data not shown). In accordance with H1, recognition rates of the whole sample did not differ significantly between real and virtual expressions, neither overall [90.4 vs. 91.7%, respectively; *X* 2 (1) = 0.07, *p* = 0.51] nor for each emotion (**Table 1**). ES was small between conditions for all emotions, including joy (0.10), surprise (0.08), anger (0.07), sadness (0.04), fear (0.12), and disgust (0.07) (**Table 1**). In accordance with H2, no difference emerged between the mean contractions of the zygomatic major or the corrugator supercilii muscles between both conditions for any emotions, with all ES below 0.19 (**Table 2**). Finally, in partial accordance with H3, only the time spent looking at the mouth differed significantly between conditions [Real >Virtual; *F*(1,29) = 3.84, *p* = 0.001, ES = 0.58; **Table 3**]. Overall, low ES demonstrate that very few difference exist between the real and virtual conditions. However, such low ES also generated weak statistical power (0.28 with an alpha set at 0.05 and 41 participants). Therefore, the possibility remains that these negative results reflect a type-II error (1 − power = 0.72).



ES, effect size (Cramer's V).

**Table 2 | Comparisons of mean (SD) facial muscle activations during presentations of real and virtual stimuli expressing the basic emotions**.


ES, effect size (r).

#### **DISCUSSION**

The main goal of this study was to initially assess concomitants and construct validity of computer-generated faces expressing emotions. No difference was found between recognition rates, facial **Table 3 | Comparisons of mean (SD) duration (ms) of gaze fixations during presentations of real and virtual stimuli expressing the basic emotions**.


muscle activation, and gaze time spent on the eye region of virtual and real facial expression of emotions. Thus, these virtual faces can be used for the study of facial emotion recognition. Basic emotions such as happiness, anger, fear, and sadness were all correctly recognized with rates higher than 80%, which is comparable to rates obtained with other virtual stimuli (Dyck et al., 2008; Krumhuber et al., 2012). Interestingly, disgust expressed by our avatars was correctly detected in 98% of the cases (compared with 71% for real stimuli), an improvement from older stimuli (Dyck et al., 2008; Krumhuber et al., 2012). The only difference we found between the real vs. virtual conditions was the time spent looking at the mouth region of the real stimuli, which might be due to an artifact. Our real stimuli were morphed photographs, which could introduce unnatural saccades or texture-smoothing from digital blending. In this study, for instance, the highest time spent looking at the mouth of real stimuli was associated with a jump in the smile of the female POFA picture set (abruptly showing her teeth). Thus, comparisons with video clips of real persons expressing emotions are warranted (van der Schalk et al., 2011). Still, these preliminary data are encouraging. They suggest that avatars could eventually serve alternative clinical approaches such as virtual reality immersion and HCI Birbaumer et al., 2009; Renaud et al., 2010). It could be hypothesized, for instance, that better detection of other's facial expressions would be achieved through biofeedback based on facial EMG and avatars reacting with corresponding expressions (Allen et al., 2001; Bornemann et al., 2012).

Some limits associated with this study should be addressed by future investigation. First, as abovementioned, using video clips of real persons expressing emotions would be preferable to using morphed photographs. It would also allow presentation of colored stimuli in both conditions. Second, and most importantly, the small number of participants in the present study prevents demonstrating that the negative results were not due to a type-II statistical error related with a lack of power. Most studies using avatars expressing emotions are based on sample sized ranging from 20 to 50 participants (Weyers et al., 2006, 2009; Dyck et al., 2008; Likowski et al., 2008; Mühlberger et al., 2009; Roesch et al., 2011; Krumhuber et al., 2012), because recognition rates are elevated, physiological effects are strong, and effect sizes are high. Although demonstrating an absence of difference is more difficult, these and the present results suggest that no significant difference exist between recognition and reaction to real and virtual agent expression of emotions. Only the addition of more participants in future investigations with our avatars will allow discarding this possibility.

Finally, with the increasing availability of software enabling the creation of whole-body avatars (Renaud et al., 2014), these virtual faces could be used to assess and treat social cognition impairment in clinical settings. We truly believe that the future of social skill evaluation and training resides in virtual reality.

#### **AUTHOR NOTE**

This study was presented in part at the 32nd annual meeting of the Association for Treatment of Sexual Abusers (ATSA), Chicago, 2013.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 18 April 2014; paper pending published: 03 July 2014; accepted: 16 September 2014; published online: 30 September 2014.*

*Citation: Joyal CC, Jacob L, Cigna M-H, Guay J-P and Renaud P (2014) Virtual faces expressing emotions: an initial concomitant and construct validity study. Front. Hum. Neurosci. 8:787. doi: 10.3389/fnhum.2014.00787*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Joyal, Jacob, Cigna, Guay and Renaud. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## EEVEE: the Empathy-Enhancing Virtual Evolving Environment

#### *Philip L. Jackson1,2,3\*, Pierre-Emmanuel Michon2,3, Erik Geslin1, Maxime Carignan1 and Danny Beaudoin1*

*<sup>1</sup> Faculté des Sciences Sociales, École de Psychologie, Université Laval, Québec, QC, Canada*

*<sup>2</sup> Centre for Research in Rehabilitation and Social Integration (CIRRIS), Université Laval, Québec, QC, Canada*

*<sup>3</sup> Institut Universitaire en Santé Mentale de Québec (CRIUSMQ), Université Laval, Québec, QC, Canada*

#### *Edited by:*

*Leonhard Schilbach, University Hospital Cologne, Germany*

#### *Reviewed by:*

*Sebastian Korb, International School for Advanced Studies (SISSA), Italy Martin Schulte-Rüther, University Hospital RWTH Aachen, Germany*

#### *\*Correspondence:*

*Philip L. Jackson, École de Psychologie, Université Laval, Pavillon Félix-Antoine-Savard, 2325, Rue des Bibliothèques, Bureau 1116, Québec, QC G1V 0A6, Canada e-mail: philip.jackson@psy.ulaval.ca*

Empathy is a multifaceted emotional and mental faculty that is often found to be affected in a great number of psychopathologies, such as schizophrenia, yet it remains very difficult to measure in an ecological context. The challenge stems partly from the complexity and fluidity of this social process, but also from its covert nature. One powerful tool to enhance experimental control over such dynamic social interactions has been the use of avatars in virtual reality (VR); information about an individual in such an interaction can be collected through the analysis of his or her neurophysiological and behavioral responses. We have developed a unique platform, the Empathy-Enhancing Virtual Evolving Environment (EEVEE), which is built around three main components: (1) different avatars capable of expressing feelings and emotions at various levels based on the Facial Action Coding System (FACS); (2) systems for measuring the physiological responses of the observer (heart and respiration rate, skin conductance, gaze and eye movements, facial expression); and (3) a multimodal interface linking the avatar's behavior to the observer's neurophysiological response. In this article, we provide a detailed description of the components of this innovative platform and validation data from the first phases of development. Our data show that healthy adults can discriminate different negative emotions, including pain, expressed by avatars at varying intensities. We also provide evidence that masking part of an avatar's face (top or bottom half) does not prevent the detection of different levels of pain. This innovative and flexible platform provides a unique tool to study and even modulate empathy in a comprehensive and ecological manner in various populations, notably individuals suffering from neurological or psychiatric disorders.

**Keywords: emotions, empathy, pain, FACS, avatar, virtual reality, affective computing**

#### **INTRODUCTION**

Imagine a hospital ward for burned people, where a nurse comes in to change the dressings of a patient. As she removes the old dressing, the face of the patient writhes in pain, his body stiffens and he moans. Yet the nurse seems insensitive to his pain and continues her work. This fictional scenario illustrates an extreme situation where healthcare professionals are constantly faced with the pain and suffering of other people, and yet, healthcare professionals are mostly able to function even in this highly discomforting environment. Does this imply that they lack empathy? Surely not, but it suggests that the response that would typically be observed in non-medical personnel, aversion, is somewhat reduced. A number of recent studies in cognitive neuroscience have examined the brain response of people confronted to the pain of others (Jackson et al., 2006; Lamm et al., 2011; Guo et al., 2013) and suggested that observing pain engages brain regions that are also involved in processing painful stimuli, a phenomenon often referred to as "resonance." Such resonant patterns have been shown to be altered in medical personnel (Cheng et al., 2007; Decety et al., 2010), and also in a number of psychopathologies (Moriguchi et al., 2007; Bird et al., 2010; Marcoux et al., 2014). While this experience-related change in brain response (and in physiological responses, e.g., Hein et al., 2011) can be well-adapted in some contexts, it might also lead in other circumstances to suboptimal interpersonal interactions. For instance, medical personnel should remain empathic, by regulating their own distress without changing their caregiving abilities. In the context of schizophrenia, a reduction of empathy could stem from dysfunctions at different cerebral levels, and could interfere with rehabilitation and therapeutic processes. Thus, having tools to better study the physiological correlates of empathy could lead to new intervention avenues. These examples highlight the need for an innovative tool, which could benefit from the growing literature on the cognitive neuroscience of empathy and the rapid technological advances in both computer animation and measures of neurophysiological responses.

While empathy was initially designated as an ability to "put oneself in the place of another," as a transposition or mental projection (Dilthey, 1833-1911), this view involves a dissociation between the perception of emotion and cognition (Descartes, 1649). Most contemporary scholars now agree that empathy is the product of both cognitive and emotional processes, and that the line between the two is indeed difficult to draw. Empathy is currently defined as "*The naturally occurring subjective experience of similarity between the feelings expressed by self and others without losing sight of whose feelings belong to whom*" (Decety and Jackson, 2004). The empathy model on which this work is founded is an extension of previously described models, based on multiple fields of research which include developmental, comparative, cognitive and social psychology, as well as psychiatry and neuroscience (Decety and Jackson, 2004; Decety et al., 2007). This model comprises three major components: (1) an automatic and emotional component referred to as *affective sharing* (resonance), (2) a deliberate and controlled cognitive component called *perspective-taking*, and (3) a hybrid component with automatic and controlled processes, dubbed *executive regulation*, which modulates the other components.

One obvious challenge in the study of empathy is that this mental faculty relies largely on inner emotional and cognitive processes, which can only be measured indirectly through largely subjective verbal responses and behavioral measures (Lawrence et al., 2004). Other measures such as psychophysical responses (e.g., skin conductance) and changes in brain activity levels and patterns [e.g., as measured with functional magnetic resonance imaging (fMRI) or electroencephalography (EEG)] have been identified as potential markers of the empathic response (Wicker et al., 2003; Jackson et al., 2005; Marcoux et al., 2013) and have the advantage of being objective. Studies combining several of these markers are scarce, yet the convergence of multiple sources of information seems to be the most promising route to grasp the full complexity of empathy. Most studies still use relatively simple visual stimuli and experiments based on a series of independent short events, in which there is no feedback to the participant. Participants in these studies observe a situation and are asked to rate it on different features, but rarely does the situation change according to the participant's response. Such a form of interactivity seems essential to the ecological study of empathy (Zaki and Ochsner, 2012; Achim et al., 2013; Schilbach et al., 2013). One way to improve this interactive factor in an experimental setting would be to alter the "other," i.e., the person with which one should empathize, in real time to provide feedback. Research in computer science, through developments in virtual reality, social gaming and affective computing research, is making progress toward improved interactivity in experimental settings (e.g., Vasilakos and Lisetti, 2010; Tsai et al., 2012; de Melo et al., 2014; Gaffary et al., 2014). Affective computing research aims to develop software that can better recognize and display emotions. This type of technology can now be used in combination with the objective neurophysiological markers of human empathy to pinpoint the most relevant and help study online changes in these markers during social interactions.

The systematic study of empathy is limited by the fact that this process may change based on context and on the individuals present in the interaction. Indeed, the literature suggests that empathic responses are different depending on the nature of the relationship between the observer and the target (Monin et al., 2010a,b). Thus, studying empathy systematically requires flexibility both in the choice of markers and in the level of experimental control over the context.

The aim of this article is to describe the recent methodological progress arising from the development of a uniquely powerful interactive virtual reality platform for empathy research. This platform, EEVEE, the Empathy-Enhancing Virtual Evolving Environment, was developed in order to better understand the different behavioral and physiological markers of human empathy. EEVEE was designed based on three objectives: (1) to provide a means to study empathy within an interactive, yet controlled, social environment, (2) to identify the biomarkers that reflect the different facets of the empathic response, and ultimately, (3) to use this technology to train and improve empathy. EEVEE uses avatars that dynamically respond to the user's physiological feedback through the production of emotional facial expressions. At this stage of development, EEVEE is geared toward expressions of pain, as they have been a good model for the study of empathy from a social and cognitive neuroscience perspective (Decety and Jackson, 2004; Coll et al., 2011; Lamm et al., 2011). The use of avatars instead of pre-recorded videos of facial expressions has the advantage of being highly controllable. EEVEE can independently change several features of the facial expression, thereby producing different combinations of expressions of varying (and measurable) intensity, duration, and context. EEVEE can also alter these features in real time based on individual responses, i.e., behavioral, neural and physiological markers. EEVEE has a modular architecture enabling additional instruments and types of neurophysiological measures to be added as they become available.

The first part of this article describes EEVEE and its different components. Then, a two-stage validation process is described, in which the properties of different facial expressions of emotions are demonstrated. A sample experiment follows, showing how EEVEE can be used to test specific hypotheses regarding pain and empathy. Finally, we discuss and provide examples of how EEVEE can be used in real time.

#### **DEVELOPMENT OF EEVEE**

EEVEE consists of three main components (see **Figure 1**): (1) the production, animation and display of responsive human avatars, (2) the measure of neurophysiological and motor responses, and (3) a multimodal interface for setting the parameters of the computational integration and of the interaction between avatar and participant within a scenario.

#### **PRODUCTION, ANIMATION AND DISPLAY OF RESPONSIVE HUMAN AVATARS**

This component allows the production and animation of highresolution avatars to produce distinct sets of emotional expressions based on the Facial Action Coding System (FACS), which encompasses 45 facial muscle movements and 10 head movements (Ekman et al., 2002; for pain, see Prkachin and Solomon, 2008). Six different avatars (male and female adults) can currently be selected (see **Figure 2**). Future versions of EEVEE will allow users to change the avatar's gender, age, and ethnicity independently.

We implemented a methodology for scanning the head of a real person using a high-resolution 3D scanner (Creaform Gemini™ with the Geomagic Wrap® software). Once scanned, these highresolution 3D models are transferred by wrap iterations on a single basic low polygon model. This model is able to express a variety of emotions by the use of motion captures (to simulate natural movements of intermediate idle phases between expressions), and blend shapes. Blend shapes (morph target animations) are convex combinations of *n* base vectors and mesh formed in the same topology. The movements of vertex points in xyz space generate animations of shapes. In this first stage of development, on the basis of the 3D model of a team member head, we used 3ds Max® 2014 (Autodesk) to recreate a low-polygon mesh model for real-time 3D use. Based on this, we created a highly detailed and triangulated model in ZBrush® (Pixologic). This high-resolution model was then used to generate a projection texture normal map (Sander et al., 2001), simulating facial detail in the Unity 3D engine (Unity Technologies). The skin of the avatars was produced by combining a Microsoft DirectX® 11 Shader Model 5, using a diffuse texture of 4096 × 4096 pixels based on RGB 173.100.68, and texture normal map of 4096 × 4096 pixels integrating wrinkles to the expressions. A third 4096 × 4096 map

and Autonomic Nervous System (ANS) responses via heart rate Beat per

Minute (BPM), Electrocardiography (ECG), respiration (RESP), Electro-Dermal Activity (EDA), and Automatic Facial Expression

> contains full texture in each RGB channel: three textures refer to levels of specular, glossiness and depth sub surface scattering (SSS) defining skin translucence. EEVEE display support includes tessellations with displacement map and Phong smoothing, diffuse scattering with separate weighted normal, Fresnel specular reflectance, translucency, rim lighting and real time shadows from two light sources (see **Figure 3**). Tessellation simulates a mesh of a large volume on a low-polygon mesh, which results in the facial shape of the avatar appearing less angular. Specular reflectance and translucence simulate various aspects of skin gloss, grain, depth and light absorption. Based on the modeling of a first avatar generated with blend shapes, five other avatars were created with very different physical characteristics, but all having the same basic triangulation meshes with 5126 triangles. Importantly, this technique allows the same blend shapes to be used for the expressions of different avatars, while maintaining their idiosyncratic aspect based on their own morphology. We also used a skinning face of the basic avatar, on a bone network for the application of facial motion captures made with a Vicon Bonita™ B10 system, using a 1.0 megapixel camera at 250 frames per second. This skinning face allows the use of motion captures combined with changes in vertex blend shapes. The avatars display

curve (EDA), facial action unit intensities (FACS). All of these data are gathered by a system for physiological measurement (MP150, Biopac Systems Inc.) and the FaceReader™ software encodes FACS information.

**FIGURE 2 | Examples of avatars' emotional expression of emotions using the Facial Action Coding System principles. Top row** (left to right): Fear, Sadness, Pain, Anger, Joy, Disgust. **Middle row**: Neutral. **Bottom row** (left to right): Joy, Disgust, Fear, Anger, Sadness, Pain.

special wrinkles only when expressing emotions. Action units (AUs) were created by modifying the vertex positions in our main avatar by a developer with a FACS Coder certification. These blend shapes were then exported to the Unity 3D engine to be dynamically used on all avatars. The blend shapes produce muscular facial movements, based on the muscular movement anatomy described in the FACS Manual (Ekman and Friesen, 1978), and use dynamic normal maps to generate wrinkles associated with the expression of emotions. These wrinkles were created in ZBrush® from a 2048 × 2048 pixel normal map texture, based on Ekman's FACS. The intensity mapping for maximum intensity (E) was determined with a top-down approach using the FACS manual, interpreted for the bone and muscle structure of the human face used as a model in the original avatar. Expression levels Neutral to E were linearly distributed between a completely relaxed state (0%) and maximum expression intensity (100%). The dynamic normal maps display the blending of each of the two RGBa textures that produces four different masks.

In this first phase of development, we created two different environments within EEVEE: a hospital room (see **Figure 4**) and a park (see Video 1 in Supplementary Material). These environments can easily be changed to other settings, depending on the user's objectives. They are entirely in 3D, are real-time ready, and can be displayed through the Unity engine on a variety of screens, or even in a more immersive display system such as a HMD (head-mounted display) or a CAVE (computer assisted virtual environment). A version of EEVEE was developed for the Oculus Rift™ HMD (Oculus VR), which improves visual immersion, but also comes with other restrictions, notably for the possibility of using eye-tracking systems.

#### **PAIN FACIAL ACTIONS UNITS**

The flexibility of the emotional expressions in EEVEE is a fundamental part of its development. The facial expressions of the avatars are based on the Facial Action Coding System (FACS; Ekman and Friesen, 1978), which is based in part on earlier research by Carl-Herman Hjortsjö on facial imitation (Hjortsjö, 1969). Contractions and relaxations of the different facial muscles create variations that are encoded as Actions Units (AUs). The FACS describes 46 numbered AUs, each AU corresponding to the contraction and relaxation of a muscle or muscle group. In the case of pain for example, several AUs are mobilized: AU4 + (AU6, AU7) + (AU9, AU10) + AU43 (Prkachin, 1992, 2009; Lucey et al., 2011a). AU4 corresponds to the brow lowerer, glabellae, depressor supercilii, corrugator supercilii; AU6 to the cheek raiser, orbicularis oculi (pars orbitalis); AU7 to the lid tightener, orbicularis oculi (pars palpebralis); AU9 to the nose wrinkler, levator labii superioris alaeque nasi; AU10 to the upper lip raiser, levator labii superioris, caput infraorbitalis, AU12 to the lip corner puller, zygomaticus major; AU25 to lips part, depressor labii inferioris, or relaxation of mentalis or orbicularis oris; and AU43 to the eyes closing, relaxation of levator palpebrae superioris. The AUs can be scored according to their intensity by appending letters A–E (from minimal to maximal intensity) following this scale: A'= trace; B = slight; C = marked or pronounced; D = severe or extreme; E = maximum. Using this system, a face with the FACS values AU4B + AU6E + AU7D + AU9C + AU10D + AU12D+ AU25E + AU43A, would result in a pain expression. We also use the units AU51 (head turn left), M60 (head-shake side to side), and M83 (head upward and to the side), to produce natural head movements of the avatars (see **Figures 2**, **5**). The timing of the different AUs is currently the same, but further developments of

the system are expected to introduce variable timing of the different AUs of one emotion, which should help approximate the natural dynamics of facial expression (Jack et al., 2014), as well as allowing the study of the impact of changing inter-AU timing.

According to the FACS Manual, wrinkles are often one the first indication of muscular contractions (Ekman et al., 2002). For most AUs, wrinkles appear at the B intensity level. In order to respect this natural dynamic, and in order to avoid the so-called "Uncanny Valley" (Mori, 1970; Seyama and Nagayama, 2007) due to cognitive incoherence during observation of emotional expressions (wrinkles produced by the normal map would be seen very late after users cognitively detected the movements produced by the vertex blend shapes), blend shapes (BS) and dynamic normal maps (DNM) were created following this rule: AU-A: 20% BS + 20% DNM; AU-B: 40% BS + 60% DNM; AU-C: 60% BS + 80% DNM; AU-D: 80% BS + 100% DNM; AU-E: 100% BS + 100% DNM.

#### **MEASURES OF NEUROPHYSIOLOGICAL AND MOTOR RESPONSES**

This component of EEVEE consists of a series of apparatus that will allow real time measurement of behavioral and physiological responses of the participants. A number of inputs have already been integrated in EEVEE, and the number and types of devices can be changed and optimized over time. The current version of EEVEE uses an emotional face recognition tool (Noldus FaceReader™), measures of heart electrical activity, respiration rate, and skin conductance (MP150, Biopac Systems Inc.), as well as eye-tracking and pupillometry (Smart Eye Pro, Smart Eye®).

The instant values for 19 of the FACS AUs and aggregated values for 6 mood expressions detected on the observed face are acquired using Noldus FaceReader™ with the action units and external data modules. This information is transmitted by TCP protocol at 15 Hz. The MP150, Biopac Systems Inc. system is used to acquire: electrocardiographic activity recorded with 3 AgCl electrodes with 10% saline gel positioned using Einthowen's triangle; thoracic dilatation recorded by a stretch mesh strap and transducer; and skin conductance recorded by 2 AgCl electrodes with 0.5% isotonic gel positioned on the index and middle fingertips. These three signals are sent by TCP protocol at a 120 Hz sampling rate, then transformed respectively into heart rate variability measure (difference between the last R-R interval and its previous value), respiration rate variability (difference between the last respiratory cycle duration and its previous value), and galvanic skin response (area under the curve for the last 4 s of a high pass 0.05 Hz filter of the skin conductance). Smart Eye Pro, Smart Eye® is used to acquire eye tracking data and pupil diameter for both eyes at 120 Hz through TCP protocol, with the latter transformed into a pupil diameter variation during the last second. All the physiological data arrive independently to a dedicated server.

#### **MULTIMODAL INTERFACE AND ONLINE COMPUTATIONAL INTEGRATION**

This component of EEVEE translates the different responses (behavioral, physiological, neurophysiological) into vectors that can modulate the motor responses of the avatars (facial expression, eye/head movements) thereby producing a truly interactive task. The signal processing, feature extraction and scenario for the avatars' responses are configurable through a scenario designer implemented in a standalone EEVEE application. All of these parameters are being built into a modifiable, simple, and intuitive drag-and-drop interface. It is also possible to program more complex scenarios, thus enabling a great variety of contexts. The system is currently designed to project one avatar at a time, but the projection of two avatars or more will be part of the next

**FIGURE 5 | Facial expressions of emotions displayed by one of the male avatars used in Experiment 1a. Top row** (left to right): Neutral, Joy, Pain, Sadness. **Bottom row** (left to right): Neutral, Fear, Disgust, Anger.

development phase. Depending on the interaction mode selected, the avatar's animation (facial features and posture) can be predetermined or triggered in real time to respond to the observer's behavior, posture or physiology. It is also possible to activate a mirror mode for the avatar: recognizing and imitating the facial expressions of the user (see description of the modes below; see Video 1 in the Supplementary Material for a demo of the mirror mode). A dedicated EEVEE server is used to collect, over the network, all the data acquired through different application program interfaces (APIs) and software development kits (SDKs), making it possible to display the avatar and collect the physiological measures at different sites. The signals are then processed with mathematical routines to extract their salient features and reduce their bulk. Such signal processing includes Fourier transforms, low and high pass filters, smoothing, interpolation, etc.

EEVEE can be used in different modes. In the Offline mode, the animations for emotional behavior can be set in advance in the scenario along with its time course and the multiple cues (responses) from an observer watching the avatar can be recorded for post-experiment analysis. In the Mirror mode, EEVEE can be used to reproduce the movements and facial expressions of the observer (by mimicking the AUs detected). The Real-time EEVEE mode will use different combinations of the observer's responses to guide the avatar's behavior (facial expression, posture, eye activity, animation) and create an interactive exchange based on information not typically accessible to people.

#### **VALIDATION OF EEVEE**

In order to validate the capacity for EEVEE's avatars to convey specific emotional content through facial expressions, two experiments were conducted. A first two-step experiment was run in which participants had to evaluate the avatars' expressions. A second experiment was conducted in order to test whether participants could assign the target emotion, in this case pain, to avatars whose face was partly masked. This latter experiment also included facial expressions of emotions produced by real actors, which allowed the comparison of pain detection in humans and avatars.

#### **EXPERIMENT 1: VALIDATION OF THE AVATARS' EXPRESSIONS** *General objective*

Our first experiment was twofold: Experiment 1a (negative emotion discrimination) consisted in asking healthy adults to discriminate between negative emotions depicted by four different avatars (2 males; 2 females) displaying dynamic emotional expressions; Experiment 1b (pain level discrimination) consisted in asking the same group of adults to specifically evaluate pain intensity in different pain expressions from the same avatars.

#### *Experiment 1a: Negative emotion discrimination*

Our goal in Experiment 1a was to determine whether people could discriminate between four negative emotions (pain, disgust, anger, and fear) depicted by avatars. Pain was the main emotion of interest, but we used fear, anger and disgust as distractors with similar negative valence. These other expressions have the advantage of containing part of the same set of action units (AUs) as facial expressions of pain, such as nose wrinkler, upper lip raiser, brow lowerer, and upper lid raiser (Kappesser and Williams, 2002; Simon et al., 2008; see **Table 1**).

We expected that the participants would differentiate the expressions by attributing more intensity to the target emotions. We also expected that most expressions would lead to the attribution of more than one emotion, as emotions are less prototypical than is often believed (see for instance: Du et al., 2014; Roy et al., unpublished manuscript).

#### *Methods*

*Participants.* The participants in this study were 19 adults (10 women) recruited through advertisement on the campus of Université Laval. They were aged between 20 and 32 years (*M* = 22*.*6 years; *SD* = 2*.*93 years). Exclusion criteria consisted in having a neurological or psychiatric disorder, a medical condition causing pain, working in healthcare or with people suffering from painful conditions, or having previously participated in a study on pain expressions. The study was approved by the Research Ethics Committee of the Institut de réadaptation en déficience physique de Québec. Written informed consent was obtained from all participants and they received 10\$ for their participation.

*Material/Task.* Participants were presented video clips of four different avatars showing dynamic facial expressions. The clips displayed the upper body, from the shoulders up, of avatars facing the camera at a 5–10◦ angle (see **Figures 4**, **5**), dressed neutrally, without hats or accessories. Each clip lasted 3 s, and displayed either neutral expressions or one of the following four negative emotions: pain, disgust, anger and fear. Each emotion was shown at 5 levels (A, B, C, D, and E of the FACS; Ekman and Friesen, 1978), and the neutral clip showed no facial contraction for the whole 3-s clip. In each non-neutral clip, the expression (AUs levels) linearly increased for 2 s from a relaxed state (neutral FACS) to reach the target expression level (either A, B, C, D, or E), which was maintained for 1 s (see examples in Supplementary Material Videos 2–6).

*Procedure.* During the experiment, which lasted about 45 min, participants were gathered in small groups of 1–8 individuals in the front rows of a classroom and were asked to rate a series of



*\*Note that this list is inclusive and that there are individual differences in the degree to which each AU is involved.*

video clips displayed on a large screen (175 × 175 cm) placed at the front of the classroom, 3–5 m from the participants. They were asked to rate each video on answer sheets that provided four Visual Analogue Scales (VAS) for each video. After a short tutorial block of 8 trials to familiarize participants with the clip presentation pace and the rating scales, they were presented with four blocks of 21 trials. Within each block (lasting about 7 min), the order of presentation was pseudo-randomized (i.e., constrained to avoid the repetition of three successive clips of the same gender, emotion or level). During each trial, the 3-s clip was repeated 4 times, with an inter-stimulus interval (ISI) of 500 ms. In order to identify the emotion or set of emotions they thought was expressed in each clip, participants were instructed to rate the intensity of each of four emotions displayed by the avatar by making small vertical marks on four separate VAS labeled respectively "Anger," "Disgust," "Pain," and "Fear." The order of the scales within subject was kept constant for each item during the experiment. The order was also the same across participants. This order could have been varied across subjects to avoid potential order effect from the list, but the clips themselves were randomized and this is where an order effect would be most probable, if any. Each VAS was 10 cm long, with anchors labeled 1 (extreme left) and 100 (extreme right). Participants were explicitly instructed to leave the VAS blank if they thought one emotion was not expressed in the clip, and leave all VAS blank if they did not detect any of the four target emotions. For each VAS, the distance from the left end of the line to the mark made by the participant was measured in millimeters to provide an intensity score (out of 100), with any blank scale attributed a score of 0. A composite score of Total Intensity was computed for each item by adding the four scores provided for this item. An accuracy score was computed for each item as the ratio of intensity for the target emotion scale divided by the Total Intensity score for this item. Additionally, a binary concordance score was computed for each stimulus by comparing the maximum score on the four scales and the emotion intended to be expressed by the avatar, independently of the FACS expression level, giving a 1 if the scale with the highest score was the intended target emotion and a 0 in any other case. The proportion of concordant items for each category of stimuli (Anger, Disgust, Pain, and Fear) was then computed using, for each stimulus category, a repeated measures ANOVA on the 4 VAS scores, taking together all expression levels and using a Bonferroni correction for the *post-hoc* comparison of mean VAS scores.

#### *Results*

The four categories of stimuli were scored (0–100) on average higher on their respective (target) scale (Anger: *M* = 28*.*3, *SD* = 16*.*5; Disgust: *M* = 21*.*9, *SD* = 10*.*8; Pain: *M* = 31*.*3, *SD* = 4*.*3; Fear: *M* = 55*.*0, *SD* = 4*.*4) than on the other scales (see **Table 2** for scores of each emotion), confirming that the dominant emotion was correctly detected in the stimulus set. However, subjects often attributed some level of Disgust to both the Anger and Pain stimuli (respectively *M* = 25*.*7 and *M* = 25*.*3), showing that Disgust was the emotion more susceptible to be misread in the stimulus set. Fear was the least ambiguous emotion; 73% of the total intensity attributed to fear stimuli loaded on the fear scores,

#### **Table 2 | Mean intensity scores on the 4 VAS for all 4-types of stimuli.**


*Stars mark average scores on a VAS significantly different (Bonferroni corrected) from the targeted Stimulus emotion, grayed on the same line.*

while the accuracy scores were 40% for anger, 43% for disgust, and 40% for pain.

The correlations between the mean ratings of emotions provided by the participants and the targeted intensity defined as the percentage of the maximum FACS intensity (A = 20, B = 40, C = 60, D = 80, E = 100) were high and significant for all emotions (Anger: *r* = 0*.*77, *p* = 0*.*00005; Disgust: *r* = 0*.*88, *p <* 00001; Pain: *r* = 0*.*85, *p <* 00001; Fear: *r* = 0*.*94, *p <* 00001). The total intensity score captured some extra variance not captured by the specific intensity, as most participants scored each clip on several emotion scales rather than only one, and sometimes misattributed the dominant emotion. This led to higher correlations between the total rating scores and the targeted intensity of the facial expression (Anger: *r* = 0*.*88, Disgust: *r* = 0*.*92, Pain: *r* = 0*.*93, Fear: *r* = 0*.*90; all *p <* 0*.*00001).

The proportions of concordant items were very different across the four emotions, ranging from 45 to 100% (Anger: 55%, Disgust: 65%, Fear: 100%, Pain: 45%), suggesting that fear stimuli were all unequivocal, while over half of the Pain stimuli elicited another emotion (Disgust) more intensely than Pain. Overall, Fear was the clearest facial expression, being always detected when present and rarely detected when other facial expressions were displayed. In contrast, Disgust was the most confused emotion overall and it was the emotion most often incorrectly attributed to the Pain clips. Conversely, Pain was the emotion most frequently incorrectly attributed to the Disgust clips. This implies that, for experiments in which it is important to discriminate pain from other emotions, some additional tuning of the avatars will be necessary. As only one AU is common to these two emotions (see **Table 1**), a detailed analysis of the time-course and intensity of this AU is warranted. Moreover, one potential source of ambiguity between emotions could be related to the linearity applied to each emotion. It might be the case that the differential timing between AUs is essential for some emotions but not others.

#### *Experiment 1b: pain level discrimination*

In Experiment 1b, our goal was to establish the relative accuracy of the levels of pain expression as modeled from levels of the Facial Action Coding System (Ekman et al., 2002).

#### *Methods*

After Experiment 1a, participants were asked to complete an assessment of only the Pain clips. Thus, the same 20 pain clips (four avatars, five pain levels each) previously presented were shown again using the same procedure. This time participants were asked to rate only the level of pain they detected in each clip, following the same procedure as in the first part of the experiment, but using only one VAS. This second step in the validation procedure was undertaken because pain will be the target emotion used by EEVEE in interactive paradigms. First, a correlation between the mean pain ratings and the target pain intensity was computed. Then, we entered the Pain evaluations into a mixedeffects ANOVA with Level (5 pain intensities: A, B, C, D, and E) and Stimulus Gender (female or male) as within-subject factor, and Participant Gender (female or male) as between-subject factor. *Post-hoc* comparisons were performed with Student's *t*-test with Bonferroni correction, unilateral for Pain Levels tests, and bilateral for Gender tests.

#### *Results*

The Three-Way ANOVA revealed a significant effect of Pain Level [A: *M* = 7*.*4, *SD* = 8*.*4; B: *M* = 23*.*3, *SD* = 13*.*5; C: *M* = 32*.*6, *SD* = 15*.*7; D: *M* = 58*.*5, *SD* = 23*.*4; E: *M* = 78*.*3, *SD* = 19*.*8; *F*(4*,* 17) = 265*.*9, *p <* 0*.*001], and all 5 levels were statistically distinct [A vs. B: *t*(18) = 21*.*8, *p <* 0*.*00001; B vs. C: *t*(18) = 4*.*9, *p* = 0*.*0001; C vs. D: *t*(18) = 17*.*8, *p <* 0*.*00001; D vs. E: *t*(18) = 8*.*5, *p <* 0*.*00001]. No main effect of Participant Gender on the pain ratings was observed [female: *M* = 37*.*5, *SD* = 29*.*1; male: *M* = 42*.*7, *SD* = 31*.*7; *F*(1*,* 17) = 1*.*4, *p* = 0*.*26]. There was, however, a significant main effect of Stimulus Gender showing that the pain of male avatars was rated higher than that of female avatars [female: *M* = 36*.*4, *SD* = 29*.*5; male: *M* = 43.5, *SD* = 31*.*0; *F*(1*,* 17) = 15*.*3, *p <* 0*.*001], but there was no interaction between Participant Gender and Stimulus Gender [*F*(1*,* 17) = 0*.*68, n.s.], or between Participant Gender and Pain Level [*F*(1*,* 17) = 1*.*9, n.s.; see **Figure 6**].

However, the interaction between Pain Level and Stimulus Gender was significant [*F*(4*,* 17) = 5*.*1, *p* = 0*.*001), showing that the pain of male avatars was rated significantly higher than that of female avatars at FACS levels B and D [A: *t*(18) = 0*.*08, *p* = 0*.*94;

**Gender (2 levels: female, male) [***F***(4***,* **17) = 5***.***1,** *p* **= 0***.***001].** Asterisks mark pain levels for which the male avatar stimuli receive significantly higher pain evaluation than female avatar stimuli (5 *post-hoc* tests, unilateral Bonferroni corrected *T* -tests α = 0*.*01).

B: *t*(18) = 3*.*39, *p* = 0*.*003; C: *t*(18) = 0*.*45, *p* = 0*.*66; D: *t*(18) = 3*.*58, *p* = 0*.*002; E: *t*(18) = 1*.*98, *p* = 0*.*06]. Finally, the interaction between Pain Level, Stimulus Gender and Participant Gender was not significant [*F*(4*,* 17) = 0*.*59, *p* = 0*.*57].

#### *Discussion of Experiment 1*

This experiment showed that while pain intensities are correctly estimated both when pain clips are presented alone (Experiment 1b) and among clips of other emotions (Experiment 1a), participants detected on average a mixed set of emotions in all clips. Although the accuracy of emotion detection was somewhat lower than anticipated for all emotions other than fear, the ambiguity caused by the Disgust expression is consistent with what has been reported in the literature about this emotion when expressed either by humans or avatars (Noël et al., 2006; Dyck et al., 2008; Sacharin et al., 2012; Roy et al., 2013; Roy et al., unpublished manuscript). The difference between emotion detections could be related to the timing of the different action units as much as the distinct AUs. This will need further investigation for which the platform will be most useful. The study design, which proposed four rating scales at once, might have contributed to the attribution of multiple types of emotion to most clips. However, recent research suggests that human emotions are complex (not one-dimensional), and this is often reflected in facial expressions. Someone can be joyfully surprised, for instance, or angrily surprised, and not show exactly the same pattern of AUs (Du et al., 2014). This suggests that multidimensional evaluation of emotions and facial expressions could be more ecological, and reflects the fine-grained analysis needed in social interaction situations. Overall, the validation of the avatars was shown to be rather accurate but not entirely specific, suggesting that the intensities of the different AUs composing each emotion can be extracted even in the presence of extra AUs (noise). By introducing different contraction levels on different parts of a face, we may be able to accentuate the facial features that are specific to an emotion over the features that are common to multiple emotions, and attain greater specificity in future experiments. Furthermore, we could use software such as Noldus FaceReader™ to help validate the avatar expressions and refine them in an iterative fashion.

#### **EXPERIMENT 2**

#### *Objective and hypotheses*

The specific objective of this second experiment was to evaluate whether different parts of the face, namely the eyes and the mouth region, have the same efficacy in communicating pain information. We expected that masking part of the facial expressions would result in observers attributing less intensity to the pain displayed in both human and avatar models. We also expected that masking the eyes would result in lower estimates of pain intensity than masking the mouth region, based on at least one study suggesting that the eyes communicate mostly the sensory component of pain while the mouth region is more associated with its affective component (Kunz et al., 2012).

#### *Methods*

#### *Participants*

A total of 36 participants (20 women) were enrolled in this experiment through emails sent to Université Laval students and personnel. Their ages ranged from 20 to 35 years old (*M* = 24*.*2, *SD* = 4*.*3). Exclusion criteria included having a neurological or psychiatric disorder, a medical condition causing pain, working in healthcare or with people suffering from pain, or having previously been enrolled in a study conducted by a member of our laboratory. The study was approved by the Research Ethics Committee of the Institut de Réadaptation en Déficience Physique de Québec. Written informed consent was obtained from all participants, who received a 10\$ compensation for their participation in the study.

#### *Material/Task*

Human clips were taken from a validated set (Simon et al., 2008). Four models (2 males, 2 females) depicting 3 levels of pain intensity (Low, Medium, High) plus a Neutral expression, for a total of 16 different clips. Based on Experiment 1's results and previous work by others (e.g., Simon et al., 2008), Low, Medium, and High pain from the human clips were matched by 4 avatars (2 males, 2 females) depicting equivalent expressions, which corresponded to FACS levels B, C, and D respectively, plus a Neutral clip. The minimal and maximal expressions (FACS levels A and E) were not used, as they are less frequently encountered in real life, not well represented in this human data set, nor in naturalistic data sets involving patients (e.g., Lucey et al., 2011b). Thus, the experiment comprised 16 clips for each model type (Avatar, Human). This set corresponded to the Unmasked condition (see examples in Supplementary Material Videos 7–10). The Mouth Mask and Eyes mask conditions were produced by adding a static gray rectangle (mask) over the mouth or the eye regions, respectively, to each clip of the original unmasked set (see **Figure 7**).

#### *Procedure*

The four categories of clips, namely Human-Unmasked (16 clips), Human-Masked (32 clips: 16 clips with eyes masked, 16 clips with mouth masked), Avatar-Unmasked (16 clips), Avatar-Masked (32 clips: 16 clips with eyes masked, 16 clips with mouth masked) were evaluated separately in four different blocks. The block order was counterbalanced: about half the participants started with the unmasked blocks (21 out of 36 participants) while the other half started with the masked blocks. Also, the model type was counterbalanced across participants, with half of the participants starting with the human blocks and the other half starting with the avatar blocks. The order of presentation of the clips within a block was pseudo-randomized, and constrained by rules to avoid more than three successive clips of the same of gender, level, or mask within the same block. In this experiment, each clip was presented twice in each trial, with a 2-s black screen following each presentation. As with Experiment 1b, in the non-neutral clips the pain expression linearly increased in intensity for 2 s from the relaxed state (FACS neutral) to reach the maximum expression level (either Low, Medium or High pain), which was maintained for 1 s. Participants were instructed to rate the level of pain they detected in the clip, using the same procedure and pain visual analog scale (VAS) described in Experiment 1b. A Two-Way ANOVA with Model Order (2 levels: Avatars first or Humans first) and Mask Order (2 levels: non-masked or masked blocks first) as within-subjects factors was conducted to rule out

**FIGURE 7 | Stimuli used in Experiment 2 showing facial expressions of pain from one of the male avatars.** All stimuli were presented in three conditions: No Mask (**top row**), Eyes Mask (**middle row**), and Mouth Mask (**bottom row**).

order effects. A mixed-design ANOVA was conducted to rule out participant or stimulus gender effects, with Participant Gender (2) as between-subject factor and Stimulus Gender (2) as withinsubject factor. To rule out possible primacy effects, a mixed design ANOVA was conducted only on the first block from each participant, with Pain Level (Low, Medium, High) as a repeated measure, and Type order (Avatars or Humans first) and Mask order (unmasked stimuli or masked stimuli first) as between subjects factors. A mixed-design ANOVA was then conducted using a Type(2)∗Level(3)∗Mask(3) design to compare the effect of the three types of masks (No Mask, Eyes Mask, Mouth Mask) on pain evaluations at low, medium and high pain, in avatars and humans. For this last ANOVA, *post-hoc* Student's *t*-tests were conducted to compare pain intensity ratings in the three mask conditions at each pain level, for both human and avatar models, using a Bonferroni-corrected threshold of *p <* 0*.*0028 (18 tests, *p <* 0*.*05 family-wise error), unilateral.

#### *Results*

First, an ANOVA on pain evaluations with Mask Order and Model Order as within subjects factors was conducted, and showed no effect of Mask Order [*F*(1*,* 32) = 0*.*31, n.s.], Model Order [*F*(1*,* 32) = 0*.*44, n.s.] or interaction between these 2 factors [*F*(1*,* 32) = 1*.*1, n.s.]. Then a second ANOVA with Participant Gender as between subjects factor and Stimulus Gender as within subject factor showed no effect of Participant Gender [*F*(1*,* 34) = 0*.*17, n.s.], but a significant effect of Stimulus Gender [female pain: *M* = 34*.*8, *SD* = 28*.*4, male pain: *M* = 38.2, *SD* = 29*.*5; *F*(1*,* 34) = 56*.*9, *p <* 0*.*001]. No significant interaction was found between these two factors [*F*(1*,* 34) = 0*.*1, n.s.]. The Type∗Level∗Mask ANOVA on the first blocks showed no effect of Mask Order [*F*(1*,* 32) = 0*.*03, n.s.], Model Order [*F*(1*,* 32) = 0*.*68, n.s.] or interaction between these 2 factors [*F*(1*,* 32) = 0*.*3, n.s.].

The Type(2)∗Level(3)∗Mask(3) ANOVA yielded a significant main effect of Pain Level [Low: *M* = 30*.*2, *SD* = 19*.*3; Medium: *M* = 47*.*6, *SD* = 20*.*8; High: *M* = 63*.*9, *SD* = 22*.*8; *F*(2*,* 70) = 380*.*8, *p <* 0*.*00001]. The main effect of Mask was not significant [No Mask: *M* = 46*.*5, *SD* = 25*.*4; Mouth Mask: *M* = 48*.*3, *SD* = 24*.*7; Eyes Mask: *M* = 47*.*0, *SD* = 25*.*2; *F*(2*,* 70) = 1*.*40, n.s.], nor was the main effect of Model Type [avatars: *M* = 47*.*4, *SD* = 23*.*5; humans: *M* = 47*.*1, *SD* = 26*.*7; *F*(1*,* 35) = 0*.*2, n.s.]. The interaction between Model Type and Pain Level was marginally significant [*F*(2*,* 70) = 3*.*0, *p* = 0*.*073]. The way the stimuli were masked also interacted significantly with the type of model [Type∗Mask: *F*(2*,* 70) = 8*.*7, *p <* 0*.*001] and the level of pain presented [Level∗Mask: *F*(4*,* 140) = 3*.*1, *p* = 0*.*02]. There was also a significant three-way interaction [Type∗Level∗Mask: *F*(4*,* 140) = 6*.*5, *p <* 0*.*001).

The effect of masking on pain intensity ratings depended on both the type of model and the pain level presented (see **Figure 8**). Because of the large number of *t*-tests performed (18), only significant differences are presented here. For avatars, intensity ratings for the Low pain level stimuli were significantly higher with the Eyes Mask than with the Mouth Mask [*t*(35) = 3*.*28, *p* = 0*.*002]. For the Medium pain stimuli, intensity ratings for avatars were lower in the Mouth Mask than in the No Mask condition [*t*(35) = 4*.*43, *p* = 0*.*0001]. For human stimuli, at Low and Medium pain levels, intensity ratings in the Mouth Mask condition were significantly higher than in the No Mask condition [Low pain: *t*(35) = 4*.*45, *p* = 0*.*0001; Medium pain: *t*(35) = 3*.*54, *p* = 0*.*001]. At Medium pain, ratings in the Mouth Mask condition were also significantly higher than in the Eyes Mask condition [*t*(35) = 3*.*37, *p* = 0*.*002]. Masking did not have any significant effect on pain evaluation in the High pain condition for either avatar or human models. All other comparisons were non-significant. Thus, overall, masking the eyes or mouth region had different effects on human and avatar pain expressions; the pain evaluation in humans tended to increase when the mouth was masked, while no clear pattern emerged in avatars.

#### *Discussion of experiment 2*

Based on previous literature, it was expected that masking part of the facial expressions of pain would reduce intensity ratings for all stimuli, with eyes masks leading to the lowest evaluations. However, the main findings from Experiment 2 suggest a different and more complex pattern, with a significant three-way interaction between Mask, Model Type and Pain Level indicating that the effect of masking on perceived pain intensity varied according to both the type of model (avatar or human) and the level of pain displayed (low, medium, high). For high pain stimuli, masking part of the model's face (either the eyes or the mouth) had no effect on the participants' evaluations of pain intensity displayed, regardless of whether the model was an avatar or a human. For low and medium pain stimuli, while there was an effect of masking on perceived pain intensity, it was different for avatars and humans. For human stimuli, in line with our hypotheses, masking the mouth region resulted in higher intensity ratings than when the whole facial expression was perceived. For avatars however, masking the mouth led to lower perceived intensity compared to the non-masked presentation for medium pain stimuli, while no difference was found between these two conditions for low pain stimuli. Taken together, these results suggest that different facial areas could convey information best for different levels of pain. In this study, the eyes appeared to convey more intensity in human, but not in avatar models.

The differences observed between the two model types are not completely unexpected, as the facial expressions of the avatars were not based on the human models that were used in this study, but were created from the general and analytic principles of the FACS. Furthermore, the intensity was modified by changing all AUs equally, which may not be how the intensity of pain expressions varies in humans. In fact, it is likely that the different AUs involved in a given expression follow different time courses and that different emotions have different level spans for each AU, which would guide our attention to the specific aspects of emotional facial expressions. EEVEE is a valuable tool in the exploration of how fine-grained changes in facial expressions and micro-expressions can affect the communication of pain and other emotions. One caveat to note is that the human stimuli used in the current study may not be the best standard for natural facial

level are significantly different from pain evaluation at other mask levels (18 *post-hoc* tests, with unilateral Bonferroni corrected *T* -tests α = 0*.*0028).

expressions, as they show actors pretending to be in pain. While these stimuli are well validated, it will also be important to examine how the avatar stimuli created for EEVEE compare to natural facial expressions of emotions. While a set of well-validated stimuli presenting natural facial expressions of pain is available (Lucey et al., 2011b), this step may require improved modeling of facial activity in ecological facial expressions of other emotions. The finding that pain intensity ratings were significantly higher for male than for female models will need to be explored further, because even if this pattern was found in actors (Simon et al., 2006), it also seems to be inconsistent with some data suggesting that male pain is often underestimated compared to female pain (Robinson and Wise, 2003).

#### **GENERAL DISCUSSION**

The objective of this article was to present the development of a new VR platform, the Empathy-Enhancing Virtual Evolving Environment (EEVEE) to study empathy in an interactive environment and to foster the interest for collaborative work in this domain. At this stage of development of EEVEE, we have successfully created a novel avatar architecture that can independently modulate in real time different components of its facial expression based on the FACS (Ekman et al., 2002). The first series of validation experiments showed that the avatars can produce distinct negative emotions that people can correctly identify, including pain, which is not always part of experimental emotion stimuli sets. Moreover, people readily recognized changes in intensity levels implemented through a nonlinear combination of blend shapes and animation. Not surprisingly, the more intense the emotion, the better it was identified. Note that informal debriefing about the avatars also taught us that the use of expression lines and wrinkles added realism to the avatars. Some refinements are still required, notably between disgust and other emotions, for which it will be important to conduct further experiments. Single emotion experiments can readily be launched with high confidence in the ability of avatars to express a specific emotion. For instance, the last validation experiment provided sample data on the detectability of pain faces when part of the face is covered. The first interesting finding was that the gender of the model affected pain intensity ratings in that higher ratings were provided for male than female avatars and actors. Interestingly, this finding is consistent with brain imaging data showing that male facial expressions of pain yield more activation in emotion related circuits (amygdala) than female facial expressions of pain (e.g., Simon et al., 2006) and extends it to avatars. The fact that the male and female avatars were based on the same mesh and used the same scale to modify the facial expression adds to the argument that this difference between intensity ratings of FACS-based expressions cannot be attributed to differences in the intensity of the expression *per se*, but rather points to a socio-cultural bias through which we attribute, for equivalent pain faces, more pain to male models. The main findings related to the masking procedure showed that at high pain levels, masking the eyes or the mouth does not change the accuracy of the evaluation, either for avatars or for real actors mimicking pain faces. At low and medium levels of pain the pattern was more complex; for instance at medium levels of pain, masking the eye area tended to yield lower pain ratings for avatars, but higher pain ratings for human faces, compared to the unmasked condition.

Overall, these validation experiments are encouraging and confirm the potential of EEVEE for a number of experiments in which the parametric modulation of dynamic facial expressions is essential. Future developments will include refinements in the avatars' expressions, the creation of novel avatars (of different age, gender, and ethnic background) and environments. These improvements will be based on a series of planned experiments, which will target different questions such as the influence of context (such as hospital room versus office) on the perception of an avatar's emotions. EEVEE also offers much more than new avatars as the platform is now ready to record a number of behavioral and physiological responses of the observer, synchronized with the expression of the avatars. This allows the deployment of experiments documenting the different markers that are associated with the resonance response when observing emotions in other people. Ongoing pilot work simultaneously recording heart rate, respiration rate, skin conductance, facial expressions and eye movements (via eye-tracking) in participants observing the emotional avatars will provide a first set of data to explore interindividual similarities and differences in the physiological basis of empathy (Jauniaux et al., 2014, MEDTEQ meeting, Quebec City, October 2014).

#### **EEVEE IN REAL TIME: PROOF OF CONCEPT**

In parallel to the refinement of the avatars, we have also begun the next development stage of EEVEE, which involves the modulation in real-time of an avatar's facial expression based on an observer's physiological responses. The current version of EEVEE can process online data recorded from a participant and modulate the facial expression of the avatar in response to it (see Video 1 for an example of changes in the avatar based on a participant's facial expression). Such a set-up can be used for instance to train people to react more (or less depending on the context) to emotional facial expressions (see part III in Future Directions).

#### **FUTURE DIRECTIONS**

In our laboratory, at least three main research themes are currently benefiting from the development of this platform: (1) pain communication and empathy, (2) the study of social cognition deficits in people with psychiatric and neurological disorders, and (3) the development of means to optimize empathy.

#### *Pain communication and pain empathy*

The added value of EEVEE for research on pain communication (Hadjistavropoulos et al., 2011) is immense. For instance, EEVEE will enable the systematic investigation of neurophysiological parameters of pain decoding by making it possible to manipulate and adapt the model in pain online. This will allow the identification of the most robust biomarkers of empathy for pain, and reveal different combinations of markers for different groups of individuals. EEVEE will also enable changes in the relationship between the observer and the target in pain by using specific avatars of known people (spouse, family member, friend, etc.) or people with a specific role toward the observer (physician, nurse, psychologist, etc.). We will be able to control several important variables by using avatars to compare for instance the neurophysiological and prosocial responses to seeing one's spouse compared to a stranger in pain (Grégoire et al., 2014). Finally, EEVEE will help the systematic study of inconsistencies between multiple emotional cues (e.g., voice vs. facial expression) and the detection of very low levels of expressions found in certain clinical populations (e.g., premature newborns, people with dementia), which can have a huge clinical significance.

#### *Social cognition deficits*

A second research axis where EEVEE could lead to important advances is the study of empathy in clinical populations showing empathy deficits, including schizophrenia, autism, personality disorders and traumatic brain injury. This axis is complementary to the first as it is based on the same models of empathy and also uses the pain of others as a model to trigger emotional responses. Initial studies have led to a number of discoveries, notably that people with high traits of psychopathy show a greater resonance response during the observation of pain (but still less empathy) than people with low traits (Marcoux et al., 2013, 2014), suggesting that they do detect pain in others. EEVEE could help determine, through controlled social scenarios, which component of empathy is specifically affected in this population. Research conducted with people having a first episode of psychosis shows that they have low to moderate levels of social cognition deficits on pencil and paper tests, which does not seem to reflect the full range of social interaction difficulties from which they typically suffer. EEVEE would help demonstrate more specifically which neuro-cognitive processes underlie these deficits and how they impact actual social interactions. EEVEE will provide several advantages, such as the possibility to introduce conflicting information in a controlled manner (for example, the avatar saying something but displaying an incongruent facial expression), to modulate affective content in a supervised environment, and to place participants in varied and complex ecologicallyrelevant social situations (such as familiar or less familiar, formal or informal) to study the full range of human social interactions.

#### *Optimizing empathy*

EEVEE can also be used as an intervention tool by exploiting the changes in a participant's physiological and neurophysiological responses to modify the avatar's facial expression, as well as their behavioral and communicative responses, so as to steer the participant toward an empathic response (a form of bio/neurofeedback). For instance, a participant could see the avatar of his spouse, who suffers from chronic pain, displaying different levels of facial expressions of pain, which will be modulated according to a predetermined combination of neurophysiological parameters (e.g., gaze directed at the right facial features, skin conductance showing elevated affective response, etc.). The avatar would then express relief only when the participant's responses are compatible with an empathic state, but not with high levels of anxiety and distress. Verbal cues can be added periodically to add reinforcement. The same type of design could also be applied to parent-child dyads or healthcare professionals and patients. EEVEE will thus make an excellent platform to train healthcare professionals to detect and manage pain. Other groups have recently reported encouraging findings using cognitive strategies with nurses (Drwecki et al., 2011) and computerized training with physicians (Tulsky et al., 2011). In combination with neurostimulation techniques (see Hétu et al., 2012), EEVEE will allow for a greater range of behavioral changes, by targeting specific behaviors and triggering online stimulation to modify the cortical excitability of subjects. A simple design has shown that transcranial Direct Current Stimulation (tDCS) applied to the dorsolateral prefrontal cortex can change the way people rate static pictures of pain (Boggio et al., 2009). A better understanding of this process, with a controlled and ecological interface such as EEVEE, would allow the systematic investigation of different stimulation paradigms that can promote certain behaviors and reduce others. These examples underline the great potential of EEVEE to lead to personalized training programs complementary to cognitive approaches.

#### **CONCLUSION**

EEVEE was designed as a flexible and powerful tool to study the different processes underlying human social interaction, with special emphasis on empathy. While EEVEE will help uncover the neurophysiological basis of these complex processes, it also has great potential for the study of human-machine interactions. Although considerable work has been done in the development of valid and finely modeled static and dynamic avatar facial expressions (see for example FACSGen, in Krumhuber et al., 2012), no freely accessible system, to the best of our knowledge, has combined physiological measures and facial expressions of avatars into a dynamic social interactive tool. The addition of new inputs (e.g., postural analysis and speech recognition) and outputs (speech production), as well as the implementation of machine learning algorithms on the large quantity of data that will be generated, will bring EEVEE to the next level. Already, EEVEE is a unique platform for studying empathy in a number of populations suffering from neurological and psychiatric disorders. Currently, the platform allows the production of different levels of facial expressions of the following emotions for four different avatars (2 males, 2 females): Pain, Anger, Disgust, Fear, Joy, Surprise, Sadness. These avatars can currently be shared with the scientific community as separate video clips by contacting the corresponding author. Once the complete platform is compiled in an executable format, it will be made available through the corresponding author's website. Making this platform available to the scientific community is a priority for our team as this will propel its development and the likelihood that it will contribute directly to improving social interactions in humans, which in turn can improve the quality of life of many different clinical populations.

#### **ACKNOWLEDGMENTS**

This work was made possible thanks to funding from the Canadian Foundation for Innovation and Natural Sciences and Engineering Research Council of Canada to PLJ. PLJ was also supported by salary-grants from the Canadian Institutes of Health Research and the Fonds de recherche du Québec – Santé. We also extend our gratitude to all the people from the lab who participated in many challenging discussions throughout the development of EEVEE and in particular to Marie-Pier B. Tremblay whose help was key in conducting the validation experiments, and to Sarah-Maude Deschênes and Fanny Eugène for their helpful comments on the manuscript.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fnhum*.* 2015*.*00112/abstract

**Demo |** Video 1 demonstrates the interactions between an avatar in the EEVEE platform and the FaceReader™ software. The avatar mirrors the pain action units (AU 6, 7, 9, 10, and 43) of the participant as detected by FaceReader™. As described in Section Pain Facial Actions Units, the blendshape's value increase linearly depending of the FACS' intensity and the blendshape's value is interpolated over time.

**Examples of emotions |** This video set illustrates the different emotions used in Experiment 1 (anger, disgust, fear, pain, plus neutral) expressed by one of the avatars at level D (80% of maximum Action Units intensity): Video 2 = Anger Level D; Video 3 = Disgust LevelD; Video 4 = Fear LevelD; Video 5 = Pain LevelD. Video 6 = Neutral.

**Pain levels |** This video set illustrates different levels of pain (B = 40%, C = 60%, D = 80% of maximum Action Units intensities) used in Experiment 2 and expressed by one avatar: Video 7 = Pain LevelB; Video 8 = Pain LevelC; Video 9 = Pain LevelD; Video 10 = Neutral.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 21 October 2014; accepted: 15 February 2015; published online: 10 March 2015.*

*Citation: Jackson PL, Michon P-E, Geslin E, Carignan M and Beaudoin D (2015) EEVEE: the Empathy-Enhancing Virtual Evolving Environment. Front. Hum. Neurosci. 9:112. doi: 10.3389/fnhum.2015.00112*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2015 Jackson, Michon, Geslin, Carignan and Beaudoin. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Gelotophobia and the challenges of implementing laughter into virtual agents interactions

#### **Willibald F. Ruch<sup>1</sup>\*,Tracey Platt <sup>1</sup> , Jennifer Hofmann<sup>1</sup> , Radosław Niewiadomski <sup>2</sup> , Jérôme Urbain<sup>3</sup> , Maurizio Mancini <sup>2</sup> and Stéphane Dupont <sup>3</sup>**

<sup>1</sup> Personality and Assessment, Department of Psychology, University of Zurich, Zurich, Switzerland

2 InfoMus Laboratory, Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi, University of Genova, Genova, Italy

<sup>3</sup> TCTS Laboratory, Numediart and InforTech Research Institutes, Faculté Polytechnique de Mons, University of Mons, Mons, Belgium

#### **Edited by:**

Eric Brunet-Gouet, Centre Hospitalier de Versailles, France

#### **Reviewed by:**

Eric Brunet-Gouet, Centre Hospitalier de Versailles, France Albert Moukheiber, INSERM, France

#### **\*Correspondence:**

Willibald F. Ruch, Personality and Assessment, Department of Psychology, University of Zurich, Binzmuhlestrasse 14/7, Zurich CH8050, Switzerland e-mail: w.ruch@psychologie.uzh.ch This study investigated which features of AVATAR laughter are perceived threatening for individuals with a fear of being laughed at (gelotophobia), and individuals with no gelotophobia. Laughter samples were systematically varied (e.g., intensity, laughter pitch, and energy for the voice, intensity of facial actions of the face) in three modalities: animated facial expressions, synthesized auditory laughter vocalizations, and motion capture generated puppets displaying laughter body movements. In the online study 123 adults completed, the GELOPH<15> (Ruch and Proyer, 2008a,b) and rated randomly presented videos of the three modalities for how malicious, how friendly, how real the laughter was (0 not at all to 8 extremely). Additionally, an open question asked which markers led to the perception of friendliness/maliciousness. The current study identified features in all modalities of laughter stimuli that were perceived as malicious in general, and some that were gelotophobia specific. For facial expressions of AVATARS, medium intensity laughs triggered highest maliciousness in the gelotophobes. In the auditory stimuli, the fundamental frequency modulations and the variation in intensity were indicative of maliciousness. In the body, backwards and forward movements and rocking vs. jerking movements distinguished the most malicious from the least malicious laugh. From the open answers, the shape and appearance of the lips curling induced feelings that the expression was malicious for non-gelotophobes and that the movement round the eyes, elicited the face to appear as friendly. This was opposite for gelotophobes. Gelotophobia savvy AVATARS should be of high intensity, containing lip and eye movements and be fast, non-repetitive voiced vocalization, variable and of short duration. It should not contain any features that indicate a down-regulation in the voice or body, or indicate voluntary/cognitive modulation.

**Keywords: gelotophobia, laughter, social phobia, virtual agent**

#### **INTRODUCTION**

Virtual environments and virtual agents become more and more popular in various domains, such as e-learning [e.g., Zaíane (2002)], intervention programs [e.g., Rinck et al. (2010)], games [e.g., Adobbati et al. (2001)], and websites [i.e., in online shops; e.g., Grenci and Todd (2002)]. Thus, various fields are concerned with how such environments and the virtual characters that interact within them can be made more natural and rewarding. As one attempt of trying to increase the pleasurable component of interacting with a virtual agent, increase the engagement, and enhance the communication outcome, smiling and laughter as expressive features of positive affect have been implemented [e.g., Niewiadomski et al. (2010), Ochs et al. (2012), Hofmann (2014), Hofmann et al. (under review)]. The functions of smiling and laughter are highly significant for social interactions and communication [e.g.,Chapman (1983),Glenn (2003),Holt (2010)]. Thus, any virtual interface, performing the role of a companion, tutor, or simply programed to interact with human beings, will considerably benefit from being able to utilize smiling and laughter

correctly, as well as to detect such displays and respond to them adequately.

While for most individuals the interaction with a laughing virtual agent will be conducive to positive affect (through mimicry and emotional contagion), for some individual's negative effects are to be expected. Individuals with a fear of being laughed at [gelotophobia, a universal phenomenon related to, but sufficiently distinct from social phobia; see Ruch and Proyer (2008a,b)] do not appreciate laughter, but see it as threatening and as a weapon [for a review, see Ruch et al. (2014)]. Thus, it is assumed that the same would hold true for gelotophobes when confronted with the laughter of a virtual agent. As such, the attempts to make virtual encounters more natural and rewarding by implementing smiles and laughter may lead to an aversive experience and a breakdown of the interaction for gelotophobes. Therefore, the responses of gelotophobes to virtual laughter should be studied to (a) find out how virtual laughter can be programed to be "gelotophobiafriendly," (b) specify the fear triggers in virtual agent laughter for a better insight in the phenomenon of gelotophobia, (c) eventually develop programs that help gelotophobes to train a re-evaluation of laughter by training them with laughing virtual agents. Indeed, using virtual agents to investigate social cognitive disorders has been proven useful, for example, a study on social anxiousness (Vrijsen et al., 2010) showed that in general, individuals high in social anxiety did not appreciate the subtle mimicry behavior of virtual agents. Conversely, having a programmable AVATAR with precise known triggers levels that can be decreased over time as a desensitization tool, may be useful for cognitive behavioral therapies for the treatment of gelotophobia. Therefore, the current study investigates the fear triggers in virtual agents across different modalities (face,voice,and body) for gelotophobes and individuals without a fear of being laughed at.

As yet, the current DSM (DSM-IV TR, 2000) does not recognize gelotophobia as a diagnosable condition. It was primarily observed in interactions between therapist and patient. An article by Titze (2009) described vignettes of interactions with clients in a clinical setting who expressed concerns relating to fearing laughter and who held the belief that they were indeed credible objects of derision. These descriptive criteria allowed Ruch and Proyer (2008a,b) to build an efficient 15-item self-report instrument (GELOPH<15>) that allowed for identifying gelotophobes at four different levels with none, slight, pronounced, and extreme fear of being laughed at. This shows gelotophobia is best conceptualized on a continuum where at the higher levels of this continuum gelotophobes consistently anticipate the shame induced by all laughter, even friendly laughter, in a fearful and negative way, making the assumption that it is malicious. Cross-cultural investigations show gelotophobia to be a universal phenomenon (Ruch et al., 2014).

For example, Forabosco et al. (2009) investigated gelotophobia. The authors went through the literature including the terms *humor/laughter* and *psychiatric patients* and identified very few studies dealing with both phenomena. The authors thus concluded,"the capability to positively experience humor and laughter are often compromised in psychiatric disorders, though in a somewhat different way" (p. 236). They then tested patients with (1) personality disorders; (2) schizophrenic disorders; (3) mood disorders; (4) anxiety disorders; (5) eating disorders with the Italian version of the GELOPH<15> to investigate if gelotophobia was more prominent in psychiatric patients. The overall prevalence of gelotophobia was found to be higher among the psychiatric patients and highest for personality disorders and schizophrenic disorders. This heightened level of gelotophobia among psychiatric patients was also found in a Russian sample (Ivanova et al., 2012). Furthermore, the patients who had been in psychiatric care for longer than 5 years were found to have significantly higher levels of gelotophobia. Broadening the scope of psychiatric disorders, Weiss et al. (2012) found a partial overlap between Cluster A personality disorders and also schizoid and schizotypal personality disorders.

Carretero-Dios et al. (2010) investigated the fear of being laughed at and social anxiety. They showed that there were definite overlaps, yet there were still unique qualities to gelotophobia, which could not be accounted for by the measures of social phobia. Similarly, Edwards et al. (2010) found that gelotophobia was strongly related to, but still distinct from, social phobia. By taking a closer look at a sample of extreme gelotophobes, three distinct

components within gelotophobia were separated. Platt et al. (2012) were able to clearly demonstrate what the distinctions were, as well as what the similarities were, compare to social phobia. The first factor was "coping with derision." Gelotophobes cope by either controlling their environment and situations they are in, to ensure no one is laughing at them, or by actually internalizing the belief that they are a valid object of derision, thus, reconciling that they will be laughed at, or by social withdrawal. This latter facet is the one which links to known behaviors associated to social phobia (Rapee and Spence, 2004). However, gelotophobes also have two further components, these being a "paranoid sensitivity to anticipated ridicule" with paranoid sensitivity here referring to gelotophobes' suspicion and belief that they will be the targets of laughter, even when this is unsubstantiated. The third factor is "disproportionately negative responses toward being laughed at," which is also specific to gelotophobia, just as the factor "coping with derision."

Therefore, it is legitimate to investigate this construct without the need to further investigate other aspects of social phobia, especially with the advent of computerized systems where laughter will be integrated into the interface. Still, we included a measure of social anxiety in the current study to see whether the effects of gelotophobia would still remain after controlling for social anxiety.

Gelotophobes have already been shown to perceive laughter differently to individuals with no fear of being laughed at (Ruch et al., 2014, 2015). Also, they have been shown to display fewer facial expressions of joy (Duchenne smiles and laughs) in an interview on positive emotions compared to the non-gelotophobes (Platt et al., 2013). This effect was found to be stronger for both the frequency and the intensity of joy smile responses toward laughter-eliciting enjoyable emotions than for the non-laughtereliciting enjoyable emotions. Those who do not have gelotophobia responded positively, displaying smiles more strongly to laughtereliciting than to no laughter-eliciting enjoyable emotion expressions. However, individuals with a pronounced level of gelotophobia showed the reverse pattern, displaying less joy smiles to the laughter-eliciting emotions. Therefore, those with gelotophobia may indeed jeopardize situations where the elicitation of positive affect or laughter is required. Furthermore, gelotophobes lack the ability to clearly judge the positive role of laughter and overly judge facially expressed joy to containing contempt, compared to individuals with no fear (Hofmann et al., under review). Yet, nothing is known about which features of the expression of laughter are perceived as malicious, as laughter is far more than simply a facial expression and also includes a host of body movements and vocalizations (Ruch and Ekman, 2001).

For the purpose of the current study, we investigated laughter expressed through face, voice, and body to identify which laughter features are perceived as malicious and consequently identify features that should be avoided when creating gelotophobia-friendly laughter. We manipulated synthesized laughter along theoretically derived dimensions and showed them to a sample of individuals with and without gelotophobia. The participants should rate the laughter on the dimensions of maliciousness, friendliness, and realness, as well as describe which features make a given laugh friendly or malicious in an open answer format. In the next paragraphs, we describe our hypotheses on each laughter modality separately.

The most extensively studied aspects of gelotophobes'responses to laughter have been with facial expressions (e.g.,Ruch et al., 2014; Hofmann et al., under review). Yet, facial displays of joy where always investigated holistically, and only one study targeted single facial features [and their intensity; see Hofmann (2014)]. In the current approach, joyful laughter stimuli in different intensities are utilized. For the gelotophobes, we expect that low and medium intensity laughs are perceived as more malicious compared to high-intensity laughs, as the former may give the impression of being cognitively regulated. This is because of the expressions occurring in different areas of the brain (Rinn, 1984). Emotion expressions, namely, the observable verbal and non-verbal behaviors that communicate an internal emotional or affective state, are quicker to appear than those representing cognitively motivated displays. Therefore, one can assume that the slower the modality the more it will appear as cognitive. In this instance, cognitive laughter expression refers to an expression that is voluntarily produced. For the non-gelotophobes, we expect that the ratings are primarily a function of the laughter intensity (stemming from perceptual studies and FACS codes). Thus, we expect that the effects of the stimulus intensity (low, medium, and high) on the ratings of maliciousness, friendliness, and realness will be different for the gelotophobia and non-gelotophobia groups.

In terms of body movements, no psychological study has investigated the perception of laughter body movements in gelotophobia yet. Wallbott (1998) investigated the effect of movement intensity on perceived valence/intensity and found that the ratings were predicted by the movement intensity. Thus, we choose intensity as an initial differentiating dimension to judge body movements of laughter. Five stimuli of low intensity and five of high intensity are rated. We assume that gelotophobes will generally perceive body movements of laughter as more malicious and less friendly than non-gelotophobes. Concerning the qualitative analysis of the laughter body movement features, we expect that movements that seem cognitively modulated, regulated, or restrained, lead to higher perceived maliciousness.

For the auditory laughter displays, we generated hypotheses based on the results of a pilot study (gelotophobes were asked to identify maliciousness indicators of laughter and they nominated features like laughter energy and pitch) and knowledge on vocal laughter perception in general. We hypothesize that in general the modifications that give the impression that the synthesized laughter sound is cognitively regulated will be the ones, which lead to higher maliciousness ratings, compared to the original laugh. In line with past research, we expect that the variability in pitch and rhythm influence the maliciousness, friendliness, and realness ratings [see also Bryant and Aktipis (2014) and Kipper and Todt (2001, 2003)]. We expect that compared to the original, sounds with modulated fundamental frequency (F0) variability, "slower" sounds and sounds with decreased intensity should be perceived as more malicious, less friendly, and less real compared to the original [see also Bryant and Aktipis (2014)]. Furthermore, we expect a general trend of gelotophobes rating auditory laughter stimuli as more malicious and less friendly [see also Ruch et al. (2009)].

#### **MATERIALS AND METHODS PARTICIPANTS**

The sample consisted of 123 participants partially completed the survey (at least one modality) and 71 participants completed all three modalities. All participants were English-speaking adults. The overall gender distribution was 78 males and 122 females. Ages ranged from 18 to 73 years (*M* = 33.80; SD = 13.37). Ten percent of participants had a secondary school education, 21% had a gymnasium/high-school education. The majority (37%) had a university degree and 23% reached a post-graduate degree. Those with an apprenticeship or had been educated in a technical college accounted for 3% of the sample. Participation of the survey was through their personal Internet access.

#### **INSTRUMENTS**

The*GELOPH*<15> (Ruch and Proyer, 2008a,b) is a questionnaire assessing the level of the fear of being laughed at (i.e., gelotophobia) consisting of 15 items in a 4-point answer format (1 = *strongly disagree* to 4 = *strongly agree*). A sample item is "When others laugh in my presence I get suspicious." Cronbach's alpha (0.89) in the present sample is comparable to the English norm sample (α = 0.90; Platt et al., 2009).

The *social phobia inventory* (SPIN; Connor et al., 2000) is a 17-item self-report questionnaire, which assesses symptoms of social phobia on 3 dimensions (fear, avoidance, and physiological arousal). Respondents are required to answer how much they were bothered by particular symptoms during the past week, measured on a 5-point scale ranging from 0 (*not at all*) to 4 (*extremely*). The SPIN has good internal consistency and discriminant validity.

#### **Procedure**

*Material production.* Stimuli were produced for each modality separately (face, voice, and body). The basis for all animations was joyful laughs elicited by amusement [corpora by Urbain et al. (2010) and Niewiadomski et al. (2013)]. The specific criteria for each modality are reported in the respective section.

*Face.* In order to generate the virtual agents' facial expressions, several steps were required. The first step was to choose six episodes of joyful laughter (two low, two medium, and two high intensity) of the freely available AVLC corpus (Urbain et al., 2010) by three authors. The selection criteria were (a) equal distribution of intensity levels [as rated by naive participants on the laughter globally, see Niewiadomski et al. (2012a,b) and assessed by the intensities of the AUs], (b) smooth animation, (c) good visibility of the face (i.e., no extreme head turns or head up/down). Each episode contains precisely one laugh. All laughs were voiced laughs of durations between 2 and 7 s. All the episodes were of one female subject ("Subject 5" in AVLC corpus).

Next, the facial expressions of the chosen videos were processed with a freely available facial tracker (Saragih et al., 2011) and the tracked 2D data were retargeted onto the mesh of a virtual character [details of the procedure can be found in Qu et al. (2012)]. The animations resulting from the retargeting were additionally modified to enhance the visibility of the AU6 (cheek raiser), a marker of amusement. In particular, the intensity of AU6 in the final animation is proportional to the intensity of AU12 [details of the approach can be found in Niewiadomski and Pelachaud (under review)]. In the last step, the wrinkles associated to action units were applied to the virtual model according to the model proposed in Niewiadomski et al. (2012b) [see also Niewiadomski and Pelachaud (under review)]. The appearance and intensity of an expressive wrinkle depends on the intensity of the corresponding AU. **Figure 1** shows the six facial laughter stimuli, depicting the apex of the laughter and its respective FACS codes.

**Figure 1** shows the apexes of each of the facial expression stimuli separately.

*Voice.* The acoustic laughter synthesis process used for this study follows the approach described in Urbain et al. (2013). It relies on hidden Markov models (HMMs) to capture the statistical distributions of audio features (characterizing the shape of the sound wave) for each laughter phone (e.g., "h,""a,""e," etc.). HMMs have the advantage to model the evolution of acoustic features both during each phone (thanks to the use of several states to model each phone) and across the laugh (thanks to the incorporation of derivatives in the feature set). Furthermore, the HTS toolbox (Oura, 2011) provides convenient ways to train contextual HMMs, meaning that different statistical distributions will be computed for each context the considered phone can be in, for example, the phone "h" will be associated to different distributions when

**FIGURE 1 | Apex of the laughter events of the six stimuli**. Top row: two low-intensity laughs (left side: AU6B, AU12B, and AU25C; right side: AU6B, AU12B, and AU25B). Mid row: two medium intensity laughs (left side: AU6C, AU12C, and AU25C; right side: AU6C, AU12C, and AU25C). Bottom row: two high-intensity laughs (left side: AU4C, AU6D, AU12D, AU25D; AU6E, AU12E, and AU25E).

it is followed by "a" or "e." To synthesize a laugh, its phonetic transcription is provided to the trained HMMs. Although phonetic transcriptions are the only required parameter, HMM-based laughter synthesis also enables to easily control the duration of each phone as well as the fundamental frequency (pitch) pattern<sup>1</sup> .

For this study, laughs varying along four dimensions were synthesized: intensity, rhythm, fundamental frequency, and number of syllables. For each dimension, the starting point is a laugh synthesized from a human phonetic transcription and respecting the duration of each of the phones from the human laugh. Two female laughs (around 6.5 s each) and two male laughs (lasting around 2 and 5 s) were created that way, and were used for the four types of modifications presented below. They are called "base synthesis" laughs. All the laughs were voiced. Thus, 4 original laughs and 32 modified laughs were created<sup>2</sup> .

*Body.* Animations of laughter body movements were created using motion capture data of the multimodal multiperson corpus of laughter in interaction (MMLI corpus; Niewiadomski et al., 2013) 3 and the Eyesweb XMI software platform (Eyesweb). For the purpose of this study, 10 episodes lasting between 11 and 30 s and involving 4 participants were chosen (5 annotated as low and 5 as high-intense laughter).

Data were collected using Xsens MVN Biomech System bodysuits (xsens), each of them consisting of 17 inertial sensors placed on Velcro straps. Data were captured at 120 frames per second;

<sup>1</sup>*Pitch*: note that in speech processing literature, and given the nature of the signals, pitch is assimilated with fundamental frequency or the vibration with the lowest frequency.

<sup>2</sup> *Intensity*: to modify the intensity pattern of the laugh, the amplitude of each audio sample was multiplied by a weighting factor. This factor was either decreasing or increasing linearly over the laugh episode, with a maximum value of 1 (at the beginning or end of the laugh, respectively) and a minimal value of I (at the end or beginning of the laugh, respectively). Laughs have been synthesized for the factor 10 (increase and decrease). The unmodified original laugh corresponds to I = 1. *Rhythm*. Rhythm was modified by multiplying with the same factor *R* the duration of each phone in the base phonetic transcription, then synthesizing the obtained phonetic transcription. Factors smaller than 1 shorten the laugh (faster rhythm). Laughs were synthesized with values of *R* equaling 70 or 130% (reference laugh corresponding to *R* = 1). *F0*. To investigate the impact of fundamental frequency (F0) patterns, the F0 curves from the base synthesis laughs have been altered. The average F0 of each base synthesis laugh was computed, and the deviations from the average value were multiplied by a factor *F*. Fundamental frequency variations are amplified if *F* > 1 and attenuated if *F* < 1 (*F* = 0 corresponding to a constant fundamental frequency over the laugh). For each base synthesis laugh, F0 was altered by the factor 10% (reduction) or 200% (amplification). *Syllables*. To vary the number of syllables, we looked for the longest series of "fricative-vowels" syllables (FV series) in the base laugh transcription. To obtain laughs with smaller numbers of syllables, we deleted syllables in the series in a uniform way: if only 1 syllable has to be deleted, we select the middle syllable from the base FV series; if 2 syllables have to be dropped, we take them at 1/3 and 2/3 of the base FV series; if 3 syllables are removed, we take them around 1/4, 1/2, and 2/3 of the base FV series. To obtain laughs with higher numbers of syllables, we added syllables in the FV series in a uniform way. The duration of the inserted syllable is the average between the durations of the preceding and following syllables, while the vowel of the inserted syllable is copied from the preceding syllable. For each base synthesis laugh, five syllables were either added or removed.

<sup>3</sup>The MMLI corpus consists of approximately 500 laugh episodes from 16 participants. It contains both induced and interactive laughs from human triads. The motion capture (mocap) data consist of 3D body position information, multiple audio and video channels, as well as respiration data. The intensity of MMLI laugh episodes was annotated by 2 coders who used a 3-step scale (low, medium high).

each frame consisting of 22 body joints' location and rotation in a 3D reference space. An animated stick figure was created with the freely available Eyesweb XMI software platform (Eyesweb), starting from the 3D Xsens position data of the corresponding episodes. The advantage of such simple body visualization is that body movements are displayed precisely, avoiding that the lack of other modalities (e.g., facial expressions) would be perceived as awkward. In the stick figure animations, a frontal camera was used and the whole body is visible throughout the entire animation.

*Study procedure.* Data were collected online with the data collection platform "Unipark." Participants were recruited over mailing lists of universities, forums, and social media platforms. After a welcome page that thanked people for their participation, instructions were given that advised the participant to allow cookies and to ensure having the brightness of the screen and the sound turned up. Further instruction was given to wear headphones for the duration of the study. All participants had to tick a confirmation box that they (a) had read the instructions, and (b) were wearing headphones, before being asked to continue with the study.

The participants were next given instructions to the nature of the study:"You will now be presented with an AVATAR. The laughter the AVATAR displays has been animated directly from a real situation where laughter occurred during a conversation between two people. The clip you are about to see shows a laugh, which occurred specifically during a conversation. This type of laughter happens when the person speaking is about to change the topic of conversation. This laugh is very common when people are talking either face-to-face or on the telephone." Please focus on the facial expressions of the laughing AVATAR. The example of the instruction for participants in the hilarious laughter condition is "You will now be presented with an animated AVATAR. The laughter the AVATAR displays has been animated from a recording of a real situation where laughter occurred when the person laughing has found something to be very funny. The clip you are about to see shows a laugh, which occurred specifically during a conversation. For example, when the person was told a funny joke. This laughter is not so common and only happens in response to something that is hilariously funny." All participants were then told to "Please focus on the facial expressions (auditory laughter/body movements) of the laughing AVATAR"<sup>4</sup> .

Next, the demographic questions, relating to gender, age, education, and which language was their mother tongue were asked. A further two questions were given, one relating to experiencing hearing problems or having previously taken part in a laughter perception study. Next, participants completed the GELOPH<15> and the SPIN. All modalities (i.e., the 6 faces, 32 vocalizations, or 10 body movement) were presented in blocks (and randomized within blocks) but each participant was randomly assigned to being presented with either the face, voice, or body stimuli block first (with the other two modalities presented after completion of the first, and also randomized). Each video or audio clip was presented on one single page. A set of questions relating to how the clip was perceived was presented on the same page as each clip.

These were (1) how malicious (with bad intention) is this laughter? (2) How friendly (with good intentions) is this laughter? (3) How real is this laughter? These were all rated on a 9 point Likert scale from 0 (not at all) to 8 (extremely). An open question was also asked (4) which markers in the face/voice/body lead to your perception of friendliness or maliciousness? Where the participant could type a response using 200 characters in a text box. The video or audio clip was automatically played as the participant clicked through each page. However, the video could be replayed as many time as required by the participants. After having rated the stimuli of all modalities, participants answered some control questions and were thanked for the participation. An email was offered for anyone requiring further information on the study or a post study brief report. This study conformed to the requirements for the approval of University of Zurich ethics committee.

#### **RESULTS**

#### **ANALYSIS OF GELOTOPHOBIA**

The averaged GELOPH<15> total scores were computed. Participant's gelotophobia scores ranged from 1.00 to 4.00. The distribution of scores for gelotophobia was *M* = 1.73, SD = 0.65. The cut off for gelotophobia was applied (i.e., 2.5) and yielded 71% (*n* = 87) with no fear, 22% (*n* = 27) were borderline, 6% (*n* = 8) were slight and 0.8% (1) indicated marked gelotophobia. Furthermore, gender was not related to gelotophobia, but age was negatively related *r*(123) = −0.241, *p* < 0.01 in the present sample.

#### **ANALYSIS OF THE PERCEPTION OF FACIAL LAUGHTER EXPRESSIONS**

We computed three separate ANOVAs with the gelotophobia as a group factor (gelotophobia vs. no gelotophobia), the stimulus intensity as repeated measures (low, medium, and high) and maliciousness, friendliness, and realness as the dependent variables. Depending on the kind of expectations, main effects for intensity (for the two groups separated) and subsequent *post hoc* tests, or trend analyses (linear and quadratic trends) were computed.

The two stimuli of each intensity level were averaged. We investigated the perception of maliciousness first. As expected, among non-gelotophobes intensity of the facial display degree did not impact on level of perceived maliciousness, *F*(2, 140) = 0.15, *p* = 0.859 (*M*low = 3.22, SDlow = 1.47, *M*med = 3.32, SDmed = 1.76, *M*high = 3.22, SDhigh = 1.53). Gelotophobes, however, were sensitive to the intensity of the display, *F*(2, 58) = 4.77, *p* = 0.012, η<sup>p</sup> <sup>2</sup> = 0.141. *Post hoc* tests revealed that the medium intensity (*M*med = 4.03, SDmed = 1.61) was perceived as more malicious than both low (*M*low = 3.35, SDlow = 1.42) and high (*M*high = 3.28, SDhigh = 1.71) intensity, *p* = 0.019 and *p* = 0.012, respectively, while the two did not differ from each other, *p* = 0.792. Thus, gelotophobes did perceive the medium intensity facial expression of the AVATAR as malicious (see medium row on **Figure 1**). A partial correlation of gelotophobia with maliciousness of the medium intensity expression remained significant even after social phobia (i.e., the SPIN) was partialed out (*r* = 0.18, *p* = 0.04).

Next, we investigated the perception of friendliness. For nongelotophobes the intensity of the laughter stimuli affected the level of perceived friendliness, *F*(2, 140) = 3.08, *p* = 0.049, η<sup>p</sup> <sup>2</sup> = 0.042. A trend analysis showed that only the linear trend was significant, *F*(1, 70) = 6.62, *p* = 0.012, η<sup>p</sup> <sup>2</sup> = 0.086; the friendliness increased

<sup>4</sup>The two instruction conditions did not impact on the ratings (no main effects or interactions) and were thus neglected in further analyses.

with the intensity (*M*low = 4.19, SDlow = 1.54, *M*med = 4.56, SDmed = 4.56, *M*high = 4.66, SDhigh = 4.65). No significant effect of intensity was found for the gelotophobes, *F*(2, 58) = 1.77, *p* = 0.180. For them, only the high intensity was perceived as more friendly (*M*high = 4.52, SDhigh = 1.48), but the difference to the low and medium (*M*low = 4.08, SDlow = 1.27, *M*med = 4.08, SDmed = 1.37) failed to be significant, *p* = 0.055 and *p* = 0.055 (one-tailed), respectively. Again, a partial correlation between gelotophobia and friendliness of the medium intensity expression was computed (controlling for the SPIN, i.e., social phobia) and it was significant (*r* = 0.21, *p* = 0.03). For the perception of realness, gelotophobia mattered as well. For non-gelotophobes the intensity of the laughter stimuli affected the level of perceived realness, *F*(2, 140) = 8.95, *p* < 0.001, η<sup>p</sup> <sup>2</sup> = 0.113, and both the medium (*M*med = 4.37, SDmed = 1.68, *p* < 0.001) and high (*M*high = 4.23, SDhigh = 1.63, *p* = 0.003) intensity levels that were perceived higher in realness than the low (*M*low = 3.57, SDlow = 1.46) intensity, with the former the two not differing from each other, *p* = 0.487. For the gelotophobes, there was a linear increase of realness with the level of intensity, *F*(1, 29) = 5.80, *p* = 0.023, ηp <sup>2</sup> = 0.167, with only the high (*M*high = 4.18, SDhigh = 1.49) (but not the medium; *M*med = 3.88, SDmed = 1.38) intensity level being significantly more real than the low (*M*low = 3.68, SDlow = 1.44) intensity level, *p* = 0.023. In other words, once a facial expression was perceived to be"real"(i.e., exceeding the scale midpoint of 4.0), the expressions were significantly exceeding the ones of the lower intensities. This was the case for both medium and high intensity for non-gelotophobes and high intensity only for gelotophobes.

#### **ANALYSIS OF THE PERCEPTION OF LAUGHTER BODY MOVEMENTS**

Next, we examined the level of realness of the body movements. While the low-intense body movements were considered to be less real (*M*low = 3.03, SDlow = 1.29) than the high intensity (*M*high = 5.03, SDhigh = 1.43),*F*(1, 90) = 6.92, *p* = 0.010, ηp <sup>2</sup> = 0.071, the interaction between intensity and gelotophobia failed to be significant, *F*(1, 90) = 1.94, *p* = 0.167. Separate inspection of the five high and five low-intensity examples showed that they varied in intensity and hence the lowest and highest were chosen for further studies. Now, the interaction between gelotophobia and intensity of body movement was significant, *F*(1, 89) = 8.13, *p* = 0.005, η<sup>p</sup> <sup>2</sup> = 0.084. While the low-intense body movements were considered to be less real by non-gelotophobes (*M*low = 2.64, SDlow = 1.65) and (*M*low = 2.73, SDlow = 1.82) gelotophobes equally, the nongelotophobes (*M*high = 5.81, SDhigh = 1.88) found the highintense body movement more real than the gelotophobes (*M*high = 4.54, SDhigh = 2.04). The interaction between intensity and gelotophobia was significant, *F*(1, 89) = 4.81, *p* = 0.031, ηp <sup>2</sup> = 0.051. The low-intense body movements were considered to be malicious by both non-gelotophobes (*M*low = 3.15, SDlow = 1.69) and gelotophobes (*M*low = 3.15, SDlow = 2.11), and while the gelotophobes (*M*high = 3.31, SDhigh = 1.59) perceived the laughs with the high-intense body movement to be malicious, the non-gelotophobes (*M*high = 2.33, SDhigh = 1.27), i.e., those that found it real, also stipulated they are not malicious. Finally, the high-intense body display was perceived as friendlier than the laugh with the low-intense body movement, *F*(1, 89) = 8.31, *p* = 0.005, η<sup>p</sup> <sup>2</sup> = 0.085, and the non-gelotophobes found the laughs more friendly than the gelotophobes did, *F*(1, 89) = 9.69, *p* = 0.002, η<sup>p</sup> <sup>2</sup> = 0.098. While the non-gelotophobes found the laugh involving the high-intense body movement disproportionately more friendly, the interaction between intensity and level of gelotophobia just failed to be significant, *F*(1, 89) = 3.42, *p* = 0.068. Thus, compared to those without a fear of being laughed at, the gelotophobes found the high-intense body movement less real, more malicious, and less friendly. While, the GELOPH correlated 0.35 (*p* < 0.001) with the perceived maliciousness of the high-intense body movement, controlling for social phobia (i.e., the SPIN) reduced the correlation to a non-significant one (*r* = 0.13, *p* = 0.223).

To explore which body movement features of the feature qualities are linked to perceived as maliciousness in gelotophobes, we next present the two laughter body movement animations that were perceived least and most malicious, respectively, by the gelotophobes. Interestingly, they both came from the high-intensity body movement category (i.e., they even had the same maximal intensity rating in a pretest). This allows for a first comparison that is independent of level of intensity. The means and SD for most malicious were *M* = 3.36, SD = 2.25 and for least malicious *M* = 2.55, SD = 1.47. **Table 1** shows the animation that was perceived least malicious and most malicious by the gelotophobes and lists all the body movements that were entailed in these two stimuli. In Supplementary Material, the full video animation can be watched.

**Table 1** and Supplementary Material show that compared to the least malicious (but intense laughter), there is more jerking than rocking movements in the more malicious laugh and the movement direction is more often forward–backward rather than left–right. Moreover, in the most malicious animation, additionally weight shifts to the left and to the right were observed that were not seen in the least malicious animation. In general, the movements on the most malicious animation appear to be quicker: arms, legs, head are jerking backward and forward, or to left/right, while chest and abdomen are contracted backward or forward. In the least malicious animation, the whole body is contracted while trunk and knees are rocking, and legs are tilting to the left or right.

More precisely, **Table 1** and Supplementary Material show that the least malicious body movements involves the bending of the knee, which appear to rock backwards and forwards. The abdomen contracts and moves sideways. The trunk moves in a rocking motion, the arms move left to right, contract upwards then move down again. The legs tilt backwards and forwards, the whole body contracts and head moves from left to right and right to left. In contrast, **Table 1** shows that the most malicious laughter involves many weight shifting movements with both from left to right and right to left direction. The knees were seen to bend backwards and forward in a jerking fashion. The abdomen contracted in a vibrating way. The trunk tilting sideways and legs not only moving left to right and right to left but also backwards and forwards. The chest contracted forwards and backwards and the head tilted backwards and forwards. The fact that both are equally intense is also underscored by the fact that for the least malicious movement seven different body movements were coded compared to the eight body movements, which were coded for the most malicious.


**Table 1 | The body movement, general direction, action type, and action direction for laughter body movement stimuli being perceived as least and most malicious by the gelotophobes**.

Yes, movement coded; No, no movement coded; AT, action type; AD, action direction following codes are given for the movements and movement directions. Action types: AT1, exhaling; AT2, vibrating; AT3, contracting; AT4, shaking; AT5, tilting; AT6, straightening; AT7, throwing; AT8, jerking; AT9, turning; AT10, rocking; AT11, twitching; AT12, trembling; AT13, convulsing. Body action directions are AD1, backwards; AD2, forwards; AD3, backwards and forwards; AD4, upwards; AD5, downwards; AD6, circular (360°); AD7, curved. If no AD is specified for a BM then only the general direction occurs.

Interestingly, in the both cases (i.e., the least and the most malicious animation) most of the body parts are involved in laughter expressions including legs, knees, trunk, arm, and head.

#### **ANALYSIS OF AUDITORY STIMULI**

Owing to the complexity of the variations of the material, we adopted a sequential strategy. First, we computed three repeated measures ANOVAS with the (original and the eight) modification methods as repeated measures (averaged over four stimuli), and the maliciousness, friendliness, and realness ratings as dependent variable for the non-gelotophobia group. If the main effect of different modification methods was significant, *post hoc* tests were used to examine the differences between each of the eight modifications and the original (tests at *p* < 0.05 level). Next, the ANOVA and the *post hoc* tests were repeated in the group of gelotophobes and it was examined whether the same differences exist. Finally, it was examined whether there was a main effect of gelotophobia.

For the non-gelotophobes, the perceived realness of the variations in laughter was significant, *F*(8, 448) = 2.93, *p* = 0.003, ηp <sup>2</sup> = 0.046. The *post hoc* tests revealed that adding syllables lead to a higher degree of perceived "realness" of laughter (*p* < 0.001) and reduction of the fundamental frequency by the factor 10 lead to a reduction of the realness of the laughter. For the gelotophobes, the perceived realness of the variations in laughter was significant too, *F*(8, 176) = 6.60, *p* = 0.001, η<sup>p</sup> <sup>2</sup> = 0.231. The following variations were lowering the realness of the laughter compared to the original adding syllables, reducing the fundamental frequency by the factor 10, and by stretching all phone durations by a factor of 130. Thus, the adding of a syllable has opposite effects for gelotophobes and non-gelotophobes. Finally, the non-gelotophobes gave higher ratings of realness which, however, failed to be significant, *F*(1, 83) = 1.06, *p* = 0.307.

Next, we investigated the perceived maliciousness. The ANOVAs showed that the differences in the auditory laughter stimuli had an impact on perceived maliciousness among both the non-gelotophobes, *F*(8, 664) = 2.93, *p* = 0.003, η<sup>p</sup> <sup>2</sup> = 0.046 and the gelotophobes, *F*(8, 176) = 2.12, *p* = 0.036, η<sup>p</sup> <sup>2</sup> = 0.088. *Post hoc* tests revealed that for both groups an increased duration (durations of phones all scaled by factor 130%) was perceived as more malicious compared to the original laughter. Then, for nongelotophobes, it was the *reduction* (compared to the original) of the variation in the fundamental frequency (F0) that yielded an increase in perceived maliciousness, while for gelotophobes, it was the *amplification* (by 200%) of the variation in the fundamental frequency that was perceived as more malicious. Furthermore, for gelotophobes (but not non-gelotophobes), the linear decrease of the intensity over the laugh episode was perceived as malicious. Finally, the gelotophobes did not generally rate the maliciousness higher than individuals with no fear, *F*(1, 83) = 0.04, *p* = 0.840.

On **Figures 2** and **3**, the spectrograms of the two stimuli that were perceived as least and most malicious by gelotophobes, respectively, are presented.

The least malicious laugh (as rated by gelotophobes) is shown in **Figure 2**. Here, a less stereotypical pattern that indicates a natural uninhibited laugh is shown (in this laughter stimulus, the F0 variability was modified). Thus, this finding is in line with former research that indicates the influences of F0 variability on the perception. **Figure 3** shows that the stimuli with an increased duration (by stretching all phones by multiplying their duration by the factor 130%) were perceived as most malicious by the gelotophobes. We assume that the stretching of the phones gives the impression of a voluntary regulation/modulation of the laugh and thus makes it sound cognitive.

Finally, the perception of friendliness was investigated. The stimuli differed in the level of perceived friendliness among non-gelotophobes, *F*(8, 448) = 3.54, *p* = 0.001, η<sup>p</sup> <sup>2</sup> = 0.055, but failed to have a significant overall effect for gelotophobes, *F*(8, 176) = 1.53, *p* = 0.149. *Post hoc* tests revealed that for nongelotophobes both an increased duration (durations of phones all scaled by factor 130%) and the reduction of the variation in the fundamental frequency (F0) were perceived as less friendly. It should be mentioned that also the shortening of the duration (durations of phones all scaled by factor 70%) and a the linear increase of the intensity over the laugh episode were perceived as less friendly but just failed to be significant (*p* = 0.056). For the gelotophobes, the reduction of the variation in the fundamental frequency (F0) led to a lower perceived friendliness. The amplification of the variation in the fundamental frequency yielded a decrease in friendliness that failed to be significant (*p* = 0.067). Finally, the two groups did not generally differ in their perceived friendliness, *F*(1, 83) = 1.23, *p* = 0.271, although gelotophobes rated all stimuli numerically lower.

#### **OPEN ANSWERS**

The responses to the open question "Which markers in the face/voice/body lead to your perception of friendliness or maliciousness?" were investigated. The first step was to sort each clip of laugh for all modalities by the rating score that had been given by the participant. The laugh was then assigned to group A (more friendly than malicious), group B (more malicious than friendly) or group C (scored equally malicious and friendly).

The analysis of the answers related to the animated facial laughter expression showed that the markers of the friendliness for the laughter in group A were reported as been because of the broadness of the mouth, especially where the teeth was showing, and the raising of the cheeks. In the six animated AVATAR facial expressions, which had been classified as group B, the upper lip curling of the AVATAR and "the eyes" were most often reported as being the markers of maliciousness, these aspects were irrespective of the rater being gelotophobic or not. When participants judged a laugh as been equally friendly and malicious (group C), the reason for this was often the "fakeness" of the virtual agents' laughter and duration of the laughter facial expression animation.

For the auditory stimuli, the findings of the open answer analysis were in line with the hypotheses: the reasons why laugh sounds were deemed malicious (group B) were given as the slowness or the monotone "ha-ha-ha" sounds. Other examples of the reasons were the "expressionless intonation and lack of variability" and "the tone that sounded controlled." The group A laughter sounds were described by the participants as being friendly as the laugh sounded "natural" or had a "warm sounding tone." Additionally, laughs that had a natural trajectory going from high speed to low, which would occur in a usual laughter event, were described as indicating friendliness. As with the facial expression animation, when people judged the laughter sounds equally for malicious and friendly (group C), the reasons given was often due to them being a "fake" or "robotic sound."

The body movements, which overall were seen as more friendly than malicious, did, however, show that if the clips contained movements, which appeared to be a "pointing" movement of the hand/arm, they were classified as more malicious. When the whole body was moving and the body "leaned backwards" participants rated the laugh as being friendlier.

Second, we looked at the answers of gelotophobes in comparison to individuals with no fear to see whether they nominate any different or additional maliciousness markers beyond the ones where both groups had agreed on. For the facial expression, while the non-gelotophobes rated the lip curls as markers of maliciousness and eyes as markers of friendliness, the gelotophobes nominated the eyes as markers of maliciousness and the lips, mouth, and teeth as markers of friendliness. Thus, indicating a reverse pattern of the features that generate the perception of maliciousness and friendliness.

Furthermore, for the auditory laughter the non-gelotophobes often gave the high pitch of the laughter as being the indicator of maliciousness. The gelotophobes commented that the "fakeness," so the artificial sound (particularly slow, particularly long), made it appear malicious. Friendliness was often based on the brevity or shortness of the laughter sound for the non-gelotophobes and its "fastness" was indicated as making the sound friendly for the gelotophobes.

For the body movements, non-gelotophobes said that when the body appeared "stiff" it appeared as malicious. The gelotophobes saw the "stillness" of the body as being malicious. When it was rated friendly, the shoulder and head movements were nominated as markers for the non-gelotophobes whereas the gelotophobes did not nominate any observable features of what determines friendliness in laughter body movements.

#### **DISCUSSION**

Gelotophobia is a specific disposition that biases the perception of joyful stimuli (expressed not only by laughter but also beyond). As laughter is an integral part of interaction it is important to create virtual agents that can account for such biases, by producing laughter that is perceived as non-malicious also to those individuals with gelotophobia. Furthermore, there are features of laughter that are perceived as malicious by both, gelotophobes and nongelotophobes. If the desired encoding of the virtual laughter is malicious, then these features should be significantly reduced.

The current study identified features of facial, vocal, and body laughter stimuli that were perceived as malicious in general,and for gelotophobes specifically. In general, our hypothesis that "cognitive" laughs would be perceived as more malicious was confirmed for the face and the voice. While for non-gelotophobes, maliciousness did not vary with intensity of the facial expression, the gelotophobes were sensitive to the intensity of the display. The analysis of the laughter facial displays showed that gelotophobes mostly perceived the mid intensity stimuli as malicious. The mid level intensity laughs shows that amusement is present but probably down regulated and not at maximum. This might indicate a cognitive element of attempting to dampen (or hiding imperfectly) amusement. For the perception of realness, gelotophobia mattered as well. The low-intensity laugh was considered to be the least real, and the highest intensity as most real. For gelotophobes, there was a linear increase and for non-gelotophobes the middle intensity was as real as the high intensity. Once a facial expression was perceived to exceed the scale midpoint of 4.0 (i.e., is perceived as "real"), the expressions were significantly exceeding the ones of the lower intensities. Overall, it has to be mentioned that the gelotophobes did not differ from the individuals with no fear in the friendliness and maliciousness ratings for the facial laughter stimuli. However, it was clear from the open answers that the triggers for those ratings did differ, in fact, they were the opposite of each other. Namely, that the shape and appearance of the lips curling induced feelings that the expression was malicious for non-gelotophobes and that the movement round the eyes, elicited the face to appear as friendly. The converse was true for the gelotophobes. The lips were what made the appearance of the virtual agent friendlier and the eyes made the appearance seem malicious. This is interesting, as it is speculated that gelotophobes are "laugh blind" in as much as they do not have a feeling for what the sender of the expression is trying to relate. As the contraction of the orbicularis oculi muscle is what differentiates a cognitive from a real expression of enjoyable emotion (Ekman et al., 1990) or williness to cooperation (Schug et al., 2010), the fact that movement around the eye is indeed deemed malicious would be problematic and lead to misinterpretation of facially expressed communication and may be one reason gelotophobes find it more difficult to form or maintain long-term adult relationships than non-gelotophobes do (Platt et al., 2010; Platt and Forabosco, 2012). Most importantly, the effect found for maliciousness cannot be explained by social anxiety, i.e., is specific to gelotophobia. A different but plausible pattern emerged for friendliness, which was a linear function increased for non-gelotophobes. For the gelotophobes, only the high intensity was perceived as more friendly.

For the body laughter stimuli, the intensity of body movement did play a role. While the low-intense body movements were considered to be less real, there was an interaction between gelotophobia and intensity of body movement (i.e., the lowest and highest in intensity) with the non-gelotophobes finding the highintense body movement more real than the gelotophobes. It was also the non-gelotophobes finding the high-intense body movement less malicious (and friendlier) than the gelotophobes did and how both groups perceived the low-intensity laughs, i.e., those that found it real, also stipulated they are not malicious. Thus, compared to those without a fear of being laughed at, the gelotophobes found the high-intense body movement less real, more malicious, and less friendly. Furthermore, to modify AVATAR laughter displays to be suitable for gelotophobic individuals, we looked at the single stimuli in details and studied the laughter body movement animations that had received the lowest and highest maliciousness ratings, respectively (gelotophobic participants only). This qualitative, descriptive analysis showed that the laughter stimulus

that displayed an uninhibited, strong laugh was perceived as least malicious. The stimulus that went along with less and slower body movements and higher retrained body movements was perceived most malicious. This is in line with our hypothesis that perceived cognitive modulation increases the perceived maliciousness (laughter as an explicit communication attempt, not felt emotion). Additionally, the weight shift and the more frequent back and forth movements might be perceived as more threatening than the sideward directions found for the low-malicious laughter. In the open answers, the non-gelotophobes stated that for them the quality that gave the body movement its maliciousness was stiffness. This was not picked up by the gelotophobes. Yet, going stiff and "feeling paralyzed" is an item on the GELOPH<15> (Ruch and Proyer, 2008a). It has also been reported by extreme gelotophobes in casestudy responses. When asked the question What do you experience when you feel being laughed at, among other things, they often reported body stiffness (Platt, under review). It could be for the gelotophobes, this body movement relates more to being fearful rather than been malicious.

For the perception of auditory laughter features, stretching the duration of the syllables compared to the original laughter was perceived as more malicious irrespective of gelotophobia level. While for the non-gelotophobes maliciousness was because of the *reduction* of the variation in the fundamental frequency, maliciousness was because of the *amplification* for gelotophobes. Similarly, the gelotophobes saw maliciousness in the linear decrease of the intensity over the laugh episode. Variations in the acoustics made a difference for friendliness among non-gelotophobes only. A reduction of friendliness could be obtained by an increase (and decrease) in duration, a reduction of the variation in the fundamental frequency, and a linear increase of the intensity over the laugh episode. For the gelotophobes, only the reduction of the variation in the fundamental frequency (F0) tended to lower perceived friendliness. It seems that several deviations from the sound of spontaneous laughter gives the impression, or indicates the "evil mind" behind the laughter interferes and that presumably disparaging thoughts (about the gelotophobe), which in the gelotophobe's view add volitional elements and render an emotional laughter to one that carries the "you are ridiculous" message. The parameters reported above need to be considered when designing a laughing AVATAR in the future. Right now, we do not know how much of these deviations from the normal laugh is tolerable and there was also no test of the interaction between the different parameters. Furthermore, it should also be remembered that already the normal laughter did yield an average level of maliciousness (*M*non-g = 3.06, SDnon-g = 1.27, *M*<sup>g</sup> = 3.12, SD<sup>g</sup> = 1.16).

To summarize, a gelotophobia-friendly laugh should consist of a high intensity, uninhibited facial expression, containing the Duchenne markers [see Ekman et al. (1990) and Ruch and Ekman (2001)], a voiced vocalization, which is fast, non-repetitive, variable, and of short duration. It should not contain any features that indicate a down-regulation in the voice or body, or indicate cognitive.

#### **LIMITATIONS**

This is the first study to investigate the fine-grained laughter features in different modalities to identify fear triggers of AVATAR Ruch et al. Laughter in virtual agents

laughter in gelotophobes. Still, due to the huge amount of stimuli, we were limited to only 6 (respective 10) stimuli for the face and body, with the stimuli only being distinguished by the dimension of intensity. For the face, only one AVATAR was utilized, which was only of female gender and thus any effects of the AVATAR appearance or gender on the perception could not be investigated. Concerning the analysis of laughter body movements, this was to our knowledge the first psychological study that investigated the laughter body movement perception of gelotophobes. Future studies should attempt more fine-grained analysis to identify which exact features and movement trajectories are linked to perceived maliciousness. Automatic feature analysis with many samples of laughter stimuli should be correlated to the subjective ratings. Here, we could only attempt a descriptive analysis of the feature qualities. Still, our results deliver first evidence of which features may be modified when generating gelotophobia-friendly AVATAR laughter body movements. Furthermore, this study considered the three modalities independently. Future studies should also focus on the importance of audiovisual integration in the perception of laughter friendliness/maliciousness.

One has to consider that people with a fear of being laughed at are rare. For example, in Switzerland only 5% of the population measured is gelotophobic. So to get a sample of 20 you need to sample over 400 participants. Additionally, gelotophobes are difficult to find for studies relating to laughter, as this is the trigger of fear, panic, and feelings of shame. Getting gelotophobes, especially those at the more pathological levels, to fully commit to such a study, even with the guarantee of absolute anonymity, is not easily achieved. Building trust by hosting face-to-face, rather than through online testing, could encourage more participation but this would limit wider participation. Presenting the modalities in the lab may encourage more of the rare extreme gelotophobes cases to undertake the task, as they could be reassured and encouraged.

#### **ACKNOWLEDGMENTS**

The research leading to these results has received funding from the European Union Seventh Framework Program (FP7/2007-2013) under grant agreement no. 270780 (ILHAIRE project).

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at http://www.frontiersin.org/Journal/10.3389/fnhum.2014.00928/ abstract

#### **REFERENCES**


Wallbott, H. G. (1998). Bodily expression of emotion. *EJSP* 28, 6.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 04 July 2014; accepted: 31 October 2014; published online: 18 November 2014.*

*Citation: Ruch WF, Platt T, Hofmann J, Niewiadomski R, Urbain J, Mancini M and Dupont S (2014) Gelotophobia and the challenges of implementing laughter into virtual agents interactions. Front. Hum. Neurosci. 8:928. doi: 10.3389/fnhum.2014.00928 This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Ruch, Platt , Hofmann, Niewiadomski, Urbain, Mancini and Dupont . This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## The relationship between virtual self similarity and social anxiety

#### **Laura Aymerich-Franch\*, René F. Kizilcec and Jeremy N. Bailenson**

Virtual Human Interaction Lab, Department of Communication, Stanford University, Stanford, CA, USA

#### **Edited by:**

Ali Oker, Université de Versailles, France

#### **Reviewed by:**

Ilaria Bufalari, Sapienza University of Roma, Italy Philip L. Jackson, Universite Laval, Canada Léonor Philip, CNRS, France

#### **\*Correspondence:**

Laura Aymerich-Franch, Virtual Human Interaction Lab, Department of Communication, Stanford University, McClatchy Hall, Room 411, Stanford, CA 94305-2050, USA e-mail: laura.aymerich@gmail.com

In virtual reality (VR), it is possible to embody avatars that are dissimilar to the physical self. We examined whether embodying a dissimilar self in VR would decrease anxiety in a public speaking situation. We report the results of an observational pilot study and two laboratory experiments. In the pilot study (N = 252), participants chose an avatar to use in a public speaking task. Trait public speaking anxiety correlated with avatar preference, such that anxious individuals preferred dissimilar self-representations. In Study 1 (N = 82), differences in anxiety during a speech in front of a virtual audience were compared among participants embodying an assigned avatar whose face was identical to their real self, an assigned avatar whose face was other than their real face, or embodied an avatar of their choice. Anxiety differences were not significant, but there was a trend for lower anxiety with the assigned dissimilar avatar compared to the avatar looking like the real self. Study 2 (N = 105) was designed to explicate that trend, and further investigated anxiety differences with an assigned self or dissimilar avatar. The assigned dissimilar avatar reduced anxiety relative to the assigned self avatar for one measure of anxiety. We discuss implications for theories of self-representation as well as for applied uses of VR to treat social anxiety.

**Keywords: virtual reality, virtual environment, social anxiety, public speaking, virtual self, self-representation, self-image, virtual classroom**

#### **INTRODUCTION**

Virtual reality (VR) enables people to experience an alternate reality. Transforming the appearance of one's self is a particularly powerful application of VR. This can be achieved by modifying the appearance of one's avatar (Biocca, 1997; Bailenson and Blascovich, 2004), thereby providing the person with a new selfrepresentation. People are known to psychologically identify with virtual representations that do not necessarily reflect their actual appearances (Kim, 2011). Thus, in virtual worlds, people can explore different versions of their self and become someone else (Turkle, 1995). Moreover, the appearance of avatars can cause behavioral and attitudinal shifts (Yee and Bailenson, 2006, 2007, 2009; Vasalou et al., 2007; Groom et al., 2009; Ahn and Bailenson, 2011; Hershfield et al., 2011; Peck et al., 2013). Also, observing the behavior of doppelgangers – virtual representations that resemble the self in appearance but behave independently – can influence attitudes and behavior (Fox and Bailenson, 2009, 2010; Fox et al., 2012; Aymerich-Franch and Bailenson, 2014).

In light of its psychological effects, VR is uniquely positioned to support the treatment of phobias and other anxiety disorders (Wiederhold and Wiederhold, 2000). In particular, the possibility to transform self-appearance offers a unique opportunity to treat social anxiety, one of the most common anxiety–mood disorders worldwide (Kessler et al., 2012). Social anxiety is an intense fear of negative evaluation from others in social or performance situations (4th ed., text rev.; DSM–IV–TR; American Psychiatric Association, 2000). According to cognitive models of social phobia, negative self-images play an important role in maintaining social anxiety (Izgiç et al., 2004; Stopa and Jenkins, 2007) as socially

anxious individuals create a negatively distorted mental representation of their ostensible appearance toward others. People who suffer from social anxiety selectively attend to and magnify negative aspects of their ostensible public image, which may also be influenced by past failures in social situations (Clark and Wells, 1995). However, previous studies that have used VR to address social anxiety have focused on manipulating features of the audience (Pertaub et al., 2001, 2002; James et al., 2003; Garau et al., 2005; Slater et al., 2006;Rinck et al., 2010;Wieser et al., 2010;Cornwell et al., 2011; Pan et al., 2012), but have not transformed the self in a social situation. Similarly, studies that have used virtual reality exposure therapy (VRET) to treat social phobia generally recreate social virtual environments for exposure (Harris et al., 2002; Roy et al., 2003;Anderson et al., 2005;Klinger et al., 2005;Wallach et al., 2011), without attempting to modify patients'self-representations.

Current VR therapies for social anxiety could incorporate transformations of the virtual self in order to restructure patients' distorted self-image in combination with exposure. We believe that having a virtual self-representation dissimilar to the real self in a social situation might decrease anxiety, because (a) virtual embodiment through an avatar can significantly alter a person's body schema and social role (Biocca, 1997; Kilteni et al., 2012), and (b) a dissimilar virtual self provides anonymity, which reduces inhibition and anxiety and facilitates self-expression. Embodying a dissimilar self could thereby neutralize some of the factors that contribute to social anxiety.

In this paper, we examine the effect of an altered selfrepresentation in VR on social anxiety and present a new approach to treat social anxiety using VR. In order to explore the effect of the appearance of the virtual self on social anxiety, we conducted three studies. In a pilot study, we examined the relationship between avatar similarity preference and trait public speaking anxiety during an imagined virtual speech task. We expected that higher social anxiety would be associated with a stronger preference for a dissimilar avatar. Based on the results of the pilot study, we designed an experiment (Study 1) in which we used immersive VR to manipulate appearance similarity of participants' virtual self-representations to their physical appearances in a public speaking context. Since multisensory correlations are essential to experience embodiment (Botvinick and Cohen,1998),we added a virtual mirror in the virtual environment and synchronized participant's body movement to the avatar reflection in the mirror in order to create embodiment and identification with the virtual body. Prior work using VR suggests that real-time virtual mirror reflections of upper body movements contribute to feelings of body ownership (González-Franco et al., 2010). We expected that participants with an avatar matching their real appearance would experience higher anxiety compared to participants with a dissimilar self-representation. In Study 1, we also examined the effect on anxiety of choosing the appearance of the virtual self in comparison to being assigned to an avatar in a public speaking situation. Prior work has examined the effects of avatar choice in online environments. The findings suggest that choosing the appearance of the virtual self in virtual environments affects constructs related to anxiety and self-consciousness. In particular, previous work has found increased self-awareness during online interactions with other people for users who were represented by an avatar that matched their appearance and preferences compared to users without an avatar representation (Vasalou et al., 2007). In addition, giving players the possibility of choosing the character that represents them during an online game was found to induce greater arousal compared to not having the option to choose (Lim and Reeves, 2009). In the first two experimental conditions, participants were assigned the avatar. In order to examine the effect on anxiety of choosing an avatar relative to being assigned one, we added a third condition, in which participants were able to choose their self-representation. To test our hypothesis and research questions, we compared anxiety outcomes among participants embodying an assigned avatar looking like the real self, an assigned avatar looking dissimilar to the real self, or an avatar of their choice, during a speech in front of a virtual audience. In Study 2, we partially replicated Study 1, comparing participants with an assigned similar versus dissimilar appearance with a larger sample and a more established measure of anxiety. In our experimental studies, we also examined effects on the sense of presence, the psychological state in which virtual objects are experienced as actual objects in either sensory or non-sensory ways (Lee, 2004). Presence is an important factor to consider in studies that explore phobiarelated issues using VR, since it contributes to the experience of anxiety in a virtual environment (Price and Anderson, 2007).

#### **PILOT STUDY**

We designed a survey in which participants had to choose an avatar to embody if they were going to give a speech in VR. They indicated how similar the avatars would be to their physical selves. We predicted that participants with higher levels of trait social anxiety would prefer to embody dissimilar avatars compared to participants with lower levels of trait social anxiety:

H1. Avatar similarity and social anxiety correlate negatively, i.e., higher social anxiety is associated with a stronger preference for a dissimilar avatar.

#### **METHOD**

#### **Participants**

A total of 252 participants from the United States completed the survey. The sample was composed of 64% males, aged between 18 and 74 years, with 49% of the sample between the ages of 25 and 34 years. Participants were recruited through Amazon's Mechanical Turk crowdsourcing service and received a \$1 payment for completing the survey. Mechanical Turk has been widely used in previous studies to recruit participants and has been shown to provide data comparable to more traditional methods of recruitment (Kittur et al., 2008; Golbeck and Fleischmann, 2010; Sprouse, 2011; Liu et al., 2012; Aker et al., 2013).

#### **Design**

In the survey, participants first answered questions regarding socio-demographic variables and interactive media habits (videogaming,virtual worlds,VR). Public speaking anxiety was measured using the *Personal Report of Communication Apprehension* (PRCA-24; McCroskey, 1982), a 24-item scale. Participants rated their answers on a five-point scale ranging from *strongly disagree* (1) to *strongly agree* (5). The reliability of this measure was α = 0.95 and the average score on this test was *M* = 69.56 (SD = 22.27).

Next, participants reviewed a passage describing VR, an avatar, technologies such as head-mounted displays (HMD),the content of a virtual scene, and the concept of similarity in self-representation. The passage included both verbal description and images.

Participants were informed that they would be giving a speech on two different topics that may or may not be socially sensitive in nature. For the first topic, participants would discuss their favorite vacation (i.e., "imagine that you are in this virtual classroom full of people. You are required to deliver a speech about your favorite vacation in front of the virtual audience"). For the other topic, they would deliver a speech on a sensitive social issue (i.e., "imagine that you are in this virtual classroom full of people. You are required to give a speech about a sensitive social issue in front of the virtual audience").

Participants rated each situation with an avatar similarity question (i.e., "if you had to design your own avatar for this task, how similar to your real appearance would you make your avatar?"). The question was rated on a five-point scale ranging from *extremely similar* (1) to *not at all similar* (5). We included a picture of a virtual classroom to accompany these questions. Across the two situations, the two ratings correlated highly (*r* = 0.78, *t* <sup>252</sup> = 19.8, *p* < 0.001), so we created an index of avatar similarity by averaging across the two situations.

#### **RESULTS AND DISCUSSION**

There was a significant negative correlation between avatar similarity and social anxiety. Avatar similarity correlated significantly with the PRCA-24 across the two situations (*r* = −0.43, *t* <sup>252</sup> = 7.6, *p* < 0.001) and for each situation individually: favorite

vacation (*r* = −0.37, *t* <sup>252</sup> = 6.4, *p* < 0.001) and sensitive social issue (*r* = 0.44, *t* <sup>252</sup> = 7.8, *p* < 0.001). As predicted, the higher social anxiety, the stronger the preference for embodying a dissimilar rather than a similar avatar during a speech in VR.

#### **STUDY 1**

Since the results of our pilot study suggested that a dissimilar virtual self in a public speaking task in VR is associated with lower anxiety, we designed an experiment to examine whether becoming someone else would reduce anxiety in a public speaking situation. In order to examine the consequences of avatar choice on public speaking anxiety in comparison to an assigned avatar, we added a third condition to the experimental design in which participants chose the appearance of their virtual self. We hypothesized that participants embodying a self avatar during a public speaking task in VR would experience greater levels of anxiety than participants with a dissimilar avatar. In particular, we made the following prediction:

H1. During a speech in front of a virtual audience, participants with an avatar of their own face (*self* condition) will experience higher levels of self-perceived physiological sensations and state anxiety than participants with an avatar of a face dissimilar to their own (*other* condition).

We also formulated the following research question regarding the possibility of choosing the avatar:

RQ1. Do participants in the *choice* condition experience different levels of anxiety than those in the *self* and *other* conditions?

Regarding presence, we formulated the following research question:

RQ2. Do participants in the *self*, *other*, and *choice* condition experience different levels of self, social, and spatial presence?

#### **METHOD**

#### **Participants**

Eighty-eight participants attending an American university took part in the experiment. We discarded six due to technical failure or motion sickness. The final sample consisted of 82 experimental subjects (51 females and 31 males) aged 18–32 years (*M* = 20.18, SD = 1.88).

#### **Design**

Participants were assigned into one of three experimental conditions: *self*, *other*, or *choice*. In the *self* condition, participants embodied an avatar with their own face modeled after a photograph (**Figure 1**). In the *other* condition, participants were assigned a dissimilar face modeled after a previous participant's photograph. It was ensured that the avatar face in the *other* condition matched participants' sex and skin color by pairing each participant in the *other* condition with a previous participant from the same study of the same sex and with similar skin color. The avatar faces did not vary across conditions, as faces from the *self* condition were reused in the *other* condition. Faces in the *other* condition were paired with ones from the *self* condition in one of eight categories defined by sex (male/female) and skin

**FIGURE 1 | A participant and her avatar modeled from a photograph of her, in the self face condition**.

color (lightest to darkest). This ensured that faces had comparable objective features in the *self* and *other* conditions. In the *choice* condition, we showed participants a chart that contained 18 photographs of people of their same sex and asked them which avatar would choose to represent them if they had to give a speech in front of a virtual audience. Participants were assigned the face they chose.

#### **Procedure**

Participants completed the experiment individually. When they arrived, we took a picture of the participants' face. In the *self* condition, we modeled the pictures of the participants' face to become their avatar's head. In the *choice* condition, participants looked at a chart with faces and had to choose which person they would like to become their avatar if they had to give a speech in front of an audience. Then, in all conditions, they filled out a pre-survey. After that, we required them to improvise a 3-min speech in front of a virtual audience. They were able to decide the topic. As a possibility, we suggested them to talk about a hobby or interest. Once in the experimental room, participants wore an HMD and tracking sensors on their head and wrists. In the virtual world, participants saw a curtain that opened and an empty classroom appeared. We told subjects that an avatar would represent them in the virtual environment and asked them to look at a virtual mirror placed on the back wall of the room. We told them to lift their arms one at a time to make sure that they were aware of their avatar's selfrepresentation, which moved its hands accordingly in real time. Participants were also asked to describe their avatar briefly. Then, the curtain closed and they were told that the audience would arrive at the classroom shortly. We asked them to rate their anxiety before the speech. After a few seconds, the curtain opened again to reveal the seated virtual audience, watching the participant. The audience was composed of 12 agents (6 males and 6 females) of various races as depicted in **Figure 2**. The agents in the audience kept neutral faces during the speech. They looked at the participant most of the time. Also, we programed them to perform some stock idling gestures such as slightly moving their heads or arms from time to time for a realistic appearance. Once the curtain was fully open, participants started their speech. Participants were able to see their virtual representation at all times during the performance, which mirrored their head, arm, and body movements. After they concluded, the curtain closed. We asked them to rate how anxious they felt during the speech. Then, we helped them to

take off the HMD. Finally, they completed a post-survey and were thanked for their participation.

#### **Apparatus**

We created the virtual classroom using Worldviz's Vizard VR Toolkit. Participants wore an nVisor SX111 head-mounted display (NVIS, Reston, VA, USA) with a resolution of 2560 horizontal and 1024 vertical and a refresh rate of 60 frames per second to visualize the virtual world. An optical tracking system (Worldviz PPT-E) combined with an orientation sensor (Intersense3 Inertial Cube) provided total tracking of six degrees of freedom (x, y, z position and pitch, yaw, and roll) for the head. The participants wore trackers on the hands that tracked the x, y, z position of each hand (but not orientation) as well (**Figure 3**).

#### **MEASURES**

#### **Pre-screen**

We only invited participants who scored six or higher on the Mini-SPIN test (Connor et al., 2001), a screening test for social anxiety consisting of three questions.

#### **Pre-test survey measures**

Trait public speaking anxiety was assessed using the Personal Report of Communication Apprehension (PRCA-24; McCroskey, 1982). The reliability of this test in the study was α = 0.90 and the average score was *M* = 71.73 (SD = 14.39). PRCA-24 scores were not significantly different across conditions: *M*self = 70.93, *M*other = 71.31, *M*choice = 72.93 (SDself = 12.63, SDother = 15.75, SDchoice = 15.11); *F*2, 79 = 0.15, *p* = 0.86 based on an ANOVA.

In addition, trait social anxiety was measured using the *Brief Fear of Negative Evaluation Scale* (B-FNE; Leary, 1983). This scale is often used to assess fear of negative evaluation, the core feature of social anxiety disorder (Weeks et al., 2005). This measure yielded a reliability of α = 0.86 and the average score was *M* = 3.26 (SD = 0.64). B-FNE scores were not significantly different across conditions: *M*self = 3.33, *M*other = 3.15, *M*choice = 3.29 (SDself = 0.64, SDother = 0.65, SDchoice = 0.65); *X*2*df* <sup>=</sup> <sup>2</sup> = 1.5, *p* = 0.48 based on Kruskal–Wallis test (residual errors not normally distributed).

**FIGURE 3 | Participant wearing the HMD (1), tracking sensors on the head and wrists (2), cameras (3) to detect the position of the trackers, and orientation device (4), during the speaking task**.

#### **Post-test survey measures**

Participants rated anxiety before and during the speech for how anxious they felt before and during the speech, using a 0 (no anxiety) to 100 (extreme anxiety) scale (Stopa and Jenkins, 2007). Participants answered these questions while they were in the virtual world. The average score on this measure was *M* = 42.68 (SD = 24.97) for anxiety before the speech and *M* = 46.21 (SD = 25.71) for anxiety during the speech.

The *Body Sensations Questionnaire* (BSQ; Chambless et al., 1984) was used to measure self-perceived physiological sensations. This measure is a 17-item scale that comprises items concerning sensations associated with autonomic arousal. Participants rate how intensely they experienced each sensation (e.g., heart palpitations or dry throat) during the speech on a five-point scale, ranging from *not at all* (1) to *extremely* (5). The BSQ has been previously used in public speaking anxiety studies (McCullough et al., 2006). The BSQ yielded a reliability of α = 0.88 and the average score was *M* = 1.73 (SD = 0.58).

A 15-item presence scale consisting of five items for selfpresence (e.g., to what extent did you feel that the avatar's body was your own body?), five items for social presence (e.g., to what extent did you feel that the audience was present?), and five items for spatial presence (e.g., to what extent did you feel that the virtual classroom seemed like the real world?) was adapted from presence scales used in previous studies (Nowak and Biocca, 2003; Bailenson and Yee, 2007; Fox et al., 2009). The items were rated on a five-point scale ranging from *very highly* (1) to *not at all* (5). For self-presence, social presence, and spatial presence, the reliability was α = 0.91, α = 0.93, and α = 0.87, respectively. Overall, presence was computed by averaging over the three presence dimensions. The reliability of the overall presence measure was

α = 0.92. Presence scores were reversed for better interpretability, such that high presence scores indicate a strong sense of presence (scores range from 1 to 5). The average score on these measures was *M* = 2.13 (SD = 0.90) for self-presence,*M* = 3.27 (SD = 0.96) for social presence, *M* = 2.81 (SD = 0.83) for spatial presence, and *M* = 2.74 (SD = 0.73) for overall presence.

Participants also rated the similarity of their avatar's face with their own face as a manipulation check. The exact question wording was "when you looked at your avatar in the mirror, how similar was its face to yours?"A five-point scale from *extremely similar* (1) to *not at all similar* (5) was used. Similarity ratings were significantly higher in the *self* condition (*M* = 1.9, SD = 0.89) than in the *other* condition (*M* = 4.3, SD = 0.74; *t* <sup>51</sup> = 11, *p* < 0.001, *d* = 3.0). Ratings in the *choice* condition were closer to the *other* than *self* condition (*M* = 3.9, SD = 0.89).

#### **RESULTS**

Descriptive and inferential statistics for anxiety and presence are summarized in **Table 1**. Differences between experimental conditions were tested using ANOVAs where the condition that residual errors are normally distributed was not significantly violated. The assumption was violated for measures of anxiety before and during the speech, BSQ, and self-presence. We tested differences using the non-parametric Kruskal–Wallis test for these measures. Trait anxiety (PRCA-24 and B-FNE) and state anxiety (BSQ, anxiety before and during the speech) were all correlated, except for PRCA-24 with BSQ. Presence measures (self, social, spatial, and overall) were also all correlated among them (see Table S1 in the Supplementry Material, for correlations between all measures).

We tested H<sup>1</sup> and addressed RQ<sup>1</sup> about differences in anxiety and BSQ with a simple test of unadjusted means (Kruskal–Wallis tests in **Table 1**) and a covariate-adjusted regression model (**Table 2**). Similar studies (Felnhofer et al., 2012;Aymerich-Franch and Bailenson, 2014) highlighted the relevance of sex and trait social anxiety (B-FNE) as moderators of the effect of virtual experiences on anxiety-related measures. Accordingly, we report results from two regressions for each outcome, one without covariates and one with sex and B-FNE in the model. The data provided some evidence for H<sup>1</sup> that anxiety is lower with a dissimilar avatar than a self avatar based on BSQ scores (*p* < 0.10), but not for anxiety measured before and during the speech. B-FNE was a significant covariate in all regressions of anxiety and BSQ, though sex was not significant (**Table 2**). Regarding RQ1, anxiety measures were not significantly different in the *choice* condition, neither based on unadjusted tests (**Table 1**) nor covariate-adjusted regressions (**Table 2**). Average levels of anxiety in the *choice* condition were between those in the *other* and *self* condition based on descriptive statistics only. As PRCA-24 was highly correlated with B-FNE, only



\*\*Significantly different from the self condition at p < 0.05, \*p < 0.10.

<sup>a</sup>Residual errors were not normally distributed, hence a non-parametric test was employed instead of an ANOVA.



Residual errors were normally distributed in all covariate-adjusted models. BSQ was log-transformed to fit a linear model, and B-FNE was centered for interpretability. \*\*Significant coefficient with p < 0.01, \*significant coefficient with p < 0.10.

one could be included in the regression model, but results were qualitatively similar with PRCA-24 as a covariate in the model.

We examined RQ<sup>2</sup> about differences in types of perceived presence between conditions with ANOVAs and non-parametric tests depending on the distribution of the data (**Table 1**). Social, spatial, and overall presence were significantly lower in the *choice* condition than in the *self* condition (*t* <sup>54</sup> > 2.2, *p* < 0.05, Cohen's *d* = 0.79, 0.59, 0.66, respectively), but there were no significant differences in self-presence. Only social presence was lower in the *other* condition than the *self* condition, *t* <sup>52</sup> = 2.3, *p* = 0.03, *d* = 0.62.

In sum, Study 1 showed that assigning participants a dissimilar face did not significantly reduce their anxiety during a speech in front of a virtual audience. However, differences in all three anxiety measures were marginally significant for BSQ and in the hypothesized direction, i.e., lower anxiety with a dissimilar than with the own face. Participants who chose their avatar experienced significantly lower levels of social, spatial, and overall presence than those who were assigned the own face. Yet, choosing an avatar was not found to induce significantly different levels of anxiety than being assigned a *self* or *other* avatar. We attempted to replicate the effect of assigning the own face or a dissimilar face in Study 2 with a larger sample size and a more established measure of anxiety to test if embodying a new self could reduce social anxiety.

#### **STUDY 2**

In this study, we partially replicated Study 1 where exploratory results indicated that participants who were assigned a dissimilar face experienced marginally lower anxiety than participants who were assigned the own face, although the preliminary results yielded no significant difference in anxiety.

In order to improve the design of Study 1, a series of modifications were made in Study 2. First, a larger sample size was used to gain more statistical power to identify significant differences between conditions. A power calculation suggests that the sample size used in Study 2 could identify an effect size of 0.57 SD with 80% power and 95% confidence. Moreover, since reported anxiety before and during the public speaking situation was potentially not sensitive enough to detect significant differences in anxiety, we opted for a more established measure of anxiety, namely *the State Trait Anxiety Inventory (STAI)* (STAI; Spielberger et al., 1970, 1983). Also, participants were required to give a longer speech and had time to prepare it in order to ensure that the experience was long enough to provoke anxiety.

In line with Study 1, we hypothesized significant differences between the *self* and *other* conditions:

H1. During a speech in front of a virtual audience, participants with an avatar of their own face will experience higher levels of self-perceived physiological sensations and state anxiety than participants with an avatar of a face dissimilar to their own.

We also explored differences in the sense of presence between the three conditions. Accordingly, we formulated the following research question:

RQ1. Do participants in the *self* and *other* condition experience different levels of self, social, and spatial presence?

## **METHOD**

#### **Participants**

One hundred and fourteen participants attending an American university took part in the experiment. Nine participants were omitted from analysis due to technical failure or motion sickness. The final sample consisted of 105 experimental subjects (61 males and 44 females) aged 18–39 years (*M* = 20.41, SD = 2.43).

#### **Design**

Participants were either assigned an avatar with the own face or one with a dissimilar face. Participants in the *self* condition were assigned an avatar with a face that was modeled after their photograph, while those in the *other* condition were assigned an avatar with the face of a previous participant. The procedure was identical to the one described in Study 1.

#### **Procedure**

Participants completed the experiment individually. First, they filled out a pre-survey. Then, we read aloud a set of instructions, which required them to give a 5-min speech about their university in front of a virtual audience. We gave them 5 min to prepare the speech. Once in the experimental room, participants wore an HMD and tracking sensors on their head and wrists, which tracked orientation and translation of head position as they moved about the room and translation of their hands. In the virtual world, participants saw a curtain that opened and an empty classroom appeared. We told subjects that an avatar would represent them in the virtual environment and asked them to look at a virtual mirror placed on the back wall of the room. We told them to lift their arms one at a time to make sure that they were aware of their avatar's self-representation, which moved its hands accordingly in real time. Then, the curtain closed and we told them that the audience would arrive at the classroom shortly. After a few seconds, the curtain opened again and the virtual audience was sitting, watching the participant. The audience was the same as from Study 1. Once the curtain was fully open, participants delivered their speech. Participants were able to see their virtual representation at all times during the performance, which mirrored their head, arm, and body movements. After they concluded, the curtain closed and we helped them to take off the HMD. Finally, they completed a post-survey and were thanked for their participation and debriefed.

#### **Apparatus**

We used the same apparatus described in Study 1.

#### **Measures**

*Pre-test survey measures.* Trait public speaking anxiety was assessed with the *Personal Report of Communication Apprehension* (PRCA-24; McCroskey, 1982) scale. Participants rated 24 items on a 5-point scale ranging from *strongly disagree* (1) to *strongly agree* (5). The reliability of this measure was α = 0.96 and the average score was *M* = 66.36 (SD = 17.54). PRCA-24 scores were not significantly different across conditions: *M*self = 65.33, *M*other = 67.64 (SDself = 16.60, SDother = 18.73); *F*1, 103 = 0.45, *p* = 0.51 based on an ANOVA.

In addition, trait social anxiety was measured using the Brief Fear of Negative Evaluation Scale (B-FNE; Leary, 1983). Participants rated 12 items on a 5-point scale ranging from *not at all characteristic of me* (1) to *extremely characteristic of me* (5). This measure yielded a reliability of α = 0.92 and an average score of *M* = 37.97 (SD = 10.22). B-FNE scores were not significantly different across conditions: *M*self = 37.19, *M*other = 38.94 (SDself = 10.05, SDother = 10.45); *F*1, 103 = 0.76, *p* = 0.39 based on an ANOVA.

*Post-test survey measures.* State anxiety was measured using the STAI – Form Y-1 (Spielberger et al., 1970, 1983). Participants rated how they felt (e.g., calm, tense) in a particular situation (i.e., during a speech) on a four-point scale ranging from *not at all* (1) to *very much so* (4). This portion of the scale was designed to assess transitory anxiety and it is the most commonly used measure of public speaking state anxiety in empirical studies published in Communication (Behnke and Sawyer, 2004). The reliability of the STAI was α = 0.93 and the average score was *M* = 42.6 (SD = 11.38).

Self-perceived physiological sensations were assessed using the *Body Sensations Questionnaire* (BSQ; Chambless et al., 1984). The BSQ yielded a reliability of α = 0.90 and the average score was *M* = 25.42 (SD = 9.16).

The same 15-item presence scale used in Study 1 was administered in this study. For self-presence, social presence, spatial presence, and overall presence, the reliability was α = 0.89, α = 0.90, α = 0.88, and α = 0.93, respectively. The average score was *M* = 2.43 (SD = 0.88) for self-presence,*M* = 3.42 (SD = *0*.87) for social presence, *M* = 3.20 (SD = *0*.86) for spatial presence, and *M* = 3.02 (SD = *0*.75) for overall presence.

Participants also rated the similarity of their avatar's face with their own face as a manipulation check on the same scale used in Study 1. Similarity ratings were significantly higher in the *self* condition (*M* = 2.1, SD = 1.03) than in the *other* condition (*M* = 4.0, SD = 0.88; *t* <sup>103</sup> = 9.7, *p* < 0.001, *d* = 1.9).

#### **RESULTS**

Descriptive and inferential statistics for anxiety and presence are summarized in **Table 3**. Differences between experimental conditions were tested using ANOVAs where the condition that residual errors are normally distributed was not significantly violated (all but BSQ and self-presence). A non-parametric Mann–Whitney test was used instead for these measures. Trait public speaking anxiety (PRCA-24) correlated significantly with all types of presence. Trait social anxiety (B-FNE) correlated with spatial presence, but did not correlate significantly with self, social, or overall presence. STAI and BSQ did not correlate with any type of presence. Trait anxiety (PRCA-24 and B-FNE) and state anxiety (STAI and BSQ) measures were all correlated with one another (see Table S2 in the Supplementry Material, for correlations between all measures).

Hypothesis 1 that participants with the own face would experience greater anxiety than those with a dissimilar face was examined with unadjusted and covariate-adjusted comparisons of STAI and BSQ levels. Due to the highly skewed distribution of BSQ scores,we employed a non-parametric test in the unadjusted comparison and resorted to a negative binomial model for the covariate-adjusted regression, as a simpler log-linear or Poisson model did not fit the data sufficiently well. The same set of covariates as in Study 1 was included, sex and B-FNE.

Anxiety measured by BSQ was significantly reduced by 14% [95% CIs = (0.9%, 27%)] in the *other* condition based on the covariate-adjusted regression model (**Table 4**). Sex and B-FNE were significant covariates, with lower anxiety for males than females at the same order of magnitude as the experimental manipulation. B-FNE contributed positively to BSQ in the model. There was no significant interaction effect between the experimental assignment and sex or B-FNE (*z* < 1.0, *p* > 0.30). In contrast to B-FNE, anxiety measured by STAI was not significantly lower with a dissimilar than with the own face, neither in the unadjusted test (**Table 3**) nor the covariate-adjusted test (**Table 4**).

To address RQ1, we compared levels of self, social, spatial, and overall presence between conditions using *t*-tests or the nonparametric Mann–Whitney test, depending on the distribution of residual errors (**Table 3**). Self-presence scores were significantly higher with the own face than with a dissimilar face, *W* = 1011, *p* = 0.023, *d* = 0.44. Other types of presence and overall presence were not significantly different (see **Table 3**).

#### **GENERAL DISCUSSION**

In the pilot study, we explored the idea that socially anxious individuals would prefer to become someone else in a social situation. Social anxiety correlated significantly with a preference for embodying a dissimilar avatar. In Study 1, we compared levels of anxiety in three experimental conditions: participants were assigned the real face, a dissimilar face, or given a face of their choice. While this study yielded no statistically significant differences in levels of anxiety, it suggested that participants embodying an assigned self avatar tended to exhibit higher levels of anxiety, followed by participants in the *choice* condition. Participants who were assigned a dissimilar avatar tended to experience the least anxiety of the three groups. Also, we identified significant differences in the sense of presence between the *self* and the *choice* conditions. Participants in the *self* condition experienced a greater sense of presence. Finally, in Study 2, we partially replicated Study

**Table 3 | Means (SD) and unadjusted statistical tests for each dependent variable**.


<sup>a</sup>Residual errors were not normally distributed, hence a non-parametric test was employed instead of a t-test.


**Table 4 | Regression coefficients (robust standard errors) for anxiety dependent variables with two models**.

Residual errors were normally distributed in (2)-(4). B-FNE was centered for interpretability.

<sup>a</sup>Test of residual deviance indicates a good fit if the p value is not significant (log-linear and Poisson models did not fit the BSQ data sufficiently well, but the negative binomial model was a good fit).

\*Significant coefficient with p < 0.05.

1 focusing on the *self* and *other* avatar conditions. We found significant differences in anxiety in the same direction as in Study 1: participants who were assigned a self avatar experienced 14% higher levels of anxiety measured by BSQ than participants assigned a dissimilar avatar when accounting for differences in sex and B-FNE. Yet, anxiety levels measured by STAI remained unchanged. Regarding presence, participants in the *self* condition experienced greater self-presence than those in the *other* condition.

We believe that embodying a dissimilar avatar helped participants reduce their anxiety to some extent. While the pilot study provided strong support for our hypothesis, the results of two experimental studies were more mixed. Thus, follow-up studies with a different procedure, design, or technique need to further investigate whether embodying a different self can in fact reduce anxiety. A possible explanation is that, in general, participants experienced low self-presence both in Study 1 and 2. Thus, it is possible that the process of embodiment and identification with the avatar was not strong enough to make the differences between conditions significant. In connection to this, several limitations can be pointed out. Principally, the avatars were fairly limited in terms of range of movements and face modeling. It would be preferable to render more joints such as elbows or leg movements to provide a more natural body movement to the avatar and make the reflection in the mirror appear more natural. Moreover, we used a generic male or female body for all participants, which sometimes was very different from the participant's real body. Body shape should be taken into account in future experiments. Finally, synchronization with the movement of the avatar in the mirror was done before the virtual audience entered the virtual room. Due to technical failure of the orientation-tracking device, we did not have reliable recordings of the percentage of time participants looked at their mirror image. Future studies should use gaze behavior as a proxy for attention to the mirror image, and examine its mediating role on self-presence.

There are other limitations in the current work. In the pilot study, the avatar similarity measure that we developed should be expanded into a more complete scale. In addition, the manipulation check for facial similarity included in our questionnaire pointed at some issues with the manipulation of avatar similarity. While facial similarity was significantly higher in the *self* than *other* condition in both studies, some participants' ratings were in the opposite direction and inconsistent with open-ended comments provided at the end of the study. The specific question wording may have confused some participants. We therefore decided not to exclude participants based on the manipulation check. Similar issues with explicit ratings were encountered in prior work on doppelgangers (Fox and Bailenson, 2009) and highlight discrepancies between survey measures of perceived similarity and actual avatar similarity. Future research should explore a better measure for manipulation check. Also, we considered trait social anxiety in all our analyses, but we used different strategies across our studies to select our participants regarding prescreening them or not for social anxiety. Other studies should examine this further and perhaps repeat similar experiments with patients diagnosed with social phobia. Study 1 presented other limitations that were fixed in Study 2 as described above.

Our findings have important theoretical and practical implications and future studies are encouraged to continue the line of research presented here. For theories of social anxiety and self-representation, the results of our study help to understand better the mechanisms underlying social anxiety. Also, more research should investigate whether alterations of selfrepresentation should be considered as a potential positive contribution to VR exposure therapy for the treatment of social phobia. For instance, further research could examine the effectiveness of progressively increasing patient's avatar resemblance to the real self along sessions in VRET. Therapists can leverage the findings to include the virtual self as part of the treatment. Most therapy for overcoming anxieties in VR is focused on exposure. Here, we provide a different approach based on the assumption that a negatively distorted self is at the core of social anxiety. Following this approach, we developed a technique to treat social phobia using VR based on modification of self-appearance. With this tool, therapists can help patients understand their phobia from a different perspective and work on correcting their self-image and improving their confidence in social situations.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at http://www.frontiersin.org/Journal/10.3389/fnhum.2014.00944/ abstract

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 05 July 2014; accepted: 05 November 2014; published online: 19 November 2014.*

*Citation: Aymerich-Franch L, Kizilcec RF and Bailenson JN (2014) The relationship between virtual self similarity and social anxiety. Front. Hum. Neurosci. 8:944. doi: 10.3389/fnhum.2014.00944*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Aymerich-Franch, Kizilcec and Bailenson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Moving from virtual reality exposure-based therapy to augmented reality exposure-based therapy: a review

### **Oliver Baus <sup>1</sup> and Stéphane Bouchard1,2\***

<sup>1</sup> School of Psychology, University of Ottawa, Ottawa, ON, Canada

<sup>2</sup> Department of Psychoeducation and Psychology, Université du Québec en Outaouais, Gatineau, QC, Canada

#### **Edited by:**

Eric Brunet-Gouet, Université de Versailles Saint-Quentin-en-Yvelines, France

#### **Reviewed by:**

Ali Oker, Université de Versailles Saint-Quentin-en-Yvelines, France Laura De Lacombe, Centre Hospitalier de Versailles, France

#### **\*Correspondence:**

Stéphane Bouchard, Université du Québec en Outaouais, Département de Psychoéducation et de Psychologie, CP 1250, Succ Hull, Gatineau, QC J8X 3X7, Canada e-mail: stephane.bouchard@uqo.ca

This paper reviews the move from virtual reality exposure-based therapy to augmented reality exposure-based therapy (ARET). Unlike virtual reality (VR), which entails a complete virtual environment (VE), augmented reality (AR) limits itself to producing certain virtual elements to then merge them into the view of the physical world. Although, the general public may only have become aware of AR in the last few years, AR type applications have been around since beginning of the twentieth century. Since, then, technological developments have enabled an ever increasing level of seamless integration of virtual and physical elements into one view. Like VR, AR allows the exposure to stimuli which, due to various reasons, may not be suitable for real-life scenarios. As such, AR has proven itself to be a medium through which individuals suffering from specific phobia can be exposed "safely" to the object(s) of their fear, without the costs associated with programing complete VEs. Thus, ARET can offer an efficacious alternative to some less advantageous exposure-based therapies. Above and beyond presenting what has been accomplished in ARET, this paper covers some less well-known aspects of the history of AR, raises some ARET related issues, and proposes potential avenues to be followed. These include the type of measures to be used to qualify the user's experience in an augmented reality environment, the exclusion of certain AR-type functionalities from the definition of AR, as well as the potential use of ARET to treat non-small animal phobias, such as social phobia.

#### **Keywords: virtual reality, augmented reality, phobia, exposure therapy, synthetic environments**

According to Moore's law, the number of transistors on integrated circuits doubles approximately every 2 years (Moore, 1965). This growth leads to an exponential growth of technological capabilities. Innovative minds are applying the potential of these new technologies in what, historically, may have been technology aversive fields; mental health was one of those fields. Today, however, it is widely recognized that new technologies such as virtual and augmented realities are showing strong potential in that same field, and more specifically, in the treatment of phobia (Wrzesien et al., 2011a).

The objective of this paper is twofold. First, it reviews the move from virtual reality (VR) systems to augmented reality (AR) systems in the treatment of phobias. Second, it highlights four issues relating to AR: (a) qualifying an AR experience necessitates a set of AR specific instruments [not necessarily those used to qualify a virtual environment (VE) experience]; (b) historically, AR applications have been around a long time before the term "AR" was assigned to the concept; (c) presently, certain AR-type functionalities are excluded from the definition of AR; and (d) the use of augmented reality exposure-based therapy (ARET) has advantages over virtual reality exposure-based therapy (VRET), but these advantages could be exploited beyond the treatment of small animal phobia.

To this aim, the article first addresses some of the evolutions that have led to the use of AR in the treatment of specific phobias. To establish the framework of AR, it is useful to distinguish it from VR. Thus, the paper presents some definitions relating to the technology (VE, VR, and immersion), some of the concepts commonly used to quantify and qualify a user's experience of virtual stimuli (presence, realism, and reality), as well as some of the non-mental health applications of VR. After having covered these basics, the focus shifts toward mental health. More specifically, the implications of suffering from a phobia and two of the possible (traditional) treatments, *in imago* and *in vivo* exposurebased therapies, are presented. Next, VRET, the "direct ancestor" to ARET, is introduced; its documented successes, as well some of as its advantages over traditional exposure-based methods are presented. After this overview, the focus shifts toward AR, including how it distinguishes itself from VR, some of its advantages over VR, and what criteria must be met to consider a functionality as AR. At this point, the instruments presently used to measure an AR user's experience are discussed, and some concepts to be measured in AR are proposed. The next section addresses the history of AR. This is accomplished in two parts. While the first covers, some of the major events that occurred after the coining of the phrase "augmented reality," the second addresses the period going back to the roots of AR, a time-frame less covered by previous publications. From this historic account will emerge an issue relating to the definition of AR: the present one leaves certain AR-type functionalities nameless. A solution will be proposed to close this semantic gap. Next, the AR enabling technologies, and some of the technical challenges faced by the developers are put forward. A variety of AR

applications are listed. In particular, publications pertaining to the use of ARET are reviewed; these eight studies either test the efficacy of ARET protocols, compare ARET protocols to other types of exposure-based protocols, compare ARET technologies, or simply quantify users' experiences in an ARET environment. Finally, the last discussion point addresses the limited use of ARET in the treatment of phobias, other than small animal phobias. One of the plausible reasons behind this self-imposed restriction, a possible way to break free of it, as well as the potentially resulting opportunity to expand ARET to the treatment of social phobia are discussed. To close, a conclusion reiterates the major points of the paper.

#### **VIRTUAL REALITY**

#### **VIRTUALITY AND ASSOCIATED CONCEPTS Virtual environment**

The exact definition of the word "virtuality" depends on the context of its use. However, in the domain of VEs, Theodore Nelson's definition is pertinent; he defines the virtuality of a thing as the "seeming" of that thing (Skagestad, 1998). Indeed, a VE consists of objects or entities seemingly "real" because they share at least one attribute of the "real thing" (usually the appearance), without sharing all of its physical characteristics (volume, weight, surface friction, etc.).

A VE can be defined as a 3D digital space generated by computing technology (e.g., the scenario of a video game). It is comprised of visual stimuli projected on a surface (e.g., a wall, a computer screen, screens of a head mounted display) and, generally, acoustic stimuli produced by an electronic device (e.g., a headset, speakers). Further to these, a VE may also expose the user to haptic (contact), olfactory, or even gustatory stimuli (Sundgren et al., 1992; Burdea and Coiffet, 1994; Kalawsky, 2000; Fuchs et al., 2006). A VE aims to "extract" the user from the "physical" world and "insert" him into a synthetic world; this is accomplished by exposing him to synthetic sensory information that emulates real life stimuli (see **Figure 1**).

#### **Virtual reality**

Virtual reality is an application that, in very near real time, allows a user to navigate through, and interact with, a VE (Pratt et al., 1995). Depending on the type of system and programing, the user may interact with the environment from an egocentric point of view (also known as "first person point of view") or an allocentric point of view (also known as "third person point of view"); in the case of the latter, the user moves a virtual representation of himself (called "avatar"). The user may also act upon virtual objects, and even interact with virtual beings (e.g., persons, animals). Compared to more passive media such as radio and television, the higher levels of cognitive, social, and physical interactivities of VR can boost the effect of the VE on the user (Fox et al., 2009). In more immersive egocentric VR systems, the user can interact with the VE via his own movements by wearing at least one input device (known as "tracker"). The latter detects its own position in space and transmits it continuously to a computer which: (a) continuously compares this data with the database associated to the VE; (b) determines the synthetic stimuli to be triggered; and (c) triggers the output devices to deploy them. All of this is

**FIGURE 1 | Example of a virtual environment**. Credit, Laboratory of Cyberpsychology, Université du Québec en Outaouais.

accomplished in very near real time and can implicate multiple sensory modalities. Generally, worn at head level, a tracker may support three degrees of freedom (3-DOF) or six degrees of freedom (6-DOF) tracking; while a 3-DOF tracker allows tracking of head rotation only, the 6-DOF version tracks head rotation as well as horizontal and vertical displacements. Trackers may also be used to track specific body parts (e.g., the hands). Generally, the user of a 6-DOF system uses own body movement for small positional adjustments (turning around, bending down, repositioning, etc.) and a hand-held device (e.g., joy-stick, space ball, 3D mouse) to move over greater distances, such as walking from room to room in an apartment.

#### **Immersion**

Factors such as the number of senses stimulated, the number of and the level of interactions, as well as the fidelity of the synthetic stimuli contribute to a VR system's level of immersion (Slater et al., 2009). This concept corresponds to the quality and the quantity of the stimuli employed to simulate the environment; it is an objective characterization of the system (Sanchez-Vives and Slater, 2005). At the same time, the level of immersion is also dependent on the ability of the system to isolate the user from stimuli foreign to the VE (e.g., room lights and external noise). Ma and Zheng (2011) use the following guidelines to distinguish between non-immersive, semi-immersive, and immersive VR systems: a non-immersive VR system employs conventional graphics workstation with a monitor, a keyboard and a mouse; a semi-immersive system uses a relatively high performance graphics computing system coupled with a large surface to display the visual scene; and an immersive VR system projects the visual scene into some kind of head mounted device – or large projection surfaces "encasing" the user – completely filling the user's field of view (see **Figure 2**). The level of immersion, in turn, affects the user's experience in the VE. Three of the dominant concepts used to measure the quality of the user's experience are: the feeling of presence,the level of realism,and the degree of reality.

#### **Presence, realism, and reality**

While various definitions of presence have been proposed, Heeter (1992) views presence as a complex feeling composed of three dimensions: (a) personal presence refers to the feeling of actually being in the VE (versus in the physical room where the immersion takes place); (b) environmental presence refers to the feeling that the VE seems to acknowledge the user by reacting to his actions; and (c) social presence refers to the feeling of not being alone in the VE (Heeter, 1992). Compared to other definitions, the strengths of this one include its fidelity to the actual term "presence," its simplicity, as well as its ability to account for the interactions between the user and the virtual location, objects, and animated entities; thus, Heeter's conceptualization of presence will serve as reference to this article. The concept of presence, however, is not unique to VEs: watching a movie, a play or a painting, as well as reading a text or listening to the radio can induce a feeling of presence (Nash et al., 2000).

The level of realism corresponds to the degree of convergence between the expectations of the user and the actual experience in the VE (Baños et al., 2000). Thus, a virtual stimulus that meets the expectations of the user, such as an orange that smells like an orange, is likely to be rated as more realistic as that same orange if it smelled like nothing at all, or if it smelled like fish.

The level of reality refers to the level by which the user experiences the immersion as authentic (Baños et al., 2000). It is felt in response to the stimuli. Thus, a higher level of realism should be associated with a higher level of reality.

#### **VR APPLICATIONS**

Today, the fields in which VR is used are numerous; they include education, health care, communication, engineering, and entertainment (Schuemie, 2003). Within these fields, VR applications may be used for a variety of purposes including pain management (Gold et al., 2007; Hoffman et al., 2008), virtual "visits" of construction projects in development (Brooks et al., 1992), the development of virtual classrooms (Moreno and Mayer, 2007), and collaborative work environments in which the users interact via avatars (Normand et al., 1999; Benford et al., 2001; Joslin et al., 2004; Reeves et al., 2008).

Often, VR is employed as a training tool. In such a function, its advantages include reduced cost, interactivity, and safety. Indeed, VR can offer financially advantageous active learning experiences involving scenarios that are too difficult and/or too dangerous to practice "real world." Furthermore, its interactivity (Bailenson et al., 2005) as well as the possibility to pre-program a variety of training scenarios at multiple levels of difficulty can facilitate better learning. VR training applications include: visual inspections of aircraft with various structural flaws (e.g., Vora et al., 2002), the operation of various vehicles (e.g., Tichon et al., 2006), rapid and efficacious decision making by medical doctors (e.g., de Leo et al., 2003; Mantovani et al., 2003; Johnsen et al., 2006; Kenny et al., 2007) and by soldiers (e.g., Hill et al., 2003) in stressful situations, pre-deployment inter-cultural communication training prior to military deployments (e.g., Deaton et al., 2005), emergency management (e.g., Viciana-Abad et al., 2004), surgical procedures (e.g., O'Toole et al., 1998; Harders et al., 2008; Spitzer and Ackerman, 2008), rehabilitation (Rose et al., 1996; Jaffe et al., 2004; Crosbie et al., 2007), stress management (e.g., Bouchard et al., 2012b), and fear management training in the face of a phobia inducing stimulus (e.g., Côté and Bouchard, 2005). The latter form of "training" is more commonly known as VRET. Indeed, VRET is essentially a training activity during which an individual learns to master a task that he is incapable of carrying out: facing a particular stimulus without experiencing unwanted psychological and/or physiological reactions.

## **PHOBIA**

#### **DEFINITION**

While about 9% of the citizens of the United States were reported to suffer from a specific phobia (Gadermann et al., 2012), 60–80% of those affected have been reported not to seek treatment (Agras et al., 1969; Boyd et al., 1990; Magee et al., 1996; Essau et al., 2000). Suffering from a phobia means an individual experiences excessive anxiety when exposed to a certain stimulus; the trigger stimulus may be a specific entity (e.g., an animal species) or a situation (e.g., addressing a group of people, driving). In association to the elevated stress and anxiety, the individual may experience increased heartbeat, sweating, and dry mouth (Abate et al., 2011). In either case, the unrealistic and excessive fear of the stimulus can lead to avoidance behaviors that interfere with the subject's life. Numerous studies suggest that exposure-based treatment is effective in treating phobic fear and avoidance behavior (e.g., Öst, 1989; Öst et al., 1991a, 1997). A lack of treatment can lead to a self-feeding spiral where increasing unrealistic fear feeds avoidance behaviors which, in turn, feed further fear. Untreated, this condition can

lead to significant social and economic costs to society (Kessler and Greenberg, 2002; Kessler et al., 2008).

#### **TREATMENT OF PHOBIA**

#### **In imago and in vivo exposure-based therapies**

Years of empirical work point to the efficacy of exposure-based therapy across a variety of anxiety disorders (Richard et al., 2007), and various theories have been proposed to explain its mechanisms of action. These include: the Two-Factor Theory of Fear Acquisition and Maintenance (Mowrer, 1960), the Bioinformational Theory (Lang, 1977), the Emotional Processing Theory (Rachman, 1980), the Emotional Processing Theory Model (Foa and Kozak, 1986), a revised version of the Emotional Processing Theory (Foa and McNally, 1996), the Perceived Control and Self-Efficacy Theory (Mineka and Thomas, 1999), as well as various Neural Networking Models (e.g., Tryon, 2005). Exposure-based treatments do not limit themselves to exposure sessions: the exposure is just the behavioral component of what usually amounts to a cognitive-behavioral protocol. Thus, an exposure-based treatment includes a broader set of behavioral and cognitive therapeutic techniques, including case formulation, cognitive restructuring, relapse prevention, etc.

The exposure component generally implies a gradual hierarchical exposure to the object of the fear in a safe and controlled way. The exposure aims to help the patient convincingly learn that the consequences he fears do not necessarily happen. According to the Emotional Processing Theory (Rachman, 1980; Foa and Kozak, 1986; Foa and McNally,1996),the exposure works because it allows the patient to fully experience the activation and subsequent natural reduction of fear in presence of the phobia inducing stimulus (Abramowitz, 2013). Thus, the use of "crutches" (e.g., relaxation

exercises) or downright avoidance behaviors (e.g., behaviorally or cognitively ignoring the stimulus) can be detrimental to the clinical efficacy of the exposure (Abramowitz, 2013). More recent models explaining the therapeutic mechanisms of exposure (e.g., Bouton and King, 1983; Craske et al., 2008) propose that the result of a successful exposure-based treatment is not the disappearance of the previously learned association between the stimulus and perceived threat, but the creation of a newly learned association that competes with the old dysfunctional one; repeated exposures, and non-avoidance behaviors are meant to establish, strengthen, and maintain the functional response such that it may "overpower" the dysfunctional response, and continue to do so in the long term.

Historically, exposure has been accomplished *in vivo* (facing the actual stimulus or a physical representation of it; see **Figure 3**) and *in imago* (mental imaging of the stimulus). However, each of these techniques has major drawbacks: while a patient may be unwilling to face the actual threat *in vivo*, it might prove too difficult for a patient to mentally visualize the anxiety inducing threat. In fact, it has been reported that when patients find out that the therapy entails facing the threat, about 25% of them either refuse the therapy or terminate it (Marks, 1978, 1992; García-Palacios et al., 2001, 2007).

#### **Virtual reality exposure-based therapy**

Enabled by technological progress, the search for a less threatening and a more practical alternative to IVET has lead to the introduction of VRET (see **Figure 3**). During VRET, the patient is immersed in a VE where he faces a virtual representation of the threat. While the patients' acceptance of such a protocol is generally higher than that of IVET (García-Palacios et al., 2001), the efficacy of the exposure-based treatment is not sacrificed.

Indeed, the use of VRET has proved itself effective in treating specific phobias such as acrophobia (Emmelkamp et al., 2001, 2002; Krijn et al., 2004), arachnophobia (García-Palacios et al., 2002), aviophobia (Wiederhold, 1999; Rothbaum et al., 2000, 2002; Maltby et al., 2002; Mühlberger et al., 2003; Botella et al., 2004), claustrophobia (Botella et al., 2000), spider phobia (Michaliszyn et al., 2010), and driving phobia (Wald and Taylor, 2003; Walshe et al., 2003). In fact, a meta-analysis by Powers and Emmelkamp (2008) suggests that, in the domain of phobias and anxiety disorders,VRET is slightly, but significantly, more effective than IVET.

Virtual reality exposure-based therapy does enjoy other advantages over IVET (Botella et al., 2005). These include better control of the anxiety inducing stimulus which, of course, poses no real threat (i.e., a virtual dog can't bite). Thus, the patient need not fear being hurt. The exposure scenarios, however complex they may be, can be stopped, paused, restarted as well as repeated, whenever and, for as many times as deemed necessary. Furthermore, the entire exposure process can be completed in the safety and privacy of the practitioner's office. In the case of animal phobia, VRET dispenses the therapist of the problems associated with finding, taking care of, and handling live animals. Finally, some therapists find VRET more acceptable, helpful, and ethical than IVET (Richard and Gloster, 2007).

#### **AUGMENTED REALITY**

#### **DISTINGUISHING AUGMENTED REALITY FROM VIRTUAL REALITY**

With time, further technological advances led to the development of another method of exposure: ARET (see **Figure 3**). In contrast to VR systems which generate a complete VE, AR systems enhance the non-synthetic environment by introducing synthetic elements to the user's perception of the world (see **Figure 4**). While VR substitutes the existing physical environment with a virtual one, AR uses virtual elements to build upon the existing environment (Azuma, 1997; Azuma et al., 2001). Milgram and Kishino (1994) present AR as a form of mixed reality (MR), that is, a "particular subclass of VR related technologies" (Milgram and Kishino, 1994, p. 1321), which, via a single display, expose the user to electronically merged synthetic and non-synthetic elements. Milgram's Reality–Virtuality Continuum serves to illustrate where MR situates itself in comparison to real and VEs (see **Figure 5**). Between these two poles exist various combination levels of synthetic and non-synthetic elements: to the right of center are the environments where virtuality provides the surrounding environment (augmented virtuality), and to the left of center are the environments where reality provides the surrounding environment (AR). It is important to note that AR does not limit itself to introducing virtual elements into the physical world, it may also inhibit the perception of physical objects by overlaying them with virtual representations, such as a virtual objects or even virtual empty spaces. Although AR can be extended to hearing, touch, as well as smell (Azuma et al., 2001), this article will limit itself to the sense of vision.

In contrast to a VR user, the user of AR does not "depart" the space he occupies, thus he "maintains his sense of presence" in the non-synthetic world (Botella et al., 2005). He is, however, put in co-presence with virtual elements that are blended

**the laboratory) augmented by a synthetic element (the small person standing on a non-synthetic table)**. Credit, Laboratory of Cyberpsychology, Université du Québec en Outaouais.

into the non-synthetic world. Azuma et al. (2001) propose that, to be considered AR, a system must: (1) combine real and virtual objects in a real environment; (2) run interactively, and in real time; and (3) register (align) real and virtual objects with each other. The purposes of the virtual elements include enhancing the experience and/or the knowledge of the user (Berryman, 2012). They could represent advisories (e.g., name of a building, distance to destination) or entities (e.g., an object, a person).

#### **ABOUT QUALIFYING THE AR EXPERIENCE**

Thus, experiencing an AR environment is fundamentally different from experiencing a VE: unlike the user of a VE, the user of an augmented reality environment (ARE) is not "transported" to a different location, and consequently, there is no immersion *per se*. Instead, it is the virtual elements that are transported into, and aligned with, the user's world. It could be said that in a VE, the user "intrudes" in the virtual world, while in an ARE, it is the virtual objects that "intrude" in the user's world. Thus, the means by which the quality of a user's experience is measured may need to be modified slightly. However, as **Table 1** suggests, the instruments used to qualify a user's experience in AR are often the same as those used in VR.

In an ARE, measures of realism (degree of convergence between the expectations of the user and the actual experience in the VE) and reality (level to which the user experiences the hybrid environment as authentic) are still pertinent. However, this may not be the case for presence. If Heeter's (1992) conceptualization of presence is used as reference, it can be argued that measures of the environmental presence (the feeling that the environment seems to acknowledge the user's movements by reacting to his actions; Heeter, 1992) and social presence (the feeling of not being alone in the environment; Heeter, 1992) can also be pertinent to qualify the experience of the user in an ARE. However, unlike VR where social

#### **Table 1 | Clinical and experience related measures taken during past ARET studies**.


presence measures the level of "togetherness"between the user and virtual agents, in AR, the level of "togetherness" between the user and individuals physically present in the environment may also be of interest. On the other hand, a measure of personal presence does not seem pertinent in an ARE; indeed, the user is not "transported" to a different location, and thus, the value of measuring the level of personal presence in a location the user never left may be questionable.

On the other hand, a measure addressing the alignment of real and virtual elements could contribute to an overall assessment of the quality of a user's experience in an ARE; this could be in the form of a measure of co-existence between the virtual and the non-virtual elements. Further co-existence measures could assist in qualifying the experience of an ARE. These could include coexistence measures between the user and virtual elements, as well as between the user and non-virtual elements (this last measure could be used as a baseline to put the level of co-existence between user and non-virtual elements into context).

#### **History of augmented reality**

The term "augmented reality" was introduced in 1990 by Tom Caudell while working on Boeing's Computer Services' Adaptive Neural Systems Research and Development project in Seattle (Carmigniani et al., 2011). There, alongside David Mizell, he developed an application that displayed a plane's schematics on the factory floor (Vaughan-Nichols, 2009), thereby saving the mechanics the difficult task of interpreting abstract diagrams in manuals (Berryman, 2012). Two further AR pioneering projects were Rosenberg's Virtual Fixtures and Feiner and colleagues' knowledge-based augmented reality for maintenance assistance (KARMA). Results of the Virtual Fixtures project suggested that teleoperator performance can be enhanced by overlaying abstract sensory information in the form of virtual fixtures on top of sensory feedback from a remote environment (Rosenberg, 1993). KARMA used 3D graphics to guide a user through the steps to carry out some of the complex tasks of printer maintenance/repair (Feiner et al., 1993). In 1993, Loral Western Development Laboratories took AR to a new level by introducing AR to live training involving combat vehicles (Barilleaux, 1999), and in 1994, in a completely different field, Julie Martin created "Dancing in Cyberspace," the first AR theater production featuring dancers and acrobats interacting with virtual object in real time (Cathy, 2011).

Some of the other important developments for AR include: Kato and Billinghurst (1999) created AR Toolkit, the first widely held software to solve tracking and object interaction; the next year, Thomas et al. (2000) developed ARQuake, the first outdoor mobile AR video game; the year 2008 saw the development of applications such as Wikitude, which uses a smartphone's camera view, internet, and GPS (or Wifi) positioning to display information about the user's surroundings (Perry, 2008); in 2009, AR Toolkit was brought to the web browser by Saqoosha (Cameron, 2010), and SiteLens, an application that allows visualization of relevant virtual data directly in the context of the physical site, was introduced (White and Feiner, 2009); in 2011, Laster Technologies incorporated AR in ski goggles (e.g., ITR News, 2011), while Total Immersion created D'Fusion, a platform to design AR projects for mobile, web based, and professional applications (Maurugeon, 2011); and finally, in 2013, Google began to test Google Glass, a pair of AR glasses connected wirelessly to the internet via the user's cellphone wireless service.

While the coining of the phrase "augmented reality" is an important historical reference, the concept at the source of the phrase had made its mark long before 1990. In fact, it was in 1901 that Lyman Frank Baum, an American author of children's books, put on paper what may have been the first idea for an AR application. In his novel titled *The Master Key* (1901), he wrote:

"The third and last gift of the present series," resumed the Demon, "is one no less curious than the Record of Events, although it has an entirely different value. It is a Character Marker.""What's that?" inquired Rob.

"I will explain. Perhaps you know that your fellowcreatures are more or less hypocritical. That is, they try to appear good when they are not, and wise when in reality they are foolish. They tell you they are friendly when they positively hate you, and try to make you believe they are kind when their natures are cruel. This hypocrisy seems to be a human failing. One of your writers has said, with truth that among civilized people things is seldom what they seem."

"I've heard that," remarked Rob.

"On the other hand," continued the Demon,"some people with fierce countenances are kindly by nature, and many who appear to be evil are in reality honorable and trustworthy. Therefore, that you may judge all your fellow-creatures truly, and know upon whom to depend, I give you the Character Marker. It consists of this pair of spectacles. While you wear them every one you meet will be marked upon the forehead with a letter indicating his or her character. The good will bear the letter "G," the evil the letter "E." The wise will be marked with a "W" and the foolish with an "F." The kind will show a "K" upon their foreheads and the cruel a letter "C." Thus you may determine by a single look the true natures of all those you encounter." (Baum, 1901, pp. 37–38)

Although potentially useful, the "character marker" was not developed into a concrete application. However, around that same timeframe, an AR-like application saw the light of day: the reflector ("reflex") gunsight. The concept behind the gunsight was published by Grubb (1901): "the sight which forms the subject of this paper attains a similar result not by projecting an actual spot of light or an image on the object but by projecting what is called in optical language a virtual image upon it" (Grubb, 1901, p. 324). Although, its first employment is difficult to date exactly, the reflector gunsight was operational in German fighter aircraft by 1918 (Clarke, 1994). Installed in front of the pilot, and in line with the aircraft's gun(s), the reflector gunsight consisted of a 45° angle glass beam splitter on which an image (e.g., an aiming reticle) was projected (Clarke, 1994); thus, it superposed virtual elements on real world elements. Its purpose was to assist pilots in hitting their targets by providing them with a reference aiming point. However, according to Azuma and colleagues' definition, this type of concept cannot be considered an AR system. Indeed, it only met two of their three criteria of AR: it combines real and virtual objects in a real environment and it runs interactively, and in real time; it does not, however, align real and virtual objects with each other. On the other hand, the reflector gunsight does meet Berryman's stated purpose of AR: the enhancement of the experience and/or the knowledge of the user (Berryman, 2012).

The difficulty of categorizing this type of application as AR persists for the follow-on systems. As it was known that the trajectory of the bullets was influenced by the shooting aircraft's flight parameters, the newer generation of gunsights started to take these into account when displaying the aiming reticle. This type of application does, to a certain extent, take real world parameters into consideration, but it still does not align real and virtual objects. Thus, it does not quite meet Azuma and colleagues' definition of AR.

The first operational system that did meet all three of Azuma and colleagues' premises of AR seems to have been the AI Mk VIII Projector System (earlier variants had been successfully tested but never entered service). As the name suggests, the radar picture of the AI Mk VIII radar picture was projected onto the pilot's windscreen, thereby superimposing the virtual cue onto the real world position of the target aircraft (as seen from the cockpit; Clarke, 1994). Although the alignment of virtual and real elements may have been somewhat rudimentary, the AI Mk VIII seems to have been, long before the term "augmented reality" was coined, the first AR system.

The concept of projecting flight parameters and target information on a see-through surface eventually led to the developments of the head-up display (HUD), the helmet mounted sight (HMS), and the helmet mounted display (HMD). The latter, invented in the 1960s by Professor Ivan Sutherland and his graduate student Bob Sproull, can be considered as one of the major technological breakthrough that furthered the development of AR (and VR; Berryman, 2012). Nicknamed the "The Sword of Damocles" (see **Figure 6**) due to the fact it was suspended from the ceiling over the user, their see-through head mounted display was able to present simple 3D wireframe models of generated environments (Sutherland, 1968). In the 1970s and 1980s, the United States Air Force and the National Aeronautics and Space Administration were among the organizations that further researched AR and its potential applications (Feiner, 2002). The integration of HMSs (in the 1970s), and HMDs (in the1980s) in fighter aircraft were among the concrete results of this research; today's Google Glasses can be seen as a technological offspring of the HMD. Although military applications may have been an important motor in the development of AR technologies, entertainment oriented applications, such as Myron Krueger's Videoplace, also occupy important places in the history of AR (Dinkla, 1997).

#### **ABOUT THE DEFINITION OF AR**

Thus, AR applications had been around long before a term was assigned to the concept. However, a look back at history and forward to the future of AR also reveals that the present definition of AR excludes some AR-like functionalities, such as the display of information that is overlaid onto, but not merged with, the real world (e.g., speed of car projected onto the inside of the windshield). As AR functionalities may co-exist with such ARlike functionalities (e.g., an arrow to indicate where to turn and an indication of the distance to go before that turn), it could be useful to find a term that describes functionalities that don't register real and virtual objects with each other. Using Azuma and colleagues' widely accepted definition of AR as an anchor, the authors of the present paper propose the term "non-registered augmented reality" (NRAR) to describe functionalities that: (1) combine real and virtual objects in a real environment; and (2) run interactively, and

**FIGURE 6 |The Sword of Damocles (circa 1968)**. Reprinted from Sherman and Craig (2003), with permission from Elsevier.

in real time; but (3) don't register (align) real and virtual objects with each other.

#### **ENABLING TECHNOLOGIES**

While the exact configurations of AR systems vary, their common elements include: (a) a means of providing a geospatial datum to the synthetic elements; (b) a surface to project the environment to the user; (c) sufficient processing power to generate the 3-D synthetic elements and merge them with the pointing device's input; and (d) adequate graphics power to animate the scene on the display (see **Figure 7**). A detailed review of each of the hardware pieces of AR systems is beyond the scope of this paper (for a more detailed overview of AR technologies, see Carmigniani et al., 2011), but it is worthwhile to mention some details about possible methods of geospatial referencing and the types of visual displays.

In order to achieve a near seamless integration of the virtual elements in the non-synthetic environment, 3D tracking must be able to define accurately the orientation and position of the user relative to the scene. To this end, magnetic, mechanical, acoustic, inertial, optical, or hybrid technologies have been used (Bowman et al., 2005). These technologies may provide: (a) the user's and the virtual elements' respective positions and orientations on a geospatial grid (e.g., GPS); or (b) the position and orientation of the user relative to a reference point recognizable by the pointing

device (e.g., a visual marker). In the case of markers, these may be visible or invisible to the human eye.

The visual displays used in AR may be categorized as projective, handheld, and head-level devices (Azuma et al., 2001). The latter can be as bulky as head mounted displays, as light as eyeglasses, and as inconspicuous as contact lenses. Two types of systems may display the composite environment to the user: (a) a video seethrough (VST) AR system (Botella et al., 2005; Juan et al., 2005); and (b) an optical see-through (OST) system (Juan et al., 2007). While a VST system exposes the user to images composed of a video feed of the non-synthetic environment merged with synthetic elements, an OST system overlays the synthetic elements on a transparent surface (e.g., glass) through which the user sees the non-synthetic environment. It is worthwhile to note that, unlike an OST system, a VST system requires a means to capture the non-synthetic environment (e.g., a web cam). In terms of user experience, a major difference between these two systems is the effect of computer graphics latency. Indeed, the user of an OST system may detect a lack of synchronization between the environment (observed in real time) and the view of the synthetic elements (displayed after some degree of graphics latency). A VST system, on the other hand, can delay the display of the video feed to synchronize video and graphics; as a result, the user detects no delay between the video of the physical world and the virtual elements, but may detect a delay between actual head movement and the head movement shown in the video of the physical world. Thus, in choosing the appropriate AR system for a particular application, one of the choices to be made is whether it is less adverse to have: (a) a slight lack of synchronicity between the environment and the synthetic elements (as in an OST); or (b) the graphics latency applied to the video transmission of the environment, thus creating a very slight time lag between actual movement felt by the body and movement detected by the user's vision (as in a VST).

#### **TECHNICAL CHALLENGES**

One of the important technical challenges of AR is to make the integration of the virtual elements into the non-synthetic environment as seamless as possible, thus giving the user the illusion of the co-existence of virtual and non-synthetic elements in a "unique world" (Botella et al., 2010, p. 402). The illusion must be maintained during the entire exposure, regardless of the angle or height from which the user observes the virtual elements. This requirement of complete fusion of the virtual elements into the non-synthetic world implies significant programing, and this challenge is further accentuated if the virtual elements are not stationary (e.g., a group of virtual spiders moving about a non-virtual table).

#### **AUGMENTED REALITY APPLICATIONS**

As the technology supporting AR developed, AR has been researched and used in various fields such as education (Kerawalla et al., 2006; Arvanitis et al., 2009), medicine (De Buck et al., 2005), architecture (Grasset et al., 2001), maintenance (Schwald and Laval, 2003), entertainment (Özbek et al., 2004), and disaster management (Leebmann, 2006). In the field of mental health, the use of new technologies holds many promises (e.g., Botella et al., 2010). Like VR, AR allows patients to have easier access to mental health services and, due to the strong representational and immersion capability of these technologies, AR can enhance the patients' engagement in the treatments (Coyle et al., 2007).

#### **Augmented reality in exposure-based therapy against phobia**

*Advantages of ARET over VRET.* In the treatment of phobias via exposure-based treatments,ARET enjoys the same advantages over IVET as VRET does (e.g., control over the scenario, safety, variety of stimuli, confidentiality, repetition, and self-training). However, as AR requires that only a few virtual elements be designed, the cost of producing the environment is reduced. Furthermore, unlike VR, AR does not "extract" the user from the real world (Dünser et al., 2011). Thus, the AR user's experience of the environment does not hinge on his ability to "build" a sense of presence. Furthermore, the user of an ARE is able to see his own body interact with the virtual elements (versus seeing a virtual representation of his body; see **Figure 8**). By embedding the virtual fear element in the real environment and allowing a direct "own-body" perception of that environment, the ecological validity of the scenario is increased (Dünser et al., 2011). The implications of these advantages of AR over VR include a less costly system that could elicit greater sense of presence and better reality judgment of the objects (Botella et al., 2005).

*Efficacy of ARET in treating small animal phobia.* Botella et al. (2005) seem to have published the first study regarding the treatment of a specific phobia using ARET. In a single subject study applying the "one-session treatment" guidelines of Öst et al. (1991b), they successfully treated a participant that initially met Universitat Jaume I. Spain.

the DSM-IV (American Psychiatric Association, 2000) diagnosis of small animal phobia (in this case, cockroaches). During the course of the study, they demonstrated not only the ability of the virtual cockroaches to activate a patient's anxiety, but also a reduction in anxiety as the 1-h period of exposure progressed. More specifically, important decreases in the scores of fear, avoidance, and belief in catastrophic thought were measured (the types of measures are shown at **Table 1**). Furthermore, after the treatment, the participant was capable of approaching, interacting, and killing live cockroaches. The results were maintained in a follow-up conducted 1 month after the termination of the treatment. Although this study showed promising results, the authors remark that they needed to be confirmed with bigger samples and other pathologies (Botella et al., 2005).

That same year, Juan et al. (2005) published a similar study involving nine participants that met the DSM-IV-TR's (American Psychiatric Association, 2000) criteria for a specific phobia (five participants feared cockroaches and four feared spiders). Using the "one-session treatment"guidelines developed by Öst et al. (1991b), the ARET protocol followed four distinct steps: (a) simple exposure to a progressively increasing number of animals (cockroaches or spiders, as applicable); (b) approaching a progressively increasing number of the animals with the hand; (c) looking under four boxes to uncover, or not, the feared animal(s); and (d) observing the therapist repeatedly crush spiders or cockroaches and throw them into a box, before doing so oneself. The study demonstrated that the AR system was able to induce anxiety in individuals suffering from spider or cockroach phobia. In all cases, the treatment successfully reduced the participants' fear and avoidance of the target animal (the types of measures are shown at **Table 1**). In fact, after the treatment, all of the participants were able to approach the live animals, interact with them, and kill them by themselves. The authors point out that their results are positive for the future of AR in psychology, but that follow-on studies should include a larger sample and a control group.

In 2010, Botella and colleagues published the results of another study testing an AR system for the treatment of cockroach phobia (Botella et al., 2010). Compared to the previous studies on ARET, this one introduced a longer period of post-treatment retest (3, 6, and 12 months). The six participants met the DSM-IV-TR (American Psychiatric Association, 2000) criteria for Specific Phobia animal type (Cockroach Phobia), and the treatment was preceded by two 60 min assessment periods during which: (a) the ADIS-IV for specific phobia was administered; (b) the target behaviors as well as the exposure hierarchy were established; and (c) the participants completed other self-report measures. The intensive exposure-based treatment, lasting up to 3 h, followed the "onesession treatment" guidelines developed by Öst et al. (1991b). Various measures of anxiety, avoidance and beliefs in negative beliefs were taken pre-, per-, and post-treatment (the types of measures are shown at**Table 1**). The data collected in this study indicate that the AR system was able to induce anxiety in all participants. Post-treatment, all of the patients: (a) had improved significantly in the level fear, avoidance and belief in negative thoughts related to the main target behavior (the gains were maintained at 3, 6, and 12-month follow-up periods); and (b) were able to interact with real cockroaches (an act they were unable to carry out pretreatment). Thus, the results of this study support the finding of the aforementioned ones, that is, ARET can be efficacious against a specific animal phobia. However, the authors point to some of the limitations of their study, namely, the small number of participants, the absence of a control group, and the absence of a formal test for cybersickness; the latter refers to a form of motion sickness that can be experienced by the user of an immersive synthetic environment.

That same year, Breton-Lopez and colleagues published a study aiming to explore the ability of an AR system to induced anxiety in six participants diagnosed with the DSM-IV-TR's (American Psychiatric Association, 2000) criteria of cockroach phobia. As the secondary objective, the authors aimed to verify their system's ability to elicit a sense of presence and reality judgment. In the ARE, the participants were exposed, in an order established to each individual's hierarchy of fears, to various elements programed in the AR system. Throughout this process, the participants rated their levels of anxiety, presence, and reality judgment (the types of measures are shown at **Table 1**). Regarding the level of anxiety, the results confirmed that the system is capable of inducing anxiety in all participants, and that the levels of anxiety decreased progressively during a prolonged exposure to the anxiety inducing stimuli. The novel aspect of the findings is that the exposures to "one insect in movement" and "more insects in movement" elicited, in all participants, higher levels of anxiety than stationary insects. This result suggests that the movement of the animal may be an important element to integrate in this type of application. Regarding presence, the authors report that all participants were able to "immerse themselves" in the AR environment and that they attributed a high level of reality to the cockroaches. Overall, the authors conclude that their results confirm the ability of their AR system to contribute to the treatment of cockroach phobia.

In 2011, Wrzesien and colleagues evaluated the Human Computer Interface and clinical aspects of their AR system for cockroach phobia (Wrzesien et al., 2011b). To this end, five "clients" (neither the diagnostic nor the instrument used for the diagnosis

is reported) were treated through individual one-session (Öst, 2000) ARET clinical guidelines. The data collected showed posttreatment improvements in the levels of anxiety, avoidance, and belief in catastrophic thoughts (the types of measures are shown at **Table 1**). More specifically, while the clients had not been able to get closer than 1 or 2 m to a real cockroach prior to the treatment, after the therapy, they were able to put a hand into a terrarium with a real cockroach. The authors conclude that, although the ARET system was effective in these clinical cases, the small size of the sample and the absence of a control group should be improved to confirm the results.

That same year, Wrzesien and colleagues published what seems to be the first (preliminary) results concerning a comparative study between IVET and ARET (Wrzesien et al., 2011a). For the purpose of this study, 12 participants that met the DSM-IV-TR (American Psychiatric Association, 2000) criteria for a specific phobia to small animals (spiders and cockroaches) were randomly assigned to an IVET or an ARET group. The therapeutic sessions, which followed the "one-session treatment" protocol (Öst, 2000), included a single intensive exposure session of up to 3 h; the exposure exercises had been defined previously and were ordered according to each participant's hierarchy of fears. Measures of avoidance, anxiety, and irrational thoughts were taken throughout the protocol (the types of measures are shown at **Table 1**). While the results of this pilot study suggest that both ARET and IVET are clinically effective, some differences were noted between the groups. For both groups, the clinical measures of anxiety, avoidance, and avoidance behavior decreased significantly after the therapeutic session. However, the clinical measure of belief in catastrophic thought only improved significantly in the ARET group. Between the groups, the authors report a significantly higher improvement of the avoidance score of the IVET group, but no improvement differences in either, the anxiety, the belief in catastrophic thought or the behavior avoidance measures. The authors suggest that the small size of the clinical sample may have played a role in the differences between the groups.

Botella et al. (2011) published the results of another single case study combining a serious game on a mobile phone with ARET. As this study involved AR in the treatment of a phobia (in this case, cockroach phobia), it is mentioned here. However, the combination of protocols goes beyond the scope of this paper; thus, this study is not reviewed.

In 2013, Wrzesien and colleagues tested a new display technology they called therapeutic lamp (TL), a projection-based AR system for therapy for small-animal phobia (Wrzesien et al., 2013). Unlike the head-level AR systems, their system has the advantage of not requiring the use of a head mounted display. The non-clinical sample of 26 volunteers underwent a single exposure-based therapy protocol comprised of 12 exercises (from least to most anxiety inducing). The results indicated that anxiety scores, although relatively high at the beginning of each exercise, dropped by the end and after the session. Furthermore, the participants' belief in their capacity to face a cockroach had increased significantly after the session (the types of measures are shown at **Table 1**). The authors conclude that TL can be a useful therapeutic tool for other psychological disorders, but that their results need to be validated with phobia patients.

*About the types of phobias treated by ARET.* All of the cases of ARET research projects found in preparation for this paper involved small animal phobia. One of the factors behind this restricted use of ARET may be related to the use of visual markers to track the orientation and position of the user relative to the scene. Indeed, the use of visual markers implies that as soon as part of the marker is not in the user's field of view, the virtual stimulus disappears completely. This technological limitation prevents the use of ARET in treating certain phobias (e.g., larger animal phobias, the fear of public speaking, the fear of thunder and lightning). While it may be difficult for cyberpsychology laboratories working without significant technical support to implement alternative tracking technologies, it could be constructive for those who do benefit from such support to experiment ARET protocols using alternative tracking methods (for examples, refer to the section titled "enabling technologies"); such developments may unlock ARET's access to many potentially useful treatment protocols. One of these may be the treatment of social phobia. Presently, some of the VEs destined to provide support in the treatment of social phobia rely on public speaking tasks. Depending on the target population, the environment often consists of public speaking rooms such as auditoriums, conference rooms, and classrooms. In this type of situation, one advantage of ARET is that the exposure could take place in the actual places the patient encounters his difficulties (e.g., an accounting officer who finds it difficult to present the financial results to the board members, or a child who isn't able to present in class). Thus, assuming that the required tracking technology is developed, AR could provide such *in situ* training.

#### **CONCLUSION**

The purpose of this paper was to review the move from VRET to ARET. Unlike VR, which entails a complete VE, AR limits itself to producing certain virtual elements to then merge them into the view of the physical world. Although the general public may only have become aware of AR in the last few years, AR type applications have been around since beginning of the twentieth century. Since then, technological developments have enabled an ever increasing level of seamless integration of virtual and physical elements into one view. Like VR, AR allows the exposure to stimuli which, due to various reasons, may not be suitable for real-life scenarios. As such, AR has proven itself to be a medium through which individuals suffering from specific phobia can be exposed "safely" to the object(s) of their fear, without the costs associated with programing complete VEs. Thus, ARET can offer an efficacious alternative to some less advantageous exposure-based therapies. Above and beyond presenting what has been accomplished in ARET, this paper also raised some AR related issues, and proposes potential avenues to be followed. These include the definition of an AR related term, the type of measures to be used to qualify the experience of ARE users, as well as the development of alternative geospatial referencing systems, which themselves, may open the door to other ARET applications, such as the treatment of social phobia. Overall, it may be said that the use of ARET, although promising, is still in its infancy but that, given a continued cooperation between clinical and technical teams, ARET has the potential of going well beyond the treatment of small animal phobia.

#### **REFERENCES**


Azuma, R. T. (1997). A survey of augmented reality. *Presence (Camb.)* 6, 355–385.


of stress management skills in soldiers. *PLoS ONE* 7:e36169. doi:10.1371/journal. pone.0036169



Burdea, G., and Coiffet, P. (1994). *Virtual Reality Technology*. New York, NY: Wiley.

Feiner, S., MacIntyre, B., and Seligman, D. (1993). Knowledge-based augmented reality. *Commun. ACM* 36, 53–62. doi:10.1145/159544.159587


Marks, I. M., and Mathews,A. M. (1979). Case histories and shorter communication. *Behav. Res. Ther.* 17, 263–267. doi:10.1016/0005-7967(79)90041-X


*Proceedings of the Human-Computer Interaction – INTERACT 2011* – *Part I*, Vol. Volume 6946, eds P. F. Campos, T. C. N. Graham, J. A. Jorge, N. J. Nunes, P. A. Palanque, and M. Winckler (Lisbon, Portugal: Springer), 523–540.

Wrzesien, M., Burkhardt, J.-M., Alcañiz Raya, M., and Botella, C. (2011b). Mixing psychology and HCI in evaluation of augmented reality mental health technology. *Paper Presented at the ACM CHI Conference on Human Factors in Computing Systems*, Vancouver, BC. doi:10.1145/1979742.1979898

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 19 December 2013; accepted: 13 February 2014; published online: 04 March 2014.*

*Citation: Baus O and Bouchard S (2014) Moving from virtual reality exposure-based therapy to augmented reality exposure-based therapy: a review. Front. Hum. Neurosci. 8:112. doi: 10.3389/fnhum.2014.00112*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Baus and Bouchard. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# HUMAN NEUROSCIENCE

REVIEW ARTICLE published: 17 October 2014 doi: 10.3389/fnhum.2014.00844

## The use of virtual reality in craving assessment and cue-exposure therapy in substance use disorders

#### **Antoine Hone-Blanchet <sup>1</sup>\*,TobiasWensing<sup>1</sup> and Shirley Fecteau1,2**

<sup>1</sup> Laboratory of Canada Research Chair in Cognitive Neuroscience, Centre Interdisciplinaire de Recherche en Réadaptation et Intégration Sociale, Centre de Recherche l'Institut Universitaire en Santé Mentale de Québec, Faculté de Médecine, Université Laval, Quebec, QC, Canada

<sup>2</sup> Berenson-Allen Center for Non-invasive Brain Stimulation, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA

#### **Edited by:**

Ali Oker, Université de Versailles, France

#### **Reviewed by:**

Eric Brunet-Gouet, Centre Hospitalier de Versailles, France Yasser Khazaal, Geneva University Hospitals, Switzerland

#### **\*Correspondence:**

Antoine Hone-Blanchet, Centre Interdisciplinaire de Recherche en Réadaptation et Intégration Sociale, 525, Boulevard Hamel, Bureau H-1312, Québec, QC G1M 2S8, Canada e-mail: antoine.hone-blanchet.1@ ulaval.ca

Craving is recognized as an important diagnosis criterion for substance use disorders (SUDs) and a predictive factor of relapse. Various methods to study craving exist; however, suppressing craving to successfully promote abstinence remains an unmet clinical need in SUDs. One reason is that social and environmental contexts recalling drug and alcohol consumption in the everyday life of patients suffering from SUDs often initiate craving and provoke relapse. Current behavioral therapies for SUDs use the cue-exposure approach to suppress salience of social and environmental contexts that may induce craving.They facilitate learning and cognitive reinforcement of new behavior and entrain craving suppression in the presence of cues related to drug and alcohol consumption. Unfortunately, craving often overweighs behavioral training especially in real social and environmental contexts with peer pressure encouraging the use of substance, such as parties and bars. In this perspective, virtual reality (VR) is gaining interest in the development of cue-reactivity paradigms and practices new skills in treatment. VR enhances ecological validity of traditional craving-induction measurement. In this review, we discuss results from (1) studies using VR and alternative virtual agents in the induction of craving and (2) studies combining cueexposure therapy with VR in the promotion of abstinence from drugs and alcohol use.They used virtual environments, displaying alcohol and drugs to SUD patients. Moreover, some environments included avatars. Hence, some studies have focused on the social interactions that are associated with drug-seeking behaviors and peer pressure. Findings indicate that VR can successfully increase craving. Studies combining cue–exposure therapy with virtual environment, however, reported mitigated success so far.

**Keywords: substance use disorders, craving, virtual reality, avatars, cue exposure**

#### **INTRODUCTION**

Substance use disorders (SUDs), commonly referred to as drug addictions, are a public health issue of growing importance throughout the world. Indeed, the number of individuals diagnosed with SUDs exceeds 15 million worldwide (World Health Organization, 2014), leading to annual costs of more than \$600 billion in the United States only (National Institute on Drug Abuse, 2014). Several legal and illegal substances entail psychoactive characteristics that may interfere with the normal functioning of individuals, providing pleasure, and chronic intake of these substances can induce SUDs. Although the majority of these active compounds are well known from pharmacological and toxicological viewpoints, prevention of relapses and treatment of SUDs remain difficult as psycho- and pharmacotherapies offer relatively poor success rates. Combination of medication and behavioral assessment is becoming increasingly important in addiction medicine and more attention is dedicated to find ways to improve such combinations. Virtual reality (VR) and novel computer technologies are well established in the fields of neurosciences and psychology and may prove to serve as supplementary tools in the assessment and treatment of SUDs. In this review, we summarize current

knowledge on craving assessment and cue-induced therapy in SUDs using VR technologies.

Clinically defined as a chronically relapsing disorder, the DSM 5 criteria for SUDs include the escalation of substance intake, appearance of tolerance, difficulties in limiting the intake, emergence of serious withdrawal symptoms and negative affect during limiting the intake, drug seeking, and craving (DSM 5, 2013). Psychoactive substances, from nicotine and alcohol to crack cocaine and methamphetamine (METH), all have differential effects on the brain and induce different psychophysiological consequences. However, all these compounds directly or indirectly stimulate the dopamine neural pathways, commonly referred to as the reward pathways, providing pleasure through intoxication. Subjective pleasure associated with recreational substance use and physiological habituation to the substance may then motivate subsequent intake and trigger an addictive process, which requires more effort and time to obtain and use the substance. Chronic substance use is, therefore, characterized by escalation of intake and loss of control in limiting the intake, emergence of withdrawal symptoms and negative emotional states when limiting the intake, and compulsive drug-seeking behavior (Koob and Volkow, 2009). Importantly, drug-seeking behavior and the negative affect associated with withdrawal are critical in relapse.

#### **CRAVING IS A CRITICAL FACTOR IN SUDs**

Craving is defined in the DSM 5 as the intense preoccupation or urge to use the desired substance, is a complex phenomenon, encompassing neurobiological and psychological mechanisms. The DSM 5 added craving as a crucial diagnosis criterion of SUD. Although this remains disputed, craving is also considered a predictive factor of relapse (Paliwal et al., 2008; Galloway and Singleton, 2009).

From a neurocognitive perspective, addiction is seen as an unbalance between reflexive and reflective systems (Bechara, 2005). Reflective processes, on the one hand, encompass cognitive inhibitory control and delayed discounting of available reward, an ability to resist craving and make adapted decisions. These processes depend on cerebral structures associated with executive functioning, such as the dorsolateral (DLPFC) and orbitofrontal (OFC) subdivisions of the pre-frontal cortex (PFC). On the other hand, reflexive processes are related to impulsivity and risky decision-making, the motivational and emotional responses to reward, which highly depend on activity in the basal ganglia and limbic areas. In fact, cognitive downregulation of craving is associated with increased activity of the PFC and decreased activity in the nucleus accumbens, ventral tegmental area, and amygdala (Brody et al., 2007; Kober and Mende-Siedlecki, 2010). In patients with SUDs, impairments in reflective and reflexive processes are believed to facilitate a disregard toward adaptive decision-making and facilitate compulsive drug seeking.

From a neurophysiological perspective, craving is initiated when the reward pathways, habituated with frequent and intense firing of dopamine neurons from the mesocorticolimbic projections to the nucleus accumbens and DLPFC, are kept relatively understimulated for a certain period of time, as when one remains abstinent. Several functional magnetic resonance imaging (fMRI) studies have demonstrated that abstinence-induced craving increases cerebral blood flow and neuronal activity in the DLPFC, nucleus accumbens, and limbic structures including the hippocampus and amygdala (Tomasi et al., 2007; Wang et al., 2007).

Thus, it is generally accepted that addiction is *initiated* by chronic and repetitive overstimulation of the dopamine pathways, but its *maintenance* is highly dependent on more complex cerebral mechanisms, such as emotional regulation and memory. These processes involve several cerebral structures other than the basal ganglia including the PFC, basolateral amygdala, and hippocampus (Steketee and Kalivas, 2011; Barak et al., 2013). Memories and learned behaviors toward an artificial reward are thus reinforced during periods of drug use in comparison to natural rewards, creating perturbations in inhibitory control and reflective processes.

#### **SOCIOENVIRONMENTAL FACTORS ARE IMPORTANT IN CRAVING**

Individuals with a diagnosis of SUD, including those in state of recuperation, and abstinent individuals are particularly vulnerable to social and environmental cues related to substance abuse. Specific drug-related environments (e.g., pubs and nightclubs) and social interactions (e.g., peer pressure) can provoke craving and relapse in abstinent individuals (Ferguson and Shiffman, 2009). Indeed, this sensitivity to drug-related conditioned cues and anticipation of future drug reward is thought to provoke a potentiated neural response in the neural pathways of reward. fMRI studies have demonstrated that cue-induced craving, for different substances of abuse, increases activity in the amygdala, anterior cingulate cortex (ACC), and PFC (Schneider et al., 2001; Seo et al., 2013). Of utmost importance, limbic activation and inception of craving following the presentation of drug cues are relatively independent of a pharmacological abstinence state and may also be elicited after drug administration (Franklin et al., 2007). Moreover, abstinence and withdrawal states may potentiate cue-induced craving in the PFC and striatum (McClernon et al., 2008). Studies using positron emission tomography (PET) imaging, which allows the imaging of a radioactive compound acting as a competitive agonist to certain receptors, have shown that cocaine- and nicotine-related cues induce changes in metabolic activity in the striatum, ACC, and different regions of the PFC (Brody et al., 2002; Volkow et al., 2007). These results demonstrate that SUD is a pathology of neuroplasticity and that drug-associated socioenvironmental cues provoke behavioral (the sensation of craving) and neurobiological (activity patterns in sensitive brain areas) responses.

Hence, as craving is now considered a crucial factor in the process of SUDs, its assessment is essential for future identification, management, and treatment outcomes. Craving episodes may vary in intensity, latency, frequency, and salience, and are traditionally measured with craving questionnaires and visual analog scales (VAS), in which individuals report subjective states of craving ranging from "not at all" to "very much" (cf., Bordnick et al., 2005; Culbertson et al., 2010). The assessment of craving along the course of therapeutic modalities is thought to represent a good marker of rehabilitation and treatment success, although this idea is still disputed (Perkins, 2012), and the majority of current studies in SUDs reports craving scores as primary or secondary outcomes. Current ways of inducing craving rely mostly on visual cues, such as pictures or videos presenting cigarettes and drugs, or the combination of such visual cues with other sensory modalities (e.g., the smell of a cigarette). Although these cues are relevant to drug-seeking behavior and successfully induce craving, they do not trigger responses associated to social or environmental cues *per se* (i.e., a specific locale or social interaction). VR paradigms may represent an improvement in this regard, as it enables the reconstruction of specific environments and characters (i.e., a bar and a bartender) and allows a certain social immersion and interaction within this context (Kuntze et al., 2001; Lee et al., 2003).

#### **CUE-EXPOSURE THERAPY HAS THERAPEUTIC POTENTIAL IN CRAVING REDUCTION**

Studies now support the idea that intervention on sensitivity and reactivity to environmental context and cues is necessary to shoulder pharmacotherapy in the treatment of SUDs (Perry et al., 2011). Cue-exposure therapy (CET) as an intervention has been used extensively in the past with patients suffering from social phobias and agoraphobia (Vögele et al., 2010), fear of flying (Maltby et al., 2002; da Costa et al., 2008), post-traumatic stress disorder (Reger and Gahm, 2008), and generalized anxiety disorders (Hofmann et al., 2006).

Cue-exposure therapy is primarily based on Pavlov's concept of classical conditioning, a main theory of learning in behavioral science. In classical conditioning, an unconditioned cue or stimulus (UCS) evoking an unconditioned response (UCR) is paired with a neutral cue for a longer period of time until the neutral stimulus itself can result in a response similar to the UCS. This response is referred to as a conditioned response (CR), which is elicited by the no longer neutral, but conditioned stimulus (CS). Thus, the concept of classical conditioning has been demonstrated to be a key mechanism of a variety of psychological phenomena, including persistent substance use and abuse (Drummond et al., 1990). The drug serves as the UCS, which elicits the UCR, usually an automatic physiological response (e.g., craving). Pairing the UCS with a neutral cue (e.g., an object such as a syringe or an environment such as a restaurant) results in a CR that evoked by the mere presence of the CS and independent of the substance itself. Hence, constant substance abuse leads to increases in cue reactivity by establishing an association between the cue and the physiological response. CET aims at diminishing the conditioned relation between a substance-related cue and the physiological response by systematically pairing them in a treatment setting. The constant combination of a CS with a CR in absence of the actual drug reduces the physiological reactivity to a substancerelated cue, since no drug is actually induced. This eventually results in an extinction of the cue-response association and, therefore, decreases reactivity to substance-related cues, which might be held responsible for maintained levels of craving.

Through the currently available results, success of CET in treating SUDs is mitigated. A meta-analysis conducted in 2002 (Conklin and Tiffanny, 2002) concluded that CET is not significantly more efficient than any other behavioral cognitive therapy, although the authors state that the limited amount of results did not allow a clear interpretation of the phenomenon. Of great interest, Martin et al. (2010) have recently reviewed several studies with different methodologies (some of them combining VR paradigms, pharmacological augmentation, and fMRI with CET) in different SUDs (nicotine, alcohol, cocaine, and opiates). Although only four of the reviewed studies were randomized clinical trials, the authors conclude that CET demonstrates interesting therapeutic potential, although not significantly superior to other available treatments (Martin et al., 2010). Importantly, they also mention that several methodological issues in CET studies are to be addressed in the future to warrant significant benefits.

In the following sections, we will discuss results on (i) the use of VR in assessing cravings in SUDs and (ii) the combination of VR and CET to tune down craving intensity and remediate to substance use in SUDs. Furthermore, this review separates findings of studies using VR to recreate computer-generated environments as opposed to others proposing social interactions with avatars in addition to virtual environments, both in the context of SUDs.

#### **DATA REVIEW**

#### **VIRTUAL REALITY AS A SUPPLEMENTARY TOOL FOR CRAVING ASSESSMENT**

Virtual reality enables the simulation of drug-related cues and environments to induce craving, and thus might serve as an ecologically valid addition to traditional craving assessments. While virtual environments by themselves – as well as possible interactions with avatars in another virtual environment – may induce craving, their specific nature of induction differs within and across VR paradigms. Therefore, we have separated studies proposing only the exposure to VR environments and studies using both VR environments and the possibility of social interaction with avatars. **Tables 1** and **2** provide summaries of each study's design and results.

#### **VR environments inducing drug cravings**

Kuntze et al. (2001) developed the first pilot study combining a classical cue-exposure paradigm with VR in heroin dependent males. Their methodological design specifically compared craving ratings following counterbalanced cue exposure to an immersive virtual environment, cue exposure to pictures, and to neutral stimuli. Cravings were measured with a VAS and the Yale–Brown Obsessive Compulsive Scale (YBOCS) on VAS before, during, and after exposure to the different sets of cues. Several physiological measurements were also obtained, including electroencephalography (EEG), skin conductance, and cardiac rhythm. Although the authors did not expound on the actual results of the study, their methodological development paved the way for other works (Kuntze et al., 2001).

The following set of studies investigating the effect of VR environments in eliciting cravings largely focused on nicotine intake and smoking. Tobacco use disorder (TUD) (DSM-5) is still a leading cause of preventable deaths and illness throughout the world and represents a complex condition (Rose, 2005). Although several pharmacotherapies are available, success rates remain relatively disputable and total abstinence is hard to achieve. Lee et al. (2003, 2005) pioneered the works on VR in TUD with two landmark studies. In their pilot study, they sought to determine if VR environments evoked nicotine cravings more effectively than presenting normal two-dimensional cues, such as photographs (Lee et al., 2003). They designed a VR bar, which had previously been identified as an environment eliciting superior cravings in smokers from results of a questionnaire. The 22 subjects participating in the study were randomly assigned to a group exposed to either VR or photographs and exposed to this condition for 5 min. Craving ratings were collected with a single item on VAS inquiring on the current urge to smoke. TheVR bar induced significantly greater nicotine cravings than the photograph condition. This supports the idea initiated in Kuntze's pilot study that exposure to VR smoking-related cues results in elevated craving ratings when compared to classical cue-exposure designs. Following these results, Lee et al. (2005) presented the same task during fMRI acquisition to determine neural correlates of VR cue reactivity (Lee et al., 2004). The eight subjects participating in the study were required to complete a craving VAS before and after the scan, during which they experienced immersion in a VR bar and exposure to two-dimensional and three-dimensional smokingrelated and neutral cues through MR-compatible goggles. fMRI scans allowed to measure a significant increase in brain activity in the PFC, ACC, inferior temporal cortex, and supplementary motor area following the presentation of two-dimensional smoking versus neutral cues, consistently with neuroimaging data on craving. Interestingly, following the immersion in the VR bar,


**Table 1 | Summary of studies assessing craving induction by substance-related cues in VR settings using virtual environments**.

VR, virtual reality.

<sup>a</sup>Number of participants assigned to experimental group of substance users.

<sup>b</sup>Clinical or diagnostic characteristics of substance users according to the study's authors, ND, nicotine dependence [<sup>1</sup>Meeting criteria for nicotine dependence and severity of dependence based on the Fagerstrom Test of Nicotine Dependence (FTND), <sup>2</sup>Minimum smoking rate of 10 cigarettes per day, <sup>3</sup>Meeting DSM-IV-TR criteria for nicotine dependence].

<sup>c</sup>HMD, head mounted display; screen, computer screen placed in front of the participant.

<sup>d</sup>Tool of assessment for subjectively reported craving levels before, during and after cue exposure, Likert, Likert Scale; VAS: Visual Analog Scale; QSU, Questionnaire of Smoking Urges; ACVAS, Attention to Cues Visual Analog Scale.

e fMRI, functional magnetic resonance imaging.

<sup>f</sup>LF/HF, low frequency/high frequency; HR, heart rate; PFC, pre-frontal cortex; STG, superior temporal gyrus.


#### **Table 2 | Summary of studies assessing craving induction by substance-related cues in VR settings using virtual social interactions**.

(Continued)

#### **Table 2 | Continued**


VR, virtual reality.

<sup>a</sup>Number of participants assigned to experimental group of substance users.

<sup>b</sup>Clinical or diagnostic characteristics of substance users according to the study's authors, AUD, alcohol use disorder; AD, alcohol dependence; CCD, crack cocaine dependence; ND, nicotine dependence [<sup>1</sup>Meeting DSM-IV-TR criteria for alcohol abuse or dependence, <sup>2</sup>Based on the Alcohol Dependence Scale (ADS), <sup>3</sup>Abstinence for at least 3 weeks, <sup>4</sup>Meeting DSM-IV-TR criteria for cannabis abuse or dependence, <sup>5</sup>Meeting DSM-IV criteria for crack cocaine dependence, <sup>6</sup>Meeting DSM-IV-TR criteria for nicotine dependence, <sup>7</sup>Minimum smoking rate of 10 cigarettes per day].

<sup>c</sup>HMD, head mounted display; screen, computer screen placed in front of the participant.

<sup>d</sup>Tool of assessment for subjectively reported craving levels before, during, and after cue exposure, AAS, Alcohol Attention Scale; VAS, visual analog scale; CCVAS, Cannabis Craving Visual Analog Scale; MDS, Multidimensional Scaling.

<sup>e</sup>EEG, electroencephalogram; HR, heart rate; SC, skin conductance; T, body temperature.

significant brain activation was only recorded in the PFC and cerebellum. These results demonstrate that 3D stimuli in a cueprovoked paradigm may be as efficient as 2D stimuli, since they induce a response in the PFC, which has been associated with craving. Activation in the cerebellum may be related to the control of posture and balance, required in the 3D environment. The authors also identified a limitation in their recent VR bar paradigm; the smoking cues were scattered around the environment and may have required additional attentional resources compared to the photographs. Works from Baumann and Sayette (2006), Traylor et al. (2009), and Pericot-Valverde et al. (2011) have also explored the efficacy of VR in nicotine craving inducement. In these studies, smokers were exposed to different virtual environments (neutral and smoking-related) and required to rate their craving levels. In Baumann and Sayette's (2006) study, smoking deprived subjects rated their cravings significantly higher following exposure to the smoking-related environment (Baumann and Sayette, 2006). Interestingly, Traylor et al.'s (2009) study implemented olfactory stimuli to VR environments and tested this paradigm in non-abstinent smokers (Traylor et al., 2009). They found that smoking-related VR cues significantly increased cravings in smokers compared to neutral cues, even though the subjects had been allowed to smoke before the experiment. However, the olfactory stimuli condition did not affect craving levels. The last two studies have clearly demonstrated how VR smoking-related cues may successfully induce craving in smokers, independent of the state of abstinence. Pericot-Valverde et al. (2011) demonstrated the different evolution patterns of subjective craving for cigarettes in VR environments, with smoking-related environments provoking a more rapid increase in subjective craving intensity (Pericot-Valverde et al., 2011). Moreover, Gamito et al. (2011) examined the difference between three different VR environments in a study comparing smokers and non-smokers. Results were similar to previously discussed studies, as high arousal virtual settings provoked the greatest significant increases in subjective craving for cigarettes in smoking individuals. Smokers and non-smokers reported similar scores for sense of presence and cybersickness

(Gamito et al., 2011). The aforementioned studies have well illustrated the efficiency of VR environments in inducing cravings, but also the importance of specificity and context in VR cues to facilitate a reaction. The role of context in craving has been directly assessed in a study by Paris et al. (2011), in which multiple smokers were exposed to four different VR environments including two neutral nature scenes and two identical convenience stores; one comprising the usual available tobacco and cigarettes and one stripped from all smoking-related products (Paris et al., 2011). Subjects spent 3 min in each environment in a counterbalanced randomized order and indicated their craving level on a 10-point scale. Cravings were significantly increased following exposure to the convenience store with smoking cues compared to neutral environments. More importantly, results also showed an increase in subjective cravings between the first nature scene and the convenience store *without* smoking cues. This provides additional evidence that a smoking-related scenery without explicit smoking cues may provoke craving (Conklin, 2006; Conklin et al., 2008). This finding may possibly be attributed to the subject's own expectations (McBride et al., 2006), and reinforces the idea that VR is an adequate medium to study cravings for nicotine. Finally, Acker and MacKillop (2013) obtained similar results when comparing the effect of exposure to a neutral or smoking-related virtual environment on cravings in smokers, but also investigated cigarette craving from a behavioral economic perspective. They found that as craving levels increased following the exposure to a smokingrelated virtual environment, the demand index (i.e., representing the economical cost subjects were willing to pay for a cigarette) also increased (Acker and MacKillop, 2013). This is in line with the literature on impaired decision-making, increased compulsive behavior and risk-taking in addicted individuals (Goldstein and Volkow, 2002; Lejuez et al., 2003; Garavan and Stout, 2005; Jentsch and Pennington, 2013).

Virtual reality has also been utilized to induce craving in alcohol use disorder (AUD), a complex relapsing disorder with current moderate therapeutic success and in need of novel intervention techniques (Addolorato et al., 2012). With alcohol being a widely available substance, differentiation of social drinkers and pathological users is critical (Myrick et al., 2004), as is the identification of concomitant substance use. Traylor et al. (2009)sought to determine if cravings for nicotine and alcohol differed in three VR environments (i.e., a neutral scene, a party scene, an office scene) in alcohol-dependent and non-alcohol-dependent smokers/daily drinkers. Results showed that alcohol-dependent subjects had significantly increased alcohol cravings in all three VR environments, whereas non-alcohol-dependent subjects only showed a significant increase in craving when exposed to the party scene. Interestingly, non-alcohol-dependent smokers were significantly more sensitive to smoking-related VR cues than the alcohol dependent, raising interesting questions on cross-sensitization and cue differentiation. The authors propose that alcohol cues may have provoked a ceiling effect in alcohol-dependent subjects since they were required to remain abstinent 24 h before the experiment. Although this may represent a limitation, the results demonstrate that VR paradigms can successfully induce cravings for alcohol. Ryan et al. (2010) also successfully induced alcohol cravings with VR cues in binge drinkers. Simultaneous to the immersion of four VR environments, subjects were exposed to olfactory (i.e., scents of whiskey and cigarettes) and auditory cues (Ryan et al., 2010). Binge drinkers compared to non-binge drinkers showed a significant increase in alcohol cravings in two out of four contexts, the party and the kitchen scenes, whereas both groups showed an increase in alcohol craving within the bar scene. Binge drinkers also reported significantly higher levels of thinking about drinking. Although it can be argued that binge drinkers do not necessarily fit the diagnosis criteria for AUD and that the control sample size was smaller, these results show that the desire to drink can be successfully induced with VR in regular drinkers, enabling to monitor early onset signs of SUD.

Apart from previous works on TUD and AUD, one study has been conducted to induce cravings for METH making use of VR environments. METH is a powerful psychostimulant with important addictive potential and dangerous neurological and physiological side effects (Marshall and O'Dell, 2012). In a VR study, non-treatment-seeking METH users were exposed to a VR neutral room and a METH-related room, in which drug transactions occurred, avatars used the drug and loud music played, and reported subjective cravings and emotional states on a VAS (Culbertson et al., 2010). The authors measured four aspects of craving, asking the subjects if they crave, desire, want, and would use METH right now. All aspects were measured on VAS. The cue condition significantly increased ratings for all four craving items and one anxiety item, when compared to the neutral VR room. Moreover, the VR cues induced greater cravings than METH-related videos, although this effect was not statistically significant. Again, this work contributes to demonstrate the viability and validity of VR in addiction research. Of technical interest, the study design was conducted with a free online VR platform, which advocates for the accessibility and relative logistical simplicity of VR in studying cue-provoked paradigms.

In summary, the studies discussed above unanimously show that immersion in VR environments can successfully induce cravings for a wide range of substances (see **Table 1**). Although works have primarily focused on TUD and AUD, several lines of evidence demonstrate that cue-induced craving with traditional stimuli can be induced with other substances (Carter and Tiffany, 2001) such as marijuana (Filbey et al., 2009), cocaine, and synthetic psychostimulants (Garavan et al., 2000; Bonson et al., 2002; Kühn and Gallinat, 2011). It can thus be suggested that future studies will support the efficacy of VR in inducing cravings in other SUDs. Although it can be disputed that self-reported measures may not always be as accurate as neurobiological correlates in assessing craving levels, it has to be acknowledged that most of the studies discussed above included important sample sizes and control conditions with either a comparison with healthy subjects or with a neutral VR condition or both. Future works will surely rely on simultaneous brain imaging to provide solid results.

#### **VR and social interactions with avatars**

Several studies assessing the use of VR to induce craving frequently expose subjects to environments depicting people, which perform substance-related behavior (i.e., talking about or administering a specific drug). These representations of human individuals in a digitalized form are usually referred to as *avatars*. The use of the term avatar itself is rather ill-defined and, thus, often used inconsistently across experimental studies. A general and broader use of an avatar in most studies includes all digital representations of a person in a virtual environment (Lim and Reeves, 2009). Experimental settings with this kind of avatar usually provide a VR scenario, such as a bar or a party, in which the subject is generally limited to the role of an observer. This approach allows a high degree of standardization, leaving the experimenter in charge of controlling the subjects' field of view and gaze, and guiding them through a scenario in a specific order while exposing them to stimuli for a predefined period of time (Stoermer et al., 2000). A more specialized view of an avatar extends the simple image of a human being on a computer screen by enabling subjects to actually interact with their virtual environment, or even control it, and manipulate certain stimuli. However, only a few studies have actually implemented this approach in their experimental design since it lacks a necessary degree of standardized performance across subjects' exposure to the VR environment [e.g., (Culbertson et al., 2010; Ferrer-García et al., 2010)]. Instead of a discrete distinction between avatars that allow for interactional behavior and those who do not, there are rather graduating steps in how an avatar is defined in individual studies, ranging from mere observation of virtual characters to actual manipulation of the environment using a virtual representation of the individual. Finding a common definition of an avatar in VR paradigms poses a challenge for recent psychiatric and neuroscientific research. For once, if standardized, it will set the framework for a deeper understanding of the socioenvironmental mechanisms of craving. Also, it eventually might lead to future advances in the treatment of SUDs. Hence, in this section, we refer to an avatar as a virtual representation of a person that *interacts* with the subject. This incorporates people in a virtual environment, which actively approach a subject (i.e., in order to offer them a cigarette) instead of performing passively observed substance-related behavior. In some studies, an avatar approached the user proposing to share an alcoholic beverage or cigarettes, but since the user did not have the possibility to answer and to change the unfolding course of events, we considered this

as a non-interaction and the avatar as being a singular part of the environment.

The most frequent implementation of avatars in VR is as part of an environment closely related to specific substance use behavior. Due to standardized timing of stimuli exposure, researchers are capable to disentangle the effect of interaction an avatar engages subjects with from other surrounding and also potentially craving-inducing cues. Studies using this kind of avatars consistently report that social interactions, as much as substance-related objects, increase craving compared to neutral cues and baseline craving before exposure (Bordnick et al., 2004, 2005; Saladin et al., 2006; Carter et al., 2008; Cho et al., 2008; Lee et al., 2008; Ferrer-García et al., 2010; Garcìa-Rodrìguez et al., 2012, 2013). Although most of these study designs provide the opportunity to assess the effect of social interaction in a substance-related environment, the exact nature of interaction is seldom specified (Bordnick et al., 2005) and no explicit statistical comparisons are made between changes in subjective craving induced by the environment itself and social interaction. Thus, how far social interaction benefited the induction of craving by increasing ecological validity of the virtual environment of these studies remains unclear. Studies and results described below are summarized in **Table 2**.

The work of Bordnick et al. (2004, 2005, 2008, 2009) has to be highlighted as one of the first to assess craving induction with VR implementing artificial social interactions. Taken together, their results successfully point out that complex virtual cues can induce cravings in a broad range of substances among chronic substance users. Moreover, the results seem to demonstrate that social interactions directed toward drug use are as effective as paraphernalia cues to induce craving. Their pilot study in smokers demonstrated that immersion in a smoking-related virtual environment, including interaction with avatars offering cigarettes to the subject, significantly induced cravings for tobacco. Moreover, smoking-related social interactions with avatars provoked the highest craving score compared to smoking-related inanimate materials (Bordnick et al.,2004). Subsequent investigation of physiological reactivity with the same VR paradigm in non-deprived nicotine-dependent smokers and yielded that smoking-related virtual cues, including interaction with smoking avatars, significantly induced cravings for cigarettes and increased skin conductance response (Bordnick et al.,2005). Of interest,smoking-related inanimate cue and social interaction conditions induced the same level of craving, with no significant increase of craving directly ascribable to interaction with avatars. Other works have similarly featured the use of VR and interactions with avatars in TUD subjects. Carter et al. (2008) found that cigarette-related environments and interaction with smoking avatars elicited greater cravings than a neutral scene (Carter et al., 2008). Using eight different cigarette-related environments with avatars, Ferrer-García et al. (2010) showed that craving levels were significantly correlated to the sense of presence within the VR context. This result is particularly interesting in perspective of creating flexible virtual environments (Ferrer-García et al., 2010). Moreover, this emphasizes the importance of assessing the sense of presence in future studies, as the sensation of craving inducing the sense of presence may also be a possible phenomenon. In a similar fashion, Garcìa-Rodrìguez et al. (2012) have assessed the validity of seven

different virtual environments, in which subjects could interact with avatars and objects, to induce nicotine cravings. This study showed that levels of cue-induced craving may be induced differentially, depending on particular situations successfully recreated in VR. However, the last three studies discussed (Carter et al., 2008; Ferrer-García et al., 2010; Garcìa-Rodrìguez et al., 2012) did not differentiate the impact of social interaction with levels of induced craving. Interestingly, Garcìa-Rodrìguez et al. (2013) also investigated the effect of smoking a virtual cigarette in a virtual pub allowing the interaction with avatars. The action of manipulating and "smoking" the virtual cigarette induced a significant increase in subjective cravings and heartbeat rate. The authors propose that simulating the action of smoking, in a virtual environment or in real life, may provoke cravings and relapse, acting as a CS.

A few studies have also assessed virtual interactions and craving in AUD. Bordnick et al. investigated the effect of their paradigm used in TUD (Bordnick et al., 2004, 2005) and AUD subjects (Bordnick et al., 2008). Subjects were exposed to neutral and alcohol consumption related virtual environments, including a bar, interacted with a virtual bartender and could order their favorite drink. Moreover, olfactive cues mimicked the scent of the specific drink, adding to the realism. Again, the alcohol-related cues and virtual interactions significantly induced cravings compared to neutral scenes. In other studies on AUD, social interaction with avatars has been used for a more tangible purpose that is quite distinct from casual interaction in everyday scenarios, such as a bar or a party. These studies make use of experimental settings, in which an avatar confronts subjects with social pressure, that is, using persuasive methods to engage the counterpart into substance use behavior. Since there is only a limited amount of studies assessing the feasibility of virtual social pressure situations to induce craving, findings so far are somewhat inconsistent. Although studies agree that the presence of an avatar urging subjects to drink alcohol can successfully increase their subjective craving levels, different views have been proposed regarding the context social pressure is deployed in. Cho et al. (2008) found that an avatar providing social pressure seems to increase craving independent of any substance-related cues. Subjective levels of craving were higher in the presence of an avatar, irrespective if the scenario was neutral or associated with alcohol use behavior (Cho et al., 2008). However, results by Lee et al. (2008) indicated that settings with substance-related cues might have differential effects on craving, which appear to be additionally driven by diagnosis of SUD (Lee et al., 2008). While social pressure significantly induced craving for healthy individuals in both a neutral and alcohol-related environment, craving levels of individuals diagnosed with SUD were already increased by the mere presence of substance-related cues. An additional avatar providing social pressure did not have any effect on changes in subjective craving levels.

Research on social aspects of craving induction has been conducted in other SUDs as well; however, available results are less numerous. Bordnick et al. (2009) investigated VR cue reactivity in cannabis smokers. Using the same paradigm as in their study on tobacco, subjects were immersed in four virtual environments (including two neutral rooms) with one cue room and one social interaction "party" room and subsequently filled VAS on marijuana cravings (Bordnick et al., 2009). In the party room, subjects could interact with avatars and even "use" marijuana with them, hear music playing, and smell the scent of marijuana. Results show that both the inanimate cue room and social interaction room significantly induced cravings compared to the two neutral rooms; however, there was no additional significant effect when comparing the inanimate cue room to the social interaction one. With respect to the idea of complex drug-related social interactions in craving induction, it has to be mentioned that olfactive and auditory stimuli were also present in the inanimate cue room, thus creating a more realistic drug-related scene. Finally, Saladin et al. (2006) have also demonstrated that VR cues and social interactions with avatars could significantly increase cravings for crack cocaine in dependent individuals. The virtual contexts in which cravings were highest implied social interactions with avatars, such as scenes where the subject interacted with a crack cocaine dealer or where other avatars were using crack cocaine (Saladin et al., 2006). Of further interest, scenes depicting aversive stimuli related to crack cocaine use (i.e., a police raid in the crack den) seemed to induce lower levels of craving, Moreover, although crack cocainerelated scenes induced craving, subjects' mood (measured with a self-reported happiness scale) decreased and remained low for the remainder of the experiment. The authors propose that this decrease in mood might imply anticipated anxiety in the face of craving and withdrawal symptoms.

In summary, virtual interactions with avatars in different contexts can significantly induce cravings for multiple substances,with several results available on cigarette smoking and alcohol administration. Although avatars allow the presentation of a more complex and ecologically valid virtual environment, it is not clear if they induced cravings for substances in a manner different from other virtual settings. Real-life environments that are thought to induce cravings are usually populated with other people consuming substances or engaged in activities more or less related to drug use (i.e., dancing, listening to music, drinking, etc.). However, as already noted in the previous section on the use of virtual environments in cue-induced craving, the presence in the specific environment alone is sufficient to induce a significant increase in subjective cravings. Additional studies will, therefore, be needed to specifically investigate the difference between environmental and social cues in craving induction. The use of neurobiological correlates may prove to be essential in the hypothetical differentiation of such cues, as social interactions are more complex and engage different brain pathways than simple sensorial attention to present cues.

#### **VIRTUAL REALITY AS A BENEFICIAL ADDITION TO CUE-EXPOSURE THERAPY IN SUDs**

Traditional CET designs in the treatment of SUD suffer from mixed findings regarding its efficacy in clinical populations (Conklin and Tiffanny, 2002; Kaplan et al., 2011). While there is experimental and meta-analytic evidence that exposure to VR cues might serve as an effective addition to CET approaches in anxiety disorders (Meyerbröker and Emmelkamp, 2010; Gonçalves et al., 2012), its beneficial effect with respect to SUDs still needs to be validated. VR CET in the context of SUD is assumed to induce states of craving by exposing individuals to environments closely related to everyday life and typical drug administration scenarios (Lee et al., 2004). This approach exceeds traditional CET designs, which

mainly draw on substance-related cues detached from socioenvironmental or *in vivo* scenarios (Drummond and Glautier, 1994). However, since there are only few studies systematically examining the efficacy of VR in CET settings, overall findings to date are rather preliminary (**Table 3**). Studies within this line of research usually assess the feasibility of VR CET in a three-folded fashion. First, changes in subjective craving levels are determined after each session of exposure to either substance-related or neutral VR cues. Second, physiological responses, such as heart rate, skin conductance, or changes in cerebral blood flow by means of fMRI provide objective tools of craving assessment, which are independent of subjective ratings stated by the subjects. Third, an overall treatment effect is derived from comparing changes in subjective craving ratings and physiological responses before and after completion of the CET.

The majority of studies addressing the efficacy of VR CET in a therapeutic context assessed populations with a diagnosis of nicotine dependence. One of the first studies to investigate the use of VR CET in cigarette smokers was performed by Lee et al. (2004). Encouraged by their previous findings that VR environments induced higher levels of subjective craving ratings compared to classical pictorial cues (Lee et al., 2003), the group implemented a VR paradigm into a CET setting with current male smokers (Lee et al., 2004). Subjects were exposed to VR environments and social situations and assessed with respect to subjective craving, as well as number of daily smoked cigarettes. Although results indicated a trend toward decreasing numbers of smoked cigarettes from pre- to post-treatment, neither this outcome nor subjective craving ratings reached statistical significance during or at the end of the VR CET. Given that Lee et al. (2004) conducted a preliminary study, a few issues might explain the lack of treatment effects on smoking urges and actual number of smoked cigarettes. Of note, the sample in this study could retrospectively be classified as non-clinical since no differentiation was made between frequent smokers and those qualifying for diagnostic criteria of nicotine dependence. Moreover, there was no methodological assessment of differences between the effects of VR in inducing craving compared to traditional pictorial or neutral cues. This comparison is necessary to argue toward a statistically and clinically meaningful treatment effect, which can be specifically attributed to exposure to VR cues.

Indeed, systematical subsequent research by Choi et al. (2011) provided evidence that exposure to VR settings successfully induced cigarette craving in TUD subjects (Choi et al., 2011). Increases in craving levels were observed by means of both higher ratings of subjective craving as well as elevated physiological responses [i.e., increased levels of skin conductance and electromyography (EMG)] to substance-related and social cues when compared to neutral control scenarios. These findings are in line with previous studies (Bordnick et al., 2004, 2005; Baumann and Sayette, 2006; Garcìa-Rodrìguez et al., 2012, 2013;Acker and MacKillop, 2013) showing that substance-related objects and social scenarios induce craving in individuals diagnosed with SUD. Moreover, subjects showed increased physiological responses to social rather than substance-related cues in virtual environments, although no differences were found in terms of subjective craving. Even more important, subjects' cue reactivity to smoking-related



<sup>a</sup>CT, clinical trial; CR, case report.

<sup>b</sup>N, no abstinence at the beginning of the treatment; Y, abstinence at the beginning of the treatment.

<sup>c</sup>CET, cue-exposure therapy; CBT, cognitive-behavioral therapy.

<sup>d</sup>Number of total treatment sessions.

<sup>e</sup>Treatment sessions took place either once (W) or twice (nW) a week.

<sup>f</sup>E, virtual environments; S, virtual social interaction.

<sup>g</sup>VR CET, virtual reality cue-exposure therapy.

<sup>h</sup>EEG, electroencephalogram; SC, skin conductance reactivity; EMG, electromyogram; PFC, pre-frontal cortex; ACC, anterior cingulate cortex; CO, breath carbon monoxide.

stimuli decreased from the first to last session of CET subjectively and physiologically, hinting toward a general effect of VR CET in reduction of craving levels.

However, Choi et al. (2011) refrained from including a group of healthy subjects exposed to the same experimental conditions, or a group of TUD subjects receiving conventional CET in their study design. It remains, therefore, questionable how far the presented results can be viewed in favor of VR as a therapeutic aid in CET. This notion is supported by findings on the effect of VR CET in a previously conducted study (Moon and Lee, 2009). Here, men with TUD were exposed to substance-related and neutral VR scenarios in an fMRI scanner. Smoking-related VR cues (i.e., a virtual bar with paraphernalia or smoking avatars at a party) increased activity in the PFC and ACC compared to neutral cues

providing insight into functionally higher cognitive cortical areas, which are involved in cognitive control and, thus, potential drugseeking behavior (Feil et al., 2010). Additionally, VR CET over a total of six sessions led to a decrease in PFC activity from the first to the last session. However,Moon and Lee (2009) did not find any effect of VR CET on subjective craving levels after six sessions of treatment. The non-clinical population that was used in this particular study might again explain the lack of statistical significance in subjects' differential behavioral responses to substance-related and neutral cues. Hence, results of this study need again to be interpreted with care.

Although studies assessing the efficacy of VR CET are highly ambiguous, their results might provide a framework for individualized treatment approaches. A recent case report by Pericot-Valverde et al. (2012) gives insight into how an elaborate treatment plan might implement VR scenarios into CET in a personalized fashion. In this study, a woman successfully quit smoking by being administered to an extensive individualized treatment plan that included both relapse training and VR CET over a course of 6 weeks (Pericot-Valverde et al., 2012). Prior to any treatment, the subject was asked to identify VR cues that might specifically induce an urge to smoke. These cues were then implemented into a VR environment. Subjective ratings of craving, as well as breath carbon monoxide levels and number of smoked cigarettes decreased over time during treatment. Since this example is a case report and there was no follow-up assessment to determine potential relapse after completion of the treatment program, it might be considered a first step in optimizing treatment designs that make use of VR CET.

Studies reviewed here provide insight into possible treatment aids of VR CET. However, little effort has been made so far to compare VR CET to other treatment approaches. A recently published study by Park et al. (2014) compared VR CET to conventional cognitive behavioral therapy (CBT) in treatment-seeking male smokers. Subjects underwent either a treatment plan, which involvedVR CET – including exposure to substance-related environments and social interactions – or CBT (Park et al., 2014). Both VR CET and CBT significantly reduced numbers of smoked cigarettes and levels of breath carbon monoxide from pre- to post-treatment and even weeks after completion of the program. However, neither individuals assigned to VR CET nor those in the CBT group experienced decreases in subjective craving levels. Additionally, since there were no apparent differences between VR CET and CBT on all treatment outcomes, assignment to VR CET appeared to produce a treatment effect that could be at least considered similar to traditional CBT.

While studies on individuals with a TUD diagnosis are growing in number, there is only limited literature regarding the use of VR in clinical studies assessing behavioral treatment approaches in alcohol-dependent populations. Lee et al. (2009) examined the use of a VR therapy approach, in male individuals with an AUD diagnosis and healthy controls while recording changes in EEG alpha frequency band power of fronto-parietal cortical areas (Lee et al., 2009). The paradigm of this study differs from CET in a way that it does not follow the approach of Pavlovian CS-CR extinction, but adds an aversive stimulus to the model of classical conditioning, usually referred to as *aversion therapy*. Individuals undergo a variety of scenarios, which set them in certain affective states including relaxation, as well as positive and negative arousal. The combination of positive and negative affective states (i.e., craving a specific drug and immediate subsequent multisensory aversive stimulation, such as showing a person vomiting and simulating the smell and taste of vomit) leads to a restructuring of the CS-response relation in addiction and should eventually result in avoidance of the administered drug. Despite different approaches in the actual treatment of addiction in CET and aversion therapy, the implementation of VR scenarios serves the same purpose in both cases, that is, exposing individuals to substance-related cues that result in an increase in craving levels. Hence,in the study by Lee et al. (2009), exposure to virtual substance-related objects and environments led to a larger increase in craving levels for alcohol-dependent

subjects compared to healthy controls. Additionally, a decrease in subjective craving levels as well as increases in fronto-parietal EEG alpha-power after 10 sessions of therapy was more prominent for VR therapy when compared to aversion therapy, indicating that the addition of VR scenarios to CBT related paradigms might result in improved treatment success.

Accumulated results of studies reviewed here point toward a decrease in the amount of substance intake per day during the course of VR CET (Lee et al., 2004; Choi et al., 2011; Pericot-Valverde et al., 2012; Park et al., 2014). With respect to its therapeutic goal, that is, reducing or stopping the administration of drugs, VR CET holds promising results at least in the short term. To validate its feasibility as a treatment for prospective abstinence in SUDs, follow-up assessments need to take place, which were only performed in one of the studies reviewed here (Park et al., 2014). This notion is crucial taking into account that especially craving for substances can occur after longer periods of abstinence leading to potential relapse (Paliwal et al., 2008; Galloway and Singleton, 2009). Additionally, only little inference can be made in how far a treatment effect can be attributed to the use of VR environments and social interactions during exposure to substance-related cues. Contrary to findings on treatment effects regarding substance intake, VR CET did not affect levels of subjective craving in most studies reviewed here. This might be due to heterogeneity in study designs, which makes it difficult to draw an overall picture on the putative superior effect of substance-related VR cues over neutral VR or traditional substance-related cues to induce craving for drugs. Only a few studies distinguished between craving induction during cue exposure and subsequent decrease in subjective craving, amount of substance intake, and physiological correlates of craving (Choi et al., 2011). Moreover, most studies do not report the former by comparing substance-related to neutral VR or traditional cues, such as pictures, and mostly relied on previous work, which is substance specific still fairly remote in number (see **Tables 1** and **2**). The delineation of successful craving induction and its reduction during the course of a treatment is of utmost importance to attribute a benefit of treatment aid to the use of VR cues. This will eventually ensure that reduction in substance intake is not mainly a result of treatment irrespective of the type of exposure, which in the long run might again lead to potential craving and relapse.

Findings by Lee et al. (2009)suggest that a decrease in subjective craving levels is achieved by means of combined VR cue exposure and aversion therapy. Hence, the concept of classical conditioning as the fundamental basis of persistent drug use and abuse might not be a sufficient theoretical framework with respect to the treatment of SUDs. Although extinction of acquired reactivity toward substance-related cues appears quite straightforward, its application in the context of SUD treatment might be yet fragmentary. This is in line with previous findings in both animal models (Conklin and Tiffanny, 2002) and human studies (Kaplan et al., 2011) challenging the efficacy of current CET as a stand-alone treatment approach in SUD. An emotional component inherent to most substance-related cues exceeds the idea of a simple classical Pavlovian learning paradigm and reflects affective states conceivably refractory to re-conditioning approaches (Havermans and Jansen, 2003).

Besides theoretical considerations, treatment success often strongly depends on factors such as sex, age, or duration of treatment (Zilberman et al., 2003; Kaplan et al., 2011; Koechl et al., 2012). Interestingly, one VR CET case report revealed a decrease in subjective craving levels for nicotine (Pericot-Valverde et al., 2012), whereas studies assessing samples of individuals mainly did not yield corresponding results. This highlights the potential of individualized CET plans for successful treatment of SUD in both stopping current substance abuse and avoiding future relapse. In light of a continuously growing, yet methodologically inconsistent, number of findings regarding VR CET in SUD on a group level, personalized treatment can serve as a key in overcoming practical and statistical issues that might circumvent therapeutic efficacy in individuals. Individualized treatment in SUD has previously been suggested as a promising approach to manage individual variability in AUD patients (Mann and Hermann, 2010). Additionally, cue reactivity, and thus craving, is associated with lower levels of substance dependency (Watson et al., 2010) and rates of relapse (Conklin et al., 2012). In context of traditional CET, cue reactivity is reflected by changes in activation of the mesolimbic circuit (Vollstädt-Klein et al., 2011).Vollstädt-Klein et al. (2011) observed a decrease of activation within areas including the ACC, PFC, and ventral striatum after CET. With respect to findings that linked increases in ACC and PFC to higher levels of craving (Moon and Lee, 2009), one could argue that VR CET might have the properties to downregulate hyper-activation of dopaminergic mesolimbic areas. Hence, increasing ecological cue validity in virtual settings might subsequently result in elevated cue reactivity and respective changes in mesolimbic activity, facilitating VR CET efficacy on at least the individual level.

#### **GENERAL DISCUSSION**

In sum, immersing subjects in substance-related virtual environments can successfully induce craving for different substances. Substance-related VR cues compared to neutral cues induced significant craving in several studies. Moreover, studies implementing social interactions with avatars in cue-provoked paradigm also reported increases in craving compared to neutral VR immersion. It thus seems that VR offers a good technical alternative to standard videos and photographs in standard cue-provoked paradigms and that VR can successfully integrate a simulation of social interaction in such paradigms. We believe this is of great importance as social interactions are often reported as craving agents and incentives to consume. However, the difference in craving induction between the immersion in a virtual environment and the interaction with avatars remains to be tested.

Virtual reality thus shows an interesting potential in the improvement of actual assessment of craving in SUDs. VR is a flexible and controllable tool in the sense that cues can easily be modified to certain patient's requirements or condition. This specificity of environments and avatars thus holds great promise for the development of individualized care and treatment in SUDs. Thus, VR may help reach a significantly higher level of ecological validity. Although traditional craving paradigms use validated sets of photographs and videos, VR directly involves the subject in simulated situations and this may significantly improve craving induction and, further down the line, sensitization behavioral therapy. Moreover, since SUDs are complex conditions evolving on several weeks and months and comprising various stages, VR may be used in these different stages i.e., before the appearance of craving. Although it offers limited support in the study of actual substance consumption, VR may help understand other stages in the development cycle of substance use. One of the key concepts in this line of discussion is *sense of presence*, which describes individual experiences of immersion, naturalism, and realism, or simply the "being there" in a virtual environment (Lombard et al., 2009). Hence, when assessed systematically, sense of presence can serve as an indicator in how far VR settings resemble real-life situations. Rigorously assessing sense of presence and immersion will thus be crucial in future works on SUDs with VR. While increased sense of presence can be linked to elevated individual engagement in VR, it also might result in negative effects, including sensations of vertigo or nausea during or after VR exposure (De Leo et al., 2014). These adverse physiological reactions are generally referred to as *cybersickness*. Cybersickness is mainly attributed to conflicts in sensory processing in artificial virtual environments and is, thus, closely related to the concept of sense of presence (So et al., 2001; Kiryu and So, 2007). Although a standardized tool for assessment of cybersickness is already available for a longer period of time [Simulator Sickness Questionnaire (SSQ); (Kennedy et al., 1993)], only a minority of studies reviewed here reported its use. Scores of cybersickness questionnaires were generally low and indicate that it did not affect behavioral or physiological outcome measures (Lee et al., 2004, 2005;Choi et al., 2011). However, one study (Baumann and Sayette, 2006) reported a single case of severe cybersickness that led to incomplete testing and subsequent dropout from statistical analysis. Hence, the issue of cybersickness might impact on both individual health and the validity of study outcomes. Finally, although it requires an important level of expertise to program and set-up VR paradigms, it is cost-effective, easily combinable with pharmaco- and psychotherapies, and presents few side effects. A recent review (Bush, 2008) points out the fact that VR may be more desirable in several clinical settings, as it is more private and less embarrassing for patients than *in vivo* behavioral therapy.

The interesting potential of VR in behavioral research and medicine may also be grasped in different contexts. For example, compulsive overeating (Hone-Blanchet and Fecteau, 2014) and gambling (Leeman and Potenza, 2011) are often compared to SUDs because of their compulsive component and appearance of craving. A few studies have investigated the use of VR to provoke food cravings (Ferrer-García et al., 2013; Ledoux et al., 2013). Moreover, researchers propose that VR may help in the assessment and treatment of obesity (Bordnick et al., 2011). Thus, VR holds promising potential in developing applications in neurosciences and may help promoting individualized care in future clinical psychology.

#### **ACKNOWLEDGMENTS**

This work was supported by a scholarship from Centre Interdisciplinaire de Recherche en Réadaptation et Intégration Sociale to Antoine Hone-Blanchet, National Science and Engineering Research Council and Canada Research Chair to Shirley Fecteau.

#### **REFERENCES**


alcoholic patients. *Am. J. Psychiatry* 158, 1075–1083. doi:10.1176/appi.ajp.158. 7.1075


dependence: a randomized trial. *Biol. Psychiatry* 69, 1060–1066. doi:10.1016/j. biopsych.2010.12.016


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 02 July 2014; accepted: 01 October 2014; published online: 17 October 2014. Citation: Hone-Blanchet A, Wensing T and Fecteau S (2014) The use of virtual reality in craving assessment and cue-exposure therapy in substance use disorders. Front. Hum. Neurosci. 8:844. doi: 10.3389/fnhum.2014.00844*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Hone-Blanchet, Wensing and Fecteau. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited andthatthe original publication inthis journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# HUMAN NEUROSCIENCE

REVIEW ARTICLE published: 15 October 2014 doi: 10.3389/fnhum.2014.00807

### The use of virtual characters to assess and train non-verba communication in high-functioning autism l

#### **Alexandra Livia Georgescu<sup>1</sup>\*, Bojana Kuzmanovic 1,2, Daniel Roth<sup>3</sup> , Gary Bente<sup>3</sup> and Kai Vogeley 1,4**

<sup>1</sup> Department of Psychiatry and Psychotherapy, University Hospital Cologne, Cologne, Germany

<sup>2</sup> Ethics in the Neurosciences (INM-8), Institute of Neuroscience and Medicine, Research Center Juelich, Juelich, Germany

<sup>4</sup> Cognitive Neuroscience (INM-3), Institute of Neuroscience and Medicine, Research Center Juelich, Juelich, Germany

#### **Edited by:**

Ouriel Grynszpan, Université Pierre et Marie Curie, France

#### **Reviewed by:**

Antonia Hamilton, The University of Nottingham, UK Michal Hochhauser, University of Haifa, Israel

#### **\*Correspondence:**

Alexandra Livia Georgescu, Department of Psychiatry and Psychotherapy, University Hospital Cologne, Kerpener Strasse 62, Cologne 50924, Germany e-mail: alexandra.georgescu@ uk-koeln.de

High-functioning autism (HFA) is a neurodevelopmental disorder, which is characterized by life-long socio-communicative impairments on the one hand and preserved verbal and general learning and memory abilities on the other. One of the areas where particular difficulties are observable is the understanding of non-verbal communication cues.Thus, investigating the underlying psychological processes and neural mechanisms of non-verbal communication in HFA allows a better understanding of this disorder, and potentially enables the development of more efficient forms of psychotherapy and trainings. However, the research on non-verbal information processing in HFA faces several methodological challenges.The use of virtual characters (VCs) helps to overcome such challenges by enabling an ecologically valid experience of social presence, and by providing an experimental platform that can be systematically and fully controlled. To make this field of research accessible to a broader audience, we elaborate in the first part of the review the validity of using VCs in non-verbal behavior research on HFA, and we review current relevant paradigms and findings from social-cognitive neuroscience. In the second part, we argue for the use of VCs as either agents or avatars in the context of "transformed social interactions." This allows for the implementation of real-time social interaction in virtual experimental settings, which represents a more sensitive measure of socio-communicative impairments in HFA. Finally, we argue that VCs and environments are a valuable assistive, educational and therapeutic tool for HFA.

**Keywords: high-functioning autism, non-verbal behavior, social interaction, virtual reality, virtual characters, social gaze**

#### **NON-VERBAL COMMUNICATION AND HIGH-FUNCTIONING AUTISM**

#### **NON-VERBAL BEHAVIOR AND SOCIAL COGNITION**

Non-verbal communication constitutes an essential aspect of social cognition. Indeed, non-verbal cues are known to influence person perception and construal processes early during social encounters (Willis and Todorov, 2006) and a large proportion of social meaning is substantially informed by non-verbal cues (Argyle, 1988; Burgoon, 1994). Thus, the investigation of the behavioral and neural correlates of non-verbal behavior processing can deliver valuable insights into social cognition and human communication.

Behavioral research has long been investigating the perception and evaluation of facial or bodily cues (i.e., decoding of emotions and intentions). More recently, the field of social and affective neuroscience has also started investigating the underlying neural mechanisms associated with the social processing of facial and bodily non-verbal cues (in terms of mental state attribution, Gallagher and Frith, 2003; De Gelder and Hortensius, 2014), and also those involved in perceiving meaningful intransitive actions (i.e., non-object directed), be they mimed, expressive, or symbolic (e.g., Gallagher and Frith,2004;Grèzes et al.,2007;Villarreal et al.,2008). Neurally, this processing is traceable to two main networks in the human brain: the action observation network (AON), associated with human movement perception, and the social neural network (SNN), involved in social-cognitive processing (e.g.,Van Overwalle and Baetens, 2009).

#### **HIGH-FUNCTIONING AUTISM AND NON-VERBAL BEHAVIOR**

Autism spectrum disorders (ASD) are characterized by impairments in communication and reciprocal interaction (World Health Organization, 1993). Socio-communicative deficits in high-functioning autism (HFA) manifest themselves in problems with spontaneously producing, interpreting, and responding to non-verbal cues. More specifically, on the perception side, the intrinsic value and salience of non-verbal cues are reduced in individuals with HFA. They do not spontaneously attend to social information, and are thus less able to intuitively interact in social

<sup>3</sup> Department of Psychology, University of Cologne, Cologne, Germany

**Abbreviations:** AON, action observation network; ASD, autism spectrum disorder; CVE, collaborative virtual environment (aka shared virtual environment); DVE, desktop virtual environment; fMRI, functional magnetic resonance imaging; HFA, high-functioning autism; IVE, immersive virtual environment; SNN, social neural network; SVE, single-user virtual environment; TSI, transformed social interaction; VC, virtual characters; VE, virtual environment; VR, virtual reality.

contexts (Klin et al., 2003). When confronted with non-verbal signals, such as eye gaze, facial expressions, or gestures; individuals with HFA have shown atypical detection (Senju et al., 2005, 2008; Dratsch et al., 2013) and interpretation of such cues (Baron-Cohen, 1995; Baron-Cohen et al., 1997; Uljarevic and Hamilton, 2013) and have difficulties in integrating them for the purpose of an adequate impression formation of others (Kuzmanovic et al., 2011). Generally, they seem to be less affected by them when processing a task, as compared with typically developed control persons (Schwartz et al., 2010; Schilbach et al., 2012), and/or they seem to use atypical strategies for social processing (e.g., Kuzmanovic et al., 2014;Walsh et al., 2014). Furthermore, neuroimaging studies have shown that the SNN, which is involved in conscious mental inference and evaluation of social stimuli (Gallagher and Frith, 2003; Frith, 2007; Van Overwalle and Baetens, 2009), shows a diminished response to the processing of non-verbal social information in HFA (Baron-Cohen et al., 1999; Critchley et al., 2000; Piggot et al., 2004; Pelphrey et al., 2005; Ashwin et al., 2007; Pitskel et al., 2011; Redcay et al., 2012; von dem Hagen et al., 2013; Georgescu et al., 2013; Kuzmanovic et al., 2014).

Thus, non-verbal information may influence social perception, as well as affective and inferential processing, all of which have been demonstrated to be impaired in HFA. In addition, non-verbal behavior exhibits specifically high levels of complexity that are closely related to intuitive cognitive and affective processing. For this reason, investigating non-verbal behavior processing in HFA not only helps to understand (1) social cognition and its underlying neural mechanisms but also (2) the specific cognitive style characteristic of HFA. These insights, in turn, are most valuable for improving supportive therapy and training options, which may improve the lives of affected individuals and their families.

Nevertheless, the investigation of non-verbal behavior faces several basic methodological challenges, which will be elaborated in the next section. Virtual characters (VCs) are introduced as a means to overcome such methodological issues and various experimental implementations are discussed. Finally, we will consider how these implementations have been or could be used in the future for research and training with individuals with HFA.

#### **VIRTUAL CHARACTERS AS A TOOL FOR NON-VERBAL BEHAVIOR RESEARCH**

#### **BASIC METHODOLOGICAL PROBLEMS IN NON-VERBAL BEHAVIOR RESEARCH**

In contrast to verbal communication, non-verbal cues cannot be readily translated into distinct meanings (Krämer, 2008). In fact, non-verbal behavior is characterized by (a) high dimensional complexity and (b) high processual complexity (Krämer, 2008;Vogeley and Bente, 2010). Dimensional complexity relates to the fact that non-verbal signals are highly context dependent and comprise a simultaneous multichannel activity (Poyatos, 1983). The interpretation of a single non-verbal cue depends on which other verbal, non-verbal, and situational cues precede, co-occur with, or follow it (e.g., Grammer, 1990; Chovil, 1991). Moreover, processual complexity implies that meaningful information is conveyed by dynamic aspects of facial expressions and movements of the head or body, and that subtle spatiotemporal characteristics of perceived behavior can affect the way that non-verbal information

is processed (e.g., Birdwhistell, 1970; Grammer et al., 1988, 1999; Krumhuber and Kappas, 2005; Krumhuber et al., 2007; Provost et al., 2008). Therefore, Burgoon et al. (1989) (p. 23) argue for the use of dynamic non-verbal stimuli and suggest that "we need to understand non-verbal communication as an ongoing, dynamic process rather than just a static snapshot of cues or final outcomes at one moment of time."

Some researchers have used dynamic non-verbal stimuli either by using so-called "thin slices" of people's behavioral streams (i.e., brief excerpts of behavior, less than 5 min in length; Ambady and Rosenthal, 1992), or by instructing confederates or actors to produce or vary particular aspects of their non-verbal behavior. This approach, however, has its disadvantages due to the fact that implicit movement qualities are both produced and perceived automatically and outside awareness, hence making them difficult to capture and/or control experimentally (Choi et al., 2005). Therefore, a large amount of research to date has investigated non-verbal cues using only static photographs or pictures of, for instance, specific gestures or emotional faces and bodies.

In addition, when using neuroimaging techniques to investigate non-verbal behavior processing, a set of unique challenges with respect to ecological validity emerges. David (2012)states that participants are restricted in the movements they can make in order to prevent artifacts in the recording of neural data. Therefore, neuroimaging paradigms are rather limited in terms of how much they allow participants to engage with a social stimulus. Furthermore, increasing ecological validity involves increasing complexity of the stimulus material and/or task demands, which raise the question concerning what a"neural correlate"actually really reflects (David, 2012). Therefore, oftentimes for experiments, social scripts and stimuli have to be reduced and presented repeatedly, in order to increase statistical power, yet then they also lack ecological validity and may lead to habituation and/or expectancy effects.

To sum up, the investigation of the processing of non-verbal behavior meets several basic methodological challenges, some inherent in the nature of the stimulus (i.e., experimental control) and others caused by technical restrictions (i.e., ecological validity).

#### **VIRTUAL CHARACTERS OFFER A GOOD COMPROMISE BETWEEN EXPERIMENTAL CONTROL AND ECOLOGICAL VALIDITY**

The abovementioned challenges can be overcome by using anthropomorphic VCs. These are artificial characters, which have realistic human features, can be either static or dynamic and can either be animated by using key framing or motion-capturing techniques. Moreover, VCs are a medium through which virtual interaction partners can be expressed. In this line, research distinguishes between two possible virtual representations of human beings in an interacting context, which differ in terms of their level of agency (Bailenson and Blascovich, 2004; Bailenson et al., 2006; von der Pütten et al., 2010b): (1) agents (i.e., a digital model of a person, which is driven by a computer algorithm) and (2) avatars (i.e., a digital model of a person, which is controlled by a real human in real time). VCs have the advantage of realistic behavior capabilities on the one hand, and systematic manipulability on the other, hence allowing the simultaneous increase of both experimental control and ecological validity (Vogeley and Bente, 2010; Bohil et al., 2011). Moreover, they provide the option to control and investigate body motion independently from body shape, a methodological advantage termed as "plasticity" by Bente and Krämer (2011). This possibility of independently masking or transforming aspects of both appearance and behavior is essential in order to disentangle top-down effects of appearance from bottom-up effects of behavior (Bente et al., 2008).

The most important prerequisite for using VCs for non-verbal behavior research is that they are veridical and convincing and that they are able to evoke impressions, attributions, and reactions in an observer that are comparable to those evoked by real human beings (Krämer, 2008; Vogeley and Bente, 2010). Indeed, the validity of VCs in non-verbal behavior research has been amply demonstrated in both behavioral and neuroimaging studies, and there is consistent evidence that VCs are perceived comparably to real human beings. For example, a series of studies have shown that person perception ratings based on the non-verbal behavior of videotaped human beings do not differ significantly from those based on the identical movements performed by VCs (Bente et al., 2001). Moreover, virtual emotional facial and bodily expressions are recognized as accurately as natural ones (Dyck et al., 2008;McDonnell et al., 2008), and recent functional neuroimaging research demonstrated that facial animations of emotional virtual faces also evoke brain responses comparable to those evoked by real human faces, specifically in the amygdala (Moser et al., 2007), a region robustly associated with social processing (Adolphs et al., 1998).

#### **METHODOLOGICAL IMPLEMENTATIONS AND SETUPS FOR THE USE OF VIRTUAL CHARACTERS**

Virtual characters can be used within a variety of virtual reality (VR) systems, which can differ in terms of their immersive potential. Immersion refers to the degree of sensory stimulation through the system on the one hand and the sensitivity of the system to motor inputs, on the other (Biocca et al., 2003). Thus, the level of immersion of a VR system is determined by the number of sensory and motor channels connected to the virtual environment (VE) (Biocca et al., 2003). For instance, desktop virtual environments (DVEs) typically involve a user viewing a VE through a computer screen (Bente and Krämer, 2011). While a participant can interact with the environment, using common input devices, such as keyboard, mouse, joystick, or touchscreen, the interaction does not include a high degree of immersion. Some examples for the use of DVEs would be the interaction within online communities like "Second Life" or various desktop-based training programs. Nevertheless, for research investigating not just the perception but also the production side of non-verbal behavior, there is an important consideration to be made: classic DVE setups often require a conscious decision by the user to launch the non-verbal cue via discrete input options, such as clicking a button or hitting a key (Bente and Krämer, 2011). Thus, the sender of a non-verbal message would be more self-aware, as they would have to consciously choose what to display and when to do it. Furthermore, the number of non-verbal signals produced is restricted as cognitive resources of the sender are limited (Bente and Krämer, 2011). Nevertheless, even though they lack peripheral vision, DVEs can increase their immersive potential, by making use of stereoscopic monitors and/or head tracking (e.g., Fish Tank VR, Ware et al., 1993). Another example, would be the virtual communication environment [VCE, by Bente and Krämer (2011), see also the desktop platform illustrated in **Figure 1**], which is a DVE paradigm that conveys in real time a wide range of non-verbal cues via eye and motion tracking.

The so-called immersive virtual environments (IVEs) typically have a higher immersion potential, compared to classic DVEs, which can be achieved for instance by including continuous realtime tracking of a user's movements with high degrees of freedom and/or by engaging peripheral in addition to central vision (Bente and Krämer, 2011). Such systems are better at capturing and transmitting a broader range of behavior and allowing for a spontaneous and subconscious usage of non-verbal cues (Bente and Krämer, 2011). IVEs may make use of different display and tracking solutions. For instance, curved screen projections, such as Powerwalls or Tiled Walls (e.g., HEyeWall; Santos et al., 2007) use a combination of multiple projectors or LCD panels to increase the overall display size and resolution and display monoscopic or stereoscopic content. Some systems are equipped with a head mounted display (HMD, a visual display worn as a type of helmet) and are directly tracking movements of the user's head and/or body to duplicate them within the VE (Bohil et al., 2011). Another type of IVE refers to open display systems, where the user is inside a room or sphere, the surface of which is a seamless display system such as CAVEs (Cruz-Neira et al., 1993) or fulldomes (Bohil et al., 2011). To increase immersiveness, all these setups may use devices to track locomotion (location trackers), hand movements (data gloves or 3D mice), and body motion (motion capture). It has to be noted that, due to end-to-end time lags (between users actions and the correspondent display changes) such immersive technologies can cause "virtual reality induced symptoms and effects" (VRISE, Sharples et al., 2008, previously also referred to as "cyber-sickness," LaViola, 2000). This issue has to be considered when designing experiments in IVEs.

Whether DVEs or IVEs are used, and which level of immersion is adopted for research purposes, usually depends on the research question, as well as on the budget and accessibility of the technology.

#### **RECENT DEVELOPMENTS IN BEHAVIORAL PARADIGMS USING VIRTUAL CHARACTERS AS REAL-TIME INTERACTION PARTNERS**

Most behavioral and neuroimaging studies on non-verbal behavior processing using VCs have used DVEs and observational paradigms and have focused mostly on the perception side of social cognition. In such paradigms, a participant merely observes and evaluates non-verbal cues performed by a virtual other on a screen, without being involved in an interaction with them (e.g., Schilbach et al., 2006; Kuzmanovic et al., 2009, 2012). De Gelder and Hortensius (2014) (p. 160) explain that in observational paradigms, which are also called "offline" paradigms (Pfeiffer et al., 2013a; Schilbach et al., 2013; Schilbach, 2014), the "person observed is not influenced by the way his/her actions are perceived by others. On the other hand, the observer does not get any feedback or insight from his/her correct perception; neither does he/she suffer the consequences of misperception." In the same line, Patterson (1994) highlights that social interaction consists of both person perception and behavior production simultaneously. Thus,

there is growing consent that observational paradigms alone are insufficient for a comprehensive understanding of the neural mechanisms of social cognition, and researchers have recently been arguing for a paradigm shift toward "online," interactive experimental designs [Hari and Kujala, 2009; Dumas, 2011; Konvalinka and Roepstorff, 2012; Pfeiffer et al., 2013a; Schilbach et al., 2013; Schilbach, 2014; but see also Przyrembel et al. (2012) for a more cautious review highlighting current limitations from a philosophical, psychological, and neuroscientific perspective].

Confirmation for the validity of VCs for "online," interactive paradigms comes from VR research, where agents have been observed to evoke comparable social effects and behaviors as during the interaction with a real human (e.g., Sproull et al., 1996;Nass and Moon, 2000; Hoyt et al., 2003; Park and Catrambone, 2007). Moreover, VR research has repeatedly confirmed that social interactions in VEs are governed by the same social norms as social interactions in the real world, and that social norms relating to gender, interpersonal distance, approach behavior, and eye gaze can be transferred to VEs (Bailenson et al., 2001; Garau et al., 2005; Yee et al., 2007). However, one of the shortcomings of "online" social interaction paradigms is the fact that, once an experimental variation has been introduced, it most likely develops its own dynamics (Bente and Krämer, 2011).

One of the most promising approaches to study non-verbal communication in "online" social interactions is the "transformed social interaction" (TSI) approach (Bailenson et al., 2004; Krämer, 2008), which builds upon the previously mentioned "plasticity" advantage of VCs (Bente et al., 2008). In this approach, motion is captured and rendered on an avatar. Not only the appearance of the VC (containing information for instance on sex, identity, ethnicity, or attractiveness), but also their non-verbal behavior can be manipulated, by blending particular channels (via static filters), or by modifying specific non-verbal cues (via dynamic filters, e.g., head movement activity can be altered using specific algorithms) (Bente and Krämer, 2011). By doing this in a systematic manner, it can be determined which aspects of non-verbal

behavior are necessary and/or most efficient with regard to various social contexts. This makes it possible to analyze how manipulations of appearance and/or behavior of one agent or a dyad affect the experience and the course of social interactions. Blascovich et al. (2002) (p.121) summarize the benefits of this approach by stating "investigators can take apart the very fabric of social interaction using immersive virtual environment technology (IVET), disabling or altering the operation of its components, and thereby reverse engineering social interaction. With this approach, social psychologists could systematically determine the critical aspects of successful and unsuccessful social interactions, at least within specified domains and interaction tasks."

The TSI approach has been used to study the effects of experimentally manipulated gaze behavior in ongoing interactions. Bente et al. (2007a,b), used eye tracking and motion capture to control two avatars representing two interactants during an open conversation. While gestures and movements were conveyed in real time, the display of gaze direction was manipulated. The authors could show that longer periods of directed gaze fostered the positive evaluation of the partner. The study demonstrated how experimental control of non-verbal cues can be implemented within a rich and fluent social interaction. Other studies using gaze-contingent eye-tracking paradigms have been developed to investigate how social gaze is used to coordinate attention between a participant and an agent (Schilbach et al., 2010; Wilms et al., 2010; Pfeiffer et al., 2011). Finally, paradigms investigating the social effects of mimicry during social interactions with VCs have been developed as well (Bailenson and Yee, 2005, 2007).

To conclude, the "online" social interaction approach in nonverbal behavior research asks for the analysis of behavioral as well as physiological and neural patterns emerging across agents during social interactions. The specific advantages of using VCs in this type of research have been demonstrated for rather simple and restricted non-verbal cue systems, such as social gaze. Nevertheless, this approach can be easily extended to higher complexity levels in non-verbal behavior (see also "shared virtual environment" in **Figure 1**).

#### **RECENT DEVELOPMENTS IN NEUROIMAGING PARADIGMS USING VIRTUAL CHARACTERS AS REAL-TIME INTERACTION PARTNERS**

Using VR in neuroimaging paradigms [for instance functional magnetic resonance imaging, functional magnetic resonance imaging (fMRI) paradigms] increases the potential of standard fMRI paradigms, where the volunteer usually has the passive role of watching a simple stimulus without any interaction. In this line, De Gelder and Hortensius (2014) (p. 160) argue that VR provides the field of social-cognitive neuroscience with a powerful tool to study affective loops created by "online" interactions "in settings where real-life manipulation is not possible, too expensive or unethical."

In neuroimaging studies, however, despite the technical advancement of the VR and motion capture technologies, the possibilities of studying non-verbal communication in social interaction are limited. First, social scripts would have to be systematically manipulated, reduced, and presented repeatedly, in order to increase statistical power. Moreover, not only need the experimenters ensure that the motion tracking systems are compatible with the available neuroimaging techniques (e.g., MR-compatible for fMRI experiments), but participants are also restricted in their movements to prevent causing artifacts during neural data acquisition. Indeed, research is being done on developing VR platforms that are compatible with magnetic resonance imaging systems (e.g., Baumann et al., 2003; Mraz et al., 2003; see also **Figure 1**).

There are, however, several ways to overcome this problem. One way would be to create VEs, where the visual embodiment of the participants and hence their means of interaction with the virtual world and virtual other is controlled via some limited input information in the scanner environment. Although far from ideal (since it involves awareness and explicit production of non-verbal behavior), this approach presupposes that a virtual avatar can be controlled by using button presses or a joystick. In this line, Baumann et al. (2003) have developed a VR system of integrated software and hardware for neurobehavioral and clinical studies for fMRI studies. The authors propose a VR system, which includes a joystick for navigation, a touchpad, and an optional data glove with an attached motion tracker. Furthermore, the setup enables the measurement of physiological data (respiration, heart rate, blood volume pulsatility, and skin conductance response), and the system provides synchronization of the VR simulation with the physiological recordings and the functional MR images (see also "brain imaging setups" in **Figure 1**).

Another possibility of using VR in the fMRI context would be to investigate a form of minimal social interaction, which would, by definition, only require minimal non-verbal input but would enable the study of social interactions based on gaze behavior in real time. In this line, social gaze paradigms offer a good solution (Schilbach et al., 2010; Wilms et al., 2010; Pfeiffer et al., 2011, 2013b). Recent methodological advances have used VCs [for a review see Barisic et al. (2013)] in (1) gaze-contingent eye-tracking paradigms (Schilbach et al., 2010; Wilms et al., 2010; Pfeiffer et al., 2011; Grynszpan et al., 2012), (2) live interactions via video feeds, as in bi-directional real-time video streams (Redcay et al., 2010; Saito et al., 2010; Tanabe et al., 2012), and (3) dual eye tracking in two (real or virtual) person setups (Barisic et al., 2013). In the first approach, the gaze is used to control contingent behavior of VCs, who are agents with preprogrammed reactions contingent upon an individual's behavior, hence creating merely the illusion of an "online" real-time interaction (Barisic et al., 2013). The second approach does not make use of virtual technology; therefore, the experimenter is unable to interfere with an interaction, except for substituting or delaying the real-time video stream (Barisic et al., 2013). Consequently, only the third approach is a real interactive one, able to make full use of avatar technology. In this line, Barisic et al. (2013) present the implementation of a dual eye-tracking setup enabling true reciprocity and coordination in a social interaction of two individuals represented by avatars. In this setup, the eye gaze can be either an active part of the task or it can be a dependent measure that can be correlated with other behaviors of interest (Barisic et al., 2013). Furthermore, in line with the TSI approach, this paradigm allows both the VE and the VCs to be fully and systematically controlled in terms of their outer appearance and behavior.

A further promising approach in terms of neuroimaging possibilities, which open up another level of analysis in this line of paradigms, is hyperscanning. It allows the simultaneous measurement of brain activity in two interacting individuals situated in different neuroimaging environments (electroencephalography: Astolfi et al., 2010; Dumas et al., 2010; Kourtis et al., 2010; Lachat et al., 2012; near-infrared spectroscopy: Cui et al., 2012; magnetoencephalography: Baess et al., 2012; Hirata et al., 2014; fMRI: Montague et al., 2002; King-Casas et al., 2005; Saito et al., 2010; Tanabe et al., 2012). In particular, the development of fMRI hyperscanning allows the synchronization of functional image acquisition across multiple subjects and scanners, the performance of cross-brain correlation analyses and, thus, permits the measurement of inter-brain activity coherence during the act of interacting. Combined with using VCs as stimuli, hyperscanning would allow researchers to measure the reactions of multiple participants to shared social situation in a VR environment (see also "Virtual characters as real-time interaction partners" and "brain imaging setups" in **Figure 1**).

#### **VIRTUAL CHARACTERS AND SOCIAL PRESENCE**

The acceptance of VCs as intentional and engaging social entities has also been described as "copresence" (also referred to as "social presence") to describe a communicator's sense of awareness of the presence of an interaction partner [for a review, see Biocca et al. (2003)]. While we can conclude that numerous studies by different research groups show that people can perceive both forms of representations (agents or avatars) comparably to real human beings, it is important to note that findings are not entirely conclusive (cf. Perani et al., 2001; Mar et al., 2007; Moser et al., 2007). The emergence of copresence, which can be measured both at the behavioral and neural level, is mediated by several factors, and the "immersion" potential of the technology is only one of them. Such factors need to be taken into account when designing social interaction paradigms using VCs. In the following, these factors are described.

#### **Anthropomorphism (i.e., human form realism)**

The consensus in the literature argues that the more anthropomorphic or humanlike a character looks like, the more likely they are accepted by an observer (Garau, 2003). On a neural level, the activity of neural regions of the SNN, which are consistently associated with social-cognitive processing, is correlated with the increasing degree of realism of a character (Mar et al., 2007), or anthropomorphism of an interaction partner while performing the prisoner's dilemma game (Krach et al., 2008). In a similar vein, it has been suggested that the AON is tuned to realistic representations of conspecifics (Perani et al., 2001; Shimada, 2010).

#### **Behavioral realism (i.e., humanlike movements or behavior patterns)**

In observational paradigms, the AON has been found to be preferentially activated when processing biological motion, i.e., movements with kinematics characterized by a smooth velocity profile [Dayan et al., 2007; Casile et al., 2010; but see also Cross et al. (2012) and Georgescu et al. (2014)]. In interactive paradigms, aspects of behavioral realism related to the responsiveness of or feedback from the VC are crucial for the emergence of social presence (Garau, 2003). Indeed, even subtle manipulations increasing an avatar's responsiveness (e.g., maintaining eye contact, realistic blinking rates) can influence participant's social responses to

VCs suggesting that on some levels people can respond to virtual humans as social entities even in the absence of complex interactions (Bailenson et al., 2001, 2002; Garau et al., 2005).

#### **The interaction between anthropomorphism and behavioral realism and the "uncanny valley" effect**

Generally, it has been suggested that the sensitivity to biological motion is independent of how detailed the character's body is (Chaminade et al., 2007;McDonnell et al., 2008). However, according to the "uncanny valley" theory, the more a VC looks like a real human, the more likely subtle imperfections are perceived as awkward and therefore allocate attention to other processes than the targeted social-cognitive processes (Mori, 1970; Garau, 2003). Hence, even subtle flaws in rendering or expression may cause irritations when extremely detailed anthropomorphic, fully rendered 3D characters are used. In this regard, VCs that are highly realistic might set up high expectations also with respect to behavior realism. Hence, a mismatch between form and behavioral realism can lead to a perception of inconsistency (Garau, 2003; Nowak and Biocca, 2003; Saygin et al., 2011). Indeed, even the most advanced motion capture technologies may find it impossible to match the level of accuracy in terms of degrees of freedom of natural human movement.

#### **Agency (the belief or knowledge about the nature of a VC)**

A top-down influence of belief about the nature of the VC (agent or avatar) may also modulate its perception [Stanley et al., 2007, 2010; Liepelt and Brass, 2010; Klapper et al., 2014; but see Press et al. (2006) and von der Pütten et al. (2010b)]. A direct comparison between the influence of belief about agency and behavioral realism on social presence revealed that believing to interact with an avatar or with an agent barely influenced the evaluation of the VC or his behavioral reactions, whereas variations in behavioral realism affected both [von der Pütten et al., 2010b; see also Nass and Moon (2000)].

#### **Observer characteristics**

It is important to note that, while VCs can elicit social presence, they tend to do so to varying degrees in different observers. Certain characteristics of the observers such as age and sex, their perceptual, cognitive and motor abilities, or prior experience with mediated environment can influence the amount of experienced social presence. For instance, participants' subjective feeling after an interaction with an embodied conversational agent, as well as their evaluation of the VC and their actual behavior was dependent upon their personality traits (von der Pütten et al., 2010a). Furthermore, an important factor to control for is the computer proficiency of the observers and their exposure to VCs. Some people have a higher affinity to computers and games that use avatars and may thus have different expectations concerning both form and behavioral realism. There is evidence that proficiency in using VEs, facilitates immersion and/or copresence, and training with artificial human stimuli can increase their credibility (Garau et al., 2005; Press et al., 2007). Similarly, Dyck et al. (2008) found that emotion recognition rates decreased for virtual but not for real faces only in participants over the age of 40, indicating that media exposure may indeed have an influence on the recognition of non-verbal signals displayed by VCs.

#### **VIRTUAL CHARACTERS IN NON-VERBAL BEHAVIOR RESEARCH IN HFA**

#### **VIRTUAL CHARACTERS AS STIMULI IN NEUROIMAGING STUDIES**

A critical prerequisite for a reasonable use of VCs in research investigating non-verbal behavior processing in HFA is that autistic individuals are engaged by VCs to the same extent as they would be by real human beings and that they do not show any differential positive or negative psychological responses to the former. Hernandez et al. (2009) performed an eye-tracking study to quantify gaze behavior in both adults with ASD and typically developing individuals while exploring static real and virtual faces with direct gaze. In concordance with the literature, participants with HFA spent less time on the eye region compared to typically developing individuals (e.g., Pelphrey et al., 2002; Rutherford and Towns, 2008; Riby and Doherty, 2009; Nakano et al., 2010; Falkmer et al., 2011). Critically, no differences were identified with regard to the exploration of the faces depending on whether they were real or virtual. With respect to the experience in IVEs, Wallace et al. (2010) have for instance shown in a usability study on HFA and typically developing children that experience of IVEs was similar across groups, and no negative sensory experiences were reported in children with HFA. We can conclude that VCs and IVEs are experienced in a similar manner by individuals with HFA and typically developing individuals, and that they can reliably be used to simulate authentic social situations in experimental settings. To our knowledge only five neuroimaging studies on non-verbal cue processing have been performed on individuals with HFA. In the following, we will review this literature (see also **Table 1** for an overview and details of the paradigms).

Neuroimaging research investigating non-verbal behavior in HFA using VCs has focused mainly on face processing and more specifically on the processing of emotional expressions and eye gaze. Schulte-Rüther et al. (2011) asked participants to empathize with static virtual emotional faces and either judge the emotional state of the face ("other" condition) or report the emotions elicited in themselves by the emotional face ("self" condition). With respect to the behavioral performance, the authors found no significant differences in reaction times between HFA participants and control participants for any of the two experimental conditions. In addition, neural results showed that key areas of the SNN were activated in both controls and HFA participants. However, the authors found evidence for a functional segregation in the medial prefrontal cortex, a region which has previously been associated with mentalizing (Amodio and Frith, 2006). Direct comparisons showed that the self- and other-referential tasks, relative to the control task, engaged the dorsal medial prefrontal cortex for individuals with HFA and the ventral portion of the same region for control participants. According to the established functional characterizations of these neural regions, empathizing with other persons is likely to be triggered by emotional self-referential cognition in controls, as affective "theory of mind" components are known to recruit ventral areas of the medial prefrontal cortex region (Amodio and Frith, 2006). Conversely, HFA participants seem to engage cognitive components of "theory of mind," which are associated with the dorsal portion of the medial prefrontal cortex (Amodio and Frith, 2006).

Given that eye gaze provides a foundation for communication and social interaction (Senju and Johnson, 2009), one area of

research that has particularly benefited from the use of VR techniques has been concerned with investigating the neural correlates of social gaze processing in HFA. One of the first neuroimaging studies on this subject used fMRI to show that, in HFA, brain regions involved in gaze processing are not sensitive to intentions conveyed by observed gaze shifts (Pelphrey et al., 2005). The paradigm was based on short videos of a VC shifting their gaze either "congruently" at a checkerboard that appeared in their visual field, or "incongruently" away from the checkerboard. Autistic participants engaged the same temporo-parietal network as controls to process this task, which was centered around the superior temporal sulcus. However, in contrast to the control group, their activation was not modulated by congruency. The authors suggest that an absence of contextual influence on the superior temporal sulcus region indicates a reduced understanding of different intentions of others' gaze behavior and may represent a possible mechanism underlying gaze-processing deficits reported in ASD. More recently,Pitskel et al. (2011) addressed the differential neural processing of direct and averted gaze. Participants viewed videos depicting an approaching male VC either maintaining direct gaze with the observer or averting their eyes from them. The SNN, which was more responsive to the direct relative to averted gaze in typically developing participants, was not preferentially active to direct gaze in HFA participants, indicating again a reduced understanding of different meanings of non-verbal social cues. Similarly, von dem Hagen et al. (2013) showed participants dynamic virtual faces with neutral expressions displaying either averted or direct gaze events. Their results showed that regions of the SNN were more involved in processing direct compared to averted gaze in control participants but in the opposite contrast for HFA participants, potentially indicating an increased salience of averted but not direct gaze in HFA. Finally, a study from our own research group (Georgescu et al., 2013) employed a parametric design in order to investigate the neural correlates of the influence of gaze direction and duration on person perception. We used dynamically animated faces of VCs, displaying averted gaze or direct gaze of varying durations (1, 2.5, or 4 s). Results showed that direct gaze as such and increasing direct gaze duration modulated the engagement of the SNN in control participants, indicating the processing of social salience and a perceived communicative intent. In HFA participants, however, regions of the SNN were more engaged by averted and decreasing amounts of gaze, while the neural response for processing increasing direct gaze in HFA was not suggestive of any social information processing.

To conclude, research using VCs as stimuli attests that they are a useful tool for the investigation of non-verbal behavior processing in HFA. In particular, the inclusion and manipulation of dynamic aspects of movement is facilitated by using VCs and is therefore able to offer unique insight into non-verbal behavior processing in HFA. As a general result, these research findings show atypical social-cognitive processing in HFA both on the behavioral and neural level, highlighting the fact that non-verbal information is less salient to individuals with HFA compared to typically developed individuals. But while the mere perception of non-verbal cues may, under certain circumstances, be comparable to that of typically developed individuals, it seems that in individuals with HFA the evaluation of such cues may rely on different cognitive strategies.



**Frontiers in Human Neuroscience www.frontiersin.org** October 2014 | Volume 8 | Article 807 |

#### **Table 1 | Continued**


#### **VIRTUAL CHARACTERS AS REAL-TIME INTERACTION PARTNERS**

Despite the strong evidence for social processing deficits in HFA individuals, it has been documented that persons with HFA may learn to compensate their performance during social situations in structured experimental settings (Kylliäinen and Hietanen, 2004; Congiu et al., 2010). Such explicit instructions to focus on specific social contents of stimuli may cancel out the atypical performance effects and even diminish the typical hypoactivation of SNN areas (Wang et al., 2007; Schulte-Rüther et al., 2011). In a similar vein, anecdotal reports inform us that individuals with autism have particular problems during "online," real-time interactions, which require the integration of signals from a variety of channels, while the complex and unpredictable input is rapidly changing (Redcay

et al., 2012; Wang and Hamilton, 2012). This is in line with the idea that ASD is a disorder of complex information processing (Minshew and Goldstein, 1998). Dynamic interactions critically impede the application of rule-based strategies to compensate for HFA-characteristic deficits in intuitive communication (Klin et al., 2003;Redcay et al., 2010). This points to the fact that"offline"social cognition paradigms as the ones described above (in"Virtual characters as stimuli in neuroimaging studies"), may fail to capture important aspects of social processing deficits in HFA, and that "online" paradigms might be more appropriate for this purpose. Thus, neuroimaging paradigms using VEs and VCs that include the complexity of dynamic social interactions, may provide a more sensitive measure of the neural basis of social and communicative

impairments in HFA. Consequently, the combined use of VR and neuroimaging techniques offers great potential to investigate nonverbal communication in social interactions as well. Therefore, the possibility of engaging HFA in interactions in the scanner may be useful in understanding social cognition in ASD (Redcay et al., 2010).

Social interactions are characterized by a high degree of automatic interpersonal coordination (Cappella, 1981, 1996; Burgoon et al., 1993) or an above-chance probabilistic relationship between the actions of two interactants (Moran et al., 1992). A number of studies have performed kinematic analyses to investigate motor patterns expressed in social interactions and were able to show that the kinematics of an action performed by an agent acting in isolation are different from those of the very same action performed within a social communicative context and that kinematics not only carry information as to whether a social communicative action is performed in a cooperative or competitive context, but also that they cause flexible online adjustments to take place in response to a partner's actions [Georgiou et al., 2007; Becchio et al., 2008a,b; Sartori et al., 2009; for a review, see Becchio et al. (2010)]. The ability to detect an interaction partner's responses as being related to one's own is termed social contingency sensitivity (Bigelow and Rochat, 2006). Several studies using observational paradigms have already attested that individuals with ASD perceive contingency in dyadic interactions abnormally in terms of animacy perception and mental state attribution [Abell et al., 2000; Klin, 2000; Castelli et al., 2002; Klin and Jones, 2006; see also Centelles et al. (2013)]. Therefore, it has been hypothesized that this inability to detect contingency in social interactions (as either observers or participants), may be a core impairment in autism (Gergely, 2001). The inefficient contingency processing could be related to one particular non-diagnostic secondary symptom of HFA, namely atypical temporal processing. For instance, interval timing [i.e., processing of stimulus duration; for a review, see Falter and Noreika (2011); see also Georgescu et al. (2013) for duration processing for social cognition] and temporal event structure coding (Falter et al., 2012, 2013) have been found to be atypical in HFA. There is evidence for the association between temporal processing and social cognition (e.g., Moran et al., 1992; Trevarthen and Daniel, 2005; Bigelow and Rochat, 2006), Thus, atypical temporal processing might play an important yet under-investigated role in ASD by interacting with and modulating primary symptoms, like deficits in non-verbal communication and social coordination and interaction (Falter and Noreika, 2011).

One such aspect that depends on temporal processing in the social domain is motor mimicry. Elementary motor mimicry (i.e., when an observer's overt motor response is appropriated to the situation of the observed other) has been understood as a communicative act (Bavelas et al., 1986). Bailenson and Yee (2005) performed the first study to show social influence effects with a non-human, non-verbal mimicker (i.e., an imitator of the behavior of another; Chartrand and Bargh, 1999). The authors found that when an embodied virtual agent mimicked participants' head movements 4 s after they occurred during a social interaction, the mimicking agent was more persuasive and was rated more positively with respect to certain traits compared to non-mimickers.

The STORM (i.e.,"social top-down response modulation") model of mimicry claims that mimicry is socially top-down modulated and subtly controlled by social goals (Wang and Hamilton, 2012). For example, in typically developed individuals, social context as communicated through eye contact has been found to control mimicry by modulating the connection strength from the medial prefrontal cortex, a key region of the SNN, to regions of the AON (Wang et al., 2011). Future studies are currently being planned that will examine how different types of social information and social goals are used in the control of mimicry and whether the mimicry production and processing is abnormal in HFA (Wang and Hamilton, 2012). One particularly promising approach in this respect would be the TSI approach, similar to the one used by Bailenson and Yee (2005). This could involve creating VCs that can copy a participant by using an automatic mimicking (i.e., a computer algorithm applied to all movements), to test how people with HFA respond to and detect different social cues from the VC.

Hyperscanning (described in "Virtual characters and social presence") may be combined with experimental paradigms to characterize the neural dynamics contributing to atypical social processing in HFA (see **Figure 1** for a possible setup involving eye tracking and manual response options). For instance, a recent fMRI hyperscanning study in which dyads either comprising a participant with ASD and a control participant or two control participants engaged in a gaze- vs. target cued joint attention task (Tanabe et al., 2012). Among other findings, the authors report a reduction of inter-individual coherence of intrinsic activity fluctuations in ASD-control as compared to control-control pairs in the right inferior frontal gyrus. The authors speculate that this finding might be related to decreased motor resonance for gaze behavior in these dyads.

#### **VIRTUAL CHARACTERS AS A SUPPORTIVE, EDUCATIONAL, AND THERAPEUTIC TOOL FOR HFA**

While the majority of adults with HFA consider the access to assistive therapy options as an important issue (Gawronski et al., 2011), many individuals have great difficulty finding traditional trainings and intervention approaches due to intervention costs and a lack of available specialized therapists (Bekele et al., 2013). In this line, the use of VEs and VCs offers an alternative, which could increase intervention accessibility and reduce the cost of treatment (Goodwin and Goodwin, 2008).

By enabling the simulation of a social environment, VR provides opportunities to practice dynamic and real-life social interactions in a safe environment (Krämer, 2008). Furthermore, VR possesses several advantages in terms of potential application for individuals with HFA. These are listed in the following, and while they represent independent features of VR technology, it is their combined value that offers unique potential for individuals with HFA (Parsons and Cobb, 2011).

#### **CONTROL**

The level and number of various features of the environment can be directly controlled and manipulated. This enables frequent practice and/or exposure in a variety of repeatable and adjustable situations that mimic the real world (Krämer, 2008).

#### **FLEXIBILITY**

Increased control over the scenarios and environments allows for interfaces to be modified for individual user needs, for intervention approaches and reinforcement strategies, as well as scenarios to be customized (Strickland, 1997; Rizzo and Kim, 2005; Krämer, 2008):

#### **ERROR-FREE LEARNING**

Increased control also allows for competing or distracting stimuli to be removed from the training setting and the level of exposure to be carefully controlled (Strickland, 1997; Parsons and Mitchell, 2002; Rizzo and Kim, 2005; Krämer, 2008). Users' performance can be recorded and used for subsequent discussion. Thus, users can practice without fear of mistakes or rejection (Rizzo and Kim, 2005; Krämer, 2008).

#### **INDEPENDENT PRACTICE**

Self-guided exploration and independent practice in a safe test/training environment are enabled (Rizzo and Kim, 2005), where the user has active control over their participation (Parsons and Mitchell, 2002).

#### **ECOLOGICAL VALIDITY**

This offers greater potential for naturalistic performance measures with real-time performance feedback, hence increasing the potential for generalization (Parsons and Mitchell, 2002; Rizzo and Kim, 2005; Bellani et al., 2011; Wang and Reid, 2011).

#### **AFFINITY WITH COMPUTERS**

Finally, this approach can be particularly useful for persons with ASD, as they have been found to have a natural interest in and affinity with computers due to the predictable, consistent, and repeatable nature of technology (Parsons and Mitchell, 2002; Parsons et al., 2006; Putnam and Chong, 2008). This, in turn, could heighten their compliance and investment in the treatment (Krämer, 2008) and may be even made use of, for example, by including gaming factors to enhance user motivation to complete tasks (Rizzo and Kim, 2005).

Indeed, usability research has attested participants' explicit acknowledgment of the value of the virtual training for them (e.g., Parsons and Mitchell, 2002; Parsons et al., 2006). Research has also shown that individuals with ASD successfully acquire new information from VEs. In particular, they learn how to use the equipment quickly and show significant improvements in performance after training [for reviews, see Strickland (1997), Bellani et al. (2011), Parsons and Cobb (2011, 2013), and Wang and Reid (2011)]. Some authors have investigated the usefulness of VEs for training behaviors such as crossing the road (Josman et al.,2008) or reacting to a tornado warning (Self et al., 2007) and to aid learning of pretend play (Herrera et al., 2008). However, we will focus the following considerations on virtual training and assistive therapies on the advantage of VEs for training social and non-verbal skills.

Virtual characters have been used for individuals with HFA to provide training to teach social conventions, facilitate acquisition and exploration of social skills, and reduce stress in social situations. For instance, some social skill training scenarios involve finding a place to sit in a crowded canteen, cafe or bus, a job interview or shopping situation (Rutten et al., 2003; Parsons et al., 2006; Mitchell et al., 2007), or training collaboration skills in the context of the production of a joint narrative (Gal et al., 2009). Jarrold et al. (2013) have developed a public speaking task using IVE technology to study social attention in HFA. They used a HMD to display a virtual classroom and assess the ability of children with HFA to answer questions and simultaneously attend to nine avatar peers seated at a table. They have found that HFA, compared to controls, looked less frequently to avatar peers in the classroom while talking. Consequently, in order to train social attention, virtual training programs have been developed (Grynszpan et al., 2009; Lahiri et al., 2011a,b). For example, Lahiri et al. (2011a,b) developed a novel paradigm, able to automatically structure and adapt interactions in real-time. The platform is called the "Virtual Interactive system with Gaze-sensitive Adaptive Response Technology" (VIGART) and is capable of monitoring a user's gaze in real-time and delivering individualized feedback based on the user's dynamic gaze patterns during their interaction with a virtual other. The experimental setup involved a DVE that presented participants with social communication tasks. While the participant viewed the avatar narrating a personal story, the participant's viewing patterns were measured in real-time by acquiring gaze data and subsequently some behavioral viewing indices were computed. These episodes were followed by a short quiz on the content of the virtual other's personal story. After the participant's reply, an audio-visual feedback, which was computed based on the real-time gaze data to determine the actual time the participant spent looking at the face of the avatar during the presentation, was provided to the participant. The idea was to give indirect feedback to the participants about their viewing patterns and thereby study how that would affect the participants as the task proceeded. Preliminary data for six adolescents with ASD indicate improvement in behavioral viewing and changes in relevant eye physiological indexes of participants. Another approach was introduced by Porayska-Pomsta et al. (2012) who developed the ECHOES project. It aims to allow children with social difficulties to understand and explore social communication and interaction skills. In this learning platform, children interact with embodied virtual agents in socially realistic situations. The interaction between the child and the agents is facilitated by a combination of learning activities, designed around specific learning goals that relate to different forms of joint attention and turn-taking as well as free exploration of the environment.

The environments used in such approaches have been either single-user virtual environments (SVEs) or collaborative virtual environments (CVEs). In an SVE, a single user explores the VE and responses from the environment or a virtual agent must be preprogrammed. In a CVE, more than one user may inhabit the VE at the same time (for example the patient and the therapist or trainer) and can interact with each other in real-time via avatars. Users control their avatars independently and can communicate directly with each other, even when physically located in different places, through speech, movement, and gesture in the virtual space (Schroeder, 2002; Rutten et al., 2003; Moore et al., 2005; Bellani et al., 2011; Millen et al., 2011; Parsons and Cobb, 2013). For example, the COSPATIAL project developed and evaluated collaborative technologies for engaging children with autism in social communication, involving perspective-taking, conversation, and collaboration games (Millen et al., 2011). Rutten et al. (2003) performed a usability study to investigate the potential of CVEs for individuals with ASD. They used role-play situations in either a meeting room or a social cafe. Although the emphasis lied on verbal communication between avatars, users were able to activate some basic non-verbal signals like a handshake and a smiling, neutral or frowning facial expression. The authors conclude that CVEs provide less inherent structure than SVEs and more scaffolding of learning is usually required to keep interactions flowing. Nevertheless, they offer increased flexibility for training in social skills, which do not rely on a fixed protocol hence providing opportunities for social skill practice in a less structured, yet more naturalistic and ecologically valid manner (Rutten et al., 2003). The authors suggest that the most productive setup in communication outcome for a CVE would be when a teacher or trainer supports the users, and a confederate plays the role of another avatar. In a similar line, Parsons et al. (2006) argue that the so-called "facilitators" in a training or intervention are an essential part of the learning process, helping the user interpret what is happening in the scene, take another's perspective and make appropriate responses accordingly. The role of the facilitator should always be adequately planned and provided for as an integral design feature of VEs for teaching of social skills (Parsons et al., 2006). Given that ASD have been associated with executive dysfunction (Hill, 2004) and that complexity in terms of task demands and sensory input information may be challenging for autistic individuals (Minshew and Goldstein, 1998; Redcay et al., 2012), it is essential to invest into optimal design research for platforms and software targeted at HFA individuals (Grynszpan et al., 2005, 2008; Wallace et al., 2010; Menzies, 2011). In this line, transformed virtual interactions might be a promising approach (Bailenson et al., 2004). Tracking non-verbal signals and rendering them via avatars allows for a strategic decoupling of communication (Bailenson et al., 2004), which would allow to alter the exchanged information between sender and receiver and increase or decrease gradually the level of complexity of the social situation, while still facilitating error-free learning.

Collaborative virtual environments also bear advantages in terms of non-verbal decoding and encoding skills. In the nonverbal domain, CVEs have been used to examine and investigate the ability to recognize emotions (Moore et al., 2005; Fabri et al., 2007) and also teaching students how to manifest their emotions and understand those of other people (Cheng and Ye, 2010). These studies found a good performance in identifying emotions and an improvement in social performance after the intervention.Krämer (2008) suggests three important requirements for a non-verbal skills training: (1) realistic setting that requires both decoding and encoding of non-verbal cues; (2) immediate non-verbal feedback from the interaction partner; and (3) feedback that is given not only with regard to demonstrative cues of the user, but also with regard to subtle aspects of their behavior, like the movement speed or quality. Indeed, VEs and virtual training partners seem to allow the development of training paradigms that fulfill these criteria.

While most training approaches using VCs and VEs have been developed for children and adolescents, Kandalaft et al. (2013) have developed the first Virtual Reality Social Cognition Training (VR-SCT), targeted at the adult HFA population. The intervention is using a CVE paradigm and a DVE setup and focuses on enhancing social skills, social cognition and social functioning. Its feasibility was tested on a group of eight HFA adults, who completed a total of 10 sessions across 5 weeks. Results showed significant increases on social-cognitive measures of theory of mind and emotion recognition, as well as in real-life social and occupational functioning.

Indeed, the literature is increasingly recognizing the potential benefits of VR in supporting the learning process, particularly related to social situations, mostly in children and adolescents with autism [for reviews, see Strickland (1997), Bellani et al. (2011), and Parsons and Cobb (2011)] and also in adults (Kandalaft et al., 2013). Nevertheless, some challenges need to be mentioned as well. Current approaches only involve small samples and more randomized controlled trials with treatment manipulations and matched control groups need to be performed in order to show the effectiveness of a certain training and whether the improvements can transfer to real-life situations. Moreover, while virtual technologies are rapidly advancing, developing training or therapeutic tools using VCs and VEs involve a great amount of time, effort, and resources, as well as a multidisciplinary dialog.

#### **CONCLUSION**

In conclusion, we have argued that the use of VCs can be of great value for experimental paradigms of social cognition, in particular for such paradigms concerned with non-verbal behavior production and perception. There are several points to note with respect to challenges inherent in the use of VCs and VEs. First, the compromise or tradeoff between ecological validity and experimental control constitutes both an advantage and a limitation of the approach. Second, individuals may have varying degrees of exposure to or experience with VCs, which may influence their expectations during observation of or interaction with them. Third, different age groups may also react differently to the stimuli and settings and may require different tasks and social situations to be implemented. Furthermore, limitations may also arise from the time and effort that needs to be invested in developing virtual and neuroimaging technologies. In a similar line, these developments need to take place in the context of multidisciplinary research endeavors, which brings an interesting set of challenges on its own (i.e., fruitful collaboration and communication of experts across disciplines). On a positive note, De Gelder and Hortensius (2014) summarize that the use of VR will give the field of affective and social neuroscience valuable and important tools to grasp the full extent of the social world in a well-controlled manner. We have argued that artificial humans are a useful and valid tool to overcome common methodological problems in nonverbal behavior research and may offer an efficient solution for the development of real-time "online" social interaction studies using the TSI approach. This would potentially allow to "reverse engineer" social cognition (Blascovich et al., 2002), by enabling a detailed and systematic examination of the contribution of various real-time factors in human social interaction. Moreover, not only can VCs inform us about human social cognition, both typical and atypical, but they can also contribute to the development of design and methodology for creating interactive agents (Vogeley

and Bente, 2010). We have also argued that VCs and environments are a valuable tool for the supportive therapies and the training of social skills and non-verbal decoding in HFA, as they provide a safe, repeatable and diversifiable learning environment. In addition, the growing trend toward CVE setups becomes evident for therapeutic technologies as well. The methodology for designing interactive multimodal technology for autistic persons requires extensive research and multidisciplinary expertise including developmental psychology, visual arts, human–computer interaction, artificial intelligence, education (Porayska-Pomsta et al., 2012). Future research can additionally investigate how newly acquired skills trough such training programs are transferred to the real world and describe their impact on a neural level (Bellani et al., 2011).

#### **ACKNOWLEDGMENTS**

We would like to thank Ralf Tepest for his helpful feedback with an earlier version of this manuscript and Marius Jonas and Susanne Holocher for help with the literature search.

#### **REFERENCES**


Burgoon, J. K., Buller, D. B., and Woodall, W. G. (1989). *Nonverbal Communication: The Unspoken Dialogue*. New York, NY: Harper and Row.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 06 June 2014; accepted: 22 September 2014; published online: 15 October 2014.*

*Citation: Georgescu AL, Kuzmanovic B, Roth D, Bente G and Vogeley K (2014) The use of virtual characters to assess and train non-verbal communication in high-functioning autism. Front. Hum. Neurosci. 8:807. doi: 10.3389/fnhum.2014.00807*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Georgescu, Kuzmanovic, Roth, Bente and Vogeley. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## An eye-tracking method to reveal the link between gazing patterns and pragmatic abilities in high functioning autism spectrum disorders

#### **Ouriel Grynszpan<sup>1</sup>\* and Jacqueline Nadel <sup>2</sup>**

1 Institut des Systèmes Intelligents et de Robotique (ISIR), Université Pierre et Marie Curie, Centre National de la Recherche Scientifique, Paris, France <sup>2</sup> Centre Emotion, Hôpital de La Salpêtrière, Paris, France

#### **Edited by:**

John J. Foxe, Albert Einstein College of Medicine, USA

#### **Reviewed by:**

Hans-Peter Frey, Albert Einstein College of Medicine, USA Julia Irwin, Haskins Laboratories, USA Karri Gillespie-Smith, University of West of Scotland, UK

#### **\*Correspondence:**

Ouriel Grynszpan, Institut des Systèmes Intelligents et de Robotique (ISIR), Université Pierre et Marie Curie, Centre National de la Recherche Scientifique, Pyramide – T55/65, CC 173 – 4 place Jussieu, 75005 Paris, France e-mail: ouriel.grynszpan@upmc.fr

The present study illustrates the potential advantages of an eye-tracking method for exploring the association between visual scanning of faces and inferences of mental states. Participants watched short videos involving social interactions and had to explain what they had seen. The number of cognition verbs (e.g., think, believe, know) in their answers were counted. Given the possible use of peripheral vision that could confound eye-tracking measures, we added a condition using a gaze-contingent viewing window: the entire visual display is blurred, expect for an area that moves with the participant's gaze. Eleven typical adults and eleven high functioning adults with Autism Spectrum Disorders (ASD) were recruited. The condition employing the viewing window yielded strong correlations between the average duration of fixations, the ratio of cognition verbs and standard measures of social disabilities.

**Keywords: eye-tracking, gaze-contingent display, facial expressions, theory of mind, cognition verbs**

#### **INTRODUCTION**

The rise of affective computing during the last two decades has offered a wide range of possibilities for designing new tools that foster the study of social and emotional impairments. Affective computing is a stream of research that strives to empower computers with abilities to detect, process and respond to social and emotional signals emitted by human users (Picard, 2000). These advances appear to be especially relevant to Autism Spectrum Disorders (ASD) where social interactions represent the core deficit of the syndrome. The last decade has witnessed a steep increase in the number of projects devoted to innovative technologies for evaluating, assisting or training individuals with ASD (Grynszpan et al., 2014). Besides, a promising trend of research relies on the use of eye-tracking to gain a deeper understanding of the cognitive processes that characterize ASD (Boraston and Blakemore, 2007). The exploratory study reported here seeks to present an eye-tracking method for examining social processing in ASD, which takes advantage of real-time eye-based computer interaction.

Social misunderstanding in ASD has been linked to atypical visual scanning patterns (Klin et al., 2002a). Earlier eye-tracking studies used static images to examine the dysfunctional visual exploration of faces attributed to ASD (Pelphrey et al., 2002; van der Geest et al., 2002; Dalton et al., 2005; Corden et al., 2008; Fletcher-Watson et al., 2009). The development of eyetracking technology has offered new opportunities to investigate gaze through the use of dynamic social scenes, which are closer to real life settings (Klin et al., 2002a). This stream of research has raised hopes that eye-tracking measures would help identify behavioral markers and enable refining the autistic symptomatology. Yet, studies reported inconsistent results: Klin et al.'s (2002b) initial report that adolescents and adults tended to focus more on the mouth and less on the eyes than typical controls has only been partly confirmed (Speer et al., 2007; Norbury et al., 2009). Jones et al. (2008) showed that fixation times on the eyes were reduced in toddlers with ASD, but more recent studies failed to reproduce this finding (Nakano et al., 2010; Chawarska et al., 2013). The most consistent discriminating measure between ASD and typical participants appears to be the fixation times on faces (Riby and Hancock, 2009; von Hofsten et al., 2009; Grynszpan et al., 2012a; Rice et al., 2012; Chawarska et al., 2013; Magrelli et al., 2013). Another promising line of research has been to examine the association between gaze data and standard evaluation instruments of ASD. Klin et al. (2002b) reported that the time spent looking at the mouth region correlated with social abilities in high functioning adolescents and adults. Later research suggested that this correlation depended on the characteristics of the video material, such as the number of characters displayed (Speer et al., 2007) or the space taken on the screen by human faces (Rice et al., 2012). Norbury et al. (2009) failed to reproduce this outcome, but found a negative correlation between the fixation times on the eyes and communicative competencies. The observation that verbally able adults with ASD showed a high association between fixations on facial features and social or communicative abilities has led to the hypothesis that their visual scanning strategies was connected to their linguistic proficiency (Klin et al., 2002b; Norbury et al., 2009).

Studies on linguistic competencies in ASD reveal a primary deficit in pragmatics (Tager-Flusberg, 2000). Grynszpan et al. (2008) suggest that children with ASD experience difficulties in using facial expressions as cues to resolve pragmatic ambiguities in dialogs. Their poor performance in pragmatics have been linked to theory of mind deficiencies (Happé, 1993). Individuals with ASD are reported to produce fewer mental state terms in their narratives (Baron-Cohen et al., 1986). In particular, Tager-Flusberg (2000) underlines their difficulties in using cognition verbs (e.g., think, know, guess) that specifically require theory of mind abilities.

The present article reports an exploratory study that seeks to examine the connections between gaze patterns and communicative competencies. It is part of a larger research program that aims at engineering innovative technologies for investigating gazing patterns in ASD. The method presented in this paper could be applied to various features of social communication, such as language fluency, mental state attribution or emotional expression. Here, we illustrate the potential of this method for revealing associations between gaze patterns and pragmatic competencies. More precisely, the method was used to examine correlations between visual fixations on faces and the production of cognition verbs.

A major pitfall of eye-tracking is that it merely measures focal vision and therefore cannot account for attention that is allotted to the periphery of the visual field. This distinction appears to be especially crucial in ASD, given the clinical reports of individuals with ASD having a greater tendency to rely on peripheral vision (Mottron et al., 2007; Noris et al., 2012). To address this possible confound, when using static stimuli, Spezio et al. (2007) employed the bubble paradigm (Gosselin and Schyns, 2001). In this paradigm, participants are shown series of masked images of a face where only randomly selected portions (bubbles) are visible. The areas of the face that contribute the most to emotion recognition are then computed based on performance obtained with the different masks. This procedure can hardly be directly transposed to dynamic stimuli, as the number of masks would increase exponentially with the number of video frames. We propose an alternative method where the mask is contingent on the gaze orientation: The visible area moves in real time with the focal position of the participant on the screen. This creates a gaze-contingent viewing window that reduces possible reliance on peripheral vision and should hence enhance the congruence between visual attention and focal vision as measured with eye-tracking. In a previous study using this gazecontingent window, participants with ASD viewed realistic animations of expressive virtual humans (Grynszpan et al., 2012a). In the present study, we sought to test a context closer to real life settings, by showing videos of real life social interactions that participants had to subsequently describe. We computed correlations with cognition verbs production, first, in the normal

vision condition and, second, using the gaze-contingent viewing window. Those two conditions were compared in two pilot experiments, one with typical individuals and the other with individuals having ASD. These experiments were meant to provide preliminary observations regarding the potential of the proposed method.

## **METHOD**

#### **PARTICIPANTS**

In experiment 1, we tested the feasibility of the method with a normative group of 11 typical adults (3 females 8 males), ranging from 24 to 40 years with a mean age of 31.82 [*SD* = 5.65]. In this pilot study, our goal was to examine the visual strategies employed by the participants who used the gaze-contingent viewing window and not to compare typical controls with individuals having ASD; therefore, in these preliminary experiments, we did not seek to match the normative group of experiment 1 with the ASD group of experiment 2.

Experiment 2 included 11 participants (2 females 9 males) diagnosed with autism by psychiatrists using the DSM-IV R diagnostic criteria. The Autism Diagnostic Interview-Revised (ADI-R; Lord et al., 1994) was used to confirm the diagnosis. Participants' mean score on the Raven's Progressive Matrices (Raven and Court, 1986) was 47.03 [*SD* = 10.01]. Their mean Verbal Intelligence Quotient assessed with the Wechsler Adult Intelligence Scale, 3rd edition (Wechsler, 1997), was 88.91 [*SD* = 15.33]. The group was thus considered high functioning. Their age ranged from 17 to 31 with a mean of 21.36 [*SD* = 4.41]. This research was prospectively reviewed and approved by the regional ethics committee of Tours, France. An informed consent was obtained from each participant. In addition, parents' consents were obtained for minor participants.

#### **PROCEDURE**

The same design was used in experiment 1 and 2. It was composed of an initial normal vision condition followed by a condition using the gaze-contingent viewing window. In the two conditions, participants watched a 2 min video on a 19 inches computer screen that was positioned above a remote eye-tracker (model EYE-TRAC 6 Desktop from Applied Science Laboratories). In the gaze-contingent viewing window condition, the entire graphic display was blurred (using Gaussian smoothing), except for a window centered on the focal point of the participant that moved in real time with her/his gaze (**Figures 1**, **2**). This viewing window was a rectangle with rounded angles measuring 200 × 80 pixels, which amounted to visual angles of 6◦ × 2 ◦ 220 , therefore covering the fovea visual region in the horizontal direction. The size of this window was determined so that it could at least encompass the two eyes of any face shown in the video. Further technical details are available in Grynszpan et al. (2012a). Two videos were used, one in each condition. They were randomly counterbalanced across participants. Both were movie extracts displaying a social interaction where two protagonists were acting hypocritically toward a third one. Their behavior was contradicting their speech, thus yielding a comical effect. Visual attention to facial expressions was essential for understanding these

two movie extracts. In one video<sup>1</sup> , a woman and a man were greeting a neighbor and praising the dish that he prepared, yet their faces and attitudes clearly showed that they were disgusted. In the other video<sup>2</sup> , two men were lauding the dance performance of a woman, although their non-verbal behaviors showed contempt.

After each movie extract, participants were asked to describe what had happened. Their answers were recorded and analyzed by two independent judges who computed the ratio of cognition verbs (e.g., think, believe, know). To calculate this ratio, the number of cognition verbs in each participant's answer was divided by the total number of verbs employed. The concordance correlation coefficient (Lin, 1989) between judges was ρ*<sup>c</sup>* = 0.91. To analyze eye-tracking data, rectangular Areas Of Interest (AOI) were defined around the faces of the protagonists. We used a software prototype developed for a previous study (Grynszpan et al., 2012a) to adjust the position of the AOI on video frames so that they would remain centered on the faces throughout the movie extracts. A proprietary algorithm of Applied Science Laboratories was used to compute fixations on the basis of clusters of Points-Of-Gaze (POG) remaining for at least 100 ms in 1◦ of visual angle. Two gaze variables are considered here: the total fixation time on faces, that is, the sum of the fixation durations; and the average fixation duration on faces. In experiment 2, the ADI-R Reciprocal Social Interaction sub-scores and the Childhood Autism Rating Scale (CARS; Schopler et al., 1980) were used as measures of social disability. Although the CARS was originally meant for children, it is also recommended for use in adults (Ozonoff et al., 2005). The ADI-R and the CARS are instruments designed to assist clinicians in diagnosing ASD. The ADI-R is a semi-structured interview of parents or caregivers that is used to retrieve information on the current behavior and developmental history of the individual. The CARS is a rating scale based on the observation of the individual's behavior.

#### **RESULTS**

The data analyses were carried out with Statistica software.<sup>3</sup> For the two experiments, we first verified whether the introduction of the gaze-contingent viewing window altered visual exploration and narrative performance by calculating Student's *t*-tests that compared the two viewing conditions. We then computed Pearson's correlation coefficients between the ratio of cognition verbs and the gaze variables.

In experiment 1, the data of the normative group showed differences in visual patterns between the two viewing conditions. The total fixation time on faces was higher in the gaze-contingent viewing window condition than in the normal vision condition [*t*(10) = 2.33, *p* = 0.042]. The average duration of fixations on faces was also higher in the gazecontingent viewing window condition compared with the normal vision condition [*t*(10) = 5.70, *p* < 0.001]. The ratio of cognition verbs did not differ significantly between the two viewing conditions. We did not find any correlations between any of these three variables. Indeed, the ratio of cognition verbs did not correlate with the total fixation time on faces [normal vision condition: *r* = 0.13 *p* = 0.71; gaze-contingent viewing window condition: *r* = 0.40 *p* = 0.23], nor did it with the average fixation duration [normal vision condition: *r* = 0.01 *p* = 0.98; gaze-contingent viewing window condition: *r* = 0.27 *p* = 0.42].

In experiment 2, no significant differences were found between the two viewing conditions on any of the measures, that is, the ratio of cognition verbs [*t*(10) = 0.41, *p* = 0.69], the total fixation time on faces [*t*(10) = 0.76, *p* = 0.46] and the average fixation duration on faces [*t*(10) = 0.48, *p* = 0.64]. Contrasting with the normative group in experiment 1, the total fixation time on faces correlated with the ratio of cognition verbs in the gaze-contingent viewing window condition [*r* = 0.70, *p* = 0.017]. The average fixation duration on faces correlated with the ratio of cognition verbs in the normal vision condition [*r* = 0.60, *p* = 0.05] and in the gaze-contingent viewing window condition [*r* = 0.87, *p* < 0.001]. In the latter condition, the ratio of cognition verbs correlated negatively with the CARS scores [*r* = −0.66, *p* = 0.03] and the average duration of fixations on faces correlated negatively with CARS scores [*r* = −0.76, *p* = 0.006] and with the ADI-R subscores in the Reciprocal Social Interaction domain [*r* = −0.65, *p* = 0.031].

#### **DISCUSSION**

Studies using eye-tracking techniques to analyze gaze exploration in ASD have been mostly involved in the description of facial regions of interests and their link with social impairments. The goal of the present study was to pinpoint more precise components of social impairments, by exploring the potential relationship between gaze exploration of faces and inference of mental states in others. Indeed, inferring mental states is often facilitated

<sup>1</sup>Taken from the film "Le père Noël est une ordure", directed by Jean-Marie Poiré in 1982.

<sup>2</sup>Taken from the movie "Podium", directed by Yann Moix in 2004.

<sup>3</sup>www.statsoft.com

by deciphering facial expressions. Cognition verbs were used here as indices of mental states processing. Results of experiment 2 show a positive correlation between gaze variables and cognition verbs in participants with ASD. In other words, the more they were attentive to the dynamics of the facial expressions, the more they would use cognition verbs as if the latter were directly derived from the former. This suggests that their mentalistic insight depends predominantly on a face reading strategy whereby mental states are mapped onto behavioral changes perceived on the face. Interestingly, the severity of autism, as assessed by the CARS and the ADI-R sub-scores in the Reciprocal Social Interaction domain, appeared to be related to poor gaze exploration of faces and poor production of mental state terms.

In experiment 1, the normative group of typical participants modulated their visual behavior to adapt to the gaze-contingent viewing window by increasing their fixation durations on faces. This is consistent with our previous findings (Grynszpan et al., 2012b). The reason why fixation durations on faces did not correlate with the ratio of cognitive verbs for typical participants could conceivably be explained by the fact that their social insight is not solely dependent on facial expressions. In experiment 2, we did not observe an adaptive change in the visual exploration strategies of the ASD group. Although we refrained from comparing groups here, such an alteration in ASD of the typical modulation effect on gaze patterns induced by the viewing window has already been addressed in a previous publication (Grynszpan et al., 2012a). The lack of change in fixation times on faces for ASD participants suggests that they relied on visual strategies when using the gaze-contingent viewing window that were similar to those used in the normal viewing condition. The outcomes derived from the gaze-contingent viewing window should therefore be indicative of their visual exploration in more natural settings.

Recent studies have attempted to tackle discrepancies found in eye-tracking experiments, by devising sophisticated analyses based on the distance between the gaze position of ASD participants and typical gaze patterns (Nakano et al., 2010) or pre-defined targets in the video material (Falck-Ytter et al., 2013). To our knowledge, none of these approaches took into account the possible bias of peripheral vision. Despite anecdotal accounts that individuals with ASD tend to rely more frequently on their peripheral vision than typical peers, especially during social interactions (Bogdashina, 2003; Williams, 2007), there has been relatively little research exploring this issue (Mottron et al., 2007; Grubb et al., 2013; Milne et al., 2013). Most eye-tracking studies focus on central vision, based on the assumption that it matches visual attention. However, detecting that the focal point of gaze is on the mouth does not prevent attention from being directed to the eyes and vice-versa. Such discrepancies could be even more critical in ASD, given the accounts of peculiar use of peripheral vision. The gaze-contingent viewing window method proposed here is meant to reduce this possible confound and indeed, our results showed strong significant correlations with measures of social disability for fixations on the face without having to distinguish between the eyes and mouth regions.

Most eye-tracking studies used the total fixation time on a given AOI as the main outcome measure (Boraston and Blakemore, 2007). Although the assumption that fixation times on a given detail can be summed makes sense for static stimuli, it is less obvious for dynamic stimuli where the information conveyed by the face changes from one fixation to another. The average fixation duration seems more relevant in this regard as it yields an indication of how long a continuous expressive facial motion is attended to by the participant. In experiment 2, this latter measure was strongly associated with the production of cognition verbs in the normal vision condition and when the viewing window was used.

It should be noted that even with the limited sample of participants with ASD used in experiment 2, the method we propose revealed strong significant associations between mental state attribution abilities, social competence and visual scanning of faces. The participants in our sample were all adults on the higher range of IQ scores. So the present findings cannot be generalized to the entire spectrum. The gaze-contingent viewing window method could reveal alternative strategies when applied to different sub-groups across the spectrum, shedding light on their distinctive cognitive functioning.

The gaze-contingent system that we presented here seems useful to examine visual exploration of social scenes. However, the role of gaze in social communication extends beyond purely perceptual functions. It can also assume an active role in faceto-face interactions as for instance in joint attention situations, where it can be used to orient a partner's attention towards an object of interest (Emery, 2000). The development course of ASD is considered to be strongly associated with impairments in joint attention (Charman, 2003). Gaze-contingent displays could be employed to study joint attention. We designed a platform that displays a virtual human character whose gaze orientation is controlled by an eye-tracking device (Courgeon et al., 2014). It can thus be used to simulate joint attention in systematic and controlled conditions that approach naturalistic situations. Our future work will seek to evaluate the potential of this platform for the study of joint attention in the typical population and for individuals with ASD.

#### **ACKNOWLEDGMENTS**

This work was supported by grants from La Fondation de France with La Fondation Adrienne et Pierre Sommer (Project #2007 005874) and La Fondation Orange (Project #71/2012). We are particularly thoughtful of Noëlle Carbonell, one of the project's initiator, who passed away before the project was completed. We are very thankful to Dr. Jacques Constant, Florence Le Barillier and all the staff and students of "La Maison pour les personnes autistes" in Chartres. We thank Jérôme Simonin, Pauline Bailleul and Daniel Gepner for their assistance in data collection and technical development.

#### **REFERENCES**


Picard, R. W. (2000). *Affective Computing.* Cambridge: MIT Press.


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 06 June 2014; accepted: 22 December 2014; published online: 14 January 2015*.

*Citation: Grynszpan O and Nadel J (2015) An eye-tracking method to reveal the link between gazing patterns and pragmatic abilities in high functioning autism spectrum disorders. Front. Hum. Neurosci. 8:1067. doi: 10.3389/fnhum.2014.01067*

*This article was submitted to the journal Frontiers in Human Neuroscience*. *Copyright © 2015 Grynszpan and Nadel. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution and reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms*.

## How and why affective and reactive virtual agents will bring new insights on social cognitive disorders in schizophrenia? An illustration with a virtual card game paradigm

#### **Ali Oker <sup>1</sup>\*, Elise Prigent <sup>2</sup> , Matthieu Courgeon<sup>3</sup> , Victoria Eyharabide<sup>4</sup> , Mathieu Urbach1,5, Nadine Bazin1,5 , Michel-Ange Amorim<sup>2</sup> , Christine Passerieux 1,5, Jean-Claude Martin<sup>6</sup> and Eric Brunet-Gouet 1,5**

<sup>1</sup> HANDIReSP EA4047, Université de Versailles Saint-Quentin, Versailles, France

<sup>2</sup> CIAMS EA4532, UFR STAPS, Université Paris-Sud, Orsay, France

<sup>3</sup> UMR6285, LabSTICC, Université Bretagne-Sud, Lorient, France

<sup>4</sup> STIH EA4509, Université Paris-Sorbonne, Paris, France

<sup>5</sup> Pôle de Psychiatrie, Centre Hospitalier de Versailles, Versailles, France

<sup>6</sup> LIMSI UPR3251, Université Paris-Sud, Orsay, France

#### **Edited by:**

Leonhard Schilbach, University Hospital Cologne, Germany

#### **Reviewed by:**

Bert Timmermans, University of Aberdeen, UK Mel Slater, ICREA-University of Barcelona, Spain

**\*Correspondence:**

Ali Oker, Versailles, France e-mail: ali.oker@uvsq.fr

In recent decades, many studies have shown that schizophrenia is associated with severe social cognitive impairments affecting key components, such as the recognition of emotions, theory of mind, attributional style, and metacognition. Most studies investigated each construct separately, precluding analysis of the interactive and immersive nature of real-life situation. Specialized batteries of tests are under investigation to assess social cognition, which is thought now as a link between neurocognitive disorders and impaired functioning. However, this link accounts for a limited part of the variance of real-life functioning. To fill this gap, advances in virtual reality and affective computing have made it possible to carry out experimental investigations of naturalistic social cognition, in controlled conditions, with good reproducibility. This approach is illustrated with the description of a new paradigm based on an original virtual card game in which subjects interpret emotional displays from a female virtual agent, and decipher her helping intentions. Independent variables concerning emotional expression in terms of valence and intensity were manipulated. We show how several useful dependant variables, ranging from classic experimental psychology data to metacognition or subjective experiences records, may be extracted from a single experiment. Methodological issues about the immersion into a simulated intersubjective situation are considered.The example of this new flexible experimental setting, with regards to the many constructs recognized in social neurosciences, constitutes a rationale for focusing on this potential intermediate link between standardized tests and real-life functioning, and also for using it as an innovative media for cognitive remediation.

**Keywords: virtual agents, schizophrenia, social cognition, theory of mind, facial expression**

#### **INTRODUCTION**

There is a long history of reductionism in cognitive science, with the progressive dissection of models of mental processes to explain human behavior. The models are tested with experimental paradigms that exclude confounding factors and focus only on the constructs of interest. This approach has demonstrated the impairment of several cognitive functions in patients suffering from schizophrenia: cognitive disorders affecting attention, memory, and executive functions are among the most commonly reported in patients with this pathological condition (Evans et al., 1997;Cirillo and Seidman, 2003). There have also been clinically relevant attempts to present schizophrenia from the standpoint of a social cognitive disorder (Brüne, 2005; Penn et al., 2008). This point of views generated extensive experimental research and, importantly, influenced clinical thinking in several psychiatric centers

and modeled new means of psychological treatments. Nevertheless, many elements are still missing to support a social cognitive theory of psychiatric disorders, and, as we will discuss in the following, methodological advances are hoped for to meet this challenge.

In the first part of this article, we will give a quick overview of the impaired processes, which were identified, from the most elementary, such as social perception, emotional perception, and motor resonance to the most complex, such as mentalizing and empathizing. One practical consequence of this view of schizophrenia has been the continual development of batteries of neuropsychological tests targeting non-social and social aspects of cognition, for use by mental healthcare professionals [for a review see Pinkham et al. (2013)]. These assessments, based on the separate measurements of neuropsychological dimensions, have proved relevant for the prediction of functional outcome (Fett et al., 2011), although they leave unexplained a significant part of the variance.

In the second part of this work, we advocate the added value of considering"naturalistic" social cognition as a domain occulted by the classical approach. We will describe how virtual reality settings may prove useful for investigating social interaction, especially through the use of virtual agents. We will discuss the relationships between social cognitive constructs and those arising from researches on immersion phenomena.

In the third part, we will illustrate these ideas with the proposal of a new paradigm based on a virtual affective agent and show how it may be used with patients. We will argue that such experimental settings will prove useful for assessing the intermediate link between cognition and real-life functioning and give some direction to improve the subjective experience and motivation.

#### **PART I**

#### **EXAMINATION OF IMPAIRMENTS OF SOCIAL COGNITION IN SCHIZOPHRENIA BY A REDUCTIONIST APPROACH**

Social cognition can be considered as an entangled set of information processes relating to the perception of others, the representation of oneself, and interpersonal knowledge in a given social context (Beer and Ochsner, 2006), which allows an individual to adapt his/her behavior to the social environment. Disorders of social behavior are prominent in schizophrenia as noted by Brüne: "the most outstanding characteristic of schizophrenia is the inapt, often bizarre behavior of affected individuals (. . .) It is almost always the deviant social behavior in schizophrenia that renders patients 'abnormal"' (2005, p. 135). For those social cognitive disorders may account for these deviances, Penn and coworkers suggested that several primary domains [theory of mind (ToM), perception of emotion in facial expressions, attributional style] or features (i.e., metacognition) should be considered when assessing social cognition in schizophrenia. Each domain relates to some specific constructs, which have been the object of thorough theorization and experiments.

First, the ability to take the mental states (intentions, beliefs, etc.) of other people into account was extensively studied by developmental psychologists since 1990s (Astington and Gopnik, 1991). ToM has been defined as the ability to represent human mental states and to make inferences about other people's intentions, beliefs and desires, by acknowledging that others may have a mental state different from our own. ToM has been explored with verbal tasks based on stories (Corcoran and Frith, 2003; Janssen et al., 2003; Van der Cruyssen et al., 2009) and with non-verbal tasks involving sequences of pictures (Sarfati et al., 1999; Brunet et al., 2003; Langdon et al., 2006; Vistoli et al., 2011) or a mixture of stories and pictures [for a review, see Bazin et al. (2007) and Brüne (2005)]. Various factors may modulate the ToM performances of schizophrenia patients and one important debate for designing cognitive remediation techniques concerns the automatic or controlled nature of these cognitive skills (Cohen and German, 2010). For instance, verbal instructions may contribute to the controlled component of ToM (Back and Apperly, 2010), an important finding given

the widespread use of verbal instructions to remediate social skill deficits.

Second, it is also widely accepted that patients with schizophrenia and or social developmental disorders (autism, Asperger's syndrome) display poor discrimination of face identity, emotion, and age (Chambon and Baudouin, 2009). The impairment of emotion recognition in these patients is well documented and has been the object of a number of reviews (Edwards et al., 2002; Kohler et al., 2010). One of the principal difficulties observed relates to fear (Edwards et al., 2001) and disgust (Kohler et al., 2003). According to Kohler et al. (2003), neutral or non-emotional expressions may be wrongly interpreted as emotional cues with a negative bias. An intriguing result was reported by Davis and Gibson (2000) who found that schizophrenic patients had emotion recognition deficits with posed emotional pictures although this pattern of deficit was not replicated with genuine emotion pictures. Besides the complexity of the emotion recognition deficit, evidence of impoverished perceptual strategies is now investigated in schizophrenia. There is a growing body of evidence to suggest that schizophrenic patients have shortened or abnormal oculomotor scan paths when confronted with faces expressing emotions (Loughland et al., 2002; Butler et al., 2009), a finding that may be extended to non-social scenes and reliably discriminate patients from healthy subjects (Benson et al., 2012, p. 722). These findings highlight the importance of intact perceptual strategies for an understanding of the affective states of other people. Other studies have examined the importance of contextual or situational cues for the perception of emotional and mental states in others. For instance, it has been shown that contextual information (e.g., background scenes or body language) can affect the recognition of facial emotions (Aviezer et al., 2009). It has been suggested that the inefficient integration of social contextual information may limit the ability to infer mental and emotional states from facial expressions (Green et al., 2007).

Third, schizophrenic patients have also been reported to have another type of abnormality, relating to attributional style. According to Penn et al. (2008), attributional style concerns the explanations people find for positive and negative events (p. 409) – the way they attribute responsibility, error and merit to others or to themselves. As an argument for the discriminant validity of the attributional style construct, a recent work from Mancuso et al. (2011) showed with a population of 85 schizophrenic outpatients that attributional style measures loaded on a different factor than social perception and mentalizing, and that this factor correlated selectively with positive symptoms, depression/anxiety, and agitation.

Finally, metacognition may be conceptualized in its broadest sense as an individual's knowledge/representations about his or her own cognitive processes. As explained by Passerieux et al. (2012), "When we answer a question or resolve a daily-life situation, the response is a cognitive product. The estimation of the quality of this response (confidence in the answer, tendency to use this answer) is of metacognitive nature." It is now acknowledged that schizophrenic patients present impairments of this skill, to various degrees, which can be assessed by recording ratings of the degree of confidence immediately after the performance of a cognitive task. According to Koren et al. (2006), the extent of the patients' knowledge of the disease and their involvement in their own treatment are positively correlated with metacognitive skills level. Interestingly, they emphasized a subdivision of the construct into a monitoring component and a control component, respectively, referring to the self-assessment of the performances of a cognitive process, and to the effective use of the results of that process to guide the behavior (for instance, when uncertain, one might not answer in a free-response paradigm).

This short overview of the constructs of major interest for schizophrenia research, brings to light the complex, multifaceted nature of these processes, and consequently of related disorders, undermining the approach of a monodimensional deficit.

#### **KALEIDOSCOPIC APPROACH OF SOCIAL COGNITION WITH BATTERIES OF TESTS**

With the unabated successful examples of well-known neuropsychological assessment batteries, several researchers advocated the use of selected sets of social cognitive tests in order to draw a patient's panorama of skills. It would be obviously out of the scope of the present article to discuss the many attempts to validate a specific set of tests. Of interest is the social cognition psychometric evaluation (SCOPE) initiative from the NIMH, a group of experts who advocated for the use of a social cognitive battery based on separate constructs such as attributional style, emotion processing, ToM, and, eventually, trustworthiness. Nevertheless, they acknowledged that there was no consensus about the constructs to be selected, and a great overlap among both the measures and the domains (Pinkham et al., 2013). It could be argued that this overlap constitutes a key feature of social cognitive processes, which are, by definition, interconnected and mutually informative (see below for discussion of this point). As noted by Keysers and Gazzola (2007), "much of the debate in social cognition might result from choosing tasks that isolate the processes of just one route in the laboratory. However, it is essential to start designing tasks that reflect the complexity of social life to test how the social brain forms an integrated whole."

There is a considerable experimental evidence to support the content validity of the aforementioned constructs, but their additional contribution to the schizophrenic phenotype remains unclear. First, the relationship of social cognitive disorders with the patients' complaints is not straightforward. While some authors found extremely limited correlations between ToM deficits and quality of life items (Urbach et al., 2013), other authors found that this potential link was in statistical interaction with symptom severity (Maat et al., 2012). Nevertheless, a much stronger line of evidence emerged during the past years focusing specifically on real-life functioning, outcome, or psychological handicap. Fett et al. (2011) meta-analyzed 42 schizophrenia studies and were able to demonstrate a significant correlation of social cognitive variables with community functioning, which explained around 16% of the latter variable's variance. Interestingly, these authors concurred with other experts like those of the VALERO project (Leifker et al., 2009), and stressed out the limited consensus about functional outcome constructs and measures. In addition, they noted that neurocognition and social cognition leave a large part of the variance in outcome unexplained.

While causal pathways leading to reduced functioning are multifactorial, we argue that some intermediate psychological constructs merit consideration to explain the variance of schizophrenic patients' outcome. To solve the puzzle, one has to hypothesize that between the impaired processes that are under scrutiny and negative consequences in real-life situations, some pieces are still missing. First, we suggest that measurements of social cognition could be obtained by testing the domains of impairments cited above in an integrative and ecological manner, rather than separately. However, it is necessary to keep in mind that the technical limitations of real-life ecological assessments by reminding Rus-Calafel et al.'s remark: "observing and practicing the patient's social skills in natural social interaction/environment could prove useful [. . .] however [. . .] this can be time-costly for the clinician and likely highly intimidating for the patient" (p. 82, 2013). We may add that poor replicability and complex standardization of these procedures will remain a fundamental limitation in obtaining valid dependent variables. Second, recent theoretical accounts brought into the light the specificity of interactive situations in terms of cognitive processes. In the following, we refer to some of these accounts and argue in favor of their relevance to provide ecologically valid insights on some of the putative intermediate links.

#### **PART II**

#### **ECOLOGICAL POINTS OF VIEWS ON SOCIAL COGNITION**

In the late 1980s, Ickes and coworkers suggested the use of the term *naturalistic social cognition* for studies in which the experimental conditions were socially complex situations (Ickes et al., 1986; Ickes and Tooke, 1988). Zaki and Ochsner (2009) recently plead this approach and claimed that, in neuroimaging studies of mental state attribution, researchers ask subjects to make judgments about targets presented as simple stimuli (pictures, cartoons, or texts), but these judgments are generally too easy, because the targets appear to be fictional. They suggested that neuroimaging studies and experimental psychology studies should generate a greater variance in terms of social performance, to prevent floor effects, and should be more ecological and dynamic, more closely resembling real life.

Besides the importance of naturalistic stimuli for the investigation of social cognition, the need for overcoming the spectatorial gap and for adopting a second-person approach has already been proposed by Schilbach et al. (2013). According to these authors, the second-person approach is based on interaction and emotional engagement between people, far beyond a mere observation (p. 395). Thus, constituents of second-person approach are described by the implication of tasks, which require social interaction and emotional engagements. Reciprocity seems also to be a key compound of social interaction and there is a growing literature, which highlights the neurobiological correlates of the reciprocity in social interaction. Social interactions are, according to Schilbach (2014), characterized by "reciprocal relations with the perception of socially relevant information prompting (re-) actions, which are themselves reacted to"(p. 2). The second-person approach allows to investigate not only the way that a persons gathers information about the other person but also one's knowledge of the other may reside in the interaction dynamics between the agents (Froese et al., 2014).

A successful example of this approach has been realized by Wilms et al. (2010). The authors suggest that the use of gaze contingent stimuli can create a truly interactive paradigm for social cognitive and affective neuroscience. In this work, participants' own gaze data have been used to animate a virtual character with an interactive eye tracking set-up coupled with a MR scanner. Participants experienced the effects of their own gaze on the virtual agent (for instance, in a joint attention condition, avatar may look to the same object that a subject is looking). This set-up can also be used to investigate neural correlates of one's being "initiator" or being "responder" in a social interaction. According to authors, the fact that the agent becomes responsive to the participant's gaze allows the whole set-up to engage and maintain an "online" social interaction.

Given the fact that emotion recognition impairments and gaze abnormalities in social interaction are intricate in psychiatric disorders such as schizophrenia and autism, it should be acknowledged that a second-person approach with naturalistic stimuli in social cognition settings provide specific information (Timmermans and Schilbach, 2014) at the junction of real-life and picture-based emotion recognition assessments. For instance, the experiment proposed by Wilms et al. (2010) is highly relevant to the case of children with autism. As a matter of fact, children with autism show less pronounced impairments in their ability to follow someone's gaze shifts than in their drive to make someone look at something (Mundy and Newell, 2007).

This is why, according to Zaki and Ochsner, the use of naturalistic social cognition may be critical to understand illnesses involving social cognition deficits. Having examined the particular case of autism spectrum disorders, they concluded,"by moving toward paradigms that capture the complexity of the real social world, and assessing perceivers' abilities to make accurate inferences about targets, neuroimaging of social cognition can approach more ecologically valid theories about how minds understand each other" (p. 9, 2009). To build upon these theoretical accounts, we identify, in the following, several characteristics in a naturalistic approach.

*First- and second-person perspectives are to be allowed and even privileged.* The mere distant observation of a person as in classical ToM paradigms (pictures, comic-strips, stories, etc.) might disengage shared representations (because they are useless to understand the situation), or neuro-functionally speaking the mirror neuron system, as well as all the processes that allow the immersion into a specific social situation.

*The subject acts upon the stimuli interactively and is not limited to a passive stance*. Methodologically,it is relevant to raise a distinction between a true interaction and a fake interaction. In the former, the reacting subject modifies significantly subsequent stimuli, as in real life. Adaptative model would be required for instance based on personality models (Faur et al., 2013) or complex cognitive architectures (Hemion, 2013). In the latter approach, the subject is presented controlled stimuli that let him believe and act as if he could modify the course of the social events. Graph-structured scenarios with only marginal adaptative capabilities may be used. While, advances in technologies allow both settings, constraints on experimental designs and subsequent statistical analyses often encourage the second solution.

*Interaction with complex social agents leads to a certain degree of unpredictability*. It is conceivable that the social brain is shaped to manage unpredictability although it contributes at making sense of multiple/successive social cues. The experiment from Yoshida et al. (2010) brings intriguing evidence of paracingulate cortex processing uncertainty about another's agent intentional strategy, i.e., the order of the ToM model used by this agent. Here again, the reality of the interaction is not crucial to elicit the illusions of being immersed into an unpredictable situation, and one might hypothesize that experimental situation with a hidden level of control on the stimuli are sufficient to elicit these neural processes.

*Multimodal stimuli are present most of the time* with a convergence of several channels such as visual perception of social cues (postures, gestures, emotional displays, etc.), auditory perception (verbal utterances, prosody), and eventually haptic components. Notably, the interpretation of the information coming from each channel requires contextualization, integration, and inferences and might involve mental state attribution [this is discussed in Brunet-Gouet et al. (2010)]. As a result, no modality dominates the others and new information has the potential to revise dramatically the current representation. For instance, an incongruous facial emotion at the end of a lengthy discourse may lead to a complete revision of our understanding of the verbal content although the stimulus is quantitatively minimal and qualitatively poorly informative.

*All the inferential processes without exclusion are involved during a naturalistic interaction*, i.e., emotional states inference or empathy, mental state attribution (beliefs, intentions, knowledge, etc.), moral judgment. In other words, naturalistic paradigms should not be structured in such away that disengagement of one process appears relevant to achieve the task.

*Contextualization concerns not only multimodal information but also the succession of events* that make sense when associated within a sequence. While many social cognition paradigms depict an agent performing actions following a specific schema (intentional comic-strips, unexpected transfer tasks, video-based mental state inferences, etc.), many of them do not capitalize sufficiently on a single character in order to promote a real interest in this agent. To enter in a true empathetic relationship, it might be relevant to allow the subject to activate full representation of the other person, including personality traits.

#### **NEW METHODS ARE PROVIDED BY RESEARCHES ON VIRTUAL REALITY**

The need to investigate social cognition in a more ecological manner will necessitate the use of multimodal, interactive experimental stimuli and to be as respectful as possible to the criteria cited above. Advances in computer technology over the last 20 years made it possible to immerse people in an artificial world and to allow them to interact with the simulated environment. *Virtual reality* is a term commonly used in reference to these technologies that often require realistic, multimodal feedback. Technically, virtual reality requires real-time 3D computer graphics, sound, and other sensory input, and physical models of various degrees of sophistication, to generate a computer-generated environment with which the user interacts (Gregg and Tarrier, 2007). In some cases, haptic, or even olfactory stimuli are proposed [for a review

of virtual environments and reality and their application in psychopathology (Baus and Bouchard, 2014), presented in the current research topic]. Recent technological advances offer exciting opportunities to carry out studies of naturalistic cognition, through the use of multimodal stimuli and the extension of the variables measured to physiological and behavioral data. Moreover, the fields of virtual agent simulation and affective computing have matured and now provide credible solutions for interactivity. For these reasons, Kim et al. (2008) argued that virtual reality techniques can overcome traditional assessments' limitations by composing diverse social situations with various backgrounds and by providing interactive dynamic simulation in social and emotional situations, in opposition of passive observational setups.

Despite the absence of consensus on objectives and methods, psychopathology has also participated to this progress, through the development of new uses of computer graphics, simulation, interface technologies, and social agent modeling such as clinical assessment, intervention, and training. First, new treatments for psychopathological conditions based on virtual reality may prove beneficial (Riva,2005). According toRizzo et al. (2012),clinical virtual reality has significant potential to enhance clinical practice and research in three key areas: exposure therapy, neuropsychological assessment, and clinical training with virtual patient agents. Virtual reality settings have been used effectively for the treatment of post-traumatic stress disorder (Rothbaum et al., 2001; Difede and Hoffmann, 2002) and social anxiety (Herbelin et al., 2002; Klinger et al., 2005). These therapies make use of the immersive nature of virtual environments and their ability to trigger emotions corresponding to real-life situations. However, unlike real exposure, the patients are protected against harmful consequences, helping them to accept being confronted with anxiogenic stimulation.

Second, cognitive rehabilitation programs using virtual reality have proved effective for the treatment of patients with unilateral spatial neglect (Kim et al., 2011) and brain injuries (Christiansen et al., 1998). Such programs may include cognitive remediation therapies, during which patients are trained on selected cognitive tasks of increasing processing demand, in order to increase cognitive efficiency and behavioral performances [for a review, see Cicerone et al. (2011)]. Based on simple games, computerized cognitive remediation programs are now frequently offered to patients with psychosis and have been shown to improve cognitive and psychosocial functioning (McGurk et al., 2007b; Wykes et al., 2011).

However, the use of virtual reality in the treatment of schizophrenia has been reported only rarely (Ku et al., 2003; Costa and Carvalho, 2004). Interestingly, the patients participating in the study by Costa and Carvalho agreed to work with computers coupled with immersive glasses and demonstrated a high level of interest in the proposed cognitive tasks. The existence of a specific motivational effect of virtual reality environments may be of clinical importance as motivation is a key factor for a successful remediation therapy (Medalia and Richardson, 2005). Among successful contributions, virtual environments have been used in attempts to improve the social skills (Rus-Calafel et al., 2013) or conversational skills and assertiveness (Park et al., 2011) of schizophrenia patients.

Of importance for the present article, virtual reality can also be used for the evaluation of cognitive impairments in schizophrenic patients. For instance, Ku et al. (2003) used a virtual reality system for the multimodal assessment of cognitive ability in schizophrenia patients. In this work, the task consisted of an adaptation of Wisconsin card sorting test (WCST) to the virtual reality. The program projected through a head mount display presented a pyramid with doors presented like WCST in which schizophrenia patients had to figure how to get out. The main idea here was to use all the benefits of a head mount display for visual and auditory stimulus as well as the haptic feedback (vibration signal for wrong answer). The results showed a correlation of the virtual task and the classical WCST; however, it provided modest information on how patients behaved according to the specific modalities of stimuli. In the same vein, the results obtained by Freeman et al. (2008) suggest that virtual reality environments could be used to assess paranoid thinking in the general population. In this investigation with 200 healthy subjects, the task consisted of a 4 min journey in a virtual London underground projected in a head mounted display. Paranoid thinking as well as other assessments through questionnaires showed significantly that there were a large minority of participants having paranoid thoughts and that this finding was correlated with anxiety, worry, perceptual anomalies, and cognitive inflexibility.

Methodologically speaking, computerized evaluations allow precise measures of behavior with variables such as performance rate, and reactions times, which are classically considered in experimental psychology. A recent study with two virtual agents engaged in a conversation allowed to investigate the gaze patterns of schizophrenic subjects and demonstrated an increased attention to the between-agent space when they spoke, putatively, a form of other's gaze avoidance (Han et al., 2014). While these authors recorded the patient's feelings principally in terms of valence during the task, no specific attention was paid to more complex feelings or interpretations. In the following, we will point out that naturalistic experiments based on virtual reality offer diversified means to assess the patient's strategies to process the task, and to investigate their subjective experiences. Subjective judgments may be recorded both during and after the experiment with specifically designed questionnaires and provide an insight into metacognition and interpretation of controlled social stimuli. Considering the replicability requirements of experimental sciences, we advocate here for the use of such measure to extend our understanding of the patients' particularities within interpersonal interaction.

Lastly, the choice of realistic simulation techniques to generate social situations and interpersonal interaction raises the issue of the quality of the empathetic relationship that is established with the virtual agent. To a certain extent, the occurrence of empathy may be considered as an extension to the social domain of the broadly speaking "immersion phenomenon" that is the ultimate ambition of virtual environments. *Immersion* is classically defined as the technological power of the system to "deliver an illusion of reality to the senses of a human participant" (Slater and Wilbur, 1997). Immersion could be considered as a precursor leading to a psychological state or even a state of consciousness characterized by perceiving oneself to be enveloped by, included in, and interacting with an environment that provides a continuous stream of stimuli and experiences. As discussed by Witmer and Singer (1998), the conjunction of immersion with a higher *involvement* of the subject (i.e., the "energy" and the attentional resources allocated to the virtual environment) results in the subjective experience of *presence* when the subject privileges the virtual environment over reality. Interestingly, the latter construct was investigated with several anxiety-eliciting virtual situations (TAVE Software) and it was shown that an individual's anxiety state as well as his/her personality characteristics like introversion influenced positively the sense of presence (Alsina-Jurnet and Gutiérrez-Maldonado, 2010). Moreover, factor analyses of presence self-reports demonstrated the multi-dimensional nature of this construct, distinguishing spatial presence, involvement, realness, and predictability and interaction (Schubert et al., 2001).

Interestingly, the concept of *social presence* was coined to account for the salience of the relationship with an agent, for instance, when using communication tools such as videoconference or avatar-based virtual worlds (i.e., Linden lab's Second Life) allowing interpersonal interactions, and even extended to the concept of *co-presence*, i.e., the sense of being together (Tugba Bulua, 2012). These different constructs appear to contribute to the subjects' satisfactory experience, which in turn could improve their motivation. Here, we hypothesize that the building of an empathetic relationship with a virtual agent would constitute a crucial correlate of the experience of his/her presence. As a consequence for clinical researches focusing on the use of virtual agents as a particularly motivating technique, studies should combine assessment of empathy toward the virtual agents, of the subjects' trait empathetic skills and, more generally of their tendency to immerse. Last, we advocate a better integration of the constructs arising from knowledge on the psychological dispositions related to virtual reality usages, and of those from social neurosciences.

#### **PART III**

#### **EXPERIMENTAL ILLUSTRATION OF INVESTIGATING SCHIZOPHRENIA WITH A VIRTUAL AGENT: THE VIRTUAL CARD GAME PARADIGM**

From the many and often heteroclite accounts described above emerges a relevant new empirical approach for exploring the interpersonal disturbances of schizophrenia patients in a more ecological and naturalistic way. This approach focuses on real-time social interaction, requiring the simultaneous integration of all social cognition processes: the attribution of intentions to others, first and second order understanding of mental states, emotion recognition, empathy, metacognition, attributional style, and the contextualization of a given social situation. It would not be reasonable to claim that this approach can test all aspects of social cognition simultaneously. However, this approach makes it possible to study some of these aspects in interaction, and represents a more ecological experimental setting than the batteries of social cognitive tests currently used to assess the impairment of social cognition in schizophrenic patients. In the following, we will show how a precise control of stimuli through computer code specification, allows a simple paradigm to investigate one or several subcomponents thanks to very simple manipulations of the parameters (here, non-verbal communication through emotional displays with varied valence and intensity). Second, we will illustrate the subjective aspects of an immersion into an empathetic relationship through the examination of patients and healthy subjects' post-experiment reports.

#### **DESCRIPTION OF THE VIRTUAL CARD GAME**

The experimental situation was derived from a trust game (i.e., games during which one evaluates anothers' trustworthiness and intentions in order to produce monetary arbitrages; Berg et al., 1995) from which psycho-economic judgments were removed. The participants were presented a game in which they met a female virtual agent and had to infer from her facial expression displays which card to choose in order to match the color of another card (see **Figure 1**; Supplementary Material). Of methodological importance, the task is self-explanatory: the agent provides the instructions to be followed, and, at the middle of the game, additional informations are provided to help the subject focusing on emotional displays. However, the participants are not informed before the experiment of the virtual agent's communicative intentions. Consequently, the attribution of a cooperative intention is left entirely to the subjects' appreciation.

Technically speaking, the task requires only a multimedia personal computer running the multimodal affective and reactive characters (MARC) framework (Courgeon et al., 2008; Courgeon and Martin, 2009) to animate a realistic 3D character named Mary. MARC is a multicharacter animation platform including real-time body and facial animations based on the computational modeling of emotions (action units models) inspired by the different

approaches to emotion in psychology. This platform provides several male and female interactive virtual characters, all of which can speak and simultaneously display subtle facial expressions. In this experiment, we used the female model called Mary, because this model was well liked by participants in preliminary tests of the set-up and had a set of validated emotional expressions (see Supplementary Material).

A state machine is used to manage the interaction with subjects following a predetermined scenario (**Figure 2**). A scenario may be more or less complex and might be seen as an explicit hardcoding of a behavior to simulate a specific social situation. The state machine was responsible for determining the agent's behavior, i.e., facial expressions to be displayed and verbal statements to be given (if any), as a function of the input provided by the subjects. In the present example, all the state transitions were predefined with the exception of the simulation of small postural movements, blinks, etc. Would the scenario require it, the state transition could be freed to enforce true interaction as defined above. However, it is important noting that the subject was kept blind to the underlying structure of the sequences of events, so that the system generated a fake interaction, which constitutes an intermediate between a true interaction and a classical stimulus/response task.

#### **HOW DIFFERENT ASPECTS OF SOCIAL COGNITION ARE MANIPULATED WITHIN THE VIRTUAL CARD GAME?**

As advocated in this article, virtual reality settings based on affective agents constitute a heuristic approach to the needs of naturalistic social cognitive research. We have described above a paradigm, which benefits from real-time 3D rendering of emotions with possibilities to add verbal communication and more or less degrees of freedom in the scenario. In the following, we exemplify how

several constructs that are considered of importance in social cognition research are at least represented and, more importantly, can be directly manipulated within the present paradigm.

#### **Emotion recognition**

Understanding the game and correctly answering requires decoding facial affects, and more precisely, to interpret them as a communication means in the absence of a verbal advice from the agent to choose one card or the other. Although it is quite unusual in other paradigms based on affect recognition, the subjects have to interpret negative emotional displays as helpful cues. Direct manipulation of the emotional displays is easily made as the software platform allows parametric control of action units (valence, intensity, arousal may be defined). In addition, dependant variables concerning emotion are obtained during specific trials in which sympathy ratings on the agent are recorded.

#### **Theory of mind**

Understanding the helping intention of the avatar is the most fundamental aspect of ToM that is elicited during the game as in any paradigm based on forms of trust judgments. This judgment of cooperation might be easily manipulated by changing the strategy of the agent, and also by changing it adaptively during the game. This would be of major interest to assess patient's disposition to manage changing relationships and not to perseverate in their own strategy. More subtle is the fact that the current version of the paradigm makes the agent express the correctness of the subject's answers making it possible to interpret these feedback as incorrect or misleading (i.e., I believe that Mary is lying even when she tells me that my choice is correct). According to a Bayesian framework like in Yoshida et al. (2010) article, such a simple task could bring some insights on the order of ToM model that is used by the subject to perform the task.

#### **Metacognition**

Measuring metacognitive skills remains technically disputable and somehow artificial. Like Koren et al. (2004) it is quite straightforward to implement measures of both monitoring and control. Self-monitoring could, like in our illustration, be made thanks to the use of self-reports on a trial per trial basis and with the correlations with response times, correct answer rates and, eventually, behavioral correlates (eye scanpaths, gait changes, skin conductance, etc.). Metacognitive control could easily and quite ecologically be implemented by allowing free-choice responses, i.e., non-response.

To conclude from this non-exhaustive argumentary, it appears that simple interactive designs contains enough degrees of freedom to permit simultaneous implementation of several constructs. To go beyond the present attempt to justify construct validity, psychometric studies should be necessary to ascertain the concurrent validity of the numerous dependent variables with regards to more consensual social cognitive tests.

#### **INSIGHTS ON EMPATHETIC RELATIONSHIP WITH A VIRTUAL AGENT FROM SUBJECTIVE REPORTS**

First of all, let's describe the behavioral patterns found while using the experimental paradigm. While healthy subjects were successful at interpreting the emotions displayed by the agent as a form of communication in order to guide their choices (see behavioral data in Supplementary Material), schizophrenic patients were profoundly impaired in this task and exhibited very low performance rates, and increased response time. In our view grounded on the literature, this result is in perfect accordance with the evidences showing that schizophrenic patients have profound impairments in emotion recognition, ToM with slower cognitive processing.

Although the behavioral performance of the patients were low, it was interesting to explore their subjective reports about their encounter with the virtual agent. To do so, participants were asked to complete a short 11-item questionnaire (**Table 1**). Considering that the evaluation of the sense of presence would not be sufficiently informative on intersubjective aspects, we designed questions focusing specifically on relational empathy and on perception and understanding of the virtual agent's behavior. Thus, several important aspects are covered by the questions:


Participants were asked to indicate the extent to which they agreed with statements concerning their perception of the agent on a four-point Likert scale as follows: completely agree (100%), agree (66.6%), do not really agree (33.3%), and do not agree at all (0%).

The results of the questionnaires indicated that the patients welcomed the use of the affective agent and that they were well motivated for the performance of the task. Their answers were quite comparable to those of healthy participants. Interestingly, participants agreed with the opinion that the virtual card game was realistic and played accordingly. According toKim et al. (2008), studies with virtual avatars have shown that patients behave as if the virtual avatar was a real person standing in front of them and talking to them. They conclude that an avatar in a virtual environment can be used in interaction studies with schizophrenia patients [see also Ku et al. (2006)].

However, the results presented here draw a more complex picture and may give some orientations for future developments. Questions about the mental states of the agent were mitigated with a tendency to disagree with the fact that the agent took into account their reactions and that she was helpful. Both the patients and the healthy subjects observed that the agent could change her attitude, a fact that is partially true as Mary provides new instructions at mid-game. Last, but of importance, the lower agreement rates concerned Mary's ability to have an opinion on oneself. This suggests that even if the participants reacted to the agent's behavior (question 3), they did not forget that Mary was **Table 1 | Relational empathy questionnaire: mean ratings of agreement expressed as percentages provided by the two populations after they had played with the virtual agent**.


a computer-animated character. Apparently, both healthy participants and patients did not find cues suggesting that the agents had a representation of the participant's attitude although they inferred a quite negative opinion about them in the agent's attitude. A first means to address this point and to increase the degree of empathy toward the virtual character consists in making the agent express opinions about the performance of the participant, or ask question about the subjects' feelings. A second strategy is the introduction of low-level and short-term interaction phenomena. For instance, Timmermans and Schilbach (2014) emphasized the relevance of gaze contingencies and related abnormalities in psychopathological conditions such as schizophrenia, autism, and personality disorders (refer to the article in the present topic). Such technical improvements are compatible with the present platform and could, supposedly, improve the sense of presence elicited by the virtual agent.

#### **CONCLUSION AND PERSPECTIVES**

Advances in virtual reality and affective computing not only bring exciting perspectives on experimental methods but also challenge profoundly theories of normal and abnormal social functioning. By allowing replicable experimental designs combining multimodality, contextualized stimuli, interactivity with controlled degrees of unpredictability, as well as intersubjectivity based on first- and second-person perspectives, these techniques urge conceptual innovation. In the present article, we have tried to draw a panorama of some of the emerging concepts that would have to be considered in the future. We illustrated this approach with the presentation of a new paradigm and showed how subjective recording of the patients and healthy subjects experience could help improving the intersubjective phenomena.

We suggest that virtual reality paradigms with affective agents, as presented here, constitute a useful and innovative way of assessing social interactions in patients with schizophrenia. These new techniques could partially overcome the difficulties to predict impaired real-life social functioning, contributing to improve ecological validity of tests. The use of affective and reactive avatars can circumvent the limitations of the batteries of test approach and provide broader information about a patient's social cognitive skills extending dependent variables to metacognition and subjective experience measures. Indeed, the paradigm presented here makes it possible to investigate globally or separately emotion recognition, mental state attribution, and metacognitive self-evaluation by varying each of these parameters.

As suggested in the experimental illustration, self-reports obtained before, during, and after the experiment might provide additional information about the subjective experience. It would be useful to design and to validate self-report scales to assess the empathetic relationship with a virtual agent. Knowledge on human empathy stressing out the importance of cognitive and affective dimensions could refine the way to measure the components of an empathetic experience beyond the measures of immersion and presence. For that the quality and the vividness of the virtual experience could affect differentially the patient's motivation depending on his/her own psychopathology and personality, experiments should be conducted to inform designers of virtual reality remediation therapies on the best means to improve these aspects.

#### **ACKNOWLEDGMENTS**

This research was funded by a grant from the Agence Nationale de la Recherche: ANR-11-EMCO-0007. We are very thankful to Dr. Graziella Zanatta, Dr. Paul Roux, Dr. Erica Martins, Elisabeth Massé, and all the members of the Adult Psychiatric Service at Versailles Hospital for their support, and their help for the participant's recruitment, along this work.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at http://www.frontiersin.org/Journal/10.3389/fnhum.2015.00133/ abstract

#### **REFERENCES**


pilot metacognitive study. *Schizophr. Res.* 70, 195–202. doi:10.1016/j.schres.2004. 02.004


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

#### *Received: 25 March 2014; accepted: 26 February 2015; published online: 30 March 2015.*

*Citation: Oker A, Prigent E, Courgeon M, Eyharabide V, Urbach M, Bazin N, Amorim M-A, Passerieux C, Martin J-C and Brunet-Gouet E (2015) How and why affective and reactive virtual agents will bring new insights on social cognitive disorders in schizophrenia? An illustration with a virtual card game paradigm. Front. Hum. Neurosci. 9:133. doi: 10.3389/fnhum.2015.00133*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2015 Oker, Prigent, Courgeon, Eyharabide, Urbach, Bazin, Amorim, Passerieux, Martin and Brunet-Gouet. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# HUMAN NEUROSCIENCE

## Effect of perceived intimacy on social decision-making in patients with schizophrenia

#### **Sunyoung Park <sup>1</sup> , Jung Eun Shin<sup>2</sup> , Kiwan Han<sup>3</sup> ,Yu-Bin Shin<sup>2</sup> and Jae-Jin Kim1,2,3\***

<sup>1</sup> Department of Psychiatry, Yonsei University College of Medicine, Seoul, South Korea

2 Institute of Behavioral Science in Medicine, Yonsei University College of Medicine, Seoul, South Korea

<sup>3</sup> Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul, South Korea

#### **Edited by:**

Eric Brunet-Gouet, Centre Hospitalier de Versailles, France

#### **Reviewed by:**

Giancarlo Dimaggio, Centro di Terapia Metacognitiva Interpersonale, Italy Philip L. Jackson, Université Laval, Canada

#### **\*Correspondence:**

Jae-Jin Kim, Department of Psychiatry, Yonsei University Gangnam Severance Hospital, 211 Eonju-ro, Gangnam-gu, Seoul 135-720, South Korea e-mail: jaejkim@yonsei.ac.kr

Social dysfunctions including emotional perception and social decision-making are common in patients with schizophrenia. The aim of this study was to determine the level of intimacy formation and the effect of intimacy on social decision in patients with schizophrenia using virtual reality tasks, which simulate complicated social situations. Twenty-seven patients with schizophrenia and 30 healthy controls performed the 2 virtual social tasks: the intimacy task and the social decision task. The first one was to estimate repeatedly how intimate participants felt with each avatar after listening to what avatars said. The second one was to decide whether or not participants accepted the requests of easy, medium, or hard difficulty by the intimate or distant avatars. During the intimacy task, the intimacy rating scores for intimate avatars were not significantly different between groups, but those for distant avatars were significantly higher in patients than in controls. During the social decision task, the difference in the acceptance rate between intimate and distant avatars was significantly smaller in patients than in controls. In detail, a significant group difference in the acceptance rate was found only for the hard requests, but not for the easy and medium difficulty requests. These results suggest that patients with schizophrenia have a deficit in emotional perception and social decision-making. Various factors such as a peculiarity of emotional deficits, motivational deficits, concreteness, and paranoid tendency may contribute to these abnormalities.

**Keywords: schizophrenia, intimacy, social decision-making, virtual reality**

#### **INTRODUCTION**

Schizophrenia involves a wide range of cognitive, emotional, and behavioral dysfunctions, and no single symptom is pathognomonic of the disorder (American Psychiatric Association, 2000). Among others, social dysfunction has been outweighed as a defining feature for the course of schizophrenia (Olfson et al., 2011). Social dysfunction of patients with schizophrenia mostly results from impaired social cognition (Corrigan and Penn, 2001; Mancuso et al., 2011), which involves deficits in the social relationship. In particular, because recognizing and understanding the others' thinking and intention may be an important element of building the social relationship,impairments in theory of mind or metacognition may mainly contribute to social dysfunction in patients with schizophrenia (Dimaggio et al., 2008). It may be evidence of this contribution that remediation of impaired social cognition and theory of mind helps improve social functioning in patients with schizophrenia (Combs et al., 2007; Lysaker and Dimaggio, 2014). In addition, impairments in emotional processing (Edwards et al., 2002; Schneider et al., 2006; Butler et al., 2009) and interaction with impaired emotional recognition and theory of mind (Brüne, 2005) need to be considered.

Intimacy may be one of the important factors regarding the social relationship. Intimacy refers to the feeling of being in a close personal association and belonging together. It is a familiar and very close affective connection with others as a result of a bond that is formed through knowledge and experience of them (Laurenceau et al., 1998). Patients with schizophrenia have a difficulty in forming social bonds due to impaired capacity to understand the emotions of others and to express their own emotions (Kulhara et al., 1989; Green et al., 2005). Furthermore, their inability to be attuned to the context of social interactions may lead to social withdrawal and social disability (Salvatore et al.,2007). It should be noted that the ability to form intact intimacy and a desire for intimacy are also necessary for the appropriate intimate relationship with others (Baumeister and Leary, 1995). It has been reported that motivation to intimacy is decreased in patients with schizophrenia due to their symptoms (Hien et al., 1998). It is unclear, however, how much of an impairment they have in intimacy formation toward strangers. Therefore, a focus of the present study was the level of intimacy formation of patients with schizophrenia in experimental situations.

Another focus was a difference in the effect of intimacy on social decision-making between patients with schizophrenia and normal controls. Most decisions related to social situations are dependent on the concomitant choices of others (Sanfey, 2007). Decision-making consists of a complex set of processes, which include reward processing, coordination, and strategic reasoning. Some previous studies have addressed the interactive effect between emotion and decision-making (Hooker and Park, 2002; Tranel et al., 2002; Bechara and Damasio, 2005). In terms of social decision-making, various factors such as competition, social reward, theory of mind, and affection have been proposed to affect this function (Park et al., 1991; Sanfey et al., 2003; Paulus, 2007; Sanfey, 2007). Intimacy may be another example of these factors.

In general, we usually have a tendency to consort with people with positive emotions such as intimate feelings. Conversely, a response to negative emotions like anger reduces cooperation and increases conflict. This pattern of attitude suggests that social decisions are heavily influenced by emotion including intimacy (Van Kleef et al., 2010). It is unclear, however, if the similar feature is also found in patients with schizophrenia who have a deficit in motivation to intimacy. This may be important in providing useful information to a psychosocial rehabilitation program for patients with schizophrenia.

Some factors should be considered in investigation of the effect of intimacy on social decision-making. Because intimacy is an emotion, which is produced in the real social relationship, evaluation of intimacy also needs to be made reflecting complex social situations of real life. This need may be particularly important in patients with schizophrenia who have social dysfunction due to various psychotic symptoms and motivational deficits (Kim et al., 2007). In addition, social anxiety and self-esteem should be taken into account when investigating intimacy levels in patients with schizophrenia (Lysaker et al., 2010). Given that people with social anxiety are used to avoiding risk taking decision (Maner et al., 2007) and those with low self-esteem tend to make decision depending on group-norm rather than their own will (Crocker and Major, 1989; Anthony et al., 2007), social anxiety and self-esteem may be considered to be other factors influencing the effect of intimacy on social decision-making in patients with schizophrenia. In fact, there is evidence that patients with schizophrenia have a high level of social anxiety and low self-esteem, which are closely linked to each other (Karatzias et al., 2007; Lysaker et al., 2008).

In this study, we produced virtual reality tasks, which are suitable for designing controlled, complex social situations to evaluate the behavior of patients with schizophrenia (Park et al., 2011; Han et al., 2012). Virtual reality is a useful technique to simulate various social situations by providing an immersive environment with three-dimensional rendering and a safe experimental environment without the limitation of time and space (Han et al., 2009, 2012). Based on the benefit of virtual reality that emotional and social stimuli are provided in a natural manner and the objective behavioral parameters such as interpersonal distance, reaction time and types of responses are automatically obtained, several studies have performed the estimation of human behaviors in established social situations (Park et al., 2009; Han et al., 2012; Kane et al., 2012).

In the present study, virtual reality was used to construct avatars with whom participants interacted and built intimacy in the complex, dynamic social situations. The purpose of this study was to determine the level of intimacy formation and the effect of intimacy on social decision in patients with schizophrenia. For this purpose, participants' tasks were to experimentally construct intimacy for avatars and to decide on whether or not to accept the avatar's request, and the results were compared between patients with schizophrenia and healthy controls. The hypothesis was that (1) patients would have a difficulty in the formation of intimacy with avatars and (2) less intimate avatar's requests would be rejected in similar proportion between patients and controls, but more intimate avatars' requests would be less accepted in patients because they might feel less intimate with avatars compared to controls.

#### **MATERIALS AND METHODS PARTICIPANTS**

Twenty-seven patients with DSM-IV-TR (Diagnostic and Statistical Manual of Mental Disorders-IV-Text Revision) (American Psychiatric Association, 2000) schizophrenia were recruited from an outpatient clinic. All patients were medicated with one or two atypical antipsychotics and were clinically stable. Mean illness duration was 9.7 (SD = 4.6) years, and mean Positive and Negative Syndrome Scale (PANSS) (Kay et al., 1987) score for measuring symptom severity was 64.9 (SD = 13.5) (**Table 1**). Thirty non-psychiatric healthy controls were age- and gender-matched to patients, and were confirmed to have no history of any psychiatric or neurologic illness as diagnosed by a psychiatrist. All participants were aged 28–39 years. Years of education [patients: 14.6 (SD = 2.3), controls: 15.7 (SD = 1.4), *t* = 2.14, *p* = 0.04] were significantly different between the two groups. Considering that social anxiety is associated with avoidant decision-making (Maner et al., 2007), the Liebowitz Social Anxiety Scale (Heimberg et al., 1999) was additionally administered, but did not show a significant group difference. The Rosenberg Self-esteem Scale (Robins et al., 2001) was administered to assess self-esteem effects on social decision-making, and the scores were significantly lower in patients than in controls (*t* = 2.72, df = 45.60, *p* = 0.02). This study was approved by the local institutional review board, and written informed consent was obtained from all participants.

#### **DESIGN AND PROCEDURE**

All participants were tested with two experimental behavioral tasks, which included several typical everyday environments and

#### **Table 1 | Participant characteristics**.


LSAS, Libowits Social Anxiety Scale; RSES, Rosenberg Self-esteem Scale; SPM, Standard Progressive Matrices, PANSS, Positive and Negative Syndrome Scale. Data are presented as mean (SD).

four avatar types using virtual reality. The virtual environment was introduced as usual situations in community of participants, and avatars were supposed to be acquaintances. The experimental tasks were constructed using Game Studio A6 Engine (Conitec Datasystems, oP Group).

#### **Intimacy task**

**Social decision task**

This task aimed to build intimacy with virtual avatars before the next task was administered, and to estimate the level of intimacy that participants feel with four avatars. Two avatars were constructed to be intimate (referred to as "intimate avatars"), and the others were constructed to be distant (referred to as "distant avatars"). Avatars spoke five times to participants in a manner of familiar and informal relationship (e.g., the weather was really good last Sunday! Did you go outside with your family?) or unfamiliar and formal relationship (e.g., we are not going to make progress any more. Let's take a rest.), respectively. Appearance, voice tone, and politeness in manner of speech were carefully constructed to reflect their assigned intimacy level. After listening to what avatars said, participants used a mouse to estimate how intimate they felt with each avatar using a Likert scale, which ranged from 0 to 100 and was marked at every 25 points (**Figure 1A**). Participants were instructed to rate above 50 if felt intimate and below 50 if felt distant. The intimacy score for each avatar type was defined as a mean score for 10 trials for two intimate or two distant avatars. There was no time limit for this task. In the preliminary validity test in 20 normal volunteers who did not participate in the main experiment, intimate avatars were rated above 55 and distant avatars were rated below 45.

This task was constructed to assess participants' social decision-making in complicated social situations. Avatars, which participants had become familiar with during the intimacy task, requested them to do something. The requests were made at three levels of difficulty: easy (e.g., I am searching for a hat to buy online. Can you select one for me?),medium (e.g., I have left my cell phone at home. Can I use your cell phone for an hour?), and hard (e.g., I need to go on a trip with my friends next week. May I use your car next week?). The dialogues in the tasks were reported in Supplementary Material. They were selected from the requests classified as one of the three categories in the preliminary validity test. The category was made according to the responses of 20 normal volunteers who also participated in the preliminary validity test for the intimacy task: "easy" if accepted by above 60%, "medium" if accepted between 40 and 60%, and "hard" if accepted only below 40%. Four avatars made 9 easy, 9 medium, and 9 hard requests, and thus 108 trials were included in the task. Participants were asked to decide whether or not they would accept the request by clicking a corresponding mouse button (**Figure 1B**). Participants were asked to make the decision as quickly as possible before the next trial began. The asking period was 5 s for all requests, and the following responding period persisted for 4 s. Scores were 1 for acceptance and 0for refusal, and average scoresfor each avatar were considered to be the acceptance rate. In addition, reaction time was automatically counted as the time between the start of the responding period and the clicking response. Reaction times of acceptance or refusal responses were merged and analyzed together.

#### **STATISTICAL ANALYSIS**

Demographic characteristics were compared between groups using the Student's*t*-test and Chi-square test. Considering that our data included several missing responses, especially in patients, the behavioral performances such as the intimacy score, acceptance rate, and reaction time were analyzed using a mixed linear model.

The variables for the main and interaction effects were *avatar type* (intimate avatars and distant avatars) and *group* (patients and controls) for the intimacy score, and were *avatar type*,*request difficulty* (easy, medium and hard), and *group* for the acceptance rate and reaction time. Considering the missing data, LSMEANS was used to report behavioral results. Pearson correlations of the behavioral performances with scores on the Liebowitz Social Anxiety Scale and Rosenberg Self-esteem Scale in each group and PANSS scores in patients were calculated. When analyzing behavioral data, years of education were used as a covariate.

#### **RESULTS**

#### **INTIMACY RATING**

The intimacy rating scores had the significant main effect of avatar type (Num DF = 1, Den DF = 169, *F* = 1,057.04, *p* < 0.01) and group (Num DF = 1, Den DF = 169, *F* = 4.49, *p* = 0.04). They were significantly higher for intimate avatars (LSMEANS: 72.4 ± 1.20) than for distant avatars (LSMEANS: 28.5 ± 1.2), and were significantly higher in patients (LSMEANS: 52.5 ± 1.4) than in controls (LSMEANS: 48.4 ± 1.3). The intimacy scores showed the significant interaction effect between avatar type and group (Num DF = 1, Den DF = 169, *F* = 25.71, *p* < 0.01). As shown in **Figure 2**, the intimacy scores for intimate avatars were not significantly different between groups (LSMEANS: patients, 71.0 ± 1.7;

controls, 73.7 ± 1.6; *p* > 0.05), but those for distant avatars were significantly higher in patients than in controls (LSMEANS: patients, 34.0 ± 1.7; controls, 23.0 ± 1.6; *p* < 0.01).

#### **ACCEPTANCE RATE**

As shown in **Table 2**, the significant main effect was found in avatar type; the acceptance rates were higher for intimate avatars than for distant avatars (Num DF = 1, Den DF = 617, *F* = 71.42, *p* < 0.01). The significant main effect of request difficulty was also revealed (Num DF = 2, Den DF = 617, *F* = 275.13, *p* < 0.01); *post hoc* analysis demonstrated that the acceptance rates were significantly different between the request difficulty levels (easy > medium, *p* < 0.01; medium > hard, *p* < 0.01).

There was no main effect of group, but the interaction effect was found between avatar type and group (Num DF = 1, Den DF = 617, *F* = 9.40, *p* < 0.01). As shown in **Figure 3**, increases of the acceptance rate for intimate avatars compared with distant avatars were significantly smaller in patients than in controls (*p* < 0.05). The significant interaction effect was also found between request difficulty and group (Num DF = 2, Den DF = 617, *F* = 6.66, *p* < 0.01). In both avatar types, a significant group difference was found only for the hard requests (*p* < 0.05), but not for the easy and medium difficulty requests. There was no significant interaction between avatar type, request difficulty, and group.

#### **REACTION TIME**

As shown in **Table 2**, the significant main effect of group was found in reaction time, which was longer in patients than in controls (Num DF = 1, Den DF = 617, *F* = 7.12, *p* < 0.01). The significant main effect of avatar type was also found; reaction time was longer for intimate avatars than for distant avatars (Num DF = 1, Den DF = 617, *F* = 19.64, *p* < 0.01). The main effect of request difficulty was also significant (Num DF = 2,Den DF = 617, *F* = 10.21, *p* < 0.01); reaction time was significantly shorter to the easy requests than to the medium requests (*p* < 0.01), but there was no difference between the easy and difficult requests and between the medium and difficult requests.

The significant interaction effect was not shown between avatar type and group, but found between request difficulty and group



(Num DF = 2, Den DF = 617, *F* = 4.59, *p* = 0.01); in both avatar types, reaction time for the easy requests was significantly longer in patients than in controls (*p* < 0.05), but not for the medium or hard requests. No interaction effect was found between avatar type, request difficulty, and group.

#### **CORRELATIONS**

The Rosenberg Self-esteem Scale scores were only significantly correlated with the acceptance rates for the hard requests in patients (*r* = −0.50, *p* = 0.007), but not in controls (**Figure 4**). The Liebowitz Social Anxiety Scale scores were not significantly correlated with the acceptance rates or reaction times for any request difficulty level in both groups. In patients, the PANSS total scores were significantly correlated with reaction time (*r* = 0.57, *p* = 0.002). In detail, the correlation was significant with the negative symptom scores (*r* = 0.54,*p* = 0.004), but not with the positive symptom scores (*r* = 0.37, *p* = 0.06).

#### **DISCUSSION**

In the present study, we examined the difference in the level of intimacy formation and the effect of intimacy on social decision between patients with schizophrenia and normal controls. Both patients and controls tended to feel closer to intimate avatars than distant avatars and to accept requests less often as requests grew more difficult. These results suggest that virtual reality tasks

simulating complex social situations of real life can be effectively applied to both groups. As expected, however, compared with controls, patients showed a different pattern in the formation of intimacy with avatars and refused the distant avatars' requests more often.

#### **INTIMACY RATING**

The intimacy rating scores for intimate avatars during the intimacy task were not different between patients and controls, but those for distant avatars were higher in patients than in controls, suggesting that there is a difference in social cue-based emotional cognition between the two groups. This is in line with previous findings that patients with schizophrenia have impairment in social cognition, including emotion perception, theory of mind, and attributional style (Penn et al., 2008). In particular, inappropriate intimacy rating for distant avatars in patients may reflect deficits in emotion perception, which were reported to be prominent in schizophrenia (Pinkham et al., 2007).

In general, expressing intimacy is positive and affiliative, whereas expressing distance is awkward and estranged (Feeney, 1999). Attitudes of distant avatars could be perceived as rude or brazen on participants, and these could induce negative emotion. Therefore, our results may reflect a bias of patients with schizophrenia toward negative emotional stimuli, as revealed in several behavioral studies investigating emotional perception (Kinderman, 1994; Loughland et al., 2002; Choi et al., 2010). A previous study using a role-play with various situations demonstrated that patients with schizophrenia consistently underestimated the intensity of negative emotion, but not the intensity of positive emotion (Bellack et al., 1992). This pattern of behaviors might be an expression of denial to the aversive condition or a coping response to control for the overwhelming and aversive input. Taken together, thefeature of feeling more intimate with distant avatars rather than feeling more distant with intimate avatars reflects a peculiarity of emotional deficits in schizophrenia.

Alternatively, abnormal theory of mind might contribute to the findings of impaired intimacy formation. In other words, it could have been difficult for patients with impaired theory of mind to understand intentions of avatars in various situations, which were

included in the intimacy task. This possibility is consistent with a previous report that accurate mind-reading was important to building an intimate relationship (Thomas and Fletcher, 2003).

#### **ACCEPTANCE RATE**

Patients showed no difference compared with controls in the acceptance rate for each of intimate and distant avatars during the social decision task, suggesting that patients with schizophrenia have the ability to take intimacy into account similarly to control subjects while accepting or rejecting the requests. However, further analyses showed that the difference in the acceptance rate between intimate and distant avatars was smaller in patients than in controls, suggesting that the effect of intimacy on social decision seems to be relatively small in schizophrenia.

Furthermore, during the social decision task, patients showed similar acceptance rate for the easy and medium difficulty requests when compared with controls, but lower acceptance rate for the hard requests than controls, suggesting motivation deficits in schizophrenia. Although patients with schizophrenia are able to predict and value social factors such as reciprocity and equity, the ability to feel anticipatory pleasure for social reward may be insufficient to motivate them to incur greater cost (Choi et al., 2013). Therefore, our findings during the social decision task can be interpreted as an aspect of negative symptoms in a broad sense (Gorissen et al., 2005; Choi et al., 2013).

Alternatively, concreteness of patients may have an effect on the results. When a represented situation becomes complex, individuals may need abstract thinking or flexibility to consider various factors, including familiarity, reciprocity, equity, and hierarchy. Given that patients with schizophrenia are known to be more impaired when making more abstract social judgments (Penn et al., 1997), the difference in the hard requests can be explained as being attributed to the patients' concrete judgment in relation to rising situational complexity. Another possibility of lower acceptance rate for the hard requests in patients could be attributed to their paranoid tendency. In the current study, 66.7% of patients had the paranoid subtype. As a request gets harder, paranoid patients can interpret it as a threatening or exploiting one.

#### **EFFECTS OF SELF-ESTEEM AND SOCIAL ANXIETY**

Meanwhile, patients with stronger self-esteem showed a tendency to refuse the hard requests more often. Our study was conducted in an Asian country in which collectivistic value orientation is dominant (Jackson et al., 2006; Dierdorff et al., 2011). In this cultural background, it is typical for people to take count of the personal relationship, and thus healthy controls might have accepted the hard requests regardless of their self-esteem. However, patients with schizophrenia who have deficits in social and emotional functioning might have been more influenced by self-esteem. More independent and less interdependent features are shown to predict higher self-esteem levels (Singelis et al., 1999).

We expected similar influence by social anxiety, but it did not produce any positive result. Given that a strong correlation between self-esteem and social anxiety has been reported in patients with schizophrenia (Lysaker et al., 2008), this negative result is somewhat of a surprise. This result may reflect a characteristic of our participants, who showed a significant group difference in self-esteem, but not in social anxiety. If patients with severe social anxiety had been more recruited, the result could have been changed.

#### **REACTION TIME AND SYMPTOM SEVERITY**

The results of overall reaction time appeared to be very short, probably because the presenting time of requests was sufficiently long that participants would have already made the decision even before the start of the separately given responding time. Nonetheless, our results showed that patients reacted more slowly, and as their symptom severity increased, reaction time also increased. This feature may correspond to previous studies, which revealed significant relationships between symptoms, cognitive impairment, and reaction time (Smyrnis et al., 2009; Neill and Rossell, 2013). In particular, the significant correlation of delayed reaction time was found with the negative symptom scores, but not with the positive symptom scores. A previous study reported that negative symptoms were correlated with the simple reaction time tasks in patients with persistent illness rather than patients with fluctuating illness, suggesting that persistent illness, negative symptoms, and impaired initiation may reflect enduring brain structural abnormalities (Ngan and Liddle, 2000). Therefore, it needs to be considered that delayed reaction time in patients may be related to the severity of negative symptoms and poor outcomes.

#### **CLINICAL IMPLICATION AND LIMITATIONS**

Our findings provide additional information for clinical practitioners.Various therapies are available, such as social skills training to help patients with schizophrenia cope with the affiliative relationships and roles required for independent living (Hogarty et al., 1986; Kopelowicz et al., 2006). To build up a patient's social repertoire to a proficient level, therapists must train a wide skill spectrum, which includes social perception, social information processing, affiliative skills, and so on (Kopelowicz et al., 2006). In this study, patients showed a difficulty in the intimacy formation and inappropriate acceptance for the requests. During therapy, if understanding of various social cues and social reward expectations can be enhanced, clinicians can lead patients to be more affiliative. In particular, a training program using virtual reality can be especially beneficial in improving motivation to participate in the therapy and enhancing conversational and assertiveness skills (Park et al., 2011).

There were some limitations in the present study. The level of education was significantly lower in patients than in controls, and thus it was included as a covariate. Comprehensive cognitive measures were not applied, and thus we did not know if our findings of social deficits were related to cognitive dysfunctions. A small sample size was another limitation. Larger sample size would produce more different levels of emotions on social behaviors. In addition, there was some task weakness with respect to enhancing participants' intimacy with avatars. In a previous study, interactive situations were shown to be necessary for participants to feel intimate with avatars and situations (Kane et al., 2012). In the present study, however, participants

did not have a chance to interact with avatars, and this could have prevented participants from feeling like experiencing a real social relationship. It was also a limitation that the feeling of reality during the tasks was not evaluated using an appropriate scale.

#### **CONCLUSION**

The present study for determining the level of intimacy formation and the effect of intimacy on social decision using virtual reality tasks revealed that patients with schizophrenia showed significantly higher intimacy rating scores for distant avatars during the intimacy task than controls, and significantly smaller increases of the acceptance rate for intimate avatars compared with distant avatars during the social decision task than controls. In addition, patients tended to refuse more the hard requests than controls. These results suggest that patients with schizophrenia have a deficit in emotional perception and social decision-making. Various factors such as a peculiarity of emotional deficits, motivational deficits, concreteness, and paranoid tendency may contribute to these abnormalities. Our results provide additional information on impairments in social cognition and social interaction in patients with schizophrenia.

#### **ACKNOWLEDGMENTS**

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. NRF-2013R1A2A2A03068342).

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at http://www.frontiersin.org/Journal/10.3389/fnhum.2014.00945/ abstract

#### **REFERENCES**


perceived partner responsiveness in interpersonal exchanges. *J. Pers. Soc. Psychol.* 74, 1238–1251. doi:10.1037/0022-3514.74.5.1238


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 04 July 2014; accepted: 06 November 2014; published online: 24 November 2014.*

*Citation: Park S, Shin JE, Han K, Shin Y-B and Kim J-J (2014) Effect of perceived intimacy on social decision-making in patients with schizophrenia. Front. Hum. Neurosci. 8:945. doi: 10.3389/fnhum.2014.00945*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Park, Shin, Han, Shin and Kim. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited andthatthe original publication inthis journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## RC2S: a cognitive remediation program to improve social cognition in schizophrenia and related disorders

#### **Elodie Peyroux 1,2,3\* and Nicolas Franck 1,2,3**

<sup>1</sup> Rehabilitation Department (CL3R), Le Vinatier Hospital, Lyon, France

<sup>2</sup> UMR 5229, Center of Cognitive Neurosciences, CNRS, Bron, France

<sup>3</sup> University of Lyon, Lyon, France

#### **Edited by:**

Ali Oker, Université de Versailles Saint-Quentin-en-Yvelines, France

#### **Reviewed by:**

Eric Brunet-Gouet, Université de Versailles Saint-Quentin-en-Yvelines, France Mathieu Urbach, Hôpital André Mignot, France Ali Oker, Université de Versailles Saint-Quentin-en-Yvelines, France

#### **\*Correspondence:**

Elodie Peyroux, Rehabilitation Department (CL3R), Le Vinatier Hospital, 98 rue Boileau, Lyon 69006, France

e-mail: elodie.peyroux@gmail.com

In people with psychiatric disorders, particularly those suffering from schizophrenia and related illnesses, pronounced difficulties in social interactions are a key manifestation. These difficulties can be partly explained by impairments in social cognition, defined as the ability to understand oneself and others in the social world, which includes abilities such as emotion recognition, theory of mind (ToM), attributional style, and social perception and knowledge. The impact of several kinds of interventions on social cognition has been studied recently. The best outcomes in the area of social cognition in schizophrenia are those obtained by way of cognitive remediation programs. New strategies and programs in this line are currently being developed, such as RC2S (cognitive remediation of social cognition) in Lyon, France. Considering that the social cognitive deficits experienced by patients with schizophrenia are very diverse, and that the main objective of social cognitive remediation programs is to improve patients' functioning in their daily social life, RC2S was developed as an individualized and flexible program that allows patients to practice social interaction in a realistic environment through the use of virtual reality techniques. In the RC2S program, the patient's goal is to assist a character named Tom in various social situations. The underlying idea for the patient is to acquire cognitive strategies for analyzing social context and emotional information in order to understand other characters' mental states and to help Tom manage his social interactions. In this paper, we begin by presenting some data regarding the social cognitive impairments found in schizophrenia and related disorders, and we describe how these deficits are targeted by social cognitive remediation. Then we present the RC2S program and discuss the advantages of computer-based simulation to improve social cognition and social functioning in people with psychiatric disorders.

**Keywords: social cognition, schizophrenia and related disorders, cognitive remediation, simulation techniques, social functioning**

#### **INTRODUCTION**

Adequate social functioning is the consequence of a person's ability to interact appropriately and effectively in the social world (Hooley, 2010). In people with psychiatric disorders, particularly those suffering from schizophrenia and related illnesses, pronounced difficulties in social interactions are a key manifestation (Bellack et al., 1990). Deterioration of social relations was recognized a century ago in the earliest clinical descriptions of the schizophrenia, and still remains a hallmark of the disorder at all of its stages. Individuals with psychiatric disorders are often socially isolated, unemployed, unable to manage money, and generally lack the skills to live independently. Moreover, these impairments and disabilities are often not improved by psychotropic medication. Poor social-skills may promote relapses and have a negative impact on interpersonal support, social affiliation, and quality of life (Kopelowicz et al., 2006). Interpersonal and relational problems exacerbate symptoms, participate in psychological suffering, and slow down the rehabilitation process (Prouteau, 2011). Impaired social functioning cannot be accounted for solely the symptoms

of the disorder or the effects of medication and hospitalization (Hooley, 2010). It may in fact be an early marker of schizophrenia.

Interpersonal difficulties can be partly explained by impaired social cognition, defined as the ability to understand oneself and others in the social world (Penn et al., 2008) or to construct mental representations about others and oneself, and about one's relationships to others (Brothers, 1990). Currently, a consistent and significant scientific body of literature attests to dysfunctional social cognition in schizophrenia (Penn et al., 1997; Green et al., 2005, 2008). According to the American Psychiatric Association (2000), impaired social functioning is one of the hallmark characteristics of schizophrenia and also one of the most important unmet treatment needs for people suffering from this mental disease (Kern et al., 2009).

#### **COGNITIVE IMPAIRMENTS IN SCHIZOPHRENIA**

Cognitive impairments are a core feature of schizophrenia that is strongly associated with functioning in areas such as work, social relationships, and independent living (McGurk et al., 2007). The cognitive deficits of people with schizophrenia and related disorders are found in the domains of neurocognition and social cognition. Their impaired neurocognitive functions are now well documented, and the most pronounced are in the areas of attention, verbal memory, and executive functioning (Medalia and Choi, 2009). To address these critical problems facing people with schizophrenia, a range of cognitive remediation programs have been developed and evaluated over the past 40 years. The principle underlying this therapeutic approach is the enhancement of patients' cognitive resources in view of improving their cognitive functions, and indirectly, the functional disabilities that impinge upon their daily lives (Demily and Franck, 2008). In the past few years, many authors have studied the links between cognition and functional outcome in schizophrenia. Despite the significant association between neurocognition and functional impairment, correlations with composite measures of neurocognition are only moderately intense, however, explaining a modest part of the variance in functional outcomes.

These results have prompted the search for other factors likely to enhance our understanding of the relationships between cognition and functional deficits (Schmidt et al., 2011). The most promising mediator uncovered so far lies in the area of social cognition. Hence,the first question raised by clinicians and researchers was: are neurocognition and social cognition separate factors in schizophrenia? According to the majority of studies, social cognition and neurocognition are related but are distinct constructs in schizophrenia (Allen et al., 2007; Sergi et al., 2007a; Hoe et al., 2012). Recent models, investigating the role of social cognition have proposed that this cognitive domain may serve as a mediating link between neurocognition and community functioning (Vauth et al., 2004; Schmidt et al., 2011). This implies that neurocognitive impairments may have a negative impact on social cognition and thereby exert a negative influence on the patient's functional state (Schmidt et al., 2011). Social cognition is actually strongly associated with social functioning and plays a key role in social and community integration, work life, and interpersonal relationships (Couture et al., 2006). Another recent meta-analysis (Fett et al., 2011) showed than social cognition has a greater impact on variance in community outcome (16%) than neurocognition does (6%). Moreover, a major part of the functional prognosis of people with schizophrenia has been found to depend on social cognition disorders (Penn et al., 1996; Kee et al., 2003).

Social cognition is not a one-dimensional construct however (van Hooren et al., 2008). In schizophrenia, three to five of the components of this cognitive domain are usually altered and have specific detrimental effects on functioning in everyday life (Green et al., 2004, 2008; Penn et al., 2008). The first component is emotional processing, which is the ability to identify and recognize emotions through facial expressions, gestures, and tone of voice. Emotional processing has been widely studied in schizophrenia, and deficits in facial and vocal emotion recognition are now well established (Edwards et al., 2002). Moreover, impairments of this component may be one of the most important factors contributing to the social isolation of people with schizophrenia (Hoekert et al., 2007). The second component is theory of mind (ToM), which is defined as "the ability to attribute mental states (beliefs, intents, desires, . . .) to oneself and others, and to understand

that others have beliefs, intentions, and desires that are different from one's own" (Premack and Woodruff, 1978). Impaired ToM in schizophrenia is now well documented (Sprong et al., 2007; Bora et al., 2009) and this component seems to be fundamental to proper social functioning (Couture et al., 2011). The third component, attributional style, refers to how people explain the causes of positive and negative events. Generally, the causes of an event can be attributed to oneself (human internal attribution), to others (human external attribution), or to the environment (situational external attribution). In schizophrenia, people tend to blame others for their negative life events rather than sharing the responsibilities between different sources (Favrod et al., 2013). These attributional biases seem to have a direct influence on social interaction abilities. The tendency to make stable attributions for life events has been shown to predict more frequent social contacts, a higher quality of social interaction, and better community participation (Lysaker et al., 2004). The last two components, social perception and knowledge, can be defined as decoding and interpreting social cues from others, taking the social context into account, and knowing social rules, roles, and goals. Social perception and social knowledge are closely linked to community functioning and are a necessary prerequisite for well-adapted social-skills (Green et al., 2005).

Each component's impairment has specific detrimental effects on everyday life. For example, strong correlations have been identified between emotional processing and occupational and social status, and between ToM and behavioral adaptation. Moreover, social cognition impairments seem to have an impact on psychotic symptomatology, although the association between them remains unexplored. According to Demily and Franck (2008), social cognition deficits contribute to the onset and maintenance of symptoms. Concerning positive symptoms, social cognition impairments may contribute to the sense of insecurity in schizophrenia which leads to delusions, and notably to persecutory delusions. These social cognition disorders may also favor negative symptoms through a lack of interest in others and reduced social relationships. Social cognition therefore constitutes a central target for improving the ability of people with schizophrenia to interact socially in an appropriate way.

#### **COGNITIVE REMEDIATION OF SOCIAL COGNITION**

The impact on social cognition of several kinds of interventions has been studied recently. Concerning antipsychotic medication, no improvement in the area of social cognition has been highlighted, even with second-generation treatments (Sergi et al., 2007b). Other techniques like the intranasal administration of oxytocin have also been investigated, with some studies showing encouraging results such as a decrease in positive symptoms (Churchland and Winkielman, 2012) and an improvement in ToM and social perception (Pedersen et al., 2011). However, the most satisfactory results for improving social cognition in schizophrenia are those obtained by cognitive remediation (Kurtz and Richardson, 2011). Several new cognitive remediation strategies and programs are currently being developed.

In the cognitive remediation of social cognition, three kinds of interventions, based on three different theoretical foundations, are available today (see **Figure 1**, and for a French review of cognitive

remediation programs targeting social cognition in schizophrenia, see Peyroux et al., 2013). The earliest ones are called "wide interventions" and are based on the idea that the acquisition of basic cognitive skills strengthens relational competency. In this domain

of intervention, two of the most well-known therapies are CET (Hogarty and Flesher, 1999; Hogarty et al., 2004) and IPT (Roder et al., 1988, 2006), a Swiss program that combines neurocognitive and social cognitive interventions with training in social-skills. Other programs called "targeted interventions" are more specific. Each one targets a specific component of social cognition such as ToM, e.g., ToMRemed a French program developed in Versailles (Bazin et al., 2010), or facial emotion recognition, e.g., TAR (Frommann et al., 2003) and Gaïa (Gaudelus and Franck, 2012) developed in Lyon and currently undergoing an efficacy study. More recently, "global interventions" have been developed. This kind of program tries to take into account all components of social cognition that are impaired in schizophrenia. In the area of global interventions, the American social cognition and interaction training (SCIT) program (Combs et al., 2007; Penn et al., 2007; Roberts and Penn, 2009) is a group-based intervention, delivered weekly over an 18-week period. SCIT is composed of three phases: emotion training (defining emotions, emotion mimicry training, and understanding paranoia), figuring out situations (distinguishing facts from guesses, jumping to conclusions, and understanding bad events), and integration (checking out guesses in real life). Its efficacy in improving social cognition components is now well established.

A recent meta-analytic study devoted to cognitive remediation of social cognition (Kurtz and Richardson, 2011), which took into account 19 studies, revealed that social cognitive training procedures have a moderate to large effect on emotion recognition, a small to moderate effect on ToM, but no effect on social perception or attributional bias. One of the most important findings in this study is that these kinds of techniques have a moderate to large effect on community functioning, which provides strong evidence for the transfert of training effects to daily social life. In another meta-analysis of social cognitive treatment for psychosis Fiszdon and Reddy (2012) reviewed nearly 50 studies. They focused on proximal social cognitive effects, the durability of those effects, and their generalization to functional measures and their conclusions are consistent with earlier findings by Kurtz and Richardson. They suggested that simple cognitive processes like emotion recognition can be definitively improved by social cognitive remediation, but that these techniques have limited success in remediating more complex, higher-order social cognitive functions. They proposed two potential reasons for these mixed results: a lack of consideration of and/or compensation for cognitive impairments, and limited opportunity to practice skills. We propose an alternative explanation, close to their second proposal: the lack of a realworld environment where patients are confronted with complex social interactions that take all components of social cognition into account. Social cognitive remediation programs today often focus on relatively simple associations and deductive reasoning about one of the components of social interaction. They use a hierarchical, step-by-step model in which each process is taught independently and thereby lacks the characteristics of the real social situation. We assume that an integrated, natural remediation program close to the real social world could be helpful in improving higher-order social cognitive functions. Thanks to technological developments this kind of program has become feasible.

#### **ADVANTAGES OF COMPUTER TECHNOLOGIES IN COGNITIVE REMEDIATION OF SOCIAL COGNITION**

In the field of neurocognition, the use of specific computer programs for cognitive remediation has become very popular within the past few years (Lindenmayer et al., 2008; Franck et al., 2013; Vianin, 2013). These programs have many advantages, such as their ability to adapt the difficulty level to the specific skills of each patient, to give immediate feedback concerning performance, and to readjust the reinforcement methods used (Tomas et al., 2010). Moreover, according to a meta-analytic study on the efficacy and specificity of computer-assisted cognitive remediation in schizophrenia, prolonged multimedia stimulation is a factor that promotes neural plasticity (Grynszpan et al., 2011), a fundamental concept in cognitive remediation. Cognitive remediation appears to have an impact on the amount of gray matter over a 2-years period, in specific areas significantly related to improved cognition (Eack et al., 2010) and also seems to be related to greater efficiency of the interhemispheric transfer of information between the right and left prefrontal cortices (Penadés et al., 2013).

Very few computer programs have been used in social cognition therapy, probably because it is assumed that interacting with a machine does not involve social-skills. It seems, however, that social cognition can be learned directly via computer programs because computerized tasks often provide the opportunity to decompose and control the different processes at play in social interactions, and thus to offer a progressive training program regarding difficulty.

Other technologies, such as virtual reality or simulation techniques, are even more applicable to improve social cognition, and to promote one of the greatest challenges for cognitive remediation: the transfer of acquired abilities to everyday life following treatment and the generalization of treatment benefits to other social processes. Virtual reality can be defined as techniques founded on real-time interaction with a virtual world that helps the learner to perform specific tasks defined by computer programs (Arnaldi et al., 2006). It offers the possibility of building realistic environments in 3D. Virtual reality or simulation techniques are already being used in several fields, not only in playing and the military, but also in medical domains.

These techniques expose patients to complex, dynamic, interactive stimuli and serve to assess or remedy cognitive, behavioral, and functional impairments in tasks very close to those found in daily life. Psychiatry and neuropsychology are taking advantage of these technologies for treating specific phobias, post-traumatic stress disorder, attention deficit disorder in children, and test anxiety (for a review, see Gregg and Tarrier, 2007). For anxiety disorders, therapies using virtual reality are based on the principle of exposure, like traditional approaches. They have proven effective for treating several phobias and post-traumatic stress disorder (Riva, 2005; Gregg and Tarrier, 2007; Powers and Emmelkamp, 2008).

Although virtual reality is typically used as an exposure technique for treating the difficulties mentioned above, it has been recently applied to the study and treatment of schizophrenia. In a recent meta-analysis, Freeman (2008) counted seven applications of virtual social environments to schizophrenia: assessing symptoms, identifying symptom markers, determining predictive factors, testing putative causal factors, investigating different predictions of symptoms, searching for toxic elements in the environment, and developing treatments. In the field of assessment and treatment, virtual reality systems can provide a viable environment for individuals to interact with social avatars. According to Kim and Kim (2011), virtual reality may be one of the most promising tools for assessing social-skills because it minimizes bias resulting from traditional assessment methods.

People placed in a virtual environment are inclined to treat virtual individuals as real humans, which involves responding and interacting with them in a natural way in accordance with both the context and the affect it entails (Bailenson et al., 2003). These "humanization effects" were also reported for people with schizophrenia by Ku et al. (2006). In the study by these authors, the participant had to discuss briefly with a virtual avatar while the experimenters measured interpersonal distance between the patient and the avatar, and the patient's verbal response time. The results supported the assumption that the avatar was perceived as a real person by the patients with schizophrenia, suggesting that communication skills could be assessed and improved by this kind of procedure.

Virtual reality and simulation techniques can also be used to generate safe and harmless environments where patients can learn social-skills without negative repercussions on their real life, such as emotional frustration or a feeling of failure. Hence, in consideration of the stigma associated with mental illness, which can be highly detrimental to rehabilitation in people with psychiatric pathologies, utilizing a virtual environment without having to fear the negative consequences of the real world may be a favorable method (Kim and Kim, 2011). For all of these reasons, a number of clinical studies have been exploring the use of virtual reality to improve conversational and communication skills.

Since 2005, an American team has launched four projects aimed at taking advantage of simulation techniques to develop programs for improving social-skills among people with autism spectrum disorders. For these patients, impaired relational commitment is one of the most debilitating symptoms (Trepagnier et al., 2005). Preliminary data obtained with this kind of technique has highlighted the benefits gained by patients in the field of social cognition. In general, studies using virtual reality or computerized techniques obtain higher scores on tests of ToM skills, more appropriate judgments about pragmatically appropriate behavior, and better recognition of facial expressions of emotion (cited in Trepagnier et al., 2011). Moreover, it seems that subjects' realworld interactions can also be improved by these techniques. In this vein, Kandalaft et al. (2013) who studied eight adults diagnosed with high-functioning autism, found a significant increase in social cognitive measures, as well as gains in social functioning (for example, maintaining a conversation in real life, understanding other people's point of view, and establishing relationships) after 10 sessions spread over 5 weeks of intervention.

Concerning schizophrenia, some research teams have begun to use these techniques as a therapeutic and social-skill training tool (Kim and Kim, 2011). Ku et al. (2007) designed an experiment to evaluate opinions of patients with schizophrenia about a virtual conversational skills training program. They acquired objective measures as silence-breaking times (the duration from the beginning of a silence during a conversation to talk-button pressing) and subjective measures by using questionnaires of usability, opinions, and presence. Their results indicated that patients underwent the virtual conversation program without problem and that the virtual conversation with virtual avatars could be effectively exposed

to patients with schizophrenia. Currently, several clinical research teams are studying virtual reality applications in the field of socialskill training. In their clinical study, Park et al. (2011) compared training in social-skills training based on traditional role-playing to another method using virtual reality role-playing. They found that the virtual reality system was particularly good at improving conversational skills and assertiveness, and concluded that these techniques may be a useful supplement to traditional social-skill training. Moreover, virtual reality applications seem to be advantageous in terms of enhancing motivation for therapy. This result is particularly interesting because recent models of cognition suggest that motivation is a core component of the relationship between social functioning and cognition in schizophrenia (Barch, 2005).

A Spanish research team has developed and integrated a virtual reality program into social-skill training (Rus-Calafell et al., 2013). Their intervention is based on a brief cognitive-behavioral training in social-skills developed by the same team, in conjunction with the Soskitrain program, which consists of seven activities targeting seven behaviors: social perception, social information processing, responding and sending skills, affiliative skills, assertive communication, instrumental role skills, and conversational skills. The virtual reality program allows patients to practice social interactions with virtual avatars and promotes the gradual learning of the repertoire of social-skills – from more basic skills like facial emotion recognition to more complex ones such as holding a conversation. A case study that used this therapy showed a positive change in facial emotion recognition, social anxiety, and conversation time, along with improved interpersonal communication and assertiveness, and fewer negative symptoms (Rus-Calafell et al., 2012). The authors recently replicated their results in a pilot study of 12 persons with schizophrenia (Rus-Calafell et al., 2014).

However, programs that use virtual reality applications in schizophrenia often focus on social-skills and do not take specific processes of social cognition into account – as emotion recognition, understanding of others' intentions or analysis of the social context – which are in fact essential to appropriate social functioning. We developed a program we call RC2S (cognitive remediation of social cognition in schizophrenia) to take this difficulty into account. RC2S is aimed at training social cognition processes in a realistic environment, in line with both the social cognition profile of the individual and his/her specific functional outcomes.

#### **RC2S: A COMPUTER-BASED COGNITIVE REMEDIATION PROGRAM TO IMPROVE SOCIAL COGNITION**

RC2S was developed in France through collaboration between the Rehabilitation Department of Le Vinatier Hospital in Lyon (the authors of the present article) and SBT Company (headed by F. Tarpin-Bernard). This global remediation program was developed with two main goals in mind. The first was to stay close to difficulties encountered by patients in their daily social life. In order to achieve this point, we developed an individualized, flexible program that addresses both specific impairments in social cognitive processes and the objectives of each patient. The second goal, which follows from the fact that transferring skills or benefits acquired during cognitive remediation to daily life is hard, was to design a program based on data from clinical, psychological, and nursing interviews, and to make use of simulation technologies


through collaboration with ITycom, a company specializing in virtual reality.

Treatment with RC2S runs for 14 weeks, at a pace of one 1.5 to 2-h session per week (see **Figure 2**).

#### **PREPARATION SESSIONS**

The first two sessions are preparation sessions. In the first part of these sessions, the therapist and the patient look together at the social cognition assessment previously administered to the patient. The social cognition assessment battery used in RC2S (named Cla-CoS) was developed by a French psychiatry research team made up of psychiatrists and neuropsychologists (GDR 3557). There is currently no consensus about what measures provide the best indicators of a given social cognitive domain, and the majority of existing social cognition measures have poor psychometric properties (Pinkham et al., 2013). To address these problems, the GDR team started from the social cognition psychometric evaluation (SCOPE) work (Pinkham et al., 2013) done in the USA, and sit out to identify the best existing French measures of social cognition – some of them are developed by the Versailles' team, such as the LIS, which assess the understanding of implicit intentions by using an ecological video-based task (Bazin et al., 2009) – and develop others that did not exist in order to obtain a battery that would be suitable for assessing the domains of impaired social cognition in schizophrenia. Examining a patient's social cognition assessment allows him/her to grasp his/her own profile, and to understand which of his/her own social cognitive components are impaired and which are preserved.

After this, the patient and his/her therapist investigate the functional outcomes of the impairments in the patient's daily life. This investigation is made possible by a specific tool, the Social Cognition-Functional Outcomes Scale (ERF-CS, developed by E. Peyroux and B. Gaudelus)<sup>1</sup> . The ERF-CS is composed of 14 items that depict different social situations in which each component of social cognition is likely to have an impact. Items are ordered according to the main process involved. The scale provides an overall score of functional outcomes ranging from 0 to 154, and 4 subscores, one for each social cognition process.

The investigation of the patient's objective difficulties (through the analysis of the social cognition assessment) along with the

<sup>1</sup>https://wiki-afrc.org/afrc:documents

investigation of the repercussions of those difficulties on daily life, allow the patient to set two or three concrete objectives that should contribute to improving his/her social functioning. The last session of the preparation phase provides specific information about social cognition, impairments in schizophrenia or related disorders, and functional outcomes of these deficits, via a psychoeducational document (developed by E. Peyroux and B. Gaudelus, see text footnote 1). The aim of the psychoeducational session is to allow the patient to understand the specific terms used in the field of social cognition and increase motivation.

#### **SESSIONS OF COGNITIVE REMEDIATION**

Then, 10 sessions of cognitive remediation are proposed, each composed of four parts: (1) an analysis of the social interaction situation proposed, (2) a virtual reality or simulation scene, (3) a review of the situation, and (4) a proposal of a home-based task. Each of the 10 sessions deals with a social interaction situation that is described in a text and simulated in the computerized program. The situations are ranked in increasing order of difficulty, based on both the emotional or affective nature of the situation and the complexity of the characters' interactions. This allows us to abide by one of the fundamental principles of cognitive remediation: errorless learning, which requires to adapt difficulty of exercises to patient's abilities, for the purpose of preventing implicit learning of errors.

#### **Analysis of a social interaction situation**

The first part of the cognitive remediation session consists of analyzing the proposed social interaction between Tom, the character interpreted by the patient during the therapy, and another character. No specific description of Tom is provided to patient to allow him/her to imagine Tom as close as possible of him/her. The interaction is described in a short text and the patient's first task is to build a coherent mental representation of the situation by breaking it down into a microstructure. The patient has to be sensitive to the context and to take into account the links between his character, Tom, and the other character. In order to help him/her to decompose the situation, a "question wheel" is proposed. This part also allows the participant and the therapist to work on specific components of social cognition, such as Tom, by thinking about Tom's or the other character's mental states, intentions, or desires, and to find arguments supporting his/her choices. It also offers the opportunity to work on attributional bias by providing two possible outcomes for the situation: a positive one and a negative one. The patient has to explain the causes of the two outcomes based on external and internal attributions. Finally, some of the situations work on basic emotions (anger, fear, sadness, joy, etc.) by using photos and proposing to be more empathic with the different characters. Technologies used to build the program do not allow providing precise facial expressions according to each emotion. The focus in RC2S is more on body movements and emotional prosody. Therefore, RC2S may be associated with the use of other material as pictures or movies, when facial emotions have to be considered.

#### **Simulation scene**

After this first analysis step, the patient sits down in front of a computer to work on the social interaction in real-time. We chose to use a screen and not a particular immersion technique, firstly, in order to let patients experience a virtual environment that is less immersive because, as reported by Ku et al. (2006), such devices might be uncomfortable or difficult to control for people with schizophrenia, and secondly, immersion devices like CAVE (automatic virtual environment) and head mounted display (HMD) are relatively expensive, even if this point may be changing in the future. Insofar as our aim is to use RC2S as widely as possible, like other cognitive remediation techniques, we chose to use standard equipment that is already available in all hospital wards.

During the simulation scene, the patient's goal is to assist Tom in the social situation and to guide him during the interaction by choosing a pattern of behavior (among those proposed) after each exchange between Tom and the characters (see **Figure 3**). To teach the ability to attribute intentions and mental states (Tom), and to analyze context (social perception), attitudes, prosody, and emotional dispositions (emotional processes), the patient is asked to select a direction for Tom's behavior, but without any clear explanations of Tom's speech. Moreover, during certain sequences, Tom's behavior is predefined by the program. The patient thus needs to adapt to the evolving situation.

The social interaction scene follows a predefined but flexible decision tree where the patient's choice influences Tom's progress and upcoming interactions. Each of the 10 scenes lasts between 10 and 20 min. For each interaction, three types of behaviors are suggested based on models from several social-skill training or self-affirmation programs:


In the RC2S program, the patient can browse through these three types of behavior in order to analyze the situation and also the interactions that follow.

In this step of the cognitive remediation session, the therapist does not intervene. The patient must verbalize the strategies employed to analyze the situation. The therapist has a caring attitude and should look for cues used by the patient to resolve the social interaction.

#### **Review of the situation**

The patient's performance during the virtual reality scene is recorded. This makes it possible to decompose the patient's behavior, interaction after interaction, which will be used in the third part of the cognitive remediation session: a review of the situation. The scene can be viewed as many times as desired and stopped at key times to allow the patient to take all of the social components

(contextual information, tone of voice, gestures, and facial expressions) into account. This part helps the patient focus on specific components of social cognition. Depending on the impairments of each patient, the therapist can suggest that the patient analyze the characters' gestures or facial expressions, listen to tones of voice in order to recognize emotion-specific inflections. The therapist can also ask the patient to do the scene over in order to generate specific mental states in the characters Tom is facing (for example, anger, envy, or discomfort), find key components to recognize in a real-time interaction, and to think about the characters' possible reactions or intentions. This kind of adaptations linked to patients' deficits can only take place through simulation technology, which enables one to come close to real interactions, test several kinds of behavior, and examine one's interlocutor's reactions without negative repercussions on daily life.

#### **Home-based task**

The cognitive remediation session finishes with a proposal of a home exercise. Home tasks are chosen by the patient in collaboration with the therapist in order to promote motivation. Exercises take into account both the level of performance of participant – tasks are thus of increasing difficulty to allow the patient to experiment consecutive successes – and related to the concrete objectives defined at the beginning of the therapy. Tasks are adapted to the needs of the patient and take into account his/her daily reality. For example, at the beginning of the therapy, the therapist can propose the patient to detect emotions and intentions of people in movies, and then in family context. Later, specific social interactions can be proposed to the patient and analyzed with the therapist. Finally, some tasks as the organization of activities with friends or family can be suggested. Home exercises are useful to promote the transfer of strategies to daily life and their subsequent automation. This is consistent with learning theories. Actually when cognitive remediation is provided in a rehabilitative context and reinforced in real-world settings, the learning process is facilitated and then generalization and transfer are promoted (Medalia and Saperstein, 2013). The home exercise is gone over with the therapist at the beginning of the next remediation session.

#### **TRANSFER SESSIONS**

Finally, two sessions are proposed at the end of the social cognitive remediation program. These sessions are transfer sessions. In these sessions, the therapist and the patient review the work done since the beginning of the therapy and look together at the achievement of the concrete objectives. These sessions also allow the patient to transfer skills acquired during the remediation program to his/her daily life by working close to his/her difficulties. Thank to these transfer sessions, the patient can adapt the strategies to other social interactions. To promote the reinforcement of strategies the patient set two or three new concrete objectives at the end of the therapy.

#### **VALIDATION METHODOLOGY**

The cognitive deficits experienced by patients with schizophrenia or related disorders are very diverse. Yet, most studies on the effectiveness of cognitive remediation treatment are randomized controlled trials. This approach has the advantage of minimizing allocation bias, balancing both known and unknown prognosis factors in the assignment of treatments and so reduces spurious causality. For example, in a very interesting randomized controlled trial from Wykes et al. (2007), symptoms of participants were rated by a psychiatrist unaware of group allocation, in a different building, and the study was carried out with blind randomization. This methodology also allows controlling focus on patients and increased social contact by using a control group similar to the therapy group in terms of focus, social contact, number, and duration of sessions. Moreover, randomized controlled trials can be combined in systematic reviews and are considered to be the most reliable form of scientific evidence. Nevertheless this methodology design also has some disadvantages because, depending on how heterogeneous the cognitive impairments are, group means cannot reflect the behavior of each person with schizophrenia (Shallice et al., 1991). Moreover, group studies have limited value when it comes to assessing the effectiveness of an intervention program for a given individual (Wilson, 2006). This is why single-case studies seem to be a better way to get as close as possible to the complexity of human social situations. Unfortunately, case study research is often misunderstood (for a summary of conventional wisdom regarding single-case studies see Flyvbjerg, 2006).

According to Tate et al. (2008), a single-case experimental design corresponds to "the intensive and prospective study of the individual, using an *a priori* methodology, which includes systematic observation, manipulation of variables, repeated measurements and data analysis." In neurology, the single-case methodology is widely used to study the impact of cognitive remediation programs, but this is not the case in psychiatry. Some authors, however, have suggested using a single-case experimental design to assess individualized interventions involving functional issues (Kurtz et al., 2001). In line with the aims of the RC2S program, we have launched a validation study using multiple single-case experimental designs. We are currently testing the program with four patients: two people with schizophrenia, one patient with schizoid personality disorder, and one patient with 22q11 deletion syndrome. A thorough assessment, including a complete evaluation of components of social cognition and of social functioning, but also clinical and neuropsychological assessments has been proposed for

each patient enrolled in the study. According to the patient's profile and his/her objectives, we have collected three kinds of baselines before the beginning of the cognitive remediation intervention: baselines specific of the targeted component, non-specific baselines (such as measures of neurocognition processes that should not be affected by the intervention), and intermediary baselines that is measures of social cognitive function linked with targeted processes but not directly concerned by the cognitive remediation program. These measures and the complete assessment will be repeated at the end of the intervention to highlight impacts of RC2S on social cognitive impairments, and 6 months after to investigate a possible maintenance of the benefits. We don't have yet objective results about the therapy; nevertheless, subjective reports from patients ongoing therapy are very positive. For the moment all the participants have been present in every sessions and are engaged in the therapy. They seem to develop social relationship by participating in activities outside their homes (such as sport, and activities with friends or relatives).

#### **CONCLUSION**

As described in the first part of this paper, social cognitive treatments seem to be more effective on basic processes of social cognition like emotion recognition than on more complex or higher-order social cognitive functions, such as attributional style or social perception. According to Fiszdon and Reddy (2012) in order to impact higher-order processes, patients need to have ample opportunity for practice of skills until they become fully integrated and least somewhat automatic. We postulate that RC2S program, by using an environment close to the real social world that allows the patient to practice skills in specific social interactions, could have a greater impact on these complex social cognitive functions. To conclude, virtual reality or simulation techniques constitutes promising tools to study social cognition and treat patients with impairments in these areas, and so to favor rehabilitation in people with psychiatric troubles.

#### **ACKNOWLEDGMENTS**

The authors would like to thank Vivian Waltz for editing the manuscript. Funding: the program has been funded by the Research Scientific Council of Le Vinatier Hospital (CSR C07) and by SBT Company.

#### **REFERENCES**


in schizophrenia using an ecological video-based task: a comparison with manic and depressed patients. *Psychiatry Res.* 167, 28–35. doi:10.1016/j.psychres.2007. 12.010


assessment, and research opportunities. *Schizophr. Bull.* 34, 1211–1220. doi:10. 1093/schbul/sbm145


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 02 April 2014; paper pending published: 22 April 2014; accepted: 18 May 2014; published online: 13 June 2014.*

*Citation: Peyroux E and Franck N (2014) RC2S: a cognitive remediation program to improve social cognition in schizophrenia and related disorders. Front. Hum. Neurosci. 8:400. doi: 10.3389/fnhum.2014.00400*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Peyroux and Franck. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## ADVANTAGES OF PUBLISHING IN FRONTIERS

FAST PUBLICATION Average 90 days from submission to publication

COLLABORATIVE PEER-REVIEW

Designed to be rigorous – yet also collaborative, fair and constructive

RESEARCH NETWORK Our network increases readership for your article

#### OPEN ACCESS

Articles are free to read, for greatest visibility

#### TRANSPARENT

Editors and reviewers acknowledged by name on published articles

GLOBAL SPREAD Six million monthly page views worldwide

#### COPYRIGHT TO AUTHORS

No limit to article distribution and re-use

IMPACT METRICS Advanced metrics track your article's impact

SUPPORT By our Swiss-based editorial team

EPFL Innovation Park · Building I · 1015 Lausanne · Switzerland T +41 21 510 17 00 · info@frontiersin.org · frontiersin.org