# USING NEUROPHYSIOLOGICAL SIGNALS THAT REFLECT COGNITIVE OR AFFECTIVE STATE

EDITED BY: Anne-Marie Brouwer, Thorsten O. Zander and Jan B. F. van Erp PUBLISHED IN: Frontiers in Neuroscience

#### *Frontiers Copyright Statement*

*© Copyright 2007-2015 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

*All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-613-5 DOI 10.3389/978-2-88919-613-5

# About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

# Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

# Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

# What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **USING NEUROPHYSIOLOGICAL SIGNALS THAT REFLECT COGNITIVE OR AFFECTIVE STATE**

# Topic Editors:

**Anne-Marie Brouwer,** TNO Soesterberg, Netherlands **Thorsten O. Zander,** Technical University of Berlin, Germany **Jan B. F. van Erp,** TNO Soesterberg and University of Twente, Netherlands

Images by Laurens R. Krol, Technical University of Berlin, Germany.

What can we learn from spontaneously occurring brain and other physiological signals about an individual's cognitive and affective state and how can we make use of this information?

One line of research that is actively involved with this question is Passive Brain-Computer-Interfaces (BCI). To date most BCIs are aimed at assisting patients for whom brain signals could form an alternative output channel as opposed to more common human output channels, like speech and moving the hands. However, brain signals (possibly in combination with other physiological signals) also form an output channel above and beyond the more usual ones: they can potentially provide continuous, online information about an individual's cognitive and affective state without the need of conscious or effortful communication. The provided information could be used in a number of ways. Examples include monitoring cognitive workload through EEG and skin conductance for adaptive automation or using ERPs in response to errors to correct for a behavioral response. While Passive BCIs make use of online (neuro)physiological responses and close the interaction cycle between a user and a computer system, (neuro)physiological responses can also be used in an offline fashion.

Examples of this include detecting amygdala responses for neuromarketing, and measuring EEG and pupil dilation as indicators of mental effort for optimizing information systems.

The described field of applied (neuro) physiology can strongly benefit from high quality scientific studies that control for confounding factors and use proper comparison conditions. Another area of relevance is ethics, ranging from dubious product claims, acceptance of the technology by the general public, privacy of users, to possible effects that these kinds of applications may have on society as a whole.

In this Research Topic we aimed to publish studies of the highest scientific quality that are directed towards applications that utilize spontaneously, effortlessly generated neurophysiological signals (brain and/or other physiological signals) reflecting cognitive or affective state. We especially welcomed studies that describe specific real world applications demonstrating a significant benefit compared to standard applications. We also invited original, new kinds of (proposed) applications in this area as well as comprehensive review articles that point out what is and what is not possible (according to scientific standards) in this field. Finally, we welcomed manuscripts on the ethical issues that are involved.

Connected to the Research Topic was a workshop (held on June 6, during the Fifth International Brain-Computer Interface Meeting, June 3-7, 2013, Asilomar, California) that brought together a diverse group of people who were working in this field. We discussed the state of the art and formulated major challenges, as reflected in the first paper of the Research Topic.

**Citation:** Brouwer, A.-M., Zander, T. O., van Erp, J. B. F., eds. (2015). Using Neurophysiological Signals that Reflect Cognitive or Affective State. Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-613-5

# Table of Contents

*06 Editorial: Using neurophysiological signals that reflect cognitive or affective state*

Jan B. F. van Erp, Anne-Marie Brouwer and Thorsten O. Zander


Peter Gerjets, Carina Walter, Wolfgang Rosenstiel, Martin Bogdan and Thorsten O. Zander


Ricardo Chavarriaga, Aleksander Sobolewski and José del R. Millán

*66 Inference of human affective states from psychophysiological measurements extracted under ecologically valid conditions*

Alberto Betella, Riccardo Zucca, Ryszard Cetnarski, Alberto Greco, Antonio Lanatà, Daniele Mazzei, Alessandro Tognetti, Xerxes D. Arsiwalla, Pedro Omedas, Danilo De Rossi and Paul F. M. J. Verschure

*85 Developing an EEG-based on-line closed-loop lapse detection and mitigation system*

Yu-Te Wang, Kuan-Chih Huang, Chun-Shu Wei, Teng-Yi Huang, Li-Wei Ko, Chin-Teng Lin, Chung-Kuan Cheng and Tzyy-Ping Jung

*96 Knowledge-based identification of sleep stages based on two forehead electroencephalogram schannels*

Chih-Sheng Huang, Chun-Ling Lin, Li-Wei Ko, Shen-Yi Liu, Tung-Ping Su and Chin-Teng Lin

*108 Brain fingerprinting classification concealed information test detects US Navy military medical information with P300*

Lawrence A. Farwell, Drew C. Richardson, Graham M. Richardson and John J. Furedy

*129 A brain-computer interface for potential non-verbal facial communication based on EEG signals related to specific emotions* Koji Kashihara


Yuan-Pin Lin, Yi-Hsuan Yang and Tzyy-Ping Jung

*180 Hybrid fNIRS-EEG based classification of auditory and visual perception processes*

Felix Putze, Sebastian Hesslinger, Chun-Yu Tse, YunYing Huang, Christian Herff, Cuntai Guan and Tanja Schultz


Chris Dijksterhuis, Dick de Waard, Karel A. Brookhuis, Ben L. J. M. Mulder and Ritske de Jong

*213 Combining and comparing EEG, peripheral physiology and eye-related measures for the assessment of mental workload*

Maarten A. Hogervorst, Anne-Marie Brouwer and Jan B. F. van Erp


# Editorial: Using neurophysiological signals that reflect cognitive or affective state

Jan B. F. van Erp1, 2 \*, Anne-Marie Brouwer <sup>1</sup> and Thorsten O. Zander <sup>3</sup>

*<sup>1</sup> TNO Human Factors, Soesterberg, Netherlands, <sup>2</sup> Human Media Interaction, University of Twente, Enschede, Netherlands, <sup>3</sup> Team PhyPA, Technical University Berlin, Berlin, Germany*

Keywords: neurophysiology, cognitive state, mental state, affective state, EEG, Brain-Computer Interface, psychophysiology, physiological computing

# Introduction

The central question of this Frontiers Research Topic is: What can we learn from brain and other physiological signals about an individual's cognitive and affective state and how can we use this information? This question reflects three important issues which are addressed by the 22 articles in this volume: (1) the combination of central and peripheral neurophysiological measures; (2) the diversity of cognitive and affective processes reflected by these measures; and (3) how to apply these measures in real world applications.

# Neurophysiological Measures and Real World Applications

Edited and reviewed by:

*Cuntai Guan, Institute for Infocomm Research, Singapore*

> \*Correspondence: *Jan B. F. van Erp, jan.vanerp@tno.nl*

#### Specialty section:

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience*

> Received: *08 April 2015* Accepted: *16 May 2015* Published: *29 May 2015*

#### Citation:

*van Erp JBF, Brouwer A-M and Zander TO (2015) Editorial: Using neurophysiological signals that reflect cognitive or affective state. Front. Neurosci. 9:193. doi: 10.3389/fnins.2015.00193* Let us first look at the last issue as it is an important driver of the current research and dictates choices related to the first two, for instance when a specific application requires sensor technology to be portable and easy to set up with little calibration time (Gerjets et al., 2014; Huang et al., 2014; Estepp and Christensen, 2015). The authors of this Research Topic describe a wide variety of applications. Many studies follow a so-called passive Brain-Computer Interface approach (Zander and Kothe, 2011): assessing typically covert aspects of the user state without interfering with the task the user is doing. Applications addressed in this volume include improved learning and training environments (Gerjets et al., 2014; Stikic et al., 2014), adaptive vehicle interfaces (Dijksterhuis et al., 2013; Touryan et al., 2014; Wang et al., 2014), classifying concealed information (Farwell et al., 2014), identifying sleep stages (Huang et al., 2014), and non-verbal communication for patients (Kashihara, 2014). Making these applications viable is not trivial. It requires experiments outside a well-controlled lab environment as initiated in several studies here (Dijksterhuis et al., 2013; Betella et al., 2014; Huang et al., 2014; Lin et al., 2014; Stuiver and Mulder, 2014) and knowledge and technological advancements in the area of classification algorithms that require little training data, are robust against environmental and physiological artifacts in the recorded signal, are able to work in real-time and on single trial data, and generalize over tasks. Unlike a lab experiment where participants are usually instructed to sit still and to perform one single task, users in real world situations will usually be in motion, subject to changing cognitive load or emotional state, and/or be multitasking (Van Erp et al., 2012). Several authors focus on developing methods able to cope with these added complexities in life-like situations. For example, Betella et al. (2014) assesses the reliability of psychophysiological signals reflecting emotional state when people are walking and gesticulating. Mühl et al. (2014) look into the effect of emotional state on workload measures, and show that cross-context classifier training is able to compensate for accuracy decline caused by changes in emotional state (especially so for indices in the frequency domain). Touryan et al. (2014) are able to model the time-ontask decrements in performance by analyzing neural activity. Chavarriaga et al. (2014), Vijgh et al. (2014) and Wang et al. (2014) give examples of real time classification. Vijgh et al. (2014) aim to asses stress state and adjust stressor intensity in a gaming environment in real time, Wang et al. (2014) are able to detect the effectiveness of a drowsiness warning system and continuous signatures of fatigue, Chavarriaga et al. (2014) develop single-trial error-related potential recognition in order to improve humancomputer interaction, and Estepp and Christensen (2015) present initial proof that electrodes can be removed and replaced between sessions without dramatic loss in performance. Gerjets et al. (2014), Stikic et al. (2014), Stuiver and Mulder (2014) and Touryan et al. (2014) tackle the question about generalizability over tasks. This is an important issue because if the models and algorithms are highly task-specific, it would mean that each application has to start from scratch in choosing the optimal parameters, gathering data and building classification algorithms. Generic classifiers can possibly shorten the calibration time, increasing the usability and applicability. Stuiver and Mulder (2014) investigate and try to grasp differences in the physiological patterns of an ambulance dispatcher task and driving. Stikic et al. (2014) and Touryan et al. (2014) focus on the validity of their algorithms for different tasks: marksmanship and golf (Stikic et al., 2014), and driving and perceptual discrimination (Touryan et al., 2014). It is not surprising that many authors look at ways to further improve signal processing and classification techniques including neural networks (Casson, 2014; Stikic et al., 2014), feature fusion (Putze et al., 2014), and elastic nets (Hogervorst et al., 2014).

# Constructs of Interest

The wide range of foreseen applications also illustrate the broadness of the constructs of interest. Workload, stress, and emotion are classic examples, but there is also an increasing interest in the ability to determine what information the brain is processing (Pineda et al., 2013; Lin et al., 2014) and for instance whether this is visual or auditory (Putze et al., 2014), which in turn links to the multitasking challenge of real world applications. Additional contributions to this area come from studies into new paradigms to evoke specific states as discussed by Gerjets et al. (2014). For instance, Vijgh et al. (2014) describe an automated stress induction and control application and Brouwer and Hogervorst (2014) describe an elegant, simple and ethically acceptable way of inducing mental stress that results in physiological responses comparable to those induced by the Trier Social Stress Test.

# References

Betella, A., Zucca, R., Cetnarski, R., Greco, A., Lanatà, A., Mazzei, D., et al. (2014). Inference of human affective states from psychophysiological measurements extracted under ecologically valid conditions. Front. Neurosci. 8:286. doi: 10.3389/fnins.2014.00286

# Combining Central and Peripheral Neurophysiological Measures

The complexity of the constructs to be measured, and the fact that (neuro-) physiological indices only provide indirect access to them forces this field to explore and combine measures from different sources. Quite novel are particular combinations of central and peripheral measures and combinations of different central sensors (more specifically EEG and NIRS). Strait and Scheutz (2014) start with an interesting analysis of what we can and cannot (yet) do with fNIRS and Putze et al. (2014) provide an example that fusing EEG and fNIRS features can increase the accuracy of classification. Stikic et al. (2014) combine ECG and EEG derived measures to correlate physiological changes with training progress, and Hogervorst et al. (2014) determine the workload classification accuracy of EEG, ECG, skin conductance and eye-based measures, alone and in combination. This study contributes to the long-lasting dispute about the best workload measures in favor of EEG-derived indices. More particular, information derived from a single electrode location (Pz) turns out to be an adequate workload predictor. Finally, papers also demonstrate the use of a wide variety of indices from a single source, for instance EEG is analyzed both in the time domain (P3 event related potentials, Farwell et al., 2014 and error-related potentials, Chavarriaga et al., 2014) and the frequency domain (as workload index, Mühl et al., 2014 or in both, Hogervorst et al., 2014). Finally, Gerjets et al. (2014) and Zander and Jatzev (2012) argue that contextual information can be a useful addition to neurophysiological measures.

# Concluding Remarks

The papers in this special issue show that the field of applied (neuro-) physiology is progressing in many aspects, especially in EEG-based passive Brain-Computer Interfaces. This is not to say that there are no remaining issues, but the field seems to be well-aware of the challenges it is facing (Gerjets et al., 2014; Strait and Scheutz, 2014; Brouwer et al., 2015) and the knowledge and technological breakthroughs it requires on the way to real-world applications. Also in order here is a word of caution: there are many potential pitfalls in using neurophysiological measures as listed by (Gerjets et al., 2014; Brouwer et al., 2015). However, and as also indicated, there are ways around them. The steps currently taken in applied neurophysiology also touch on other issues, such as ethics, acceptance of the technology by the general public, privacy of users, and the possible effects that these kinds of applications may have on society as a whole. These are not covered in this issue, but should also be borne in mind (Van Erp et al., 2012).


common pitfalls. Front. Neurosci. 9:136. doi: 10.3389/fnins.2015. 00136


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 van Erp, Brouwer and Zander. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Using neurophysiological signals that reflect cognitive or affective state: six recommendations to avoid common pitfalls

Anne-Marie Brouwer <sup>1</sup> \*, Thorsten O. Zander <sup>2</sup> , Jan B. F. van Erp1, 3 , Johannes E. Korteling<sup>4</sup> and Adelbert W. Bronkhorst 1, 5

*<sup>1</sup> Perceptual and Cognitive Systems, Netherlands Organisation for Applied Scientific Research TNO, Soesterberg, Netherlands, <sup>2</sup> Team PhyPA, Biological Psychology and Neuroergonomics, Technical University, Berlin, Germany, <sup>3</sup> Human Media Interaction, Twente University, Enschede, Netherlands, <sup>4</sup> Training Performance Innovations, Netherlands Organisation for Applied Scientific Research TNO, Soesterberg, Netherlands, <sup>5</sup> Cognitive Psychology, VU University, Amsterdam, Netherlands*

# Edited by:

*José Del R. Millán, Ecole Polytechnique Fédérale de Lausanne, Switzerland*

#### Reviewed by:

*Fabio Babiloni, Sapienza University of Rome, Italy Stephen Fairclough, Liverpool John Moores University, UK Javier Minguez, University of Zaragoza, Spain*

#### \*Correspondence:

*Anne-Marie Brouwer, Perceptual and Cognitive Systems, Netherlands Organisation for Applied Scientific Research TNO, PO Box 23, 3769 ZG Soesterberg, Netherlands anne-marie.brouwer@tno.nl*

#### Specialty section:

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience*

Received: *24 December 2014* Accepted: *02 April 2015* Published: *30 April 2015*

#### Citation:

*Brouwer A-M, Zander TO, van Erp JBF, Korteling JE and Bronkhorst AW (2015) Using neurophysiological signals that reflect cognitive or affective state: six recommendations to avoid common pitfalls. Front. Neurosci. 9:136. doi: 10.3389/fnins.2015.00136* Estimating cognitive or affective state from neurophysiological signals and designing applications that make use of this information requires expertise in many disciplines such as neurophysiology, machine learning, experimental psychology, and human factors. This makes it difficult to perform research that is strong in all its aspects as well as to judge a study or application on its merits. On the occasion of the special topic "Using neurophysiological signals that reflect cognitive or affective state" we here summarize often occurring pitfalls and recommendations on how to avoid them, both for authors (researchers) and readers. They relate to defining the state of interest, the neurophysiological processes that are expected to be involved in the state of interest, confounding factors, inadvertently "cheating" with classification analyses, insight on what underlies successful state estimation, and finally, the added value of neurophysiological measures in the context of an application. We hope that this paper will support the community in producing high quality studies and well-validated, useful applications.

Keywords: passive BCI, physiological computing, mental state estimation, affective computing, neuroergonomics, EEG, applied neuroscience

# Introduction

A Brain–Computer Interface (BCI) has commonly been defined as a communication system in which messages or commands that an individual sends are encoded from brain activity as for example recorded through EEG (Wolpaw et al., 2002). In this view, a BCI is seen as an alternative communication channel that can supplement or replace common channels such as speech, typing, or gestures. While using such a BCI requires active involvement of the individual, brain signals can also be recorded without the need of conscious or effortful communication (the user remains passive in that respect). This is referred to as passive BCI (Zander et al., 2008; Zander and Kothe, 2011). A passive BCI represents an output channel above and beyond the more usual ones that can be used, possibly in combination with other physiological signals, to provide continuous information in real time about an individual's cognitive or affective state (or "mental state"). The question of how to detect and use information about mental state is also being approached from other fields of research. Fields that are related to passive BCI are physiological computing (Fairclough, 2009; Fairclough and Gilleade, 2014), affective computing (Picard, 1997), augmented cognition (Schmorrow et al., 2009), and neuroergonomics (Parasuraman and Rizzo, 2007). The focus of this paper is on the general area of research as investigated in all of these interrelated fields.

Several mental states have been shown, or suggested, to be reflected in neurophysiological signals, and several types of use have been proposed. Examples include monitoring cognitive workload through cardiovascular measures for adaptive automation (Stuiver and Mulder, 2014) or using Event-Related Potentials (ERPs) in response to errors to correct for an erroneous action (Chavarriaga et al., 2014). While passive BCIs make use of online neurophysiological responses, neurophysiological responses may also be used in an offline fashion. Examples of this include using measures of workload to evaluate different systems (e.g., interface designs) or using measures of stress to evaluate interventions taken to reduce mental stress (Brouwer et al., 2013b).

Our impression is that generally, there is a strong belief that mental states can be well-inferred from neurophysiological signals, and easily harnessed in applications. This belief seems to be partly based on conclusions that are not always warranted or on potentially problematic generalizations. While similar problems play a role in other research fields, there are two specific aspects of research on neurophysiological signals for mental state estimation that aggravate these problems.

Firstly, the field is highly interdisciplinary by nature which makes it very difficult to design, conduct, and evaluate studies correctly with respect to all their elements. To realize a successful, validated demonstrator of a system using mental state as estimated from neurophysiological signals, we need expertise in sensor technology (targeted at easily wearable systems), signal processing, mathematical modeling, experimental design, psychophysiology, systems design, engineering, and knowledge of the targeted user group or field. It is an enormous challenge for scientists to oversee the state of the art in all these areas of expertise, and to integrate it successfully in their research, or accurately judge the quality of research performed by others.

Secondly, as has been observed in related fields, non-experts unjustifiably tend to regard neurophysiological data as conveying an objective truth. Especially recordings from the brain possess an air of truth or objectivity that is often unfounded (Canli and Amin, 2002; Farah, 2002), as for instance in the suggestion that brain activity patterns reveal true emotional involvement, even if subjective reports indicate otherwise. Weisberg et al. (2008) asked a group of neuroscience experts and a group of non-experts to judge explanations of psychological phenomena. Explanations could be good or bad, and could contain no or irrelevant neuroscience information. They found that non-experts judged explanations with logically irrelevant neuroscientific information as more satisfying than explanations without. In particular bad explanations profited of the addition of irrelevant neuroscientific information indicating that for non-experts, neuroscience information could mask problems in explanations of psychological phenomena. With respect to neuroscience and education, Howard-Jones (2009) notes that brain-based educational ideas can be very popular in spite of the fact that their claims are not backed by scientific evidence. He argues for scientists to not only communicate skeptically amongst themselves, but also with the educational community. Researchers in the general field of applied neurophysiology should take this to heart and take care that they make a clear distinction between what can be concluded from their results now and what may eventually be possible. This is not only important when addressing the layman audience but also peers who may not be experts on all of the underlying expertise areas in the field.

Quantification of mental state is, thus, a popular but difficult art, requiring integration of knowledge across different scientific fields. Major overarching disciplines are neurophysiology, which provides knowledge on the functioning of the nervous system and how this can be measured, and experimental psychology, which provides methods to discriminate and assess mental states. Furthermore, one needs advanced classification algorithms developed within the field of machine learning, and, last but not least, human factors expertise to devise, develop, and test real-world applications. From our own stumbling over problems not in our direct areas of expertise, and from discussions with laymen as well as peers, we gathered that it would be useful to highlight a number of pitfalls that not only occur relatively often but that also may lead to unfounded conclusions and claims. An overview of earlier research, including our own work, resulted in six such pitfalls as well as six recommendations on how to avoid them. They are discussed below and may serve to improve the design and execution of a study as well as a checklist to keep in mind when reading and evaluating studies. A summary is provided in **Table 1**. An interesting, but not wholly surprising finding is that most of the pitfalls occur in interdisciplinary regions linking the four scientific fields mentioned above. This is illustrated in **Figure 1** that represents 5 out of 6 recommendations associated with the pitfalls, and how they are linked with the underlying fields.

We hope that our list of recommendations will be useful for researchers working on mental state estimation based on neurophysiological signals, especially for those entering the field, and will help to maintain high standards in both fundamental and applied studies. This paper also highlights the papers published in the current Frontiers Research Topic "Using neurophysiological signals that reflect cognitive or affective state."

# Recommendations

# Define Your State of Interest and Ground Truth

To prevent confusion it is important to be clear from the outset about the mental state that you address. Mental states such as stress and workload usually have a long history in the (psychological) literature. Not all authors refer to the same concept when using the same word, which is partly dependent on the disciplines that researchers are coming from. It should be discussed how the addressed state has been defined in previous studies and which definition is adhered to in the study at hand. This may be guided by the specific application that the authors have in mind. It could be necessary to narrow down a specific concept (e.g., is the type of stress under investigation acute or chronic—Dickerson and Kemeny, 2004) or to refer to states that may underlie an overarching concept as is for example done for "engagement" in Fairclough et al. (2013). It is important to connect the mental state of interest to its operationalization in the study at hand,

#### TABLE 1 | Summary of the six recommendations.


*Each recommendation and associated key points is more elaborately described in the corresponding subsections of Section Recommendations.*

since this reflects what is being used or considered as the ground truth.

Obtaining ground truth can be extremely complicated in studies on mental states. Cognitive and affective states are not always easily detectable from behavior. This difficulty to observe them is precisely the reason why neurophysiological variables are of interest and of potentially added value. However, it also poses a problem when one wants to relate these "invisible" states to neurophysiology. One straightforward solution is to ask individuals to subjectively report their state, but this approach seems to conflict with the notion that neurophysiological correlates of mental state are valuable because of the fact that they are insensitive to (un)intended subjective distortion. This notion is reflected in the description of neurophysiological correlates of mental state as "objective measures." However, it is not necessarily true that whenever a mismatch occurs between mental state as experienced subjectively and as indicated by neurophysiological measures, the latter is more reliable. Things are particularly fuzzy when it comes to mental states that are not experienced very clearly by individuals themselves. Even if you did not feel stressed, when your neurophysiological stress monitor says you are, it must be so (and you might even start to realize that actually you do feel kind of stressed)! Indeed, subjective reports cannot always be taken at face value, especially when individuals can be expected to be under pressure to provide a certain kind of answers (e.g., social desirability) or when subjective reports refer to experiences sometime in the past (Nisbett and Wilson, 1977). On the other hand, one could argue that, since affective states are effectively about the feelings of an individual, they are closer to the ground truth than neurophysiological correlates. For applications that claim to estimate mental state, the fuzziness of ground truth make it difficult to judge their validity on face value. Discrepancies between estimated mental state and subjectively experienced mental state could be due to errors of the application or due to the unreliability of introspection.

While determining ground truth of mental state is a difficult and perhaps unsolvable problem, validation of neurophysiological estimates can be performed to some extent by relating them to several different measures of mental state (Mauss and Robinson, 2009). Three broad categories can be distinguished. Firstly, behavioral measures such as button press accuracy (Mühl et al., 2014). Secondly, subjective measures such as responses on known scales for arousal and workload (Mühl et al., 2014) or for different types of emotion (Kashihara, 2014). Thirdly, knowledge of the condition that individuals are currently in, for example whether participants are performing a difficult or easy task (Mühl et al., 2014), whether a certain face stimulus has been paired with an aversive stimulus before or not (Kashihara, 2014), or whether individuals will undergo eye laser surgery in a few minutes or not in a study about stress (Hogervorst et al., 2013). Expert judgments can also be taken as, or contribute to, ground truth, where expert judgments can relate to all or some of the three categories. Expert judgments are also used in relation to neurophysiological measures (e.g., visual interpretation of neurophysiological measures is used to assess sleep stage: Huang et al., 2014).

# Connect Your State of Interest to Involved Neurophysiology

When trying to estimate cognitive or affective state based on neurophysiological signals, one aims to connect a certain psychological state to certain physiological signals. However, it is not a priori clear how, or to what extent, states defined from a psychological point of view map onto neurophysiology. Cacioppo and Tassinary (1990) describe four possible types of relation between psychological elements and physiological responses: one-to-one (i.e., one psychological element such as "attention" only maps to one physiological response, such as heart rate deceleration), one-to-many (attention also maps to skin conductance response), many-to-one (relaxation and attention map to heart rate deceleration) and many-to-many (attention and relaxation map to heart rate deceleration and skin conductance response). In her review on autonomic nervous system activity in emotion, Kreibig (2010) describes different views on how and to what extent physiological signals map onto emotions. While the experience of emotion goes together with physiological processes, one of the lessons to be taken from this review is that it is not only the emotion but also, or perhaps even mostly, the associated (future) action that defines the physiological response. Physiological processes have not evolved in order to provide signals that inform us about mental state, but in order to adapt and prepare the body for certain activity. This may be at the root of a counterintuitive finding that while (high arousal) fear will generally result in an increase of heart rate, presenting negatively valenced perceptual stimuli usually results in a decreasing heart rate (Bradley and Lang, 2000, see also discussion and results Brouwer et al., 2013c). This has been explained by the fact that while fearful pictures do not form a physical threat which would be followed by fight or flight, they do elicit heightened attentional processing which is associated with (a static posture and) a decreased heart rate (Lacey and Lacey, 1970; Graham, 1992; Codispoti et al., 2001). In a similar vein, sadness can be associated with autonomous signals associated with overt activity or withdrawn behavior resulting in a need to distinguish between "crying and non-crying sadness" (Kreibig, 2010). Reconceptualization of psychological and physiological processes may be required to design better models of the relation between mental states (psychological elements) and physiology (Cacioppo and Tassinary, 1990). Findings as described in the literature can be used to formulate hypotheses as to which neurophysiological measures are expected to vary in what way with the mental state of interest. This will help identify useful variables or features for training a mental state estimation classification model (see Section Adhere to Good Classification Practice). In addition, this knowledge can and should be used to examine whether the mental state estimation model is functioning as expected (see Section Provide Insight into the Cause of Classification Success).

# Eliminate Confounding Factors (or at Least, Do Not Ignore Them)

When designing a study or evaluating an application, one should check whether the examined mental states of interest co-vary with other factors that a priori are not informative of the state of interest. If these confounding factors affect neurophysiological variables on their own, their effect can easily be mistaken by an effect of the state of interest.

A factor that can often potentially explain observed differences between mental states are differences in body movement. These body movements can, for instance, consist of button presses, turning a steering wheel, eye movements, or movements related to speech. Body movements can affect neurophysiological recordings through artifacts caused by subtle movements of the sensors or wires (Strait and Scheutz, 2014), by "real" effects on the neurophysiological signals such as electrical potential differences caused by rotating the eyes being reflected in EEG signals, or by "real" effects on the neurophysiology itself such as an increased heart rate or increased activity of the motor cortex when planning and executing movements.

Another type of confound is related to other mental states or processes co-varying with the state of interest. For example, when one aims to increase workload by increasing the number of visual stimuli, observed differences in EEG may not (only) be caused by a difference in workload but by the fact that the brain is processing more visual information in the one case than in the other (Brookings et al., 1996). Another often occurring example in this category is that studies aiming to find neurophysiological markers of emotions do not keep the arousal levels of the examined emotions equal, so that reported neurophysiological correlates of types of emotions may actually reflect levels of arousal (Oliveira et al., 2011).

Finally, it is important to realize that factors varying with time can have a huge impact on neurophysiological variables (e.g., Brouwer et al., 2014; Touryan et al., 2014). This can be a problem when experimental conditions co-vary with time relative to the onset of the study. If a difficult condition is presented before an easy condition, the higher heart rate in the former may not be related with task difficulty but with not yet being used to the experimental setting, previous physical activity or other unknown time related effects. Providing participants with practice or habituation time before the actual recordings start may alleviate time related effects.

From the perspective of fundamental science, the best way to deal with the effects of confounds is by the design of the study. Different conditions ideally only differ with respect to the examined state of interest. Different levels of mental state ideally should be present repeatedly, at different moments in time over the course of the recording session. However, complete control is not always possible or desirable. In ambulatory psychophysiological studies the aim is to investigate neurophysiological signals in the context of potential application or in contexts that are more ecologically valid than studies in the laboratory (Turpin, 1990; Picard, 1997). This usually conflicts with the attempt to control for confounds. When confounds cannot be avoided, examining the data can clarify whether or not found effects of mental state on neurophysiological variables are likely due to the varying mental state of interest or likely due to confounds. For instance, if the amount and type of body movements (speech, eye-, and hand movements) as indicated by suitable systems and measures are the same between conditions, they are an unlikely explanation for differences between conditions in neurophysiology (Betella et al., 2014). If there are differences, in some cases one could posthoc select samples of data to fulfill this condition. For instance, when examining the differences between ERPs as elicited by looking at a target object vs. a non-target object in a search task where people were free to move their eyes, Kamienkowski et al. (2012) selected eye fixations in such a way that the preceding eye movement was the same in length and direction for both targets and non-targets. Hereby they ensured that differences between target- and non-target ERPs could not be explained by effects of differences in eye movements. Perhaps most importantly, and as touched upon before, neurophysiological differences between the examined states should be examined to check whether they fit the previously defined expected effects, or whether they hint at the effect of confounds. For example, when the goal is to distinguish between processing of stimuli in different auditory and visual modalities, data should at least show differential brain activity over the visual and auditory cortex (Putze et al., 2014). As another example, while one would expect differences in workload to be reflected by power changes in the alpha and theta band (Klimesch, 1999; Fink et al., 2005; Brouwer et al., 2012), differences in workload reflected by power changes in high EEG frequencies at frontal electrodes are likely caused by a difference in muscle activation (Whitham et al., 2007). This could mean that classification can be based on differences in frowning between low and high workload, rather than on differences in brain signals. If desirable, it is possible to correct to some extent for the likely contribution of muscle activation to classification results by excluding EEG high frequency features from the classification analysis (Dijksterhuis et al., 2013).

# Adhere to Good Classification Practice

Classification analysis is an indispensable tool for estimating mental state, especially when high dimensional signals such as EEG signals are concerned. Traditional applied neurophysiology research typically uses group analyses to study neurophysiological correlates of mental state: neurophysiological variables and signals are averaged over multiple time intervals and individuals, and using statistical tools such as t-tests or ANOVAs it is determined whether varying mental state significantly affects neurophysiology. While this research and these methods are suitable to study the relation between mental state and neurophysiology in general, they would not suffice for a range of applied settings. If EEG frontal alpha power is significantly lower in high compared to low workload conditions for a group of experimental participants, we do not know whether this effect would be strong and consistent enough for estimating workload over a short time interval for a single individual. However, this is exactly what would be required if the information is to be used in adaptive automation. BCIs, that require short samples of brain signals of single individuals to be reliably translated into an intended action of a computer, welcomed classification techniques into the realm of neuroscience. These techniques had been successfully applied already in fields such as image and speech recognition. Lotte et al. (2007), van Gerven et al. (2009), Domingos (2012), and Lemm et al. (2011) provide easily readable reviews on classification. In short, (supervised) classification models are trained using samples of neurophysiological data that are labeled according to the states of interest (e.g., "low workload" and "high workload"). Subsequently, these trained models are used to label new, unseen neurophysiological data. If the label of this unseen data is known, the label as estimated by the classification algorithm can be compared to the actual label, and performance of the classifier can be determined. Subsequently, proper statistical analyses should be performed to interpret and evaluate classification performance (see e.g., Mueller-Putz et al., 2008).

While the classification procedure as described may seem simple enough, there is a multitude of options to be chosen and potential mistakes to be made that may lead to failure of successful classification or to overly optimistic results. Lemm et al. (2011) give a helpful summary of how machine learning is, or can be, unintentionally abused in brain imaging.

An important potential reason for overly optimistic results is that data used to train the model is not independent of data used to test the trained model. This links to the previously discussed effect of time-related factors (often referred to as nonstationarities). As an example, consider an experiment in which the level of workload is alternated in four blocks of 5 min. If a model is trained with 50% of 1-s randomly drawn samples, and tested using the remaining 50%, the classification success will be inflated since training and testing data from the same class are often close in time. Simply because of this fact, they will be similar rather than being similar because they originate from the same workload condition.

The number of free parameters that needs to be set when performing classification analysis is very large, varying from the type of classification algorithm and feature selection procedure, to criteria for outlier rejection and settings of hyper parameters specific to the exact pre-processing and classification procedures. This links to another important cause of inflated classification accuracy, which is reporting on pre-processing and parameter settings with the accompanying classification performance that happened to correspond to the best results. This good performance will partly be due to chance and not be reproducible. Whether or not this occurred for a specific study can be difficult to judge from information that is usually reported in a paper. It is important to choose such parameters separately from the test set that is used in the end to estimate classification performance (nested cross-validation—Lemm et al., 2011).

The problems as indicated above do not exist for studies that apply trained classification models in real time as now increasingly becomes the standard in BCI research, since in that case, making use of time related effects is not possible and classification performance is unmistakably associated with previously determined parameter settings.

# Provide Insight into the Cause of Classification Success

While classification analyses are very useful, they are essentially black box analyses. Data are used to train a model which subsequently turns out to be successful or not in properly classifying new data, but if it works it is largely unknown what is at the basis of success. It is therefore important to not only present information about classification results but also about the way that neurophysiological processes underlying the different categories differ. This provides a check as to whether the data are as expected or whether a confound could be responsible for classification success.

Note that results of traditional and classification analyses do not need to overlap exactly. For example, when a certain neurophysiological variable varies both with the mental state of interest and with time, a paired t-test can indicate a strong effect of mental state while most classification models based on the same information are expected to perform badly because of the time effects. This is because a traditional statistical procedure averages out general time effects while this is not straightforward for classification analyses. This difference between the two types of analyses is a possible explanation for the counterintuitive finding in an experiment that varied workload using a task that changed difficulty level every 2 min, for 24 times. While heart rate was a poor feature in workload classification analysis (Hogervorst et al., 2014), an ANOVA indicated that heart rate very reliably increased with mental workload (Brouwer et al., 2014). For these data, a strong decrease in heart rate over the time course of the experiment was found (Brouwer et al., 2014). This effectively means that a sample of high heart rate data could originate from the start of the experiment or from a high workload condition.

Reversely, classification analysis could indicate that certain features are very informative for estimating mental state, while this does not seem to be the case when examining results of traditional analyses. This can be the case when the way that certain neurophysiological features are associated with mental state is very different for one individual compared to another, while the association is very consistent within individuals. This has been suggested to be the case for EEG signals associated with workload (Grimes et al., 2008). It may also hold when examining a range of different neurophysiological responses to complex situations that allow for different strategies or coping styles. The announcement of a camera crew arriving in 2 min to take an interview about your research may lead to some kind of stress for virtually all people, but while one individual will quickly and intensely start thinking about the messages she wants to convey, the same event will elicit a pure fright response in the other.

Combining the two types of analyses gives us insight in the neurophysiological processes underlying cognitive and affective states. A strong association of a neurophysiological variable with mental state according to a traditional approach indicates that there is a reliable association in the same direction across individuals (possibly on top of other factors that play a role but that are equally strongly present in the different mental states, e.g., time varying factors). A strong association of a neurophysiological variable with mental state according to an individually tailored classification analysis indicates that there is a strong association in the same direction within that individual, and that the variable is not strongly affected by other factors or occurring events. Offline classification analyses based on different (combinations of) features are arguably the most straightforward approach for finding out exactly which features contribute most (Hogervorst et al., 2014). Other methods like independent component analysis (ICA; Makeig et al., 1996) or transforming the classification backward model into a forward model (Haufe et al., 2014) can lead to a spatial interpretation of the signal.

# Provide Insight into the Added Value of Using Neurophysiology

Carefully controlled experiments allow us to verify which neurophysiological measures are connected to mental state and in what way. However, in an applied research field where we want to use these signals, this is not enough. It should also be pointed out under which circumstances and in what way these signals are envisioned to be helpful. This is not always easy since one should realize that usually, there are alternative ways to retrieve the desired mental state information. Why would one use neurophysiological signals to estimate how well people like a product if they can also be asked—is there empirical evidence that this will help to better predict which product they will buy? Why would one use neurophysiological signals to estimate driver's workload if distance to the lines on the road is informative as well? While the idea of passive BCI is that we have a channel of information available "for free" since users do not need to spend attention or conscious effort to convey information about their mental state, there are costs involved with respect to buying and wearing the sensor equipment, calibration procedures, etc. Relative to other sources of information, these costs can (as of yet) be quite high, especially for brain signals. It should thus be explained that, and how, they potentially add value. Note that this could also be in the context of a combination of different types of information. See for example Huang et al. (2007) who combined ERP and button press responses for detecting targets within series of rapidly presented images, and Lin et al. (2014) who combined characteristics of music with listeners' EEG to estimate emotion.

### Confounds

In real life applications, the discussion on confounds touches the discussion on alternative sources of information about cognitive and affective state. The confounds that we carefully want to exclude in experimental situations in order to verify that and investigate how neurophysiology is connected to mental state, are abundant in real life and can actually be used if these confounds are reliably present in both the model training and the application data. For instance, if a high workload situation in air traffic control reliably co-varies with more arm movements because of button presses, more detailed information presented on the screen, and more verbal communication compared to a low workload situation, EEG is expected to co-vary with workload because of movement artifacts as well as activity in the visual cortex, and breathing variables are expected to co-vary with workload too. Thus, EEG and breathing can be used to estimate the workload situation that the controller is in. This is so because even though in principle, movement artifacts, visual information processing and speech are unrelated to workload, they are related for the specific situation at hand (air traffic control). In such cases, one should take care to define conclusions or claims properly (i.e., such as not to claim that differences in EEG are caused by different brain processes associated with workload where actually, it is a difference in movement that is responsible for the effect). Secondly, it should be realized that the trained model will not generalize to workload situations with other sensory input and motor output than the examined situation. Finally, an important question to answer when confounds are responsible for the effect is whether it is conceivable that neurophysiological data will improve mental state estimation over and above estimation based on measuring the confounding information in an easier way. For instance, the number of button presses, the information presented on the screen and a microphone indicating whether the air traffic controller is talking could in this case be just as or more effective to estimate workload than EEG and easier to collect.

# Characteristics of Applications That Likely Benefit from Neurophysiological Measures for Mental State Estimation

In general, applications using information about mental state as estimated on the basis of neurophysiology are likely to be of added value if firstly, alternative measures of mental state are not available, unreliable or difficult to obtain, and secondly, mental state as estimated by neurophysiological signals is relatively reliable. Relatively reliable information from neurophysiological signals is expected when there is little noise due to body movements and when estimates can be based on large amounts of data. Also, relatively reliable information is expected when the relation between neurophysiology and the mental state of interest is clear and well-established. Fundamental neurophysiological studies, even if only based on group effects, indicate which neurophysiological variables reflect which mental state. Emotional valence and arousal can be indicated by peripheral measures such as heart rate and skin conductance (Mauss and Robinson, 2009; Brouwer and Hogervorst, 2014; van der Vijgh et al., 2014). Workload has been shown to be measurable through EEG (even though for many studies, effects may be completely or partly caused by confounds—see Gerjets et al., 2014 for a discussion on this). Fatigue and behavioral lapses have been shown to relate to EEG signals (Wang et al., 2014) and variables related to eye blinks (Schleicher et al., 2008). Errors can be detected by analyzing event-related potentials occurring after an error on a single-trial basis (Chavarriaga et al., 2014). In addition, active BCI studies for which (consciously modifiable) signals with a high signal to noise ratio are of paramount interest, indicate which mental states can be measured at the level of a single person at one moment in time. For instance, motor imagery based active BCIs (Kalcher et al., 1996; Wolpaw et al., 2002) show that it is relatively easy to distinguish EEG signals that co-occur with imagining right hand movement (power decrease or desynchronization in the 8– 13 Hz band over the left sensorimotor cortex) from those that co-occur with left hand movement imagination (power decrease in the 8–13 Hz band over the right sensorimotor cortex). Pineda et al. (2013) show that a similar principle can be used to distinguish sounds that are related to hand-based action from sounds that are related to mouth-based action. In general, brain activity in certain brain areas is roughly indexed by power in the 8–13 Hz frequency band, where high power indicates functional cortical inhibition (reviewed by Klimesch, 2012; Horschig et al., 2014). This inhibition has been proposed to block task-irrelevant processes, therewith enhancing task-relevant brain processes in other brain areas. When measured over the visual cortex, this even allows to estimate spatial direction of visual attention which may be used in active BCI (Bahramisharif et al., 2010). Another important class of BCIs is based on the P300 ERP (Farwell and Donchin, 1988). The P300 is also related to attention—it occurs after an event that is relevant for and attended by an individual. The distinction between attended and non-attended stimuli can be made on a single ERP basis. This is also true for ERPs starting at fixation onset rather than stimulus onset in the context of a visual search task, where distinguishing between targets and non-targets on the basis of fixation-locked ERPs tended to outperform distinguishing between targets and non-targets on the basis of fixation duration (Brouwer et al., 2013a).

# Examples of Existing and Promising Applications

A concrete example of an application that fulfills the criteria of no or unreliable alternative measures of the mental state of interest and reliable neurophysiological signals, is "concealed information detection" (Farwell et al., 2014). In this work ERPs are used to determine whether or not it is likely that a person under "criminal" investigation possesses certain information. This is done by examining whether ERP responses to this concealed information are more similar to responses to known and relevant information, or more similar to responses to irrelevant information. In this case, verbal information about not knowing the concealed information cannot be taken at face value, i.e., there is no reliable alternative information present. In addition, data can be collected such that little noise is present because of body movements. Large amounts of data can be collected and analyzed offline, and the procedure is based on the established differential effect of relevant and irrelevant stimuli on ERPs. Another promising example from the current special issue is studied by Wang et al. (2014). They showed that EEG indicates whether a warning, that individuals respond to behaviorally, actually alerted them or not as indicated by a quick vs. late behavioral response later on. In contrast to EEG, the behavioral response to the warning did not reveal an individual's alertness level, i.e., neurophysiology is likely to add value. Furthermore, fatigue or alertness has been welldemonstrated to be associated with neurophysiology, and Wang et al. propose an application in driving, where little body movement occurs. They also show that similar results were obtained in an experiment using lightweight, portable, and low density EEG equipment. Another application area that at least fulfills some of the requirements is evaluating working with different interfaces and displays with respect to workload and attention. For this, performance and subjective measures could be used as well, but neurophysiological measures could provide more continuous information, such that potential difficulties could be pinpointed more exactly. Analysis can be performed offline. In certain real-life evaluation scenarios, neurophysiology may add value because social demand characteristics could play a role, e.g., when individuals are reluctant to report that an ad within a display caught their attention. Real time prediction of errors based on neurophysiological correlates of workload or drowsiness could be helpful, especially if people are unable or reluctant to signal high drowsiness or workload by themselves and if other behavioral performance measures are not available (e.g., for monitoring images from a surveillance camera where relevant events seldom occur but a single miss could have serious consequences). A potential application that builds upon well-established motor imagery BCI as discussed in the previous section, is detecting and feeding back information on motor imagery to support motor rehabilitation (Mokienko et al., 2014).

There are applications using or claiming to use neurophysiological signals for estimating mental state that are successful up to the point of commercial success, while they do not fulfill the proposed conditions as to the unavailability of reliable alternative measures of mental state or to the reliability of mental state as estimated by neurophysiological signals. In these applications, the (apparent) use of neurophysiological signals is perceived by the user as fun or helpful in itself. An example of this are the moveable "Necomimi cats ears" that users can wear on their head. It is claimed that the device reflects the emotions of the wearer based on EEG as measured by a single dry electrode on the forehead. Also, there is a range of biofeedback and neurofeedback games commercially available that are said to provide users with information about their neurophysiological signals so that they can learn to adapt these, which in turn should lead to improved health or well-being. While good research is conducted in the neuro- and biofeedback area (e.g., van Boxtel et al., 2012; see also Frontiers Research Topic "Learned brain self-regulation for emotional processing and attentional modulation: from theory to clinical applications" e.g., Enriquez-Geppert et al., 2014), doubleblind controlled research is scarce (Vollebregt et al., 2014). For the commercially available self-help applications it is unclear that these benefit the user over a placebo effect. Even though work in the area of commercial consumer applications is not up to high scientific standards yet, it is valuable and important for advancing knowledge in wearable, low energy equipment, user acceptance, and usability. However, care must be taken not to mislead users on what the equipment exactly does and achieves.

# Links between Recommendations

While we grouped and presented pitfalls and recommendations as six separate entities, they are closely interconnected. Most clear in this respect is the recommendation concerning confounds (3), that runs through all other recommendations. Confounds should be recognized when defining ground truth (see recommendation 1). One should consider that classification results may not reflect differences in mental state but rather result from confounding factors (4 and 5). Recommendation 2 on connecting the state of interest to involved neurophysiology could help determine whether an effect is likely based on a confound or not. Finally, one should consider to actually make use of confounds, i.e., regard and use them as sources of information (see recommendation 6). Another example of interconnections are that experimental skills as described under recommendations 1 and 3 are important to derive valid training data for the classification algorithm (4 and 5). Finally we would like to mention that connecting the state of interest to involved neurophysiology (2) is important to choose sensible features to train the classifier (4), where good data-driven classification practice may also lead to improved understanding of the mapping between mental state and neurophysiology (Lemm et al., 2011) (2).

# Concluding Remarks

A large body of previous research shows that in principle, neurophysiological variables contain information about mental state. Continuous knowledge of mental state could potentially be helpful in a range of application areas such as gaming, security, health, and mobility (van Erp et al., 2012). Two areas of research can be defined that are crucial for future success of applications making use of mental state estimation based on neurophysiology. These are sensor technology and generalization of mental state estimation across time, tasks, and people.

Advances in wearable, even fashionable sensor technology boosted the field and are expected to contribute further even though there are still major challenges to overcome. Currently, there are a number of dry electrode EEG systems available or under development. These systems do not require the application of gel and sometimes come with a fancy headset. While at least some of these systems approach performance of conventional wet systems (Zander et al., 2011; Chi et al., 2012), dry electrodes need pressure to overcome the lack of gel which can be uncomfortable. Debener and colleagues follow a different route by focusing on tiny lightweight electrodes that provide signals that are resistant to body movements such as those caused by walking (De Vos et al., 2014). For physiological signals besides EEG, wearable sensor developments are quick. Breathing and heart rate can be monitored without attaching sensors but by using a camera (Wu et al., 2012; Brouwer and Hogervorst, 2014) or radar (Lazaro et al., 2010), and wristbands that record heart rate or skin conductance are commercially available. Validation of these new types of equipment by independent parties is required.

For practical applicability, it is important that estimates of mental state based on neurophysiological signals can be generalized across tasks, time and individuals. The recommendations as discussed (e.g., the recommendation in Section Provide Insight into the Cause of Classification Success) are partly connected to improving generalization and to be able to predict whether, and under which circumstances, generalization is possible. For some types of signals and tasks such as the P300 in a P300-BCI, generalization (in this case, across days or even months) has been demonstrated not to pose a large problem (McCane et al., 2014). Wang et al. (2014) discuss that for EEG signals associated with fatigue, or (upcoming) lapses in performance, findings are similar across tasks. However, for many passive BCI-like applications it is difficult to create a training situation that can be used to train a classification model and that is sufficiently similar to the application situation. A workload classifier trained using known labels of a working memory task that varies in difficulty may not be able to estimate workload in a driving task where properties of the task and environment are very different. The present issue features a number of studies that worked on generalization across tasks. Stikic et al. (2014) show similarities in (unsupervised) neural network results trained and tested on neurophysiological data from combat marksmanship and golf putting tasks. Gerjets et al. (2014) propose a strategy to deal with cross-task generalization (see also Walter et al., 2013). The present issue also includes work on generalization across specific electrode montages and days (Estepp and Christensen, 2015). Touryan et al. (2014) modeled time related changes in EEG. Algorithms

# References

Bahramisharif, A., van Gerven, M., Heskes, T., and Jensen, O. (2010). Covert attention allows for continuous control of brain–computer interfaces. Eur. J. Neurosci. 31, 1501–1508. doi: 10.1111/j.1460-9568.2010.07174.x

that can adapt the classification model on the fly could prevent problems due to generalization across time (e.g., Millán, 2004; Kindermans et al., 2012). Casson (2014) shows that adding artificial noise to EEG data helps to make classification performance more robust across time. Reuderink (2011) discusses generalization issues with respect to variability within and between users, and potential ways to make classification algorithms more robust which may help to reduce other generalization problems as well.

We would like to end this paper by stressing the importance of real-life studies. While laboratory studies designed to reveal the exact connection between mental state and neurophysiology are important, in an essentially applied field of research, we also need to design applications and test whether they have added value. This should also be done under real life circumstances rather than (only) in a laboratory. Individuals are expected to function differently in controlled lab environments and under ecological, every-day circumstances. For instance, compared to having images imposed at static eyes, visual information processing seems to be quicker when actively sampling the environment through eye movements in which case the brain "knows" when information processing of a new image will start (Kamienkowski et al., 2012). However, because of confounds in real-life studies, it will be hard to connect neurophysiological results directly to cognitive and affective state. Therefore, special care is needed with statements about cause and effect. Also in real-life studies, it is possible and desirable to investigate the likely cause of classification success. This will improve our understanding of the connection between mental state and neurophysiology as well as providing clues for alternative (potentially easier measurable) informative variables. Ultimately, what needs to be shown is that applications based on mental state estimation through neurophysiological signals support users and improve performance or well-being over and above the use of a sensible comparison application. This will likely refer to a certain context and a defined range of function (Fairclough, 2009). The field of mental state estimation through neurophysiological signals as a science will benefit from careful behavior of scientists as to statements of what is possible and potentially helpful and what not, as well as from high quality studies that avoid the most common pitfalls.

# Acknowledgments

Thanks to the participants of the workshop "Using neurophysiological signals that reflect cognitive or affective state" held during the fifth International BCI Meeting in Asilomar, 2013 for their active discussion that fueled the idea of this paper. Thanks to the participants of the first Passive BCI Community Meeting in Delmenhorst, 2014 for helpful comments on a presentation based on an initial version of this paper. Thanks to Maarten Hogervorst for numerous discussions on the topics described in this paper.

Betella, A., Zucca, R., Cetnarski, R., Greco, A., Lanatà, A., Mazzei, D., et al. (2014). Inference of human affective states from psychophysiological measurements extracted under ecologically valid conditions. Front. Neurosci. 8:286. doi: 10.3389/fnins.2014. 00286


biocybernetic adaptation. Int. J. Auton. Adapt. Commun. Syst. 6, 63–79. doi: 10.1504/IJAACS.2013.050694


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Brouwer, Zander, van Erp, Korteling and Bronkhorst. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Cognitive state monitoring and the design of adaptive instruction in digital environments: lessons learned from cognitive workload assessment using a passive brain-computer interface approach

#### *Peter Gerjets <sup>1</sup> \*, Carina Walter 2, Wolfgang Rosenstiel 2, Martin Bogdan2,3 and Thorsten O. Zander <sup>4</sup>*

*<sup>1</sup> Hypermedia Lab, Knowledge Media Research Center, Tübingen, Germany*

*<sup>2</sup> Department of Computer Engineering, University of Tübingen, Tübingen, Germany*

*<sup>3</sup> Department of Computer Engineering, University of Leipzig, Leipzig, Germany*

*<sup>4</sup> Team PhyPA, Biological Psychology and Neuroergonomics, Technical University Berlin, Berlin, Germany*

#### *Edited by:*

*Anne-Marie Brouwer, TNO - Netherlands Organisation for Applied Scientific Research, Netherlands*

#### *Reviewed by:*

*Justin Estepp, Air Force Research Laboratory, USA Stephen Fairclough, Liverpool John Moores University, UK*

#### *\*Correspondence:*

*Peter Gerjets, Knowledge Media Research Center, Schleichstrasse 6, Tübingen 72076, Germany e-mail: p.gerjets@iwm-kmrc.de*

According to Cognitive Load Theory (CLT), one of the crucial factors for successful learning is the type and amount of working-memory load (WML) learners experience while studying instructional materials. Optimal learning conditions are characterized by providing challenges for learners without inducing cognitive over- or underload. Thus, presenting instruction in a way that WML is constantly held within an optimal range with regard to learners' working-memory capacity might be a good method to provide these optimal conditions. The current paper elaborates how digital learning environments, which achieve this goal can be developed by combining approaches from Cognitive Psychology, Neuroscience, and Computer Science. One of the biggest obstacles that needs to be overcome is the lack of an unobtrusive method of continuously assessing learners' WML in real-time. We propose to solve this problem by applying passive Brain-Computer Interface (BCI) approaches to realistic learning scenarios in digital environments. In this paper we discuss the methodological and theoretical prospects and pitfalls of this approach based on results from the literature and from our own research. We present a strategy on how several inherent challenges of applying BCIs to WML and learning can be met by refining the psychological constructs behind WML, by exploring their neural signatures, by using these insights for sophisticated task designs, and by optimizing algorithms for analyzing electroencephalography (EEG) data. Based on this strategy we applied machine-learning algorithms for cross-task classifications of different levels of WML to tasks that involve studying realistic instructional materials. We obtained very promising results that yield several recommendations for future work.

**Keywords: passive brain-computer interface, EEG, cross-task classification, working-memory load, adaptive learning environments, cognitive load theory**

### **INTRODUCTION: PASSIVE BRAIN-COMPUTER INTERFACES FOR ADAPTIVE DIGITAL ENVIRONMENTS**

A Brain-Computer Interface (BCI) is a direct link between a human brain and a technical system. It detects patterns in brain activity (by means of so-called classifiers) and translates them into input commands for the machine. Usually, brain activity is recorded noninvasively through electroencephalography (EEG) and is interpreted by a conventional personal computer using machine learning and signal processing techniques (Blankertz et al., 2002). Machine learning algorithms are used to classify different patterns in the EEG that have to be actively produced by patients, e.g., by imaging movements (Wolpaw et al., 1991; Pfurtscheller et al., 1993; Schalk et al., 2004) or that are elicited in reaction to an attended stimulus. The initial, principal goal of BCI-based applications has been to provide direct communication and control channels for patients who have lost their ability to communicate naturally (Wolpaw et al., 2002). As the reliability and usability of BCI systems have improved over the past decade, their applicability and their appeal for other purposes besides patient support has grown also. Today BCI-approaches are developed for applications beyond assistive technologies addressing general problems of human-computer interaction (Zander et al., 2010). According to such a passive BCI approach (Zander and Kothe, 2011), spontaneously generated brain signals related to current changes in cognitive and affective user states can be deployed to support a given interaction. With passive BCIs, new methods for real-time cognitive state assessments become available that might improve human-computer interaction significantly (for exemplary studies in this emerging field see e.g., Lin et al., 2005; Ramsey et al., 2006; Dyson et al., 2010).

Different forms of passive BCIs have been proposed in recent years, for example for detecting the perception of errors (Zander and Jatzev, 2009; Zander et al., 2010), for detecting cheating (Reissland and Zander, 2010), for detecting specific intentions when fixating an object (Zander et al., 2011a; Protzak et al., 2013), or for detecting the perceived loss of control when using a system (Jatzev et al., 2008; Zander and Jatzev, 2012). Passive BCIs can be considered as an implicit, secondary communication channel enriching the ongoing primary human-computer interaction by providing information about ongoing user states. The opportunity to adapt specialized digital environments to current mental states such as inattention, cognitive or perceptual workload, mental fatigue or aversive emotions might be very helpful for users such as system operators, people interacting in augmented environments, or even surgeons and astronauts (Tonin et al., 2010; De Negueruela et al., 2011).

In this paper we will mainly concentrate on issues related to the detection of users' cognitive workload and the automatic adaptation to it. Workload adaptation is a topic that is not only relevant for specialized digital environments as mentioned before but also for much more ubiquitous and generic settings of everyday life such as adaptive learning environments for educational purposes. As will be outlined in the next section, adapting the complexity of instructional materials to learners' cognitive workload has been the main rationale of many instructional design approaches. In the remainder of this paper we will elaborate on the methodological and theoretical prospects and pitfalls of a passive BCI approach to cognitive workload assessment in instructional contexts by discussing results from the literature and from our own research. Based on these discussions, we will present a strategy on how several inherent challenges of applying BCIs to workingmemory load (WML) and learning can be met by refining the psychological constructs behind WML, by exploring their neural signatures, by using these insights for sophisticated task designs, and by optimizing algorithms for analyzing EEG data.

# **COGNITIVE WORKLOAD AND ADAPTIVE INSTRUCTION: A CHALLENGE FOR INSTRUCTIONAL DESIGN**

According to important cognitive theories of instructional design (e.g., Cognitive Load Theory, CLT, Sweller et al., 1998; Cognitive Theory of Multimedia Learning, CTML, Mayer, 2009) the type and amount of cognitive load learners experience while studying instructional materials is one of the crucial factors for successful learning. Optimal learning conditions are characterized by providing challenges for learners without inducing cognitive overload. This general rationale is also prevalent in many classic instructional design theories. For instance, Vygotsky's (1978) well-known "zone of proximal development" describes an ideal instructional situation as intermediate between situations that are "too difficult" and situations that are "too easy." Salomon's (1984) AIME-approach focuses on optimizing learners' "amount of invested mental effort" by manipulating the perceived task demands of learning materials in a way that learners' are stimulated to engage in study activities imposing a high level of cognitive workload. Reigeluth's seminal Elaboration Theory (Reigeluth and Stein, 1983) is based on the idea of presenting the same instructional contents in a sequence of increasingly more complex versions to align the cognitive workload imposed onto learners to their growing level of understanding. Thus, presenting instructional materials in a way that learners' cognitive workload is constantly held within an optimal range is not a new idea but rather seems to be unequivocally advocated by many instructional theories as a guideline to provide optimal learning conditions. The different conceptions, however, of what constitutes the nature of cognitive workload, seem to be less unequivocally.

Workload theories in the human factors area like Wickens' (1984) widely adopted Multiple Resource Model assume that there are numerous information-processing resources (like perception, processing and action applied to items of different modalities and representational codes), each of which could potentially be overloaded in a current task performance, thereby generating a bottleneck. On the contrary, cognitive theories of instructional design like CLT or CTML are much more specific with regard to the information-processing resources they consider as pivotal cognitive bottlenecks for processes of learning and comprehension. According to these theories, the workload imposed onto a structure called working memory is the crucial type of cognitive load. Working memory describes the small amount of information that can be held and manipulated in mind simultaneously for the execution of a current cognitive task (Cowan, 2014). A careful management of cognitive load imposed onto working memory is the main instructional rationale of theories like CLT or CTML. This is in line with the main recommendation derived by working memory researchers (e.g., Cowan, 2014) from their work with regard to improving learning and education: It is of fundamental importance that instructional materials are adjusted to the working-memory capabilities of learners.

The concept of WML in CLT and CTML refers back to the multicomponent working memory model by Baddeley (1986, 2000, 2012), which distinguishes verbal and visual temporal storage components (phonological loop and visuospatial sketchpad) from an attentional control system (the central executive, borrowed from the supervisory attentional system postulated by Norman and Shallice, 1986). CLT and CTML focus almost entirely on the capacity limitations of the storage components of working memory as the bottlenecks for learning, assuming that these components constrain the amount of new information that can be processed simultaneously in order to be integrated into long-term memory. According to this view, if the amount of information a learner has to process at some point in time, exceeds the capacity of these storage components, the entire learning process is hindered. Consequently, the scientific literature on CLT and CTML focuses on identifying instructional manipulations that allow influencing the type and amount of WML during learning in order to not overtax students working memory capacity.

WML during learning is commonly assumed to be the result of an interaction between the external task requirements of the instructional design presented to learners and the complexities of the contents to be learned in relation to learners' knowledge prerequisites (cf. Kalyuga et al., 2003). Identical contents can result in lower WML for more (as compared to less) knowledgeable learners due to the chunking of information enabled by the availability of more complex concepts for representing contents. Beyond the cognitive load imposed by the representational holding of contents (also described as intrinsic or essential cognitive load) there are additional components of cognitive load imposed by task requirements of the instructional design itself (e.g., requirements such as searching and handling information, comparing instructional materials, or drawing elaborative inferences). Depending on whether these additional processes required by the instructional design are helpful or hindering for deeper learning they are described as germane or extraneous processes (Sweller et al., 1998; Gerjets and Scheiter, 2003; Gerjets and Hesse, 2004; Cierniak et al., 2009b). This distinction, however, between "positive" (germane) and "negative" (extraneous) components of cognitive load has been subject to severe critique in recent years (cf. Gerjets et al., 2009; De Jong, 2010).

Frameworks like CLT and CTML have elaborated ample advice on how to adapt contents and instructional designs to students' limited working-memory resources. One important caveat in this field, however, consists in the fact that it is not possible to provide a general advice for all types of learners independent of their level of prior knowledge. Due to the interaction between content complexities and learner's knowledge prerequisites, one and the same instructional material can impose different levels of workload onto different learners. There are even "expertise reversal effects" showing that one instructional material *A* as compared to another instructional material *B* imposes higher WML for novice learners, whereas the reverse is true for more advanced learners (Kalyuga et al., 2003). For instance, a so-called integrated graphic that presents pictorial elements in close spatial proximity to corresponding verbal explanations (e.g., labels) imposes a lower level of WML onto novice learners than a non-integrated graphic that provides verbal explanations apart from their pictorial counterparts (e.g., as a figure caption). It is assumed that integrated graphics support novices in searching for verbal explanations of pictorial elements. Interestingly, the inverse pattern has been shown for more advanced learners (e.g., Cierniak et al., 2009a). Advanced learners suffer from higher WML when learning with integrated graphics as compared to non-integrated graphics. This is due to the fact that at some point during learning verbal explanations of pictorial elements become redundant for learners as they are acquainted with these elements. Subsequently, learners might tend to even suffer from integrated graphics with regard to WML because they are forced to process verbal information in working memory that is unnecessary for them to understand the graphics. Consequently, from the viewpoint of instructionaldesign theories WML is a highly volatile learner characteristic, potentially consisting of different cognitive load components that change during learning not only due to learners' increasing level of knowledge but also due to the changing instructional materials and task requirements presented to learners at each point in time.

Accordingly, an ultimate instructional goal would probably be a moment-to-moment assessment of WML leading to an immediate online adaptation of instructional materials in case of learners getting overwhelmed (or underchallenged) with regard to their currently available working-memory capacity. A natural technological solution to achieve this goal would consist in constructing an adaptive learning environment that—other than traditional intelligent tutor systems (ITS) based on modeltracing (e.g., Corbett, 2001) or based on natural language analysis (e.g., Graesser and McNamara, 2010)—does not aim to diagnose learners' developing knowledge structures directly, but more generically concentrates on the continuous detection of different levels of WML and the online-adaptation to it. Probably, the biggest obstacle to achieve such a technological goal is the lack of appropriate measurement procedures allowing for a continuous and non-obtrusive online tracking of WML. Current measurement methods mostly rely either on subjective workload ratings (Hart and Staveland, 1988; Paas et al., 2003) or on dual-task procedures (Brünken et al., 2003; DeLeeuw and Mayer, 2008; Cierniak et al., 2009b), both of which are likely to disrupt and annoy learners while studying—without allowing for a fine-grained temporal modeling of WML. Despite these drawbacks, first provisional attempts to construct workload-adaptive learning environments based on subjective rating scales yielded promising results, even though adaptation was not instantaneous, as WML could not be assessed continuously (Mihalca et al., 2011). Additionally, the learning process had of course to be interrupted frequently in this study to obtain workload ratings from learners.

Less obtrusive measures of WML that are better integrated into the learning process, such as performance measures obtained on assessment items embedded into the instructional materials are unfortunately not sufficiently specific for cognitive load monitoring, as they do not reveal whether the successful or unsuccessful solution of an assessment item was achieved with a high or low level of working-memory investment. Or as Cowan (2014) puts it: "One potential pitfall to watch for is that, while some students will want to press slightly beyond their zone of comfort and will learn well, others will want an easy time, and may choose to learn less than they would be capable of learning," Thus, the investments of working-memory resources during learning and the success or failure with regard to assessments of learning outcomes are clearly dissociable. Beyond this disadvantage, even repeated assessments during learning will not lead to a continuous measurement and might even frustrate learners in case of high failure rates.

In sum, measures like task performance, rating scales, or dual tasks do not allow for a continuous and non-obtrusive measurement of WML and this fact provides a fundamental problem for workload-adaptive instructional environments. One option that has been advocated to overcome this problem is to identify suitable physiological indicators of WML. For instance, it has been proposed to use physiological measures like pupil dilation or skin conductivity for a continuous WML assessment. However, up to now these measures mostly turned out to be not very reliable and specific indicators of WML (Brünken et al., 2003; Paas et al., 2003). A novel methodological avenue for solving this problem in instructional research might be to obtain more direct physiological measures of neural activity to derive sufficiently specific indicators of WML during learning. This approach can capitalize on a tradition of using EEG parameters as workload indicators in human factors research as well as in neuroscience research (e.g., Gevins and Smith, 2003). In instructional scenarios this approach has been applied very rarely and in a tentative fashion (e.g., Gerlic and Jausovec, 1999; Antonenko and Niederhauser, 2010; Antonenko et al., 2010). In this paper we will follow this line of research by trying to explore the potential of a passive BCI approach for designing workload-adaptive learning technologies based on EEG signals. In the near future, such BCIs may be applied in combination with low cost headset-like EEG sensors using dry electrodes (as they are currently beginning to reach the market, cf., Zander et al., 2011b) to help learners in instructional environments to keep their cognitive workload constantly within an optimal range. Beyond preventing cognitive overload, such BCIs may also help to immediately detect and fix a lack of challenge when learning materials are too simple in relation to a specific learner's prerequisites in terms of prior knowledge or working memory capacity. This approach resembles ITS that have already been developed for other types of learner states (e.g., affective states like boredom or engagement), following the same rationale of using sensor-based measures to control adaptive learning environments (for an overview see Calvo and D'Mello, 2010).

# **PROMISES AND DRAWBACKS OF EEG-BASED MEASUREMENT OF WML DURING LEARNING: THE ISSUE OF PERCEPTUAL-MOTOR CONFOUNDS**

The currently best-studied EEG correlates of WML that would be suitable for a continuous online assessment of learner states are variations in the oscillatory power of the theta and alpha band activity (for a review see Klimesch, 1999). An increase in WML has repeatedly been shown to lead to an increase in theta band activity over frontal-midline electrodes (event-related synchronization, theta ERS; e.g., Gevins et al., 1997; Jensen and Tesche, 2002; Missonnier et al., 2006; Krause et al., 2010; Sauseng et al., 2010) and a decrease in alpha band power over parietaloccipital electrodes (event-related desynchronization, alpha ERD; e.g., Gevins et al., 1997; Stipacek et al., 2003; Krause et al., 2010).

In line with these findings, it has recently been proposed by Antonenko et al. (2010) to use oscillatory power for the measurement of WML in instructional research. To substantiate this idea, the authors discuss two studies that obtained continuous EEG measures to indicate WML during learning with hypermedia (Antonenko and Niederhauser, 2010) and multimedia (Gerlic and Jausovec, 1999). However, in our view both of these studies are subject to one of the main pitfalls that have to be resolved before neural workload measures can find a useful application in the context of a passive BCI to improve learning: In both studies it remains very unclear due to the complex learning materials used for experimentation whether the observed EEG differences between more and less demanding learning materials really go back to differences in WML alone or whether they might be mostly results of some behavioral correlates of workload or of some other perceptual or motor differences between the experimental conditions.

For instance, Gerlic and Jausovec (1999) found that learning about planets from spoken text combined with music and pictorial information (high WML) as compared to learning from written text alone (low WML) yielded alpha ERDs in temporal and occipital electrodes whereas in the text only condition alpha ERDs in frontal and central electrodes occurred. Here, potential differences in WML between the experimental conditions are massively confounded with perceptual differences of the experimental materials. The same criticism applies to the second study cited by Antonenko et al. (2010). This study (Antonenko and Niederhauser, 2010) revealed differences in alpha and theta frequencies when reading hypertext with vs. without link previews, hypothetically indicating a lower level of WML in conditions with link previews. However, conditions with link previews differed also perceptually (pop-up windows showed up) and with regard to motor activity (mouseover was needed for activating previews) from the conditions without previews. Again, these perceptual-motor confounds seem to prevent a clear attribution of the EEG differences found between experimental conditions to levels of WML. In our view, these problems of perceptualmotor confounds seem inevitable when using standard EEG power analyses for comparing different realistic learning materials (instead of comparing more controlled experimental tasks without perceptual-motor confounds).

What implications do these studies have with regard to the prospects of developing workload-adaptive instructional environments based on a passive-BCI approach? First it has to be noted that the standard EEG power analyses used in these experimental studies require a statistically averaging of data across several subjects and trials to identify significant EEG differences between instructional conditions. When it comes to adaptive learning environments, however, single-subject single-trial analyses are necessary to derive a continuous classification of learners' level of WML from EEG data allowing instantaneous reactions to different levels of WML. This requires very different approaches to online data analysis. Second, even if it is possible to classify realistic instructional conditions (inducing different levels of WML) online based on suitable methods to analyze EEG power differences, the question needs to be addressed, whether this classification is really based on differences in WML or on some of the perceptual-motor confounds of the different instructional conditions. In the latter case, it can unfortunately not be expected that the classification method used for the current task will yield transfer to novel instructional situations unless these situations have similar perceptual-motor confounds of different levels of WML. Beyond this problem that a classification based on perceptual-motor confounds is quite uninteresting from a practical perspective, it is also quite uninformative with regard to the theoretical issue of whether we can adapt instructional environments to cognitive workload because the classification method might not be specific enough to address workload only and hence might not really keep track of learners WML. We addressed both caveats (demonstration of single-subject single-trial analyses; avoiding perceptual-motor confounds in classifier training) in two of our own studies that will be reviewed in this paper.

# **STUDY 1: REALISTIC INSTRUCTIONAL MATERIALS IMPOSING DIFFERENT LEVELS OF WML: DO SINGLE-SUBJECT SINGLE-TRIAL EEG DATA ALLOW FOR A CLASSIFICATION OF THESE LEVELS?**

The first caveat that we addressed in a study by Walter et al. (2011) is whether EEG power differences between learners studying two types of realistic instructional materials (inducing different levels of WML) are sufficiently strong and reliable to enable a good classification result when analyzed with BCI methods. The problem of analyzing single-subject single-trial data online is pivotal for the passive BCI approach and requires completely different methods for analyzing oscillatory EEG data than those used by Antonenko and Niederhauser (2010). These methods are based on machine learning algorithms as they have been developed by research on traditional BCIs for patients. In our study, we used realistic materials similar to Gerlic and Jausovec (1999) or Antonenko and Niederhauser (2010) to investigate classification accuracies, being aware, of course, that using realistic instructional materials will inevitably result in the abovementioned confound issues.

The study by Walter et al. (2011) used a within-subject design to manipulate WML. Ten learners (12–14 years) were asked to study two different types of instructional materials involving processes of learning and comprehension at different levels of WML. In a high WML study window, learners had to study graphical representations and explanations of mathematical angle theorems in order to understand the theorems. In a low WML study window, learners had to study a different kind of graphical representations, namely comic strips, in order to understand the stories depicted. Both materials involved complex graphical displays, however, the comic strips where quite easy to understand for learners whereas the angle theorems where quite hard for them to grasp. The two types of materials where presented to learners in an alternating sequence to avoid confounding WML with presentation time. This type of sequencing was chosen to improve experimental control and internal validity. It has to be noted, however, that it is not very representative with regard to realistic instructional situations. Under realistic conditions, there is usually an increase in the objective complexity of learning materials over time, which is, however, not always associated with an increase of learners' level of WML due to learner's improved knowledge prerequisites over time (see Section Cognitive workload and adaptive instruction: A challenge for instructional design for details).

The experimental procedure comprised three parts. First, subjects had to solve a pre-test to assure they had no prior knowledge on angular geometry before participating in the study. A second part consisted of three learning episodes, 11 min each. In each episode subjects were asked to study five angle-theorems as well as five comic strips (in alternating sequence). Each theorem and each comic strip was presented for a 45 s time interval that we define as *study window*. Subsequent to each study window, subjects were requested to answer a question with regard to the interpretation of the theorem or comic strip (four multiplechoice options presented for 10 s). Finally, subjects had to rate their subjective level of cognitive load during the study window on a 7-point Likert scale presented for another 10 s. Overall, subjects were presented with 15 study windows on angle theorems and 15 study windows on comic strips in the second part of the experiment, always alternating between these two types of materials. In a third part of the experiment, first, learning outcome measures were obtained by asking students to solve angle problems. We used a German version of the Carnegie Learning geometry tutor for this task (Schwonke et al., 2007). Finally, a post-test comprising the same items as the pretest was administered in order to allow for direct pre-post comparisons.

During the two types of study windows (angle theorems vs. comic strips), EEG data was collected and features with regard to different characteristics in the EEG frequency bands were extracted. Due to the low number of 30 study windows (15 for each type of instructional material), all artifact free study windows were segmented into smaller epochs of 15 s for each subject and each study window. Accordingly, 45 epochs per subject resulted for each type of instructional material (3 epochs per study window × 5 study windows per learning episode × 3 learning episodes for each type of instructional material), which were used for classification. This segmentation was possible because no significant variation was detected in the signal over a full study window. Concerning the feature selection, we focused on frontal and parietal electrodes with regard to the spectral power within the alpha (8–13 Hz) and theta (4–7 Hz) frequency band. For spectral analysis, an autoregressive model was calculated using the Burg-Algorithm. As there were no theta differences between learning materials, only alpha power values were used as features to classify comic epochs (studying easy materials inducing low levels of WML) vs. theorem epochs (studying difficult materials inducing high levels of WML). As students were not required to take any overt motor action during learning, prima facie no obvious motor confounds of the two types of study windows are to be expected. The results show that during epochs with high levels of WML, a desynchronization of alpha band activity could be observed in parietal (and occipital) brain areas as compared to epochs with low levels of WML. These differences could be successfully classified on a single-subject single-trial basis by using a support vector machine (SVM) with radial basis function (RBF)-Kernel (Lotte et al., 2007). A 10-fold cross validation was conducted to verify the accuracy of the trained SVM. The mean classification accuracy for all 10 learners was 76%. For seven out of the ten learners, an accuracy of 80% or higher could be achieved for detecting epochs with high vs. low levels of WML.

Thus, our results show that it is indeed possible to classify whether learners study realistic instructional materials inducing low vs. high levels of WML with a substantial accuracy based on single-subject single-trial EEG data. However, although our results seem to be practically relevant and methodologically interesting, we would consider them to be subject to the same conceptual critique that we raised when discussing the studies by Gerlic and Jausovec (1999) or Antonenko and Niederhauser (2010) with regard to perceptual-motor confounds. For instance, the angle theorems and the comic strips might differ with regard to certain perceptual characteristics, potentially leading to differences in processing beyond imposing different levels of workload onto working memory (e.g., different eye-movements or different types of semantic processing). Thus, even if there have been no obvious motor confounds of task difficulty in this study (because students were not required to take any overt motor action during learning), there nevertheless might have been perceptual and as a consequence subtle motor confounds because more difficult tasks might have resulted in different eye-movements that could have been picked up by a classifier. Accordingly, our second caveat remains: Even if an EEG-based classification is possible for realistic tasks imposing different levels of WML, this classification might still not be very helpful theoretically because it remains unclear whether a workload classifier trained on realistic tasks really represents a measure of WML or of something else.

# **ARE THERE SUCCESSFUL CLASSIFICATIONS OF LEVELS OF WML BASED ON OSCILLATORY EEG DATA? A CRITICAL REFLECTION ON RECENT STUDIES AND SOME SUGGESTIONS FOR IMPROVEMENT**

Having a closer look at the tasks used for the classification of WML from EEG data in studies outside the field of learning and instruction reveals that problems of perceptual-motor confounds seem to be quite common, shading doubt onto the usefulness of these results as examples for the successful classification of WML. Interestingly, these problems even occur in most studies that use simple and low-level working-memory tasks for classification.

For instance, Heger et al. (2010) compared a resting state situation to a situation where subjects were conducting workloadimposing tasks like flanker tasks (where they have to press keys in reaction to the orientation of the middle arrow of a display of five arrows like: *>><>>*) or switching tasks (where they have to press keys to decide whether a number is greater vs. lower than five or whether a number is odd vs. even depending on the dashed or solid framing of the number). Heger et al. (2010) classified the neural signature of the resting state situation vs. the workloadimposing situations with an accuracy of over 90% [using Artificial Neural Networks (ANNs)]. Additionally, they applied the classifier trained on these manipulations to realistic computer-based tasks of low vs. high WML (reading vs. typing) by using a socalled cross-task classification approach (for details with regard to this approach see below). The authors claim that the classifier application to the computer-based tasks was quite successful, although no quantitative results of this cross-task classification are reported. These results seem to be quite impressive at first sight, however, there are severe perceptual-motor confounds in this study that render a clear theoretical interpretation of the findings impossible: Obviously, a striking difference between the two classes used for training and for cross-task classification was whether motor activity was involved or not (key pressing vs. resting and typing vs. reading). Thus, motor activity was strongly confounded with the manipulation of WML. As the representation of motor activity in the EEG itself has a strong alpha component and there are strong oscillatory electromyographic effects (EMG) on the beta and gamma band activity in addition (which were actually used for classification by the authors) this study may mainly demonstrate a classification of motor vs. no-motor activity rather than of high vs. low levels of WML.

A study by Berka et al. (2004) who claim to classify different levels of workload in motor tasks as well as in cognitive tasks might be subject to a more sophisticated motor confound. They used EEG data to distinguish four classes of "vigilance" and report that increasing workload leads to a classification into the high vigilance class for all types of tasks. However, as their classifier mainly relies on alpha band power (which is also subject to ERD during motor activity) in combination with behavioral measures (fast eye blinks), it remains quite unclear whether the classifier is sensitive for increased workload, increased motor activity and eye movements, or both.

Similar arguments apply to the work of Chaouachi et al. (2011) who used digit span tasks (requiring subjects to retain sequences of digits of different length) and a logic task (requiring subjects to induce rules of different difficulties describing sequences of numbers like 2 - 4 - 6) for training and classification (by means of Gaussian Process Regression). The NASA-TLX rating scale (Hart and Staveland, 1988) was used to obtain subjective data on experienced cognitive workload. Their results yielded over 90% accuracy with regard to the prediction which tasks were subjectively rated as simple, intermediate or difficult. Though initially impressive, again the difficulty levels of the tasks used were systematically confounded with the amount of motor activities required. Thus, the results cannot unambiguously be interpreted as a classification of WML. For instance in the easiest condition of the digit span tasks subjects were prompted 20 times to enter the last three digits presented resulting in a pattern of 20 bursts of three key presses in the experimental block. In the most difficult condition of the digit span task they were prompted four times to enter the last eight digits presented resulting in a pattern of four bursts of eight key presses in the experimental block. Thus, not only does the easy block contain 60 key presses and the difficult block 32 key presses but also the temporal distribution was very different. Moreover, a baseline measurement that was used as a standard for the EEG power data from the different experimental blocks was obtained only once at the beginning of the experiment. In combination with the fact that all tasks were presented in an easy to difficult order, the study seems also to confound task difficulty and time delay since baseline measurement, which might be exploited by the classifier. Therefore, systematic drifts of EEG features over time could also be responsible for good classification results.

Another study that reported good within-task classification results for three types of working-memory tasks (two levels of WML each) is the study by Baldwin and Penaranda (2012). They used a reading span task, a visuospatial n-back task, and a Sternberg task for inducing different levels of WML. The reading span task requires simultaneous processing (deciding whether a sentence is correct or not) and storage (memorizing a letter presented at the end of each sentence) for a variable number of sentences (three or four in the low WML condition and six or seven in the high WML condition). The visuospatial n-back task required remembering and comparing the previous locations of a moving square with its current location. In the low WML condition the current location has to be compared with the location one trial before (1-back). In the high WML condition it has to be compared with the location three trials before (3-back). Thus, the task requires updating a list of one vs. three locations in working memory for each trial. Nine different locations were used for presentation and the probability that the current location was identical to the criterion location presented *n* trials before (i.e., positive match) was set to 50%. Subjects received a feedback (answer correct or not) after each trial. The Sternberg task requires deciding whether a number was or was not present in a list of numbers presented before. In the low WML condition the list length was one, two, or three. In the high WML condition the list length was four, five, or six. The six blocks of tasks (two difficulty levels of each of the three tasks) were presented in a randomized order. Each experimental block was presented for 5 min to subjects. For the subsequent EEG analysis each block was segmented into 60 non-overlapping windows of 5 s each (motor responses required to complete the tasks were not excluced from the analysis). The classifiers used to distinguish the levels of WML relied on 50 features, namely the EEG power of five frequency bands (delta, theta, alpha, beta, and gamma) obtained in 10 EEG channels (three frontal, three central, three parietal, and one occipital). ANNs were applied to classify the two levels of WML for all three tasks. The within-task classification, using 50% of the data from the to-be-classified task as training data and 50% for classifier application, resulted in high classification accuracies (approximately 80% on average). As in the studies reported before, however, these results might easily go back to some perceptual-motor confounds as the different experimental conditions (levels of WML) differed strongly with regard to the amount of motor activity contained in the EEG signal used for classification. For instance, in the reading span task the memorized letters had to be typed into an on-screen keyboard every three or four letters in the low WML condition and every seven or eight letters in the high WML condition. Thus, in the easy condition the number of 5-s windows that contain motor signals from typing should approximately be twice as large as in the difficult condition. The same is true for the Sternberg task. In the easy condition, a motor reaction is required after a sequence of one to three numbers, whereas in the difficult condition, a motor reaction is only required after a sequence of four to six numbers. Again, windows containing motor signals should be twice as frequent in the easy condition than in the difficult condition. The effects of these confounds might be severe as some of the EEG frequency bands used for classification (e.g., alpha or gamma) strongly react to motor responses. The n-back task used by the authors seems to have no obvious motor confounds with regard to the number of reactions required, however, the task might suffer from a subtle perceptual confound: The 1-back is a quite easy task with an error rate of approximately 10% whereas the 3-back is a rather difficult task with an error rate of approximately 30% (chance level was set to 50%). As subjects received a feedback (response correct or not) after each trial, the number of 5-s windows containing the cognitive processing of error feedback was three times higher in the difficult condition than in the easy condition. Assuming that error feedback leads to strong neural responses in the EEG (Falkenstein et al., 1991; Spüler et al., 2012a) it cannot be ruled out that this perceptual confound due to feedback presentation has contributed to the strong classification results. Moreover, as the visuospatial n-back requires subjects to memorize one vs. three locations on the screen, it cannot be ruled out that subjects in the difficult condition displayed much more eye-movements during rehearsal than subjects in the easy condition. As the authors do not report to have removed artifacts due to eye movements from the EEG data, this difference might also have contributed to good classification results.

Based on the abovementioned studies, it has to be noted that even if there are no obvious motoric confounds of task difficulty in workload classification studies—as it is, for instance, the case in the n-back task used by Baldwin and Penaranda (2012) or in our own study reported above (Walter et al., 2011)—it is nevertheless possible that perceptual confounds are present as more difficult tasks might result in different visual displays (e.g., more error feedback or more complex graphics) leading to differences in processing that are unrelated to the concept of WML (e.g., eye movements or error detection). Thus, in order to reliably classify WML, which is a precondition for our goal of developing workload-adaptive instructional environments, it is important to train EEG classifiers on tasks that are highly controlled for perceptual and motoric confounds, which probably cannot be fulfilled by the realistic instructional tasks that we eventually intend to classify in instructional environments.

With regard to the issue of avoiding perceptual-motor confounds of WML, the most convincing classification study we found in the literature was conducted by Brouwer et al. (2012), who used a sophisticated approach to workload classification based on a letter n-back task. They implemented three levels of WML (0-back, 1-back, and 2-back), whereby in the easiest condition (0-back) no updating was required as participants only had to compare whether a presented letter was identical to a previously defined target letter. Tasks were presented in blocks consisting of 48 trials of the same level of difficulty (2.5 s per trial, 2 min per block). Overall, 24 blocks were presented in a pseudorandom order (eight blocks of each difficulty level). The first six blocks of each difficulty level were used for classifier training, the remaining two blocks for classifier testing. Similar to the n-back task used by Baldwin and Penaranda (2012) no obvious perceptual-motor confounds were committed. Additionally, compared to the visuospatial n-back task used by Baldwin and Penaranda (2012) there were probably no differences between difficulty levels with regard to eye-movements for the letter nback task. Moreover, Brouwer et al. (2012) controlled for potential effects of the differential distributions of error feedback received by participants for the three difficulty levels. In order to run the EEG analysis, each block was segmented into non-overlapping windows of different length depending on the analysis (ranging from 2.5 to 120 s per window, which allows to test for the relation between temporal window size and classification accuracy). Motor responses required to complete the tasks and other artifacts like eye blinks were not excluced from the analysis. Different types of classifiers were tested with regard to their ability to detect levels of WML. The classifier that was based on oscillarory power relied on three frequency bands (theta, alpha, and beta) obtained in seven EEG channels (frontal, central, and parietal). A SVM was applied to these features to classify different levels of WML. The results showed that the classification accuracy strongly depended on the temporal window size used for the classification (ranging from 2.5 to 120 s) as well as on the difficulty levels that have to be distinguished. The best classification results (approximately 85% correct on average) were obtained for distinguishing complete blocks of 0-back and 2-back task (i.e., decision after 48 trials or 120 s). However, as the 0-back task involves no updating process at all but relies a comparison of a current stimulus with a predefined target, it might not really be considered a working-memory task (as compared to the 1-back or 2-back task). Distinguishing the latter two (thereby involving a serious detection of the level of WML) yielded classification results of approximately 75% on average (again based on a window size of 48 trials or 120 s). However, the classification accuracy rapidly diminished when the time interval after which the decision had to be made decreased. For instance, for the distinction between 0-back und 2-back tasks the accuracy level was reduced from approximately 85% on average for a window size of 120 s (i.e., one block) to approximately 65% for a window size of 2.5 s (i.e., one trial). Unfortunately, the authors do not report on the accuracy of the more interesting distinction between 1-back and 2-back task for small window sizes but based on the abovementioned results it can be expected to be rather small. Thus, the overall classification results in this study seem to be sound and encouraging—but they are not overwhelming with regard to the prospects of yielding a fast real-time classification of WML as it is needed for developing workload-adaptive instructional environments. We assume that one reason for the moderate classification accuracy for short windows of analysis might be the fact that the motor responses required to complete the tasks as well as other artifacts like eye blinks were not excluced from the analysis in the study of Brouwer et al. (2012). As these artifacts are known to cause strong EEG signals, they can be expected to increase the overall noise level of the data, thereby potentially drowning out the weaker signals resulting from the different levels of WML implemented in the study. Accordingly, it might be a better option to define windows of analysis in a way that a time interval from at least approximately 125 ms before any keypress to approximately 125 ms after any keypress is excluded from the analysis to filter out the strongest motor signals for the sake of data quality. This is probably most important during classifier training. Using such a procedure during classifier application would be anyway quite difficult to implement in a realistic online scenario.

To conclude, several studies have recently been conducted to classify different levels of WML based on oscillatory EEG data. Although many authors report results that are quite impressive at first sight, a closer look usually reveals severe issues with regard to perceptual-motor confounds of the workload manipulations chosen. These confounds render it rather difficult to unequivocally attribute classification accuracies to levels of WML alone. Moreover, as the study of Brouwer et al. (2012) demonstrated, even if motor signals are not confounded with levels of WML, they might nevertheless create high levels of noise in the EEG data that can complicate the effective detection of the more subtle signals resulting from cognitive states. This is particularly the case when time intervals that contain strong motor signals are not excluded from the EEG analysis, which is true for all studies reviewed in this section. However, this type of temporal filtering will probably not suffice to counteract the problem of perceptualmotor artifacts in general. A better option with regard to this issue would probably be the development of more sophisticated methods for the analysis of oscillatory EEG data based on Independent Component Analysis (ICA, Makeig et al., 1996). ICA aims at reconstructing independent sources of neural activity inside the brain from the EEG data observed at surface electrodes. Once ICA-based methods become sufficiently sophisticated to rule out that classifiers strongly rely on the contribution of perceptual and motor sources, confounds like the ones observed in the abovementioned studies might become less severe (see Zander et al., submitted, for an example of how this type of methodological approach might look like). However, until these more advanced methodologies based on ICAs are available for practical application, we would strongly recommend avoiding perceptual-motor confounds when training classifiers for the detection of different levels of WML.

Thus, the rationale that we would advocate with regard to our instructional goal in this paper comprises three parts: First, we would suggest to avoid realistic instructional tasks of different WML for classifier training because these tasks will be necessarily confounded with regard to their perceptual and motor requirements. Second, we would encourage using very well designed tasks from working memory research without perceptual-motor confounds for sound classifier training. And third, in order to render these working-memory tasks relevant with regard to our practical goal, methods of cross-task classification are indicated for the later application of the classifier training to instructional target tasks. However, as the studies reviewed in this section might have sufficiently demonstrated, avoiding perceptual-motor confounds that can be erroneously picked up by classifiers remains a challenge even when using low-level working-memory tasks for classifier training. In the next section we will argue that overcoming this challenge is an important prerequisite for the successful application of cross-task classification methods.

# **HOW TO ALLOW FOR CLASSIFIER TRANSFER? THE CASE FOR CROSS-TASK CLASSIFICATION**

The method of cross-task classification for detecting WML is based on the use of EEG data recorded during solving simple and short, but theoretically well-defined working-memory tasks that induce differences in WML without inducing other substantial differences (particularly with regard to perceptual or motor processing). These tasks are used for classifier training. After the levels of WML induced by these tasks have been calibrated, the classifiers then can be applied to target tasks, for instance, more complex instructional tasks. For cross-task classification (other than for within-task classification) all trials of the training data can be used to calibrate the classifier. Subsequently, each trial of the test data serves as new and independent input data to test the separability of the target classes according to the pretrained classifier model. However, it requires quite sophisticated passive BCI methods to train classifiers on working-memory tasks and to apply these classifiers to more complex learning materials for the detection of different levels of WML by means of cross-task-classification. The main challenges of this approach are (1) identifying or designing appropriate training tasks without confounds for different levels of WML, (2) identifying EEG features related to WML, and (3) developing machine learning algorithms enabling cross-task classification of these features (Schölkopf and Smola, 2002; Lal et al., 2004; Besserve et al., 2007; Brugger et al., 2008). Achieving high classification accuracies requires a good generalizability of the classifier as well as suitable working-memory tasks for classifier training that induce the same type and level of WML than the instructional target tasks.

With regard to our instructional goal, defining a successful classifier for cross-task classification would not only solve the core problem of perceptual-motor confounds when training classifiers with realistic tasks but would also address three other important challenges of detecting levels of WML during complex learning. First, collecting data for classifier training during complex learning tasks is very time consuming for learners, as a sufficiently high number of training trials are needed to calibrate a reliable classifier. Second, realistic learning tasks are not as reproducible as performance tasks because they inherently induce learning effects. Thus, using a set of similar learning tasks of identical complexity does not imply that these tasks will induce the same amount of WML onto learners if administered in a sequence. Rather, the first learning tasks in the sequence might yield much higher levels of workload than the last learning tasks in the sequence due to learners' knowledge gains (cf. the discussion of expertise-reversal effects in Section Cognitive workload and adaptive instruction: A challenge for instructional design). Third, a classifier trained on complex tasks usually remains quite opaque with regard to what characteristics of learners' mental state the classifier actually has picked up. Simple and theoretically well-defined working-memory tasks without perceptual-motor confounds, on the contrary, will be conceptually much more revealing. Thus, an appropriate cross-task classification approach might not only serve to overcome important methodological and practical problems of classifier transfer but might also solve theoretical problems. Up to now, there are only a few studies on applying cross-task classification procedures to workload detection and as can be seen from these studies, defining training tasks for cross-task classification is in no way trivial if perceptual-motor confounds are to be avoided. In the previous section, we already discussed the study by Heger et al. (2010) who used ANNs for cross-task classification but had strong motor confounds in the training tasks (resting state vs. flanker tasks or switching tasks) as well as in in the target tasks (reading vs. typing). Accordingly, the strong cross-task classification results in this study presumably demonstrated the cross-task classification of motor activity vs. no-motor activity but not a cross-task classification of levels of WML.

Gevins et al. (1998) also used ANNs trained on EEG data obtained during solving two types of simple working-memory tasks (verbal and spatial n-back). They were able to successfully discriminate between three levels of WML for each of these tasks. Additionally, they used cross-task classification to distinguish between low and high difficulty levels across the two tasks and reported classification performances above 85%. Although these results are promising at first sight, it has to be noted that subjects basically had to solve the same task (n-back) in a verbal and a spatial version. Using two versions of the same task for cross-task classification is prima facie much simpler than our goal of using rather diverse tasks like a working-memory task and a complex learning task. In our current context, we are obviously more interested in cross-task classification across diverse tasks than in cross-classifying different versions of the same task.

Another study that uses cross-task classification to distinguish different levels of WML across diverse working-memory tasks is the study of Baldwin and Penaranda (2012) that has already been discussed in the previous section. The authors achieved good with-task classification results for three different workingmemory tasks. These results might, however, easily go back to some perceptual-motor confounds as described above. For crosstask classification, ANNs were trained on one (or a set of two) of these tasks and tested on a different task, which was not included in the training set. In this analysis, only very poor classification accuracies could be achieved. Actually, according to the authors, the accuracies for each individual subject were near chance level for all possible combinations of tasks used for cross-task classification. From our perspective, these disappointing cross-task classification results could go back to the same perceptual-motor confounds we pointed out earlier as potential explanations of the good within-task classification results: If classifiers can pick up strong perceptual-motor signals that incidentally distinguish between different versions of individual tasks (high vs. low WML), a good accuracy for within-task classification can be expected although this classification does not rely on levels of WML. Unfortunately, if different tasks have different patterns of those distinct perceptual-motor signals, training classifiers on one of these tasks will not yield transfer to the other tasks. Alternatively, the authors themselves assume that the three different tasks they used in their study might induce highly dissimilar features in the EEG signal because they rely on separate types of working memory processes related to different neural structures. In our view, this explanation addresses an important additional issue and strongly points to an urgent need for a better theoretical underpinning of passive BCI approaches to WML with regard to relevant working memory processes, their taxonomy, and their neural basis. In our view, a better theoretical understanding of different working memory processes is not only important for designing suitable sets of training tasks for crosstask classification but also for choosing appropriate strategies for data analysis as will be pointed out in the next section. Thus, for achieving our instructional goal of designing passive BCIs for cognitive load estimation during learning we tried to establish a strong link to state of the art research on working-memory processes.

# **HOW IS WORKING MEMORY WORKING? THE NEED FOR THEORY ON WORKING MEMORY PROCESSES**

When referring back to the multicomponent working memory model by Baddeley (1986, 2000, 2012), which is the basis of most instructional approaches to cognitive load, sets of tasks like those chosen by Baldwin and Penaranda (2012) indeed seem to be quite disparate with regard to the process components involved. Baddeley's model distinguishes different temporal storage components for handling diverse representational codes as well as a central attentional control system for implementing executive functions. The tasks employed by Baldwin and Penaranda (2012) differed on the one hand in their requirements with regard to processing diverse representational codes like letters, numbers, sentences, and spatial locations. On the other hand, they also differed with regard to the attentional control processes involved like updating and matching a set of items (n-back), interleaving sentence processing with maintaining a list of items (reading span) or maintaining a list of items in the presence of distracting stimuli (Sternberg task). Thus, when considering the different workingmemory components postulated by Baddeley, these tasks seem to be quite disparate. Accordingly, a strong cross-task classification performance between them might not be expected to occur. In other words, if three tasks are used for cross-task classification, but each task requires different working-memory processes with regard to storage as well as with regard to attentional control, then no strong classifier transfer between these tasks should be hypothesized unless the different processes involved in the tasks are characterized by an identical or very similar neural basis. Taking this line of reasoning very seriously, we propose that a first step for defining a theoretically sound set of training tasks for crosstask classification should be an exact definition of the cognitive processes that need to be picked up by a classifier. From our perspective, this step is crucial to allow for an appropriate classifier transfer with regard to a later target context. Thus, for our current purposes of connecting working-memory tasks and learning tasks by means of cross-task classification it will be necessary to decide first, which of the different working-memory processes we want to track. Accordingly, we have to clarify which workingmemory processes are pivotal when it comes to the role of WML for learning.

Instructional theories like CLT and CTML unequivocally focus almost entirely on processing requirements with regard to the storing of specific representational codes in working memory. Thus, they would consider the capacity limitations of the storage components of working memory to be the main bottlenecks for learning, assuming that these components constrain the amount of new information that can be processed simultaneously in order to be integrated into long-term memory. According to these theories, the entire learning process is hindered when the amount of information a learner has to process exceeds the capacity of these storage components. Using this theoretical account as a basis for an approach to cross-task classification would probably imply to define different classifiers for different storage components in order to measure the WML with regard to one specific storage component. Consequently, each classifier would be trained on a set of working-memory tasks using only one representational code and therefore imposing WML only with regard to one specific storage component. Interestingly, however, recent research on the relation between working memory and instruction suggests a quite different theoretical perspective. Many studies in instructional contexts have shown that the individual capacity of the storage components of working memory (as measured, for instance, by simple span tasks) neither predicts comprehension and learning outcomes in the short run (e.g., in studies on multimedia learning, see Schüler et al., 2011 for an overview) nor in the long run (e.g., in studies on school achievement, cf. Hoard et al., 2008; Alloway and Alloway, 2010; Kornmann et al., submitted). Instead, the central executive functions of working memory (that is, the attentional control processes that are usually required to accomplish more comprehensive working-memory tasks; cf. Daneman and Carpenter, 1980; Engle, 2002) seem to be highly predictive for successful learning and comprehension. These findings render only little support for the basic assumption of current instructional theories that capacity limitations of the storage components of working memory are pivotal for learning and comprehension. Rather, they provide evidence for an alternative idea, namely that the requirements imposed onto attentional control processes in working memory might be the crucial constraints for learners that need to be handled by providing workload-adaptive instructional environments.

Working out this idea in further detail first raises the issue of defining more precisely the nature of attentional control processes in working memory. Baddeley's working-memory theory might be a good conceptual starting point in this respect because it already postulates a central attentional control system for implementing different executive functions. Baddeley has modeled this central executive after the supervisory attentional system postulated by Norman and Shallice (1986). According to Norman and Shallice, many cognitive and overt responses are elicited quite automatically and based on activations of well-learned schemata in long-term memory. However, in order to cope with situations that cannot be handled by schema-based processes alone without resulting in errors, they postulate a supervisory attentional system that actively steps in to inhibit automatic responses and to select more appropriate ones. According to this view, the main role of attention control or executive functions in working memory might be to replace automatically activated information by more suitable information in order to avoid inappropriate processing (for a similar recent account of working memory see Oberauer, 2009). This general idea of attentional control is well in line with a latent variable analysis of executive functions conducted by Miyake et al. (2000). In this study the authors measured individual task performances in a set of simple tasks—each loading on one single executive function—as well as in a set of complex executive tasks. This procedure allows analyzing the contributions of each executive function to the individual performances in the complex task. Confirmatory factor analysis indicated that there are three target executive functions that are clearly separable although they are also correlated with one another. According to this analysis, attention control can be decomposed into the three basic executive functions inhibition, shifting, and updating, all of which aim at replacing currently active working-memory contents. The authors summarize their results by claiming that it is important to recognize both the unity and diversity of executive functions, implying that there is one overarching common factor representing the overall control of attention in addition to three specific factors representing the three executive functions.

Coming back to our instructional context, the ideas of Norman and Shallice (1986) and Miyake et al. (2000) might suggest a distinction between (1) those instructional situations that allow for a schema-based processing and therefore require little attentional control and (2) those situations that highly depend on executive processes (e.g., for selecting relevant information and *inhibiting* irrelevant information, for *updating* and organizing memory contents or for *shifting* between different task demands). Interestingly, and in line with this reasoning, most instructional design principles elaborated by CLT and CTML indeed address issues of helping learners to focus on relevant and to ignore irrelevant information, to update and organize memory contents and to shift attention between different task demands (cf. Mayer, 2009). Therefore, these design principles fit nicely into the view that the amount of controlled attention required by a learning task is pivotal to define relevant WML in instructional contexts. Moreover, a focus on requirements with regard to controlled attention (instead of storage requirements) would also be well aligned to core assumptions of many recent working memory theories (e.g., Barrouillet et al., 2004; Engle and Kane, 2004; Cowan, 2005, 2014; Unsworth and Engle, 2007; Oberauer, 2009). These theories agree that attentional control demands rather than temporary storage demands constitute the core of WML. According to this perspective, the limited ability to control attention constitutes the essence of human working memory limitations and also explains individual differences in working memory capacity (Engle and Kane, 2004; Unsworth and Engle, 2007). Using this theoretical account as a basis for cross-task classification would first of all imply a very different approach to the selection of training tasks compared to the approach of focusing on the limitations of different storage components of working memory. In particular, in order to define a theoretically sound set of training tasks for cross-task classification, we would either need a training set that allows a classifier to pick up the overarching common factor representing the general control of attention or we would have to define three training sets that more specifically address the three executive functions identified by Miyake et al. (2000).

An important consequence of these theoretical insights from the cognitive psychology of working-memory in the context of designing passive BCIs for cognitive load estimation would probably be that we need to better specify the neural basis of the general control of attention in working memory as well as the neural signatures of the three target executive functions that constitute controlled attention. This is important mainly because it may help to inform feature selection for analyzing sets of training tasks for different aspects of WML. Based on functional magnetic resonance imaging (fMRI) techniques, it has been shown that working-memory tasks typically involve an interaction of the dorsolateral prefrontal cortex and the intraparietal sulcus, which are also described as anterior and posterior attentional systems in controlled attention tasks (Curtis and D'Esposito, 2003; Klingberg, 2009). This is well in line with known relations between WML and frontal and parietal changes in oscillatory power (Gevins et al., 1997; Jensen and Tesche, 2002; Stipacek et al., 2003; Missonnier et al., 2006; Krause et al., 2010; Sauseng et al., 2010). It has to be noted, however, that due to volume conduction EEG signals obtained in frontal and parietal electrodes do not necessarily indicate that the sources of these signals also originate from frontal or parietal areas of the cortex. Therefore, fMRI data and EEG data resulting from the same cortical processes do not necessarily need to map with regard to the localization of activated voxels in the cortex and the localization of electrodes on the surface of the skull that are responsive to these processes. Nevertheless, fMRI data are highly useful in our context in order to characterize the homogeneity or heterogeneity of brain areas involved in the different working-memory functions we are interested in. Neuroimaging studies (cf. Nee et al., 2012 for a recent review of 37 studies, Smith and Jonides, 1999; Sylvester et al., 2003) as well as some recent EEG studies (Chapman et al., 2007; Kiss et al., 2007; Hanslmayr et al., 2008; Sörqvist and Sætrevik, 2010; Nigbur et al., 2011) suggest, for example, that beyond the general fronto-parietal pattern of activation characteristic for WML, different frontal areas are involved in different working-memory functions. Detailed fMRI studies show, for instance, that the dorsolateral prefrontal cortex (Broca area 9 and 46) is typically activated when holding spatial information, monitoring and manipulating information in working memory, using strategies to facilitate memory, or verifying representations that have been retrieved from long-term memory (Goldman-Rakic, 1994; Owen, 1997; Dobbins et al., 2002; Bor et al., 2004; for an overview see Owen et al., 2005). Closely related areas in the mid-ventrolateral prefrontal cortex (Broca area 45 and 47) are activated during the selection, comparison and evaluation of stimuli held in short-term and long-term memory as well as during the holding of non-spatial information in working memory and the elaborated encoding of information into episodic memory (Goldman-Rakic, 1994; Petrides, 1994; Henson et al., 1999). Compared to the frontal cortex, the posterior areas of the brain that are crucial for working memory—including the parietal cortex—are mainly responsible for the maintenance of information in working memory. Thus, these areas have also been described as a "buffer for perceptual attributes" (Callicott et al., 1999). In line with this characterization, the involvement of different representational codes in working-memory tasks might lead to subtle differences with regard to the activation of these parietal areas. For instance, Knops et al. (2006) demonstrated in an fMRI study that two types of n-back tasks (identity match: "stimulus is the same/not the same as *n* trials before" and comparison: "stimulus is smaller/larger than *n* trials before") with numerical vs. verbal materials activated slightly dissimilar areas in the intraparietal sulcus. Besides prefrontal and parietal areas, also the anterior cingulate cortex (ACC) shows an increased activation during working-memory tasks. Activity in this brain region is often related to increased effort, error detection, and attention (Callicott et al., 1999). Moreover, the brain regions crucial for working memory are not operating in isolation from each other, but are communicating. For instance, a study by Honey et al. (2002) has shown that the connectivity of fronto-parietal brain networks increases when working memory is involved.

What are the implications of these findings from the neuroscience literature on working memory and executive functions with regard to potential approaches to the cross-task classification of WML? First, in line with the functional evidence from cognitive psychology (e.g., Miyake et al., 2000) the neural basis of working memory also seems to be characterized by unity (e.g., fronto-parietal network activation) as well as diversity (e.g., differential involvement of specific brain areas). There seem to be networks that are involved in all working-memory tasks whereas other networks are only involved in specific working-memory tasks. Second, we take this pattern as evidence that it might make sense to look out for a quite broad neural signature indicating the load onto a generic common factor representing the overall control of attention (cf. Klingberg, 2009). Third, although there is evidence that particular areas are involved when it comes to specific working-memory functions (e.g., certain executive functions or functions related to particular representational codes), it might be difficult to distinguish these different functions by means of their neural signatures in the EEG signal. Differences in the areas involved are quite subtle so that detecting these specific signatures might require a further development of more sophisticated methods for the analysis of oscillatory EEG data based on ICAs (Makeig et al., 1996), also including the activity of networks (Mullen et al., 2010). These future methods might allow for better defining (source-based) features, which might enable classifiers to pick up these rather subtle differences with regard to the activated brain areas. In line with this evidence, in our own attempts to classify EEG data from working-memory tasks, we will first try to identify an overarching factor of controlled attention involved in WML. This factor is assumed to represent the unity of executive functions according to the analysis of Miyake et al. (2000) and is considered to be based on the same neural basis (i.e., fronto-parietal networks, Klingberg, 2009) for different learning materials, independent, for instance, from the representational codes involved (e.g., numerical or verbal) or from the specific executive functions required (e.g., inhibition, shifting, updating). After having specified more precisely the nature of the workingmemory processes we want our classifier to pick up—based on the cognitive as well as on the neuroscientific working-memory literature—we can now continue to develop a theoretically sound set of training tasks for these processes.

Defining a classifier (based on fronto-parietal patterns of activation in EEG data) that allows extracting the amount of controlled attention required by a task (independent from specific representational codes or executive functions involved) presupposes at least a set of two different working-memory tasks for classifier training, each without perceptual-motor confounds, and both differing substantially with regard to the representational codes and executive functions they involve (to ensure that the classifier will not extract these specific features). An ideal set of working memory-tasks for classifier training in our instructional setting would consist of two working-memory tasks that are both known to be highly correlated with achievements in learning tasks (potentially indicating that the executive functions involved in these tasks are relevant for learning). At the same time they should not be highly correlated with each other (potentially indicating that these tasks do not involve the same combination of executive functions). Actually, a combination of an n-back task with a complex span task—as used in the study of Baldwin and Penaranda (2012)—fulfills these constraints. Both tasks are standard paradigms in working-memory research that predict learning outcomes very well (Daneman and Carpenter, 1980; Engle, 2002; Kornmann et al., submitted) but correlate only weakly with each other (Kane et al., 2007; Redick and Lindsey, 2013), indicating that different executive processes might be involved in task performance. For instance, a reading span task requires *shifting* between a rather complex semantic processing task and a rather simple additive *updating* of a set of items. Contrarily, the n-back task requires a complex *updating* of a set of items involving a replacement (*inhibition*) of previous set members and a reordering of the sequence of set members as well as a *shifting* between these updating task demands and a simple identitymatching task. In the light of this analysis it is quite astonishing that the promising combination of an n-back task and a complex span tasks in the study of Baldwin and Penaranda (2012) yielded no positive cross-task classification result with regard to the Sternberg task. One reason for this finding might be of course that there were differential perceptual-motor confounds of the three tasks in that study (as already mentioned above). Another reason, however, might be that the Sternberg task used in the study does not really represent a working-memory task in the sense that it involves executive demands like *updating*, *shifting*, or *inhibition* (in addition to maintaining a set of items): Subjects had to maintain a sequentially presented list of items for 2 s and after this phase they had to decide whether a certain probe was present in the set or not. Subsequently, a new list was presented. Thus, this task does not involve maintaining a memory set and working on this set or on other items in an interleaved fashion. Therefore, the Sternberg task resembles much more a short-term memory task than a working-memory task. In line with this reasoning, Corbin and Marquer (2013) argue that the Sternberg task can be considered a working-memory task only under very specific conditions, for instance, when additional experimental constraints are imposed on subjects that increase the processing load. Thus, in terms of neural signatures, the Sternberg task may mainly rely on neural networks not involved in executive control whereas the n-back task and the complex span task may rely on neural networks required for executive control, but may differ with regard to the specific mixture of executive functions required for task performance, which explains their rather weak correlation on a behavioral level. In essence, as our overview of the working-memory literature on the behavioral and the neural level reveals, it will not be sufficient to just select a random working-memory task or a random combination of workingmemory tasks for successful classifier training when cross-task classification is intended. Rather, a strong link to state of the art research in the cognitive psychology and cognitive neuroscience of working-memory processes is necessary to select appropriate sets of working-memory tasks that are characterized by overlapping task demands with regard to the target construct (controlled attention) and differential mixtures of specific executive functions so that the resulting classifier is not constraint to a specific mixture of these functions.

# **WHAT HAVE WE LEARNED SO FAR? A SUMMARY IN FIVE LESSONS**

As a summary on how the different issues raised in this paper might be overcome in order to achieve the instructional goal of designing passive BCIs for cognitive load estimation during learning we compiled the following list of five lessons learned. These lessons do not only result from research on cognitive load in instructional design (Section Cognitive workload and adaptive instruction: A challenge for instructional design) and from studies on EEG measures of workload during instruction (Section Promises and drawbacks of EEG-based measurement of WML during learning: The issue of perceptual-motor confounds), but also from our own initial studies (Section Study 1: Realistic instructional materials imposing different levels of WML: Do single-subject single-trial EEG data allow for a classification? And what is being classified?), from studies on workload classification (Section Are there successful classifications of levels of WML based on oscillatory EEG data? A critical reflection on recent studies and some suggestions for improvement) and on cross-task classification (Section How to allow for classifier transfer? The case for cross-task classification) as well as from the behavioral and neurocognitive working-memory literature (Section How is working memory working? The need for theory on working memory processes). It should become clear from this list that many details of how to choose tasks for classifier training and how to analyze the EEG data resulting from these tasks might be of crucial importance, even though these details might appear quite oversophisticated at first sight. In order to demonstrate the importance of these lessons we applied them meticulously in designing a follow-up study (Walter et al., submitted) that resulted in a successful cross-task-classification of WML. This study will be summarized in the next section.

#### **LESSON 1 – THE ROLE OF TASK ORDER IN THE CONTEXT OF LEARNING**

A randomized task order is usually a good choice for classifier calibration and testing to avoid any confounds of the classes to be detected with the time of presentation. However, when it comes to learning tasks, randomization is hard to implement. WML is a highly volatile learner characteristic that changes during learning not only due to ubiquitous increases in the complexity of instructional materials and task requirements over time (objective complexity) but also due to learners' increasing levels of knowledge (degree of expertise, cf. Section Cognitive workload and adaptive instruction: A challenge for instructional design). Based on these changes, task order plays a different role for learning tasks than for performance tasks where a randomized task order can typically be implemented without hesitation. For learning tasks, randomization is usually not appropriate. In most cases, certain materials will be too complex to be understood at an early phase of instruction while they might be quite easy at a later point in time. This is due to the fact that an increase in expertise over time (i.e., learning) allows handling more complex contents than before without the need to activate a larger number of knowledge structures in working memory (due to the chunking of information into more complex concepts, cf. Section Cognitive workload and adaptive instruction: A challenge for instructional design). Accordingly, two materials with identical objective complexity might impose very different levels of WML at different points in time during learning due to learners' knowledge gains. As a consequence, it is not advisable to present too complex learning tasks at the beginning of an instructional sequence and too simple learning tasks at the end of an instructional sequence. Otherwise, no learning might occur at all because the learning tasks are either much too simple or much too complex in relation to learners' current knowledge prerequisites. This inherent dynamic of WML in the context of learning tasks due to the ongoing acquisition of knowledge implies that it makes no sense to present learning tasks in a randomized task order, as one would typically do with performance tasks: Realistic learning tasks differ exactly from performance tasks with regard to the fact that they cannot be presented in an arbitrary order without jeopardizing their character as learning tasks. This specific characteristic, however, has serious implications for classifier testing and training when designing passive BCIs for cognitive load estimation during learning: Most important, when instructional materials need to be presented in a fixed simple to complex sequence in any realistic learning context, it is inevitable that testing a classifier on such a fixed sequence of learning materials is subject to a confound of objective levels of task complexity and presentation time. From our perspective, it is necessary to take the potentially negative effects of such a confound due to fixed task order explicitly into account (e.g., the effect that a classifier might pick up slow drifts in the EEG signal over time as useful information about task complexity due to the correlation of both factors). We advocate two measures to counteract these potentially negative effects. First, with regard to feature selection for classifier definition, we suggest to use %ERD/ERS ratios (Pfurtscheller and Lopes da Silva, 2005) instead of simple power values to explicitly cancel out slow changes in the EEG signal over time. The %ERD/ERS ratios are calculated by using a baseline signal for comparison that immediately follows or precedes the target window of analysis. Second, for parity reasons we recommend to use the same simple to complex sequence not only for classifier testing on learning tasks but also for classifier training on working-memory tasks (by presenting the working-memory tasks needed to train the classifier in a similar fixed order. When applying %ERD/ERS ratios to both task sequences (classifier training and testing), the baseline as well as the window of analysis will both be subject to potential drifts in the EEG signal over time for all sequences. Accordingly, these drifts cannot be erroneously picked up by a classifier.

# **LESSON 2 – CLASSES BASED ON SUBJECTIVE RATINGS OF WML**

The inherent dynamic of WML due to learning explained in Lesson 1 also has a second implication beyond task order, namely that it might not be useful to define classes of learning tasks for classifier training and testing based on their objective task complexity. Rather, the definition of these classes needs to take into account that the same learning tasks could elicit different levels of WML depending on learners' current degree of expertise as well as on their individual working-memory capacity (cf. Section Cognitive workload and adaptive instruction: A challenge for instructional design). For instance, in a block of 40 learning tasks of similar objective complexity, the first 20 tasks might impose rather high levels of WML onto learners due to their novelty, whereas the last 20 tasks might impose lower levels of WML onto learners due to schema formation (chunking). Accordingly, in line with instructional theories (cf. Section Cognitive workload and adaptive instruction: A challenge for instructional design) we propose that classes based on subjective ratings of WML may be more appropriate for classifier training and classifier testing than classes based on objective task complexity, at least when the target construct for classification is the WML actually experienced by learners (which is the case in the context of adaptive instructional environments). For illustration, consider the example of learning how to solve two-digit addition tasks in the octal numeral system (base-8 number system, e.g., 23 + 77 = 122). This task will be quite demanding with regard to working-memory resources when encountered for the first times. However, with increasing practice, it will become rather easy, comparable to two-digit addition tasks in the decimal numeral system that we are all acquainted with (e.g., 23 + 77 = 100). Thus, the WML actually experienced by learners when solving two-digit addition tasks in the octal numeral system will change to a large extent over time, whereas the objective task complexity remains constant. This example should make clear that the classes that we need to detect in instructional contexts are subjective levels of WML and not objective levels of task complexity.

#### **LESSON 3 – AVOIDING PERCEPTUAL-MOTOR CONFOUNDS**

The windows used for analyzing EEG data during classifier training should not contain perceptual-motor confounds that could be erroneously picked up by the classifier. This implies that classifier training cannot be conducted on realistic learning tasks but need to be restricted to more controlled experimental working-memory tasks. Additionally, methods like ICA and connectivity analyses need to be further developed to better rule out motor artifacts in the future (cf. Section Promises and drawbacks of EEG-based measurement of WML during learning and instruction: The issue of perceptual-motor confounds, Study 1: Realistic instructional materials imposing different levels of WML: Do single-subject single-trial EEG data allow for a classification of these levels?, Are there successful classifications of levels of WML based on oscillatory EEG data? A critical reflection on recent studies and some suggestions for improvement, How to allow for classifier transfer? The case for cross-task classification). These methods might eventually allow excluding those components or networks from the set of features for classifier training, which clearly reflect perceptual or motor processing.

# **LESSON 4 – COMBINING WORKING-MEMORY TASKS**

The working memory-tasks used for classifier training should not be confined to a single type of task. Instead combinations of at least two working-memory tasks should be used (preventing the classifier from picking up task-specific features). All workingmemory tasks used should be predictive for achievements in learning tasks (potentially indicating that the executive functions involved in these tasks are relevant for learning). At the same time, the tasks should differ with regard to the specific executive functions they require (updating, shifting, inhibition) and with regard to the representational codes (numbers, letters, words etc.) they involve in order to train a generic classifier sensible to general changes in the requirements on controlled attention irrespective of specific executive functions or representational codes. This can be ensured by selecting working-memory tasks that only have a weak correlation with each other (potentially indicating that these tasks do not involve the same combination of executive functions). Additionally, it might be wise to avoid visual-spatial working-memory tasks for classifier training because these tasks might afford rehearsal strategies that produces different pattern of eye-movements for different levels of WML (cf. Section How is working memory working? The need for theory on working memory processes).

#### **LESSON 5 – DIFFICULTY LEVEL OF WORKING-MEMORY TASKS**

One should be prepared that the level of WML induced by controlled experimental working-memory tasks used for classifier training could differ from the level of WML induced by realistic learning tasks used for classifier testing. This is a natural implication of the fact that different task are used for classifier training than for classifier testing (cf. Section How to allow for classifier transfer? The case for cross-task classification). Thus, a calibration procedure for cross-task classification might be needed that allows for a scaling to adjust a trained classifier to new testing materials.

# **STUDY 2: CROSS-TASK CLASSIFICATION OF BASIC WORKING-MEMORY TASKS AND REALISTIC INSTRUCTIONAL TASKS**

Based on the five lessons outlined in the previous section, we extended the design of our Study 1 (outlined in Section Study 1: Realistic instructional materials imposing different levels of WML: Do single-subject single-trial EEG data allow for a classification of these levels?, for details see Walter et al., 2011) in several ways (Walter et al., 2013a; Walter et al., submitted). The goal of Study 2 was to develop efficient classification methods to differentiate levels of WML based on cross-task classification. We trained a SVM on two well-controlled working-memory tasks (reading span, numerical n-back), assuming theoretically that they would induce similar types of WML (and accordingly similar types of neural processing) as the two realistic learning tasks we used subsequently for classifier testing (algebra problems). By taking into account the lessons summarized in the previous section, we were able to achieve promising results for cross-tasks classification. 21 subjects participated in the study, however, due to technical problems during data collection, five subjects had to be discarded from the analysis. For the majority of the remaining 16 subjects the cross-task classification reached a significant classifier performance (*p <* 0*.*05, permutation test), with classifier accuracies up to 95% for the best subjects. The classifier was able to distinguish the subjectively easier vs. harder set of algebra problems with a mean classification accuracy of 73%. In the following we will outline the methodological strategies that we used to overcome the inherent challenges of using cross-task classification for cognitive workload assessment in the context of learning.

# **TASK DESIGN**

#### *Training tasks*

According to the considerations in Section How is working memory working? The need for theory on working memory processes, it will not be sufficient to just select a random working-memory task or a random combination of working-memory tasks to train an efficient classifier for cross-task classification. Rather, it will be necessary to select appropriate sets of working-memory tasks overlapping in task demands with regard to our target construct (controlled attention demands due to WML) and differing with regard to specific executive functions (see Lesson 4). As described above, the n-back task in combination with the reading-span task fulfills these requirements. Thus, we decided to use this combination of tasks in Study 2. Both tasks predict achievements in learning tasks very well but correlate only weakly with each other, which might be due to their different profiles with regard to executive functions. A reading span task requires *shifting* between a semantic processing task and a simple additive *updating* of a set of items. The n-back task, on the contrary, requires an *updating* of a set of items together with a replacement (*inhibition*) of previous set members interleaved with an identity-matching task. In addition to these differences, we designed the tasks to cover various representational codes (see Lesson 4). In the n-back task, singledigit numbers had to be memorized (except the number seven, which is the only two-syllable number), whereas the reading span task was based on memorizing letters (from the set B, F, H, J, L, M, Q, R, X) and on verifying sentences. By these design decisions we tried to prevent the classifier from picking up task-specific features. We implemented three levels of task difficulty for each task in a within-subject block design (i.e., 1-back, 2-back, 3-back and readings spans with 2, 4, 6 letters/sentences). We ensured that subjects received identical visual displays in all levels of task difficulty to avoid perceptual confounds (see Lesson 3).

# *Testing tasks*

For classifier testing, two different types of algebra word problems were designed (subtraction problems and fraction problems). For both types of tasks, again three levels of task difficulty were implemented that strongly differed with regard to the level of WML they would induce. To avoid perceptual confounds of task difficulty (see Lesson 3), all word problems contained exactly four numerical pieces of information at each level of task difficulty (either numbers or fractions). Additionally, the word problems were matched for number of words. Thus, we ruled out that more difficult word problems resulted in more numerical of text information or more complex visual displays leading to differences in processing, which may show up in the EEG (see Section Are there successful classifications of levels of WML based on oscillatory EEG data? A critical reflection on recent studies and some suggestions for improvement).

In order to solve the subtraction problems, a variable *x* had to be calculated by selecting and integrating appropriate numbers. For the first level of task difficulty, subjects merely had to select one out of the four numbers (*x* = *a*). The second level required them to subtract two relevant numbers (*x* = *a* − *b*). The third level, finally, asked for the difference between two differences, thus involving all four given numbers (*x* = (*a* − *b*) − (*c* − *d*)). This manipulation of task difficult was based on the taxonomy by Schnotz et al. (2010) and can be assumed to induce strong differences in WML.

The fraction problems required subjects to select and instantiate algebraic expressions containing fractions and multiplications in order to determine the value of a variable *x*. For the first level of task difficulty, the appropriate expression contained one fraction (*x* = *c* · *a/b*). For the second difficulty level the expression contained the same fraction two times (*x* = *c* · *a/b* + *d* · *a/b*). The third level contained two different fractions (*x* = *c* · *a/b* + *d* · *e/f*). This manipulation of task difficult was based on the taxonomy by Scheiter et al. (2010) and can be assumed to induce strong differences in WML.

In all four tasks, subjects had to react with key presses to provide their answers. They could react either with "yes" or "no" in the working-memory tasks (identity matching in the n-back task and sentence verification in the reading-span task) or with one out of four multiple choice options in the learning tasks used for classifier testing). The motor reaction was exactly the same for both working-memory tasks and both learning tasks. Furthermore, there were no differences in the motor reaction between different levels of tasks difficulty. No feedback was given to subjects in order to avoid confounding task difficulty and ratio of negative feedback (cf. Section Are there successful classifications of levels of WML based on oscillatory EEG data? A critical reflection on recent studies and some suggestions for improvement).

For modeling a realistic learning scenario (cf. Lesson 1), the mathematical word problems were presented in a simple to complex sequence for each learner (within-subject block design). Accordingly, a randomized task order for classifier testing could not be implemented. For parity reasons, we used the same simple to complex sequence for each working-memory task to generate the data used for classifier training. To rule out potential negative effects of such a fixed task order (e.g., slow drifts in the EEG signal that can inform the classifier) we calculated %ERD/ERS ratios using a baseline for comparison that immediately follows or precedes the window of analysis, thus counteracting the potential confound of time and level of WML (for details see below).

#### **DETAILS OF EEG MEASUREMENT**

With regard to the windows used for analyzing EEG data, we ensured that they did not contain any motor events or perceptual confounds that could be picked up by a classifier. The windows of analysis always ended at least 125 ms before any keypress to exclude EEG signals based on motor planning (Grabner and De Smedt, 2011). Furthermore, a time interval of 125 ms after any keypress was excluded from the analysis to avoid potential motor artifacts (see Lesson 3). Although these decisions with regard to the EEG analysis as well as with regard to the tasks designs already yielded highly controlled tasks without perceptual-motor confounds, we additionally reduced eye-movement artifacts by optimizing the data pre-processing step (Walter et al., submitted). We applied the artifact-reduction method described by Schlögl et al. (2007).

To counteract the potential confound of time and level of task difficulty, we calculated %ERD/ERS ratios based on power values using two time intervals for each task, one interval that strongly imposes the task-specific WML (activation interval, *Ia*) and one interval that imposes only a low level of WML (resting interval, *Ir*). Defining these intervals requires a theoretically based understanding of the working-memory processes involved in the tasks (cf. Section How is working memory working? The need for theory on working memory processes). We ensured that each trial of a task included both intervals to avoid systematic evects of drifts in the EEG signal (see Lesson 1).

In the n-back task, every 2000 ms a new digit was presented for identity matching. Subjects could react by pressing "yes" or "no" anytime. For *Ia* we used the time interval from stimulus onset until 125 ms before keypress (in this interval identity matching and updating/replacement is required). For *Ir* we used the time interval from 125 ms after keypress until the next stimulus appeared (in this interval mainly storage is required).

In the reading-span task, a list of sentences was presented for verification (e.g., oranges are blue). After each sentence, subjects could react by pressing "yes" or "no" anytime. After keypress a fixation cross was presented for 500 ms before a letter (that had to be remembered for later recollection) was presented for 1000 ms. For *Ia* we used the time interval in which the letter was presented (in this interval a shifting between the semantic processing task and an updating of the set of letters to be remembered is required). For *Ir* we used the time interval from 125 ms after keypress (sentence verification) until the next letter to remember appears (in this interval mainly storage is required).

In both learning tasks, a series of word problems was presented to subjects. First, subjects read a page with facts as long as they wanted. After keypress a problem statement appeared that had to be solved. Subjects could react by pressing a next-button anytime and could then select one out of four multiple-choice options to provide their solution. Subsequently, a fixation cross was presented for 500 ms. For *Ia* we used the time interval from the onset of the problem statement until 125 ms before keypress for selecting the problem solution (in this interval the necessary facts had to be remembered and inferences had to be drawn for problem solution). For *Ir* we used the time interval from 125 ms after keypress (i.e., after providing the problem solution) until the next page of facts appeared (in this interval no cognitive processing was required).

#### **DATA ANALYSIS AND RESULTS**

Concerning the feature selection, we focused on frontal and parietal electrodes with regard to the spectral power within the theta (4–7 Hz) and alpha (8–13 Hz) frequency band (see Section Promises and drawbacks of EEG-based measurement of WML during learning: The issue of perceptual-motor confounds and Study 1: Realistic instructional materials imposing different levels of WML: Do single-subject single-trial EEG data allow for a classification of these levels?). For spectral analysis, an autoregressive model was calculated with the Burg-Algorithm. The %ERD /ERS values for the 10 postulated frequencies (4, 5, 6, 7, 8, 9, 10, 11, 12, and 13 Hz) were calculated with regard to 10 electrodes (F3, Fz, F4, FC1, FC2, CP1, CP2, P3, Pz, P4) resulting in 10 features for each of the 10 electrodes. For the sake of consistency, these features were used across all subjects and tasks.

Cognitive-load ratings were obtained after each block of working-memory tasks and after each learning task by means of a 7-point Likert scale. These ratings were used for defining task classes (according to subjective levels of WML, see Lesson 3) and for adjusting the complexity levels of the different tasks (see Lesson 5). Since the tasks (n-back task, reading-span task, and subsequent learning tasks used for classifier testing) differed substantially with regard to the subjective level of WML they induced for each of their three difficulty levels, a calibration procedure was applied (see Lesson 5). We selected two out of the three difficulty levels for each task based on subjects' error rates and cognitive-load ratings. For instance, we removed the first level of task difficulty of the subtraction problems from the analysis because it turned out that these tasks were much simpler than the other tasks. The procedure of selecting two out of three difficulty levels for analysis allows us to calibrate difficulty levels across different tasks. For the remaining differences in the difficulty levels across the four tasks, the scaling method of the EEG power values was improved by scaling the data over trials and by adjusting the range of the diverse datasets: First, our training data were z-score normalized resulting in a centered, scaled version of the input-data. Subsequently, we calculated the means and standard deviations of the power values of the training-set. Finally, the testing data were normalized with regard to these means and standard deviations calculated from the training data.

In order to define classes for the training and testing of classifier we analyzed the relation between the objective complexity levels of the four tasks and the subjective cognitive-load ratings. In line with our expectations (see Lesson 1) and with instructional theory (see Section Cognitive workload and adaptive instruction: A challenge for instructional design), it turned out that due to learning the levels of subjectively experienced cognitive load decreased over time within each level of objective task difficulty. As a consequence, we used subjective cognitive-load ratings for defining classes. For this purpose, we calculated the mean value of the subjectively experienced WML for all working-memory tasks solved by a subject as a personal cut-off point to define two classes (low vs. high level of WML according to cognitive-load ratings) for all four tasks (i.e., we defined the two classes for both types of word problems according to the same subjective cut-off point). Using this procedure requires to adjust the objective complexity levels of the tasks for classifier training and testing according to the calibration procedure described in the last paragraph to ensure that the number of trials in both classes is sufficiently balanced for each individual task (i.e., it is necessary to select appropriate complexity levels of all tasks so that all tasks have a similar range of WML as measured by the subjective cognitive-load ratings).

For cross-task classification, we used a SVM with RBF-Kernel that was trained on the combination of the two working-memory tasks and applied to the two realistic learning tasks. As a result, the two levels of WML of the learning tasks could be successfully classified on a single-subject single-trial basis. On average a classification accuracy of 73% could be achieved for the 16 subjects. The classification results were significant for 22 out of the 32 classifications conducted (for each of the 16 subjects two classifications were conducted for the two learning tasks). It has to be noted, though, that these results are based on classification decisions after each trial of the learning tasks. Taking the results of Brouwer et al. (2012) into account, much higher accuracies could be expected when classification decisions would be based on more than one trial of a task.

In conclusion, the methodological strategies that we used to implement the lessons learned so far yielded quite promising results. Our decisions with regard to how to design and analyze Study 2 allowed for a successful cross-task classification of WML when solving complex learning tasks. Thus, the most important precondition for the goal of developing workload-adaptive instructional environments based on passive BCIs, namely the availability of a continuous and non-intrusive assessment of WML during solving realistic learning tasks, seems to be satisfied.

### **DISCUSSION**

In this paper, we discussed and applied several lessons learned on how to implement passive BCI methods to assess WML during learning. A study based on these lessons (Study 2) yielded very promising results for a cross-task classification of WML when solving complex learning tasks (see Section Study 2: Crosstask classification of basic working-memory tasks and realistic instructional tasks). In the remainder of this paper we will discuss the prospects for further improving cross-task classification results and for applying our methods outside the lab in realistic environments, for instance by expanding the feature space used for workload assessment in real-world studies.

# **FUTURE PROSPECTS FOR IMPROVING CROSS-TASK CLASSIFICATION RESULTS**

To improve cross-task classification accuracies even further, a couple of additional methods could be applied to the tasks we used in our studies. One issue that might be addressed by these methods is the fact that, due to learners' knowledge gains over time, the learning tasks needed to be presented in a fixed simple to complex order, which results in a confound of presentation time and objective levels of task complexity. Due to non-stationarities in EEG signals over time, a classifier might pick up slow drifts in the feature space as useful information about task complexity due to the correlation of both factors. Although we tried to diminish this effect by analyzing %ERS/ERD ratios, in future studies, additional methods for covariate shift adaptation might help to further alleviate these drifts and to improve classification accuracies (Satti et al., 2010; Spüler et al., 2012b). Another approach would be to present all tasks in a randomized order of difficulty levels to evaluate the influence of non-stationarities on the cross-task classification results. It should be clear from Lesson 1, however, that this manipulation is not very plausible as a modeling of a realistic learning situation (which is our target scenario) but should only be used to clarify the role of non-stationarities for methodological reasons.

Another interesting option that arises from the task design of Study 2 is to explicitly include very high levels of WML in the analysis that intentionally result in a "cognitive overload." Chanel et al. (2008) found that an excessive increase in WML might lead to disengagement that is detectable by a reversed EEG pattern (i.e., theta-desynchronization and alpha-synchronization), which corresponds to the neural signature of low WML. In our own study we found that absolute difficulty levels varied substantially across different types of tasks. We used the calibration procedure described above to remove difficulty levels, which turned out to be too easy (e.g., arithmetic level 1) or too difficult (e.g., algebra level 3) for learners in terms of error rates and subjective ratings of cognitive load. Interestingly, however, exploratory analyses of these excluded conditions indeed revealed in line with Chanel et al. (2008) the similarity of the neural signatures of conditions with very low levels of WML and very high levels of WML (in the sense of cognitive overload), thereby potentially indicating task disengagement (Walter et al., 2013b). This pattern might be exploited in future studies to better define a zone of optimal workload during learning (high but not too high) and to assess this zone by applying passive BCI methods. Cross-task classification might benefit from this approach because tasks that induce cognitive overload will be better excluded from the high WML class.

# **ASSESSING WORKLOAD OUTSIDE THE LAB IN REAL-WORLD ENVIRONMENTS**

Although we applied our approach of assessing different levels of WML to realistic instructional materials (algebra word problems), the laboratory tasks used for data collection were nevertheless highly controlled with regard to temporal parameters and perceptual-motor confounds. However, if we consider our target scenario of applying passive BCIs in realistic learning scenarios the question remains, whether the methods we developed can be applied successfully outside the lab in real-world environments, where several parameters are expected to differ from the lab. Of course, it is highly interesting in the first step to evoke WML by using well-controlled laboratory tasks as laboratory tasks and designs allow for high control over most of the relevant factors and provide a clear structure that can be aligned to cognitive theories of WML. On the other hand, using well-controlled laboratory tasks moves the whole study further away from realistic settings. It is likely, that the brain works differently in real world scenarios, where tasks appear to be more relevant for the person involved. McDowell et al. (2013) discuss this hypothesis postulating that the activity of the brain increases with the relevance of the tasks it is involved in. Hence, it is of great importance that theoretical and methodological approaches developed in laboratory studies are validated in real world environments. Fortunately, first steps into this direction (e.g., Zander et al., submitted) provide evidence that today's state of the art theories explain at least partially what brain activity is related to realistic workload in real-world environments.

One of the most important issues with regard to real-world studies will be the occurrence of perceptual-motor confounds that have been extensively discussed in this paper. In realistic environments we would clearly expect workload-related behavioral activity, such as specific changes in the pattern of behavior. Due to volume conduction such behavioral activity will be added to the EEG signal through muscularly, ocularly, or translatorly induced effects. Generally, these changes in the potential recorded at EEG electrodes will be bigger in amplitude than those resulting from cortical activity. Hence, it is likely that BCI classifiers might strongly rely on this behavioral information. Thus, in uncontrolled realistic application scenarios it might be quite unclear whether a single-trial detection system used for classification is based on brain activity or on perceptual-motor confounds. This mirrors the methodological problems of many of the studies discussed in this paper.

As a consequence, it is important to better investigate, which signals a BCI classifier picks up when moving out of the controlled laboratory environments. As the EEG is a complex mixture of different signals, complex methodologies have to be applied to identify sources contributing to it. Recent research revealed that methods like ICA (Makeig et al., 1996) are powerful tools for solving this problem. Independent sources of signals can be identified by their temporal, spectral, and spatial properties. From this information the type of process might be inferred that underlies a specific aspect of the overall signal. For instance, data resulting from artifacts might be better discriminated from cortical activity, and cortical activity might be divided into different processes that can be related to specific cognitive and affective states. ICA can even be used to reveal the connectivity of different cortical structures. Such network activity likely carries relevant information about workload, which is not only based on the activation of specific brain areas but also on the information transfer between different areas (e.g., the fronto-parietal networks implementing controlled attention). Different statistical approaches might be utilized in the future to investigate network activity in the EEG. A description of an open-source toolbox including the most commonly used methods can be found in Mullen et al. (2010).

To conclude, we consider passive BCIs to be powerful tools for improving adaptive learning environments in the future, provided that they are properly validated, first, by combining controlled experimental studies and complex real-world studies and, second, by investigating independent components, network activities and classification properties. Moreover, our passive BCI approach might also be transferred to many other aspects of Human-Computer Interaction, using cognitive state monitoring for introducing fundamentally new types of input, like implicit interaction (see Zander et al., 2014).

#### **ACKNOWLEDGMENTS**

This research was funded by the Leibniz ScienceCampus Tübingen "Informational Environments." Carina Walter was a doctoral student of the LEAD Graduate School [GSC1028], funded by the Excellence Initiative of the German federal and state governments.

# **REFERENCES**


Petrides, M. (1994). Frontal lobes and behaviour. *Curr. Opin. Neurobiol.* 4, 207–211. doi: 10.1016/0959-4388(94)90074-4


and K. Gilleade (London: Springer), 67–90. doi: 10.1007/978-1-4471- 6392-3\_4


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 14 February 2014; accepted: 10 November 2014; published online: 09 December 2014.*

*Citation: Gerjets P, Walter C, Rosenstiel W, Bogdan M and Zander TO (2014) Cognitive state monitoring and the design of adaptive instruction in digital environments: lessons learned from cognitive workload assessment using a passive braincomputer interface approach. Front. Neurosci. 8:385. doi: 10.3389/fnins.2014.00385 This article was submitted to Neuroprosthetics, a section of the journal Frontiers in*

*Neuroscience. Copyright © 2014 Gerjets, Walter, Rosenstiel, Bogdan and Zander. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# What we can and cannot (yet) do with functional near infrared spectroscopy

# *Megan Strait\* and Matthias Scheutz*

*Human-Robot Interaction Laboratory, Department of Computer Science, Tufts University, Medford, MA, USA*

#### *Edited by:*

*Jan B. F. Van Erp, TNO, Netherlands*

*Reviewed by:*

*Koen Koenraadt, Amphia Hospital, Netherlands*

*Hein Daanen, TNO, Netherlands \*Correspondence:*

*Megan Strait, Human-Robot Interaction Laboratory, Department of Computer Science, Tufts University, 200 Boston Avenue, Medford, MA, 02155, USA e-mail: megan.strait@tufts.edu*

Functional near infrared spectroscopy (NIRS) is a relatively new technique complimentary to EEG for the development of brain-computer interfaces (BCIs). NIRS-based systems for detecting various cognitive and affective states such as mental and emotional stress have already been demonstrated in a range of adaptive human–computer interaction (HCI) applications. However, before NIRS-BCIs can be used reliably in realistic HCI settings, substantial challenges concerning signal processing and modeling must be addressed. Although many of those challenges have been identified previously, the solutions to overcome them remain scant. In this paper, we first review what can be currently done with NIRS, specifically, NIRS-based approaches to measuring cognitive and affective user states as well as demonstrations of passive NIRS-BCIs. We then discuss some of the primary challenges these systems would face if deployed in more realistic settings, including detection latencies and motion artifacts. Lastly, we investigate the effects of some of these challenges on signal reliability via a quantitative comparison of three NIRS models. The hope is that this paper will actively engage researchers to facilitate the advancement of NIRS as a more robust and useful tool to the BCI community.

**Keywords: functional near infrared spectroscopy, brain–computer interfaces, human–computer interaction, reliability, signal processing**

# **1. INTRODUCTION**

The primary aim of human–computer interaction (HCI) research is to develop methods and tools to facilitate effective interaction between people and with computer systems. While current modes of interaction mainly rely on tactile communication, there is a growing body of research on using brain-based sensors as an additional information channel (e.g., Tan and Nijholt, 2010; Zander and Kothe, 2011; Strait et al., 2014a). Socially-aware systems that can *capture and respond to* changes in anxiety, attention, arousal, and other user states have been found to be more effective in engaging people (e.g., Szafir and Mutlu, 2012). Hence, research on neurophysiological signals has been gaining the attention of researchers in human–computer interaction in recent years (e.g., Bainbridge et al., 2012; Frey et al., 2014; Strait and Scheutz, 2014).

Amongst this work, electroencephalography (EEG) is the most widely used technology in HCI, as it provides high temporal resolution and has general success in measuring a wide array of user states such as workload, attention, fatigue, and affect (Frey et al., 2014). However, EEG has limited spatial resolution, thus constraining its applicability for measuring region-specific brain activity. Conversely, high spatial resolution can be achieved using fMRI, but at a cost to both participant mobility and temporal resolution (e.g., Canning and Scheutz, 2013; Frey et al., 2014). Hence, functional near infrared spectroscopy (NIRS; also referred to as fNIRS or fNIR) is a promising alternative, achieving some middle ground in spatial and temporal resolution as well as mobility between the EEG and fMRI technologies (e.g., Villringer et al., 1993; Hoshi, 2011).

Within the human–computer interaction community, NIRS has been primarily used in two ways: (1) for *evaluating* human– machine interactions (e.g., Hirshfield et al., 2009a, 2011a), and more recently, (2) as additional input *to adapt* user interfaces and computer systems based on the user's cognitive state (e.g., Solovey et al., 2012), which is generally referred to as a *passive* brain–computer interface (Zander and Kothe, 2011).

While there are a growing number of EEG-based brain– computer interfaces (BCIs) (e.g., George and Lecuyer, 2010), the development of NIRS-based BCIs has generally lagged behind (e.g., see **Table 1** vs. Frey et al., 2014). Moreover, as a consequence of the NIRS literature being dispersed across many publication outlets in HCI, neuroimaging, and brain–computer interface communities (and furthermore, of inconsistencies in results within and between these fields), the efficacy of NIRS-BCIs in realistic human–robot interactions (Canning and Scheutz, 2013) and HCI settings (Strait et al., 2013b) is relatively unknown and unexplored.

To date, NIRS has been shown to be quite successful in measuring a number of cognitive and affective states (e.g., Cutini et al., 2012) in highly controlled laboratory settings. Yet, substantial challenges persist concerning signal processing for more realistic settings, many of which have already been identified (e.g., Hoshi, 2003, 2007; Plichta et al., 2007; Cutini et al., 2011; Hoshi, 2011; Krusienski et al., 2011; Kirilina et al., 2012; Canning and Scheutz, 2013; Hu et al., 2013; Strait et al., 2013b, 2014b). And while these challenges are not necessarily unique to NIRS, (e.g., see the limitations of using functional magnetic resonance imaging Cacioppo et al., 2003; Logothetis, 2008 and EEG Lotte, 2011; Ohara et al.,


#### **Table 1 | Continued**


*In red: useful reviews of NIRS instrumentation and applications. In blue: NIRSbased investigations of neural signals that reflect affective states in particular.*

2011; Brouwer et al., 2013; Frey et al., 2014), we are still lacking adequate solutions to overcome them.

Hence, the goals of this paper are the following: to provide (1) a review of what can be currently done with NIRS-BCIs for measuring cognitive and affective user states relevant to HCI, (2) a discussion of the effects of naturalistic and unconstrained interaction settings of HCI on signal reliability, and (3) a quantitative comparison of the performance of three modeling approaches in these more realistic settings. We first start with a review of the technology, including an overview of current NIRS-based systems and their limitations. We then identify and evaluate some of the challenges for model reliability, and conclude with a discussion of directions for future research to overcome those challenges.

# **2. FUNCTIONAL NEAR INFRARED SPECTROSCOPY**

Functional near infrared spectroscopy is a neuroimaging technique (similar to fMRI) for measuring changes in bloodoxygenation (Hoshi, 2011). Due to the differences in absorptivity between oxygenated and deoxygenated hemoglobin and the transparency of biological tissue to light in the 700–1000 nm range, NIRS is able to capture the hemodynamic changes via the coupling of infrared light emission and detection (Hoshi, 2011). Change in hemoglobin concentration following a precipitating stimulus is referred to as the hemodynamic response (HDR) and can be used to make inferences about functional areas of the brain. Unlike EEG, however, most NIRS-based studies find the onset of the response lags behind the triggering events by at least 1–2 s (e.g., Cui et al., 2011), which then peaks

*(Continued)*

applications Girouard et al., 2010 Review of NIRS for human–computer interaction Herff et al., 2013a Single-trial quantification of workload Hirshfield et al., 2009b Assessment of syntactic workload Hu et al., 2013 Reduction of inter-trial variability using

Izzetoglu et al., 2010 Motion artifact cancelation using Kalman filtering Kirilina et al., 2012 Method for separation of superficial and cortical signals Lloyd-Fox et al., 2010 Review of the utility and limitations of NIRS for use with infants Lu et al., 2010 Assessment of resting-state connectivity

resting-state connectivity

2012

4–8 s after the stimulus onset and then dips back down over the course of several more seconds as homeostasis is reestablished (e.g., Matthews and Pearlmutter, 2008; Hoshi, 2011). For detailed reviews of hemodynamics and NIRS instrumentation, see for example: Lloyd-Fox et al. (2010); Hoshi (2011); Ferrari and Quaresima (2012); Scholkmann et al. (2014).

#### **2.1. USING NIRS TO MEASURE COGNITIVE STATES**

Within the field of HCI, discrimination of workload-based states is the predominant application of NIRS (e.g., Nozawa, 2010; Hirshfield et al., 2011a; Ayaz et al., 2012; Coffey et al., 2012; Herff et al., 2013a,b; Schudlo and Chau, 2014). There are also a growing number of affect-related studies using NIRS, with the primary focus on the detection of negatively-valenced and high-arousal states (e.g., Tupak et al., 2014). **Table 1** shows a number of relevant NIRS-related publications and a summary of their topics. Additionally, there are several comprehensive reviews of the utility and limitations of NIRS in general (Hoshi, 2011; Cutini et al., 2012; Brigadoi et al., 2014; Tak and Ye, 2014) and for human–robot interaction (Canning and Scheutz, 2013) in particular.

Although this set of measureable states (i.e., workload, negative affect) is a subset of that which is achieved using EEG (i.e., workload, attention, vigilance, fatigue, error recognition, affect, engagement, flow, and immersion; see Frey et al., 2014), NIRS may serve as a complimentary or alternative modality. Specifically, while some comparisons of EEG versus NIRS for workload detection found that NIRS is less effective across a population (i.e., better-than-chance classifications were observed for only 50% of participants using NIRS versus 80% of participants using EEG) (Coffey et al., 2012), NIRS has also been found to achieve better overall discrimination of two levels of workload compared to EEG (Hirshfield et al., 2009a). Hence, a combination of the two (both NIRS and EEG) may be more appropriate for general deployment in workload-related activities.

Moreover, as the prefrontal cortex shows functional coupling in response to emotionally-charged tasks (e.g., Strait et al., 2013a), NIRS may be of greater utility (than EEG) for the detection such localized affect-related brain activity. For instance, recent EEG-based studies have shown recognition rates of only mid-50% for two-way classification (Frey et al., 2014) which is substantially less than what has been achieved in similar paradigms using NIRS which show recognition rates of mid to high 60% (Heger et al., 2013). Although recent EEG-based research shows successful recognition rates of 85–90% for arousal and valenced-states (Liu et al., 2011), artifacts arising from the electrical activity of facial muscles were not controlled for in this work. Given such artifacts are both inherent to emotion induction paradigms and have been shown to have significant effects on frontal EEG channels (e.g., Heger et al., 2011), it is unlikely the above results are reliably detecting brain activity (versus EMG activity of facial muscles). Hence, NIRS may be a useful alternative for measuring affect-related activity. In particular, for NIRS-based affect-related studies (e.g., Aoki et al., 2011, 2013; Hirshfield et al., 2011a; Strait et al., 2013a; Strait and Scheutz, 2014; Tupak et al., 2014), the results are highly consistent across the various efforts and moreover, across a diverse set of contexts (i.e., threat, working memory tasks, moral decisionmaking, human-robot interactions) in which detection rates significantly better than chance have been achieved. However, as this body of work—similar to Liu et al. (2011)—relies on frontallysituated probes that are proximal to primary facial muscles, the measurements might still reflect some degree of EMG artifacts rather brain activity alone.

Furthermore, as the majority of these studies have been conducted in offline settings, affect detection may still be premature for passive NIRS-BCIs. There exist but a few attempts (moreover, with mixed results) at single-trial and online decoding of affective states (specifically Luu and Chau, 2009; Heger et al., 2013; Peck et al., 2013). Regarding the detection of user preferences (e.g., affinity versus aversion), Luu and Chau originally showed an average classification accuracy of 80% in decoding users' preferences between two possible drinks in a single-trial NIRS paradigm (Luu and Chau, 2009). However, after an issue with the original methodology was identified (Dominguez, 2009), reanalysis yielded an average classification accuracy of 54% (Chau and Damouras, 2009) which was not significantly better than chance. Similarly, in an online classification paradigm, Peck and colleagues investigated preference decoding as a means of providing implicit ratings of movies (Peck et al., 2013). However, comparison of the NIRS-based recommendations (recommendations based on classification of the users' NIRS data) versus random movie recommendations did not show any significant difference. Despite the unsuccessful approaches to decoding of preference states, the work of Heger and colleagues suggests that offline experimentation on the detection of certain affective states may indeed extend to more realistic settings. In Heger et al. (2013), they showed three affect classes (high valence, high arousal, and high valence/arousal) could be reliably (63–69% average classification accuracies) discriminated from neutral for an eight-subject sample in an asynchronous classification paradigm. However, their recognition of high-valenced versus high-arousal states did not perform significantly better than chance (average accuracy of 53%), thus suggesting the granularity of passive NIRS-BCIs for affect recognition is limited.

#### **2.2. EXEMPLARS OF NIRS-BCIs**

While investigation into NIRS-based detection of affect is growing, on the forefront of state-of-the-art NIRS-BCIs is the development of NIRS as a passive input modality (referred to here as "NIRS-pBCI") based on workload-related user states. **Table 2** shows a detailed summary of known demonstrations of NIRSpBCIs. Aside from the couple aforementioned attempts at online affect detection (Heger et al., 2013; Peck et al., 2013), these systems are primarily based on the decoding of workload-related states (i.e., Matsuyama et al., 2009; Solovey, 2012; Solovey et al., 2012; Girouard et al., 2013; Afergan et al., 2014; Schudlo and Chau, 2014). Here we discuss three such systems in detail regarding their approaches to the online decoding of cognitive states as well as their current limitations.

#### *2.2.1. Reference channel/thresholding*

Matsuyama and colleagues created a simple, proof-of-concept NIRS-pBCI based on the detection of workload-related


*Model refers to the type of state information of interest, latency is the delay imposed by the signal processing on onset detection, and N indicates the population sample size.*

hemodynamic changes (Matsuyama et al., 2009). Their study was a preliminary attempt at using passive monitoring of users' cognitive state to adapt a robot's behavior. Using a 35-channel NIRS instrument, they measured participants' prefrontal cortex while they solved arithmetic problems. As a proof-of-concept of NIRS-based robot adaptivity, they developed their NIRS-pBCI to send a primitive motion command to a robot when it detected changes in hemoglobin associated with the arithmetic problem solving (i.e., when an increase in oxygenated hemoglobin was observed corresponding to the participant actively working on a arithmetic problem). They used a simple combination of thresholding and reference channel for noise subtraction to detect task-evoked changes in oxy-hemoglobin. Specifically, to avoid noise from widespread brain activity, they computed the difference between two regions—a target region and a reference region (F7-F4, coordinates according to the International 10–20 placement system). Then, using a single threshold (max F7-F4 difference in oxy-hemoglobin), their NIRS-pBCI would cause the robot to move whenever this threshold was surpassed. While there exist many sound BCIs for the direct control of robotic systems (e.g., Canning and Scheutz, 2013), their NIRS-BCI system was not intended to use workload-related activity to directly control a robot. Rather, it served as an effective demonstration that a NIRS-based BCI can passively monitor a person's cognitive workload to initiate behavioral changes in a robot. However, this work also exposed a particular shortcoming of NIRS that is an obstacle for its effectiveness in more realistic scenarios, namely that of onset detection latency (Canning and Scheutz, 2013). Specifically, using their approach to workload monitoring, the time between a participant beginning the arithmetic problem and the transmission of the motor control signal ranged from just few seconds to over 15 s (Matsuyama et al., 2009). As task-related hemodynamic changes in oxygenated hemoglobin occur over several seconds (Coyle et al., 2007), this delay was (and is) somewhat unavoidable due to the inherent hemodynamics; however, recent work has demonstrated vast reductions in temporal delays to onset detection (Cui et al., 2010b), which suggests improvement may be possible.

#### *2.2.2. Temporal dynamics*

Similar to Matsuyama et al. (2009), we previously participated in the development of a passive NIRS-BCI aimed at adapting a robot's behavior based on a person's detected multitasking state (Solovey et al., 2012). A two-probe NIRS instrument (with four sources per probe) was used to image participants' prefrontal cortex, while they worked with two simulated robots on a humanrobot team task. Here we designed a naive SVM (support vector machine) classification model based on gross temporal dynamics, built by the Sequential Minimal Optimization (SMO) algorithm available in the Weka (Waikato Environment for Knowledge Analysis1 ) library (Hall et al., 2009) and trained using data collected while participants performed a variant of the n-back task. Specifically, the SVM was trained on feature vectors containing every measure of amplitude of both oxy- and deoxy-hemoglobin over the course of a 40 s period of n-back performance. That is, for a device with a sampling rate of 6.25 Hz and a task period of 40 s, a single training example was a vector of 40 s × 6.25 cycles/second × 2 signals (oxy and deoxy) × 2 probes × 4 sources/probe, or 4000 features. This naive approach was a first attempt at capturing temporal patterns over the full time course of a person performing the n-back task. The n-back task, rather than human–robot team task, was used for training in order to avoid potential variations implicit in the team task, but we expected participants to show similar patterns in their NIRS data across both tasks as both induced similar levels of subjectively reported mental stress.

In the human–robot team task, we hypothesized that adapting the level of a robot's autonomy would lead to better task performance and better perceptions of teamwork. Thus, while participants performed the team task, classifications of their mental workload dynamically adapted the autonomy of one of the robots according to the participant's multitasking state. An initial evaluation (Solovey, 2012; Solovey et al., 2012) showed successful task completion was significantly moderated by adaptivity: the dynamic adaptivity of the robot's autonomy improved task performance (82% of participants successfully completed the team task versus a baseline performance rate of 45%). This system was thus a substantial extension of Matsuyama et al. (2009), as it was the first NIRS-BCI to demonstrate effective improvements on a *realistic* task. However, in a recent series of reinvestigations (Strait et al., 2014b) of this system's classification performance, the average classification accuracy on an alternative dataset (of mental arithmetic) was only 54*.*5% (*SD* = 14*.*3%) suggesting limited generalizability of the system's signal processing. Additionally, this NIRS-pBCI was found effective (statistically better than chance) for only 10 of 40 participants in this alternative dataset (Strait et al., 2014b), which suggested limited utility for a more realistic population sample (i.e., when *N* = 40 versus *N* = 3 in the initial evaluation). This finding was consistent with one recent investigation (Coffey et al., 2012) which showed better-thanchance NIRS-based classifications for only 5 out of 10 participants on a workload task, but not with another recent investigation (Hirshfield et al., 2009a), which showed the reverse. Hence it remains to-date unclear whether one modality or the other (EEG versus NIRS) is better for measuring workload-related signals, if either, or if it is largely a function of the signal processing methods employed.

<sup>1</sup>The Weka Java libarary contains a collection of common tools for data processing, classification, visualization, and other common analyses for data mining. For more information, see Hall et al. (2009).

#### *2.2.3. Combination temporal/spatiotemporal dynamics*

Schudlo and Chau (2014) also developed an online NIRS–BCI which was driven by a mental arithmetic; however, unlike previous NIRS-pBCIs, their system also accommodated an unconstrained rest state. That is, while previous examples of NIRSpBCIs have been demonstrated to function in online settings (e.g., Matsuyama et al., 2009; Solovey et al., 2012; Girouard et al., 2013), they all employ a synchronous training paradigm, which does not clearly allow the user to remain in an unconstrained resting state for an unfixed length of time. Given this gap in the NIRS-pBCI literature, Schudlo and Chau investigated whether prefrontal activity corresponding to mental arithmetic and unconstrained rest could be differentiated online at a practical accuracy for more realistic BCI use. Here the prefrontal cortex was sampled (using a nine-channel spectrometer) while participants selected letters from an on-screen scanning keyboard via intentionally controlled brain activity (mental arithmetic). To classify the hemodynamic activity, a combination of temporal features (extracted from the NIRS signals) and spatiotemporal features (extracted from dynamic NIRS topograms) were used in a majority vote combination of multiple linear classifiers. The online classification results showed an average accuracy of 77*.*4% (*SD* = 10*.*5%), with 8 of the 10 participants showing accuracies significantly above chance. Considering previous results showing significant detection accuracies in less than half of participants (Coffey et al., 2012; Strait et al., 2014b), the findings of Schudlo and Chau's work are particularly promising, and suggest that mental workload, using a more complex classification approach, may indeed be effective at driving a passive NIRS-BCI.

#### **2.3. CONSIDERATIONS**

The previous section detailed three examples of state-of-the-art passive NIRS-BCIs, which intended to serve both as proof-ofconcept demonstrations of NIRS being successfully utilized as a passive input to a computer system, as well as of the challenges to achieving more robust NIRS-pBCIs. While there are numerous factors that contribute to the reliability and robustness of a NIRSbased system (e.g., Oriheula-Espina et al., 2010), we highlight some of the more pressing of these considerations, as well as the differences in signal processing that may contribute to decrements to signal reliability in moving from offline NIRS-based systems to online, passive BCIs.

In the standard, *offline* approaches to signal processing of NIRS data, the signals are short (3–60 s) and heavily filtered *post hoc* (with roughly the following measures)—*detrending* (removal of low frequency signal artifacts and drift), *smoothing* (removal of systemic artifacts such as cardiac pulsations, respiration, and Mayer waves), *motion correction* (reduction of motion artifacts), and *data reduction* (removal of noisy or corrupt trials; averaging over repetitions of a task and/or truncation of the signal to reduce temporal variation; using summary statistics, e.g., areaunder-the-curve, percent signal change to represent the overall hemodynamic response) (see Cui et al., 2010a; Oriheula-Espina et al., 2010; Hoshi, 2011; Brigadoi et al., 2014; Scholkmann et al., 2014; Tak and Ye, 2014). Such processing can result in dramatic reductions of signal noise, however, in online, passive settings, signal processing faces substantial challenges (Canning and Scheutz, 2013; Schudlo and Chau, 2014), three of which we detail here.

#### *2.3.1. Onset latency*

In moving from offline to fully online, unconstrained, realtime analysis, NIRS-pBCIs suffer a loss in signal processing as well as task information which may result in increased signal noise, and hence, increased unreliability. Specifically, while offline paradigms have known onsets and offsets of the task stimulus, such an oracle is lost in an online, asynchronous scenario. That is, the difficulty in offline processing is primarily to identify whether a trial contains a significant change in hemodynamic activity in response to a particular stimulus. Whereas, in passive (online) systems, not only must we identify whether the signal contains a significant hemodynamic response, but also where such a response begins and terminates. While these fundamental differences in offline versus online protocols is not a new consideration for the signal processing or EEG communities (e.g., Lotte, 2011), they underscore a necessary consideration when transitioning from proof-of-concept (offline) systems to robust online, passive systems that has yet to receive much discussion regarding NIRS-based BCIs. For instance, while both Girouard and colleagues (Solovey, 2012; Girouard et al., 2013) as well as Schudlo and Chau (Schudlo and Chau, 2014) achieved accuracies that were relatively high for online classification of NIRS data with their NIRS-pBCIs, their systems implicitly required delays in the detection of task-related onsets of 20–40 s. Such delays limit the execution of passive NIRSbased adaptivity to only after a significant amount of time has elapsed.

# *2.3.2. Participant mobility*

In addition to the loss of onset/offset oracles, signal noise is also problematic for passive BCI systems. In particular, unrestricted participant mobility can cause motion artifacts which degrade the NIRS signals (e.g., Canning and Scheutz, 2013). These artifacts can be caused by movement of the sensors on the skin, facial expressions, and head orientation (Matthews and Pearlmutter, 2008; Robertson et al., 2010). As techniques for online, asynchronous filtering are limited (e.g., Ayaz et al., 2010; Cui et al., 2010a), other attempts at combating motion artifacts include restricting participant mobility (e.g., using chin rests and mechanical supports, Coyle et al., 2007), which are not particularly suited for *realistic* HCI settings and furthermore, such restrictions on participant mobility significantly reduce the value gained in using NIRS over fMRI. There are, however, a growing number of proposals for real-time motion artifact correction in natural environments, such as the adjustment of the signal based on statistical associations between oxy- and deoxyhemoglobin values (Cui et al., 2010a), the use of linear quadratic estimation (Izzetoglu et al., 2010), and the use of complimentary physiological measures (Falk et al., 2011).

#### *2.3.3. Task-unrelated activity*

Lastly, task-unrelated activity such as resting-state fluctuations (Hoshi, 2011; Hu et al., 2013) or whole brain activity (Matsuyama et al., 2009) can degrade the signal quality. That is, separating task-related from unrelated cortical activity and signal noise can be difficult in some cases (e.g., Kirilina et al., 2012). For example, to separate task-related activity from unrelated whole brain activity, a reference channel outside the cortical region of interest has been used as a method to subtract out the taskunrelated activity (Matsuyama et al., 2009; Lu et al., 2010; Scarpa et al., 2013). This method, however, is impractical when multiple channels are not available (e.g., as was the case in Solovey et al., 2012) and moreover, assuming the reference is neutral (that the activity at the reference region is unrelated to the taskevoked activity), it relies on the quality of the channel placements, which is in itself a challenge for NIRS (Plichta et al., 2007). However, there are a couple of recent proposals for improving the identification of sampling region using probabilistic registration methods of probe placement based on a reference-MRI database (Tsuzuki and Dan, 2014), as well as for separating superficial from cortical signals (Kirilina et al., 2012) and for using resting-state connectivity for reducing inter-trial variability (Hu et al., 2013).

# **3. INVESTIGATION**

To empirically investigate some of the aforementioned challenges to signal reliability, we collated a large NIRS dataset which we used in the construction of three basic models. The dataset contains (1) 18 training samples of resting versus workload-induced states, during which participant mobility was restricted; (2) 18 training samples (rest versus workload) where mobility was *unrestricted*; and (3) one testing sample of a more realistic task paradigm (i.e., prolonged rest and task periods similar to the human–robot team task in Solovey et al., 2012). Here, we first compare the performance of three basic NIRS models (using 10 fold cross-validation) when trained on data with and without participant movement. Following, we then look at the relative model performances when applied to the more realistic testing sample.

#### **3.1. DATASET**

To compare the relative performance of three modeling approaches, as well as the effects of unrestricted participant mobility on model performance, we obtained the dataset from Strait et al. (2013b) for further analysis. The dataset contains 40 Tufts University students and staff (18 male; ages 18–45, *M* = 23*.*4, *S* = 5*.*8), sampling prefrontal hemodynamic activity (recorded bilaterally using a two-channel ISS OxiplexTS, with a temporal resolution of 6*.*25 Hz) while participants performed a workload-inducing arithmetic task. All participants were healthy, right-handed, with normal or corrected-to-normal vision, and reported no known history of neurological or psychiatric disorder. To secure the NIRS probes to the participant's forehead, we used a fitted black cap. To minimize signal noise due to ambient light, the room lights were turned off during the recording periods and all stimuli were presented via white text on a black background. Each participant performed two blocks of the workload task (each block comprised of nine trials of arithmetic, nine trials of rest)—one block with their motion restricted (using a zero-gravity chair and verbal instructions to remain motionless) and one with their motion unrestricted (using a simple office chair and verbal instructions to sit naturally). While the trials were each separated by a 30 s fixation cross, here we refer to trial as a sampling period comprised of the participant performing the task or resting only. That is, the trials contained measurements sampled while the participant was actively performing the task or (exclusive) resting.

#### *3.1.1. Signal processing*

Prior to analysis, the dataset was first converted using the modified Beer-Lambert Law (MBLL), which yielded a measure of Hb (deoxygenated) and HbO (oxygenated hemoglobin) at each time point for each of two sensors positioned over the left and right prefrontal cortex (PFC), respectively, for a total of four timeseries signals (left Hb, HbO; right Hb, HbO). We then detrended the signals by subtracting out the signal obtained from a lowpass filter (1st degree Savitsky-Golay with a cut-frequency of 0.01 Hz) and smoothed the resulting signals using another lowpass FIR filter (1st degree Savitsky-Golay with a cut-frequency of 0.15 Hz) to reduce the effects of systemic physiological artifacts (namely, cardiac pulsations and respiration). Lastly, we applied a correlation-based signal correction (Cui et al., 2010a) to reduce the effects of motion artifacts. Although all signal processing was applied *post hoc* and offline, online implementations of similar filters have been suggested to be equally effective (Cui et al., 2010a,b).

### *3.1.2. Modeling*

We constructed our models using the nine arithmetic and nine rest training trials (measured under restricted mobility conditions) based on three relatively successful approaches to classifying NIRS data: (1) the reference channel/threshholding approach described in Matsuyama et al. (2009), and the slightly more complex SVM-based approaches of (2) Cui et al. (2010b) and (3) Solovey et al. (2012). Here we implemented the reference channel/thresholding approach put forth by Matsuyama et al. (2009), such that we calculated the difference in oxy-hemoglobin between the two sensors placed bilaterally on the PFC (left PFC—right PFC). This roughly corresponds to the probe placement used in Matsuyama et al. (2009), with the probe measuring the left PFC placed more anterior and medial to the F7 region of interest. To classify the rest versus workload states, this model compares each time point in the left-right oxy-hemoglobin difference against a single baseline value (the average of the max differences during the observed in the resting trials). If the difference at the current time point exceeds the baseline value, the system classifies it as task-evoked activation. To compare more sophisted approaches, we implemented a simple SVM model based on Cui et al. (2010b) which uses four features—the amplitude of left and right oxy/deoxy—and again performs a classification of each timepoint. While this approach is still relatively simple, it capitalizes on the correlations between oxy/deoxy hemodynamics, as well as possible left/right synchronies. Lastly, we compared both approaches with the results of the model described in Solovey (2012) and Solovey et al. (2012), which uses the entire time course of a training sample (see Strait et al., 2014b for details).

#### **3.2. RESULTS**

The results of the cross-validation are shown in **Table 3**, where accuracy refers to the overall recognition rate of both classes (rest and task). The results of the Matsuyama thresholding model (Matsuyama et al., 2009) are depicted in the first column section (average time to onset detection, *Mon*, and average classification accuracy, *Macc*(1)). The middle column section depicts the results of the simple SVM model (Cui et al., 2010b), and the rightmost column section depicts the results of the more complex SVM model2 (Solovey et al., 2012). Using the thresholding approach, we found an average task detection latency of 12*.*6 (±7.6) s across participants (*N* = 40), with individual averages

<sup>2</sup>The model based on Solovey et al. (2012) and results of its cross-validation are also described in Strait et al. (2014b). However, all additional analyses and discussion presented here of its performance are novel.


**Table 3 | Relative model performances in nine-fold cross-validation.**

*The Matsuyama approach is shown in the left column section (with both onset latency and classification accuracy shown). Middle shows the model based on Cui et al. and far right, the Solovey et al. model. In red: rates that are significantly above chance (right-tailed t-test, tcrit(8)* = *1.8595).*

ranging from 3*.*1 to 27*.*5 s (see **Table 3**, left). However, the recognition rate of this model did not perform better than chance (*M* = 46*.*6%, *SD* = 17*.*2%). Whereas, both the more complex SVM models performed significantly above chance recognition levels (simple SVM: *M* = 60*.*5%, *SD* = 15*.*8%, *p <* 0*.*0001 and complex SVM: *M* = 54*.*5%, *SD* = 14*.*3%, *p* = 0*.*0037). However, between these two SVMs, the more simple approach of the two (Cui et al., 2010b) performed significantly better both in terms of classification accuracy (*p* = 0*.*0035) and across the subject population (with 20*/*40 participants showing significant recognition rates) versus the more complex approach (with 10*/*40 showing significant rates).

To examine the effects of (semi) unrestricted participant mobility, we next re-constructed each of the three models using the motion-*unrestricted* set of training samples (again nine of each rest and arithmetic trials). Using nine-fold cross-validation of these samples, we found neither the thresholding nor simple SVM approaches were significantly affected in terms of classification accuracy (*M* = 45*.*2%, *SD* = 18*.*2%, *p* = 0*.*5459; and *M* = 60*.*4%, *SD* = 15*.*4%, *p* = 0*.*8850, respectively), nor in onset latency for the thresholding approach (*M* = 13*.*7 s, *SD* = 5*.*7 s, *p* = 0*.*1446). However, the performance of the more complex SVM model was significantly degraded, with an average classification accuracy of 25*.*3% (*SD* = 7*.*3%, *p <* 0*.*0001).

To investigate the relative performances of each of these three models in a more realistic task paradigm, we tested each of the classification approaches (using the models trained on the motion-*restricted* training samples) on the testing sample (3.5 min rest, 3.5 min arithmetic, 3.5 min post-arithmetic rest). Here we observed a significant reduction in classification accuracy for the simple SVM model (*M* = 54*.*6%, *SD* = 14*.*4%, *tobs* = 1*.*74), but not the complex SVM (*M* = 48*.*5%, *SD* = 15*.*1%, *tobs* = 0*.*67). However, the simple SVM still performed significantly above chance (*tobs* = 2*.*02, *tcrit*(39) = 1*.*68). There was not any significant change in accuracy for the thresholding model (*M* = 43*.*9%, *SD* = 10*.*5%, *tobs* = 0*.*84).

# **3.3. DISCUSSION**

#### *3.3.1. Model performance*

In comparison to Matsuyama et al. (2009), the simple reference channel/thresholding combination approach on the dataset used here showed onset latencies substantially slower (*M* = 12*.*6 s, *SD* = 7*.*6 s) than theirs (*M* = 9*.*1 s, *SD* = 4*.*3 s). This increase in delay and variability may be in part due to a different and larger sample population, as well as the placement of the probes (the positioning used here was inexact and slightly more anterior and medial in comparison to Matsuyama et al., 2009). Hence, the measured activity by the channel used for reference may not have been entirely distinct from the target region-of-interest. In any case, our results confirm a temporal limitation for workloadbased state detection, at least when using a minimal (two-probe) NIRS instrument. That is, a fair onset detection delay (9–13 s) will be encountered using this method (see **Figure 1**). However, more problematic for this method is the classification accuracy: which failed to perform any better than chance overall. While this naive detection approach may work appropriately for contexts in which the duration of the passive adaptivity is not important, for contexts in which it is (e.g., if a robot should only act autonomously while a person is multitasking or mentally stressed), this may not serve as the best model. Similarly, a model that is very complex also may not be the best approach. Specifically, the more simplistic SVM model significantly outperforms the more complex SVM, both in terms of overall accuracy (60*.*5% versus 54*.*5%) and within the population (effective for 20 participants versus only 10 using the complex SVM). As SVMs are known to produce poor performance on highly-dimensional data with few training samples (Cortes and Vapnik, 1995), this difference in performance here between the two SVMs might be attributable to the availability of only 18 training samples total in combination with the complex SVM (which employs 4000 features in its model of workload-based activity) versus the simple SVM (which makes use of only four features). For instance, Power and colleagues showed a nearly 15% improvement in classification accuracy in

using 80 versus 10 training samples (Power et al., 2012). Thus, given more training samples, we might expect the complex SVM approach to show better recognition rates.

#### *3.3.2. Model performance subject to movement*

When we next re-trained our models using the training samples with semi (participants were still tethered within range of the NIRS device) unrestricted participant mobility, we found neither the thresholding nor simple SVM approaches were significantly affected. However, the performance of the more complex SVM model was significantly degraded, with an average classification accuracy of 25*.*3% (*SD* = 7*.*3%, *p <* 0*.*0001). This difference in effects may be due to the difference in approach, where the more simplistic approaches of Matsuyama et al. (2009) and Cui et al. (2010b) classify the NIRS signal at every time point versus the more complex model which classifies a sizable window of the data. Hence, while a motion artifact may significantly degrade the overall measurement sample (thus resulting in lower accuracy of the complex SVM), an individual timepoint may not be so influenced. Potential influences on these models, however, may have been obscured in part by the filtering methods (namely, the correlation-based signal correction to attenuate movement artifacts). Hence it is worth further consideration when developing a NIRS-BCI, as to what signal processing is necessary depending on the context in which it will be used (i.e., if participants will be moving). Lastly, we looked at model performance given a more realistic task paradigm. Here we observed a significant reduction in overall classification accuracy for the simple SVM model, but not for the complex SVM or thresholding model. While the performance of the simple SVM was still statistically significantly above chance, passive adaptivity of a system based on this model would be unlikely to have any serious effects (and thus would be considerably difficult to measure in terms of behavior enhancements of the user).

# *3.3.3. Limitations*

In this section, we systematically investigated three recentlyproposed models of NIRS data and their performances when subject to certain factors of more realistic HCI settings (namely participant motion and semi-undefined task durations). While this evaluation serves to highlight the challenges of these factors to achieving more robust NIRS-based systems, there are also a number of limitations to the interpretation of results. In particular, all three modeling approaches performed significantly worse than prior work, with the thresholding approach showing a substantial increase in onset latency and the two SVMs a substantial decrease in accuracy (roughly 15% and 13%, respectively) than the models on which they were based. It is likely that these differences are at least in part due to the sample size, as the sample population used in this study is meaningfully larger than all prior work (*N* = 7 in Matsuyama et al. and *N* = 3 in both Cui et al. and Solovey et al.). It is also likely that they are attributable partially to differences in the task (e.g., numeric versus the alphameric n-back task used in Solovey et al.), region of measurement (prefrontal cortex versus motor cortex measured in Cui et al.), and placement of probes (the 10–20 system was used in Matsuyama et al., but no standardized coordinates were used in this investigation). Hence, it is impossible to speculate as to whether the above effects would be observed in exact replications of prior work. However, these limitations in themselves raise an important consideration regarding NIRS-based research: specifically, whether underpowered studies generalize over larger populations and whether the methods for signal processing and modeling generalize across functional regions of the brain and over a variety of tasks.

# **4. CONCLUSIONS**

The aim of this paper was to provide (1) an overview of what we can do with NIRS-BCIs for measuring cognitive and affective user states, (2) a discussion of the effects of naturalistic and unconstrained interaction settings of HCI on signal reliability, and (3) a quantitative comparison of the performance of three recent modeling approaches in these more realistic settings. Specifically, we described two primary cognitive and affective states (mental workload and negative affect) measureable with NIRS, as well as two modes of use (evaluatory and passive). Additionally, we emphasized the distinction of offline versus online (real-time) signal processing for NIRS-based BCIs. The prototypical application of NIRS as an evaluation tool is as an offline *post hoc* analysis of a signal recorded during some stimulus. However, the usage of NIRS as a passive BCI (involving the online processing of hemodynamic data) has emerged, and with it, a number of challenges have followed.

We discussed some of those key challenges (participant mobility, more naturalistic interaction) and investigated their effects with a comparative analysis of three recently-proposed modeling techniques. The results of our investigation highlight several considerations, including detection latencies (the temporal delay between a precipitating stimulus and the detection of the stimulus-evoked hemodynamic changes), performance of the model in more naturalistic contexts (i.e., when participant mobility is unrestricted), and the generalizability of current training paradigms (i.e., offline, time-restricted) to the asynchronous, online paradigms of more realistic settings (e.g., Brouwer et al., 2013). The results also underscore several additional considerations, namely efficacy of a NIRS-BCI across a population (i.e., whether the signal processing and modeling approach effective for the whole population or only a small proportion) and task/region-specificity of a technique. While these challenges are not particularly new to the field, or to BCI in general, both the review of the literature and the empirical evaluation highlight the dependencies between performance, signal processing, and experimental context. Research efforts on all these fronts are mutually complementary and necessary to the advancement of NIRS as a tool for human–computer interaction.

NIRS-based systems have already been used in a range of applications, such as the quantification of mental workload and differentiation of aroused/valenced states; however, substantial challenges remain to be addressed before NIRS can become a practical and robust tool for passive BCIs. The challenges emphasized here concern detection latency, signal processing, as well as better understanding of hemodynamic changes over undefined task durations. While there are numerous challenges that have been raised previously (both in NIRS and EEG research), they remain to-date unaddressed. It is thus our hope that this survey and dataset will facilitate researchers to actively engage in NIRSrelated research that will help overcome current challenges and make NIRS a more robust and useful tool to the BCI community.

#### **ACKNOWLEDGMENT**

This material is based upon work supported by the National Science Foundation under Grant No. ISS-1065154.

# **REFERENCES**


surprise, and workload," in *Foundations of Augmented Cognition* (Orlando, FL), 507–516.


craniocerebral correlations: reproducibility of activation? *Hum. Brain Mapp.* 28, 733–741. doi: 10.1002/hbm.20303


Zander, T., and Kothe, C. (2011). Towards passive brain-computer interfaces: applying brain-computer interface technology to human-machine systems in general. *J. Neural Eng.* 8, 025005. doi: 10.1088/1741-2560/8/2/025005

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 01 February 2014; accepted: 02 May 2014; published online: 23 May 2014. Citation: Strait M and Scheutz M (2014) What we can and cannot (yet) do with functional near infrared spectroscopy. Front. Neurosci. 8:117. doi: 10.3389/fnins. 2014.00117*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Strait and Scheutz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Errare machinale est: the use of error-related potentials in brain-machine interfaces

# *Ricardo Chavarriaga\*, Aleksander Sobolewski and José del R. Millán*

*Defitech Chair in Non-Invasive Brain-Machine Interface, Center for Neuroprosthetics, School of Engineering, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Andrea Kübler, University of Würzburg, Germany Johanna Wagner, Graz University of Technology, Austria Laurens Ruben Krol, Technische Universität Berlin, Germany*

#### *\*Correspondence:*

*Ricardo Chavarriaga, Defitech Chair in Non-Invasive Brain-Machine Interface, Ecole Polytechnique Fédérale de Lausanne, EPFL-STI-CNBI, Station 11, Lausanne 1015, Switzerland e-mail: ricardo.chavarriaga@epfl.ch*

The ability to recognize errors is crucial for efficient behavior. Numerous studies have identified electrophysiological correlates of error recognition in the human brain (error-related potentials, ErrPs). Consequently, it has been proposed to use these signals to improve human-computer interaction (HCI) or brain-machine interfacing (BMI). Here, we present a review of over a decade of developments toward this goal. This body of work provides consistent evidence that ErrPs can be successfully detected on a single-trial basis, and that they can be effectively used in both HCI and BMI applications. We first describe the ErrP phenomenon and follow up with an analysis of different strategies to increase the robustness of a system by incorporating single-trial ErrP recognition, either by correcting the machine's actions or by providing means for its error-based adaptation. These approaches can be applied both when the user employs traditional HCI input devices or in combination with another BMI channel. Finally, we discuss the current challenges that have to be overcome in order to fully integrate ErrPs into practical applications. This includes, in particular, the characterization of such signals during real(istic) applications, as well as the possibility of extracting richer information from them, going beyond the time-locked decoding that dominates current approaches.

**Keywords: brain-machine interface, error-related potentials, reinforcement learning, EEG, neuroprosthesis, hybrid BCI**

# **1. INTRODUCTION**

*Errare humanum est, perseverare autem diabolicum –Seneca the younger*

The ability of human and non -human animals to learn and adapt their behavior is largely based on their capacity of identifying erroneous actions (Rabbitt, 1966). Several studies have reported that such events elicit distinct neural responses, which can be observed using different neuroimaging techniques including fMRI, scalp and intracranial electroencephalography (EEG), and magnetoencephalography (MEG). In particular, it has been demonstrated that the electrophysiological signatures of this error processing—i.e., error-related potentials, ErrPs– can be reliably decoded on a single-trial basis, thus allowing their use through brain-machine interface (BMI) systems as a means to improve the machine's performance, similarly to animals. For instance, typically BMIs aim at decoding user's intentions from the neural activity (e.g., as recorded by EEG). Misclassification of these intentions results in an erroneous command. The user's subsequent perception of such error can elicit an ErrP and the successful decoding of this response would allow the system to take corrective actions, e.g., by preventing the erroneous command from being fully executed or reverting its outcome (Schalk et al., 2000; Ferrez and Millán, 2008a; Dal Seno et al., 2010). Alternatively, ErrPs can be used to reduce the possibility of the error reappearing in the future through re-calibration of the system, allowing it to "learn from its mistakes" (Artusi et al., 2011; Llera et al., 2011). These approaches are illustrated in **Figure 1**. They combine the decoding of one brain signal (e.g., correlates of motor imagery or stimulus recognition) for controlling the device and the ErrP as a corrective mechanism, thus corresponding to hybrid BMI systems (Pfurtscheller et al., 2010). Notably, the same principles can also be applied to human-computer interaction (HCI) systems when input devices other than BMI are employed (Parra et al., 2003; Chavarriaga and Millán, 2010; Wang et al., 2011; Zander and Kothe, 2011; Zander and Jatzev, 2012). Interestingly, these ErrPs are naturally elicited during human interaction with the machine. This means that information about the user's cognitive assessment of such interaction can be obtained implicitly, without a need for training or asking the users to actively generate them. Systems that decode this information are sometimes referred to as *passive* BMIs; as opposed to the so-called *active* BMIs where the brain signals are consciously modulated by the user to control a given device or application (Zander and Kothe, 2011). However, caution should be taken not to interpret this as if the user played an entirely passive role during the interaction. In fact, ErrPs have been shown to be modulated by the user's level of engagement in the task (Hajcak et al., 2005).

In the last decade, researchers have provided ample evidence of the feasibility of such approaches. Here we review this work, starting with a short description of different error-related electrophysiological patterns (section 2). For more detailed account of the neural basis of these signals, readers can refer to reviews by Taylor et al. (2007); Hoffmann and Falkenstein (2012); Wessel

(2012), and Ullsperger et al. (2014). Here we focus on signals that have been mostly exploited for brain-machine interfacing and primarily discuss electroencephalographic signals found using non-invasive recording techniques (section 3). We go on to present different strategies that can be applied to increase the robustness of the BMI system by incorporating single-trial ErrP recognition in both able-bodied subjects and users with motor disabilities (sections 4 and 5). We also present recent efforts to integrate these signals into real-world applications (section 6). Finally, we review the techniques used for decoding these potentials (section 7) and discuss current challenges in the study and exploitation of these signals (section 8).

# **2. ERROR-RELATED BRAIN ACTIVITY**

Early reports of error-related brain activity date back to the early 1990's (Falkenstein et al., 1991; Gehring et al., 1993). These studies showed a characteristic EEG event-related potential (ERP) elicited after subjects committed errors in a speed response choice task. This pattern is characterized by a negative potential deflection, termed the *error-related negativity* (ERN), appearing over fronto-central scalp areas at about 50–100 ms after a subject's erroneous response (Falkenstein et al., 2000). This negative component is followed by a centro-parietal positive deflection (Pe). Modulations of this latter component have been linked to the subject's awareness of the error. Interestingly, correlations have been found between the ERNs and behavioral adjustments following these errors, e.g., post-error response slowing (Debener et al., 2005; Frank et al., 2005; Themanson et al., 2012); supporting the idea that the signal indeed reflects an action monitoring process (Holroyd and Coles, 2002). This is further corroborated by the fact that the ERN amplitude seems to be modulated by the importance of errors in the given task (Frank et al., 2005; Taylor et al., 2007), as well as the subjective awareness of the error (Falkenstein et al., 2000; Wessel, 2012; Navarro-Cebrian et al., 2013). Regardless of such functional modulations, these signals are also influenced by individual differences and certain pathological conditions (Olvet and Hajcak, 2008). Importantly, however, these signals have been shown to be quite reliable over time (Olvet and Hajcak, 2009) and across different tasks (Riesel et al., 2013).

A similar medial-frontal EEG pattern has been reported to appear after presentation of "feedback," i.e., the delayed result of a choice or action. This *feedback-related negativity* (FRN), appearing between 200 and 300 ms after feedback onset, is modulated by choices leading to losing situations in strategic gambling tasks (Cohen et al., 2007), as well as subject-specific sensitivity to reinforcement signals (Frank et al., 2005). Interestingly, similar signals are also elicited in the absence of motor response or while observing errors committed by a different person or agent (van Schie et al., 2004; Yeung et al., 2005; Zander et al., 2008; Chavarriaga and Millán, 2010; Zander and Jatzev, 2012). Mounting evidence provides further support of the link between these signals and reward or utility prediction errors, suggesting that ErrPs are generated when the actual outcome does not correspond to the expected one (Holroyd and Coles, 2002; Holroyd et al., 2003; Nieuwenhuis et al., 2004; Yeung et al., 2005). Such information can be used for learning by adjusting the behavior to minimize errors, as proposed by the reinforcement learning theory (Sutton and Barto, 1998).

It is worth to notice that, although these signals are typically referred to as "negativities," the EEG correlates of performance monitoring comprise a uniform sequence of ERP components irrespective of the error source (Ullsperger et al., 2014). These include the fronto-central negative deflections related above, followed by a fronto-central positive deflection and then a later parietal positivity. This pattern is found after self-generated errors (i.e., the ERN/Pe complex), stimulus presentation (i.e., N2/P3 complex), and feedback errors (i.e., FRN/P3 complex). It is not entirely clear to what extent these signals share common underlying processes. Several studies using fMRI, EEG-based source localization, and intra-cranial recordings suggest that the fronto-central ERP modulations commonly involve the medialfrontal cortex, specifically the anterior cingulate cortex (ACC) (Ullsperger and von Cramon, 2001; Brázdil et al., 2002; van Veen and Carter, 2002; Herrmann et al., 2004; Taylor et al., 2007).

Lastly, ErrPs seen as distinct patterns in the temporal domain of electrophysiological signals are not an exhaustive description of the observable EEG phenomena. Accumulating invasive and non-invasive studies are also demonstrating frequency modulations, specifically with erroneous responses eliciting an increase of theta activity followed by a decrease of beta rhythm amplitude (Trujillo and Allen, 2007; Cohen et al., 2008; Koelewijn et al., 2008; Cavanagh et al., 2009, 2012). Moreover, connectivity studies reveal patterns of cross-regional synchronizations, pointing to influences from ACC to prefrontal areas (Cavanagh et al., 2009).

As already mentioned, several studies have reported an evoked response to errors in the user intention decoding when using BMI systems (c.f., **Figure 2**). This response exhibits the same pattern of modulations as described above. The difference waveform (error minus correct) over fronto-central areas is characterized by an initial positive peak at about 200 ms after feedback presentation, followed by a larger negative deflection at about 250 ms and a third larger positive peak at about 320 ms. Furthermore, estimation of the intracranial activity using sLoreta (Pascual-Marqui, 2002) indicated that the signals elicited during brain-machine interaction were generated in the ACC, consistent with other error-related EEG correlates (Ferrez and Millán, 2008a; Lopez-Larraz et al., 2010; Iturrate et al., 2013a). Notably, the term *errorrelated potential*(ErrP), has since become quite widespread within the BMI community, covering electrophysiological responses elicited in a number of paradigms. Resolving its relationship to ERP components, and their functional modulations, typically identified in basic cognitive neurosciences is beyond the scope of this BMI-focused review (although we refer to one relevant confound below). It can be considered as a useful umbrella term for application-driven research, albeit to a certain extent at the cost of correspondence with fundamental investigations. This is, however, partly justified by different research settings: closedloop usability and practicalities of single-trial decoding are rarely the chief concern of basic neuroscience, while the latter's typical abstract, distilled experimental paradigms are not employed by engineering-oriented researchers.

# **3. ERROR-RELATED POTENTIALS FOR BMI**

Following the basic neurophysiological findings described in the preceding section, several studies aimed at assessing whether similar signals were also to be found when the errors were produced by a machine as a result of a misclassification of the user's intention while operating an actual or simulated BMI. In a first report Schalk et al. (2000) showed in four healthy subjects that ErrPs are elicited at the end of erroneous trials when they controlled a 1-D cursor using a non-invasive BMI based on modulation of mu and beta EEG rhythms. This approach was further developed by Ferrez and Millán (2008a) in a study on five subjects using a 2-class motor-imagery (MI) based BMI controlling a cursor moving in discrete steps (c.f., **Figure 3A**). They showed that the ERPs elicited after each command could be decoded as corresponding to the error or correct condition with an accuracy of about 80%. Simultaneously, other studies tested the feasibility of decoding the error-related activity elicited after manual responses (Blankertz et al., 2003; Parra et al., 2003).

Further studies demonstrated other encouraging features of ErrPs for their use in BMI applications. Firstly, as with the ERN, they have been shown to be quite stable over time. ErrP classifiers maintained the same performance when tested several months after their calibration (Ferrez and Millán, 2008a; Chavarriaga and Millán, 2010). Furthermore, these signals seem to be mainly related to a general error-monitoring process, instead of specificities of the particular task that was performed. Iturrate et al. (2014) compared ErrPs elicited in three tasks in which subjects (*N* = 6) monitored the operation of devices of different degree of complexity: a 1-D cursor movement (**Figure 3A**), and a simulated (**Figure 3B**) and real robots (**Figure 3C**) moving in a 2-D space. Their results show that ErrPs across these tasks significantly differ in the latency of the peak modulations, but not in amplitude or overall waveform, thus suggesting the possibility of identifying task-independent markers of erroneous brain-machine interaction. Along the same lines, similar waveforms have been reported in tasks using different feedback modalities (Lehne et al., 2009; Perrin et al., 2010; López-Larraz et al., 2011; Chavarriaga et al., 2012).

An account of ErrP usage in BMI applications must indicate several specific confounds these signals may be susceptible to. Since –due to the nature of BMI– such applications often involve moving stimuli, there is the possibility that observed signals may be contaminated with electrooculographic (EOG) artifacts due to eye movements. This can bias the decoding particularly in application designs where direction of feedback movement is related to correctness of action. To counter this confound, researchers – especially in in proof-of-concept studies– may take care to ensure

MI-based BMI (Ferrez and Millán, 2008a). Right column, Monitoring ErrP: The cursor moves automatically and the user is asked to evaluate whether it moves toward the target location (Chavarriaga and Millán, 2010). **(A,B)** Event-related spectral perturbation. **(C,D)** Grand-average ERP conditions. *t* = 0 corresponds to the stimulus presentation onset (i.e., cursor movement). **(E,F)** Topographical representation of the group average difference ERP for both the interaction (*N* = 4) and monitoring paradigms (*N* = 6). Activity is color coded from blue to red corresponding to the range [−5 5] uV.

**brain-machine interaction. (A)** 1-D cursor control (Ferrez and Millán, 2008a; Chavarriaga and Millán, 2010; Tsoneva et al., 2010; Goel et al., 2011; Zhang

discrete steps toward a target location (green square). **(B,C)** 2-D control of a simulated and real robotic arm (Omedes et al., 2013; Iturrate et al., 2014).

that the location or movement of target stimuli are balanced. Fortunately, it has consistently been found that ocular artifacts have little influence on the signals used for the decoding (Ferrez and Millán, 2008a; Chavarriaga and Millán, 2010; Iturrate et al., 2010; Artusi et al., 2011; Spüler et al., 2012): Nevertheless EOG artifacts remain an ever-present concern in EEG studies using moving stimuli and their potential impact should be systematically assessed.

A different possible confound is that the observed potentials are more related to the rarity of the erroneous events than their valence. To evaluate this, Ferrez and Millán (2008a) and Chavarriaga and Millán (2010) performed experiments with error rates of 50% and 40%, respectively. In both cases they report similar ErrPs than those obtained with lower error rates, although with lower amplitudes. Similarly, the N200/P300, as well as the FRN signals have also been reported to be modulated by the target/error likelihood (Polich, 1990; Polich and Margala, 1997; Jessup et al., 2010; Hauser et al., 2014). In conclusion, although ErrPs are modulated by the frequency of the stimulus they cannot be explained by this factor alone and seem more correlated to their meaning.

Another factor that can modulate the amplitude of the ErrP concerns the attention level of the subject and her/his engagement in the task (Hajcak et al., 2005). Subjects tend to have smaller ErrP amplitude when simply monitoring the device than when they are controlling it, c.f. **Figures 2C,D** (Ferrez and Millán, 2008a; Chavarriaga and Millán, 2010). This factor may influence the performance and, as with other BMI approaches, calls for efficient calibration methods, which could be used before online operation.

Overall, initial ErrP studies supported the idea that it was possible to identify erroneous responses–either manual or decoded through a BMI–and use them to correct these errors to improve the overall performance (c.f., **Figure 1** Left). They were based on offline analysis and did not assess the effect of these approaches during online operation, but fostered continued efforts to reliably decode such error-related brain activity and to integrate it in the framework of human–machine interaction.

# **4. ERROR-RELATED POTENTIALS AS A CORRECTIVE SIGNAL**

#### **4.1. ERROR CORRECTION IN MOTOR-RELATED BMI**

Subsequent attempts at integration of ErrP-based correction into online BMI setups yielded generally positive results. Extending their previous protocol, Ferrez and Millán (2008b) used a twoclass MI-based BMI to control one-dimensional step-wise movements of a cursor. The potential evoked by the cursor movement was decoded to indicate an erroneous or correct movement. In the former case, the cursor was returned to the previous position. Simultaneous real-time decoding of both ErrPs and MI-related activity in two subjects resulted in a three-fold increase in the information transfer rate. They report 80% accuracy in the ErrP recognition, which lead to a reduction of the MI decoding error from about 30% to less than 9%. Kreilinger et al. (2009) also reported performance improvement in a similar experimental protocol involving 13 healthy subjects. MI classification accuracy increased from about 70 to 80% using the online ErrP-based correction.

Other studies have provided further support to the feasibility of using such hybrid approaches, combining the use of one BMI signal to decode the action commands and the ErrP decoding to correct erroneous actions. For instance, Artusi et al. (2011) using offline analysis showed improvement in the decoding of movement-related potentials (i.e., preparatory EEG activity before actual movement performance) by introducing ErrP classification. In their approach, the outcome of the movement decoder was shown to the user and if the elicited EEG response was decoded as corresponding to the error condition, the trial was discarded and the task had to be repeated. Their experiment, involving six healthy subjects, yielded an average ErrP recognition of 80%. Simulation of this corrective mechanism showed a reduction of the global error rate in discriminating between imagination of slow and fast arm flexions from 26% to 14%. In this case, 20% of the trials were discarded based on the ErrP decoding. They estimated an improvement in the average information transfer rate of 76%. Altogether these results are indeed encouraging; however, often such studies used simulated initial BMI commands in order to keep a constant performance; e.g., Ferrez and Millán (2008a); Millán et al. (2009); Artusi et al. (2011). The purpose of such manipulation is to decouple the estimation of the benefits of ErrP-based detection from within-session variations of the command decoder. In consequence, further online tests are required to fully assess the actual performance of motor-related BMIs combined with ErrP-triggered corrective actions.

#### **4.2. ERROR CORRECTION IN P300-BASED BMI**

ErrP-based correction mechanisms have also been applied widely to P300-based spellers (Dal Seno et al., 2010; Takahashi et al., 2010; Combaz et al., 2012; Spüler et al., 2012; Schmidt et al., 2012). These systems exploit an event-related potential elicited by a rare, relevant stimulus: the so-called P300 ERP component (Farwell and Donchin, 1988). In this application, the interface can cancel a character selected with the P300-based speller upon subsequent ErrP detection or, alternatively, correct it by choosing the second most probable character according to the P300 decoding. Although, an early study showed little or no improvement by ErrP-based online correction for two subjects using a pseudorandom matrix speller (Dal Seno et al., 2010), later works showed advantage of integrating the ErrP detection into the BMI. Schmidt et al. (2012) reported an average increase of 40% in the writing speed of twelve healthy subjects using a speller interface designed to reduce the performance sensitivity to gaze shifts (Treder et al., 2011). More recently, Spüler et al. (2012) reported experiments with six subjects with motor disabilities (5 diagnosed with amyotrophic lateral sclerosis, ALS, and one with Duchenne muscular dystrophy) in which a performance increase was observed (0.37 bits per trial). For comparison, an age-matched group of eight able-bodied subjects showed an increase of 0.73 bits per trial, while a group of nine younger subjects had an increase of 0.44 bits per trial. Notably, patients with ALS exhibited similar ErrP patterns to those of healthy subjects, further supporting the potential use of error processing signals in such BMI applications, primarily meant for users with severe disabilities. Both studies found an inverse correlation between the performance improvement and the accuracy of the P300 decoder. However, conversely, a different study involving 16 healthy subjects reported larger improvement for users with higher spelling accuracy (Perrin et al., 2012). In this study, such subjects also showed a slightly higher specificity in the decoding of the ErrP signal. A potential issue to be taken into account in this approach is that both P300 and ErrPs are modulated by attentional processes (Yeung et al., 2005; Kleih et al., 2010). Therefore, factors that affect the level of engagement or motivation of the user (e.g., a BMI low accuracy, high mental workload) may be reflected in the elicited ERPs and, depending on the sensitivity of the decoder to these variations, be detrimental to the overall performance after integration of the ErrP-based correction in both able-bodied and users with disabilities.

# **4.3. CORRECTION OF MANUAL RESPONSES**

Besides using ErrPs to correct commands generated by BMI systems, these signals can also be used in HCI applications requiring manual responses from the user. The first attempts to decode the ERN/Pe components date back to the early 2000s. Blankertz et al. (2003) reported decoding of error-related signals in eight ablebodied subjects using a modified d-2 attention task (Bates and Lemay, 2004). Then, Parra et al. (2003) analyzed ErrPs in a forced choice visual discrimination task, i.e., the Eriksen Flankers task. They reported single-trial classification accuracy of 91% averaged over seven healthy subjects. Furthermore, online correction of the manual responses using the decoded error-related EEG correlates reduced the discrimination error rate in 5 of the subjects (average error reduction was 21.4 ± 21.7%). While in the above study users responded by pressing a key, Ventouras et al. (2011) also tested the decoding of the ERN/Pe in the Eriksen task using a joystick as input device. Their experiment included 16 healthy subjects, and classification performance was assessed using the leave-one-out procedure. They reported sensitivity and specificity values over 87.5%.

These signals have also been tested as a means to correct errors in typewriting tasks. Wang et al. (2011) decoded ErrPs elicited during a hear-and-type task where nine subjects had to type numbers dictated by a computer. They reported sensitivity and specificity values of 68.72 and 51.68% for classifiers trained and tested on the same subject. The performance for cross-subjects classifiers was 68.72 and 49.45%, respectively. A limitation of this study, and a potential reason for the low decoding performance, is the small number of keystroke errors made by the subjects, ranging from 0.42 to 3.58%. As discussed below, this limits the possibility of building proper models of the signals corresponding to errors.

Another interesting study evaluated both feedback and selfgenerated errors in a task involving visuo-tactile stimuli (Lehne et al., 2009). Eleven participants took part in the study where an array of vibrotactile stimulators provided information about a tactile cursor that should be directed toward a target location on the torso. Visual stimulus presented an intended direction of movement and upon its appearance, the user pressed a button to confirm or reject the proposed movement direction. Given the task difficulty, users made erroneous responses in 27.8% of the trials on average. Furthermore, in other trials machine errors were also introduced (i.e., the machine misinterpreted the button responses). Classification of both types of errors yielded accuracies of about 70%, with higher detection rates for the correct than the error trials (i.e., about 70 and 50%, respectively).

Overall, these works show the feasibility of decoding errorrelated information after user overt responses. In general the classification performance was higher for the correct condition, in particular when the complexity of the task increases.

# **5. ERROR-DRIVEN LEARNING**

The studies presented above used ErrP detection to immediately correct erroneous decisions made by the BMI. An alternative use of these signals is error-driven learning. This approach, illustrated in **Figure 1** Right, has been applied to endow BMI systems with adaptive capabilities in two different manners. One possibility is to update the BMI classifier (Blumberg et al., 2007; Llera et al., 2011, 2012; Roset et al., 2013). For instance, Llera et al. (2011) used the decoding of error-related MEG activity to identify misclassification in a two-class covert visual attention paradigm (*N* = 8). The lateralization of alpha-band power in posterior channels was classified using logistic regression to infer which direction (i.e., left or right) the subject was covertly attending to. ErrP decoding was used to identify misclassifications and provide new labels for the incoming data in a semi-supervised manner. The labeled sample was then used to update the classifier parameters. Offline analysis showed that this approach can significantly increase the performance of the BMI classifier. Importantly, given that this is a binary task the intended target class can be easily inferred for the misclassified samples allowing the use of supervised learning techniques for the classifier adaptation. A similar strategy was adopted by Artusi et al. (2011) in the task described in section 4.1, i.e., decoding of imagery of fast vs slow movements. In their case, those trials that were considered as correct after ErrP classification were incorporated into a learning set that was used to perform online retraining of the MI classifier.

However, as long as the ErrP decoding is considered to be in essence binary, the overall performance will be substantially affected by the false positive rate (i.e., correct BMI actions misclassified as errors). A way to palliate this effect is to use methods relying on probabilistic error signals. In that way the reliability of the ErrP decoder (estimated from the training set) can be taken into account. Bayesian filtering or Expectation-Maximization have been put forward as possible approaches (Perrin et al., 2010; Artusi et al., 2011; Llera et al., 2012). A similar method was also proposed in a hybrid system for human-computer interaction where an acceleration-based gesture recognition system was updated using the decoding of the ErrP signal (Chavarriaga et al., 2010). However, as per our knowledge they have only been tested in offline experiments.

Besides adapting the BMI classifier, ErrPs can be used to improve the behavior of a semi-autonomous system. This approach is anchored in the concept of shared control, where intelligent devices can take care of low-level decisions while the user only provides high-level commands –using a BMI or another input modality– (Perrin et al., 2010). In this case, the user monitors the performance of the intelligent device and whenever an ErrP is detected, suggesting the action was perceived as erroneous, it is used to adapt the device controller to reduce the likelihood of committing the same error later on (Chavarriaga and Millán, 2010). In terms of the reinforcement learning algorithm, the detection of an ErrP corresponding to the error condition will be translated onto a negative reward value, effectively punishing the performed action when updating the control policy. This approach was first tested in a 1-D control task with six healthy subjects (c.f. **Figure 3A**). The offline analysis showed that it was possible for the device to learn optimal control policies even though the accuracy of the ErrP decoder was not perfect. An online evaluation of this approach on two subjects monitoring a simulated robot was demonstrated by Iturrate et al. (2010). In this work the subject had to choose the intended target location of the robot and then monitor its movements. The decoded ErrPs were used in a reinforcement learning paradigm to update the robot control policy. They reported that the learned policy converged toward the optimal one–i.e., taking the robot to the user's intended location–in 92% and 75% of the cases for each subject, respectively. Recent integration with shared control techniques suggests that further improvements in performance can be achieved (Iturrate et al., 2013b). In this case 4 subjects monitored a moving cursor in a 2D reaching task, and the ErrPs were used to select one control policy from a pre-defined repertoire; i.e., selecting the most suitable policy to reach the inferred target location.

# **6. ERROR-RELATED POTENTIALS IN REALISTIC APPLICATIONS**

As summarized in section 2, a wealth of neuroscience literature has reported error-related neural correlates. These studies are typically performed in well-controlled laboratory conditions using abstract tasks and stimuli. This allows characterization of such correlates in recording conditions that yield higher signalto-noise ratio and avoid confounds that may appear when allowing more behavioral freedom to the subject(user), or relaxing constraints of the operational setting.

Notably, several studies presented in previous sections corroborated the existence of similar correlates during complex scenarios or realistic interaction with complex devices. Wang et al. (2011) evaluated ErrPs when users performed a typewriting task, while Spüler et al. (2012); Schmidt et al. (2012) and others have tested these potentials while subjects use a P300-based speller. Moreover, it has been possible to observe and decode error-related potentials while people monitor the performance of a robotic arm (Kreilinger et al., 2012; Iturrate et al., 2014) or a mobile robot (Perrin et al., 2010; Chavarriaga et al., 2012), both using simulated and real platforms. Similar correlates were also found during simulated driving of an intelligent car (Zhang et al., 2013). Another study, assessing potential BMI applications to cope with the situational disability experienced by astronauts, reported similar ErrP waveforms and decoding performance under different gravity conditions in parabolic flights (Millán et al., 2009).

These studies suggest that these ErrPs can also be decoded in more complex tasks and scenarios. Nevertheless, it has to be noticed that the decoding performance is typically lower than in simpler, well-controlled experimental paradigms. The performance differences can be due to the decreased signal to noise ratio of the recorded signals, as well as the increased workload placed on the user in the complex tasks. As shown below, some works have attempted to identify and exploit common patterns between simple and complex tasks as a procedure to improve the training of the ErrP decoder in more challenging conditions (Kim and Kirchner, 2013; Iturrate et al., 2014).

### **7. CLASSIFICATION OF ERROR-RELATED POTENTIALS**

A key factor for exploiting ErrPs to improve BMI performance is the ability to decode this signal in a single-trial. As it is the case for all BMI systems, they rely on the real-time processing of the neural signals and the use of machine learning techniques to relate the current activity pattern to a corresponding class (i.e., error or correct condition). This process involves the extraction of suitable features and the training of a classifier based on available labeled samples. Below we discuss the most common methods applied for decoding the error correlates. However, a comprehensive review of the machine learning methods applied in BMI is out of the scope of this paper. Interested readers can refer, among others, to introductory papers by Bashashati et al. (2007); Lotte et al. (2007) and Blankertz et al. (2011).

Studies presented in the previous sections show that it is possible to decode the ErrPs. Noticeably, they have often reported higher classification accuracy for correct trials than errors. This may be partly due to the protocols used to train the classifier. These typically involve a low error-rate (e.g., 20%) thus yielding a larger number of examples for the correct class. The use of an imbalanced number of samples per class may result in asymmetric costs for misclassification of each class and leads to classifiers that are biased toward one of the classes. Moreover, it is difficult to properly estimate the classifier parameters if only a limited number of examples is available.

Regarding the processing and classification techniques used to decode the ErrPs it can be observed that a vast majority of the reported studies are based on temporal features (i.e., waveform shape) computed from a few pre-selected electrodes in the fronto-central areas (e.g., FCz, Cz). Typically, EEG signals were low-pass filtered below 10 or 20 Hz and time-samples from a pre-defined window (usually between 200 and 600 ms) were used for classification (Blankertz et al., 2003; Ferrez and Millán, 2008a; Kreilinger et al., 2009; Chavarriaga and Millán, 2010; Dal Seno et al., 2010; Takahashi et al., 2010; Artusi et al., 2011; Spüler et al., 2012). In some cases, authors used automatic selection mechanisms over larger feature spaces, quantifying discriminant power of features with some metric, e.g., t-statistic, Fisher score or *r*<sup>2</sup> (Dal Seno et al., 2010; Goel et al., 2011; Iturrate et al., 2013a). These studies reported similar features than those manually selected but aimed at better capturing subject-dependent variations in the elicited signals. Alternative approaches to compute features have recently been proposed including the usage of spatiotemporal filters (Perrin et al., 2012; Rousseau et al., 2012; Iturrate et al., 2013a, 2014), as well as singular value decomposition (Hamner et al., 2011; Phlypo et al., 2011).

A few studies have tested the feasibility of exploiting features computed in the frequency domain (Bollon et al., 2009; Omedes et al., 2013) with generally encouraging results. Interestingly, Omedes et al. (2013) tested the use of theta power as a feature for classification in the three experiments shown in **Figure 3**. They evaluated which type of features generalize better across tasks by measuring the classifier performance in a different task than the one it was trained for. Offline tests of data from six subjects showed smaller performance degradation across tasks for classifiers using frequency features compared to those using temporal features. A separate study showed that ErrP variation across these tasks is mainly due to latency (Iturrate et al., 2014). This suggests that frequency-based features may be less sensitive to temporal jitter across individual ErrP trials.

Goel et al. (2011) tested the use of features computed on the intracranial EEG sources, which have been estimated using inverse solution methods. The hypothesis being that projection into the source space can act as a spatial filtering technique that increases the signal-to-noise ratio of neurophysiologically relevant discriminant features. Offline analysis in the monitoring protocol depicted in **Figure 3A**, showed improved performance– in terms of area under the curve, AUC (Fawcett, 2006)–with respect to previously reported results using surface EEG classifiers in the six subjects analyzed. Further online experiments confirmed the validity of using these features for classification although its potential advantages over standard methods is yet to be fully validated (Goel, 2013).

Taking into consideration the cross-regional interactions reported in neurophysiological studies (c.f., section 2), Zhang et al. (2012) evaluated the use of single-trial estimation of connectivity patterns for ErrP classification. They computed directional interaction across channels using a modified directed transfer function (DTF) method in different frequency bands (Kaminski and Blinowska, 1991 ´ ). Offline analysis of data on 16 subjects using the same monitoring protocol as above, showed discriminant fronto-central interactions in the theta rhythm that yielded single-trial decoding above chance level. Furthermore, the combined use of connectivity and time-based features gave significantly better performance than temporal features alone, suggesting that the two feature sets convey complementary information.

The most common classification techniques used for the decoding include linear discriminant analysis (LDA) or its variations such as Fisher LDA or regularized LDA (Blankertz et al., 2003; Parra et al., 2003; Lehne et al., 2009; Ventouras et al., 2011; Iturrate et al., 2014), as well as Gaussian classifiers (Ferrez and Millán, 2008a; Kreilinger et al., 2009; Chavarriaga and Millán, 2010; Perrin et al., 2012) and support-vector machines (SVM) (Artusi et al., 2011; Ventouras et al., 2011; Wang et al., 2011; Spüler et al., 2012). In their study, Spüler et al. (2012) performed an offline comparison of LDA, step-wise LDA, and SVMs with linear and radial basis function (RBF) kernels. For this analysis they used previously recorded data of six patients with ALS. Using 10-fold cross-validation test they selected the RBF-kernel SVM as the best suitable for their application (P300-speller). Unfortunately, they did not report what specific criteria were used for this selection nor the performance of each method. Similarly, Ventouras et al. (2011) compared the performance of SVM and K-NN classifiers on the decoding of ErrPs elicited after manual responses. Their analysis using different feature selection mechanisms and leave-one-out cross-validation showed no significant differences between the two methods. Wang et al. (2011) also compared classifier performance when decoding those signals. Interestingly, they found little performance differences between the sensitivity obtained with LDA and SVM classifiers when training and testing on the same subject's data. In contrast, when the test was performed on a subject that was not part of the training set, the LDA yielded higher sensitivity. However, their results were close to chance level and may not be significant given that only a small number of trials were available for the error class.

A direct comparison of the performance obtained in these studies cannot be interpreted as a fair assessment of the advantages of a given decoding method. That is due to the differences in the pre-processing methods applied, the features selected for classification, and the reported performance metrics. Nevertheless, they often reported classification accuracies between 70 and 80%. As a tentative synthesis, one is under the impression that various classification methods reported in literature seem to obtain comparable results. Taking into account that most of these studies involve a rather small number of subjects, it is very likely that any performance differences are largely influenced by inter-subject variability.

In addition, efforts have been undertaken to design methods for fast training of the ErrP decoder. Some recent approaches rely on semi-supervised or unsupervised learning (Grizou et al., 2014; Zeyl and Chau, 2014). Another applied technique is the use of available data from other subjects to boost learning of a subjectdependent classifier (Iturrate et al., 2011; Putze et al., 2013). Alternatively, ErrPs have been shown to have common characteristics across tasks. In consequence, several methods have been proposed for online adaptation of classifiers trained in a previous protocol to the characteristics of the potentials elicited in the new task (Iturrate et al., 2013a; Kim and Kirchner, 2013; Iturrate et al., 2014). In this case, provided that an ErrP decoder has already been trained in a given task, the calibration time for a new task can be considerably reduced. Finally, as mentioned above, frequency features seemed less sensitive to task-dependent latency jitters in the neural response and were thus proposed as a potential means to implement task-independent classifiers (Omedes et al., 2013). These techniques have shown encouraging results, but have yet to be thoroughly tested to confirm their real advantages.

Lastly, besides direct single-trial classification problems, the ErrP detection process should take into account how this information will be later on utilized for interaction. In particular, there may be application-dependent requirements in terms of sensitivity and specificity that need to be considered when choosing the classification technique and parameters (Parra et al., 2003; Seno et al., 2010; Spüler et al., 2012).

# **8. DISCUSSION**

The works reviewed in this paper strongly support the feasibility of decoding error-related potentials and using the information they carry to improve the performance of BMI and HCI systems. There are, however, several challenges that need to be overcome until efficient and fully working applications can be implemented in the real world.

First of all, there is a clear need for further evaluation of the online exploitation of ErrPs. Although there is an increasing amount of studies showing online decoding of error-related signals, in particular for P300 applications, they typically rely on a small number of subjects. These studies have already highlighted how individual differences may affect the overall performance of the error-correction mechanism (Perrin et al., 2012; Schmidt et al., 2012; Spüler et al., 2012). Therefore caution should be taken in the design of such studies. In particular, this includes the number of subjects involved in the study, the control conditions against which the performance will be evaluated, as well as the effects of subject learning and fatigue when testing over several sessions.

Further studies are also needed to evaluate the detection of these potentials in people with severe disabilities, which is the principal target user group of BMI today (one of the main potential applications of BMIs is the restoration or substitution of motor and communication capabilities). Spüler et al. (2012) reported encouraging results, showing that reliable ErrPs are elicited in patients with ALS and their decoding can improve performance of a P300-speller. Nonetheless, there is clearly a need for more studies characterizing these signals in different populations. Some studies have already pointed out age-related changes in the ERN (Davies et al., 2004; Wiersema et al., 2007), but it is yet to be assessed how these changes affect the discriminability between error and correct trials. Similarly, longitudinal studies may be necessary to identify how ErrPs change in the case of degenerative diseases. In addition, several works have pointed out that feedback modalities other than visual may be suitable for users in the locked-in state as they do not rely on volitional gaze control (Schreuder et al., 2010; Treder et al., 2011; Kaufmann et al., 2013). Preliminary evidence suggests that ErrPs can be elicited and, to some extent, decoded after tactile stimulation (Lehne et al., 2009; Chavarriaga et al., 2012) in healthy users. This possibility remains to be further tested, in particular in users with disabilities.

Another issue of interest concerns the evaluation of the performance of hybrid BMI systems exploiting ErrP-decoding. Typically, authors have reported changes using diverse metrics including accuracy, information transfer rate (Wolpaw et al., 2000), efficiency (Quitadamo et al., 2012) or utility value (Seno et al., 2010). This denotes a lack of a formal framework for performance assessment –a problem common to the overall BMI field– and prevents the comparison of results across different studies (Thomas et al., 2013; Thompson et al., 2014). It is advisable that future works provide a comprehensive evaluation of performance reporting different metrics to enable such comparisons. Moreover, performance can be affected by protocol-specific parameters. For instance, in the case of ErrP-based correction, each command correction will have a cost (e.g., time required to undo the last action), and the overall benefit of the correction mechanism will depend on both: this cost and the specificity of the ErrP decoder (Parra et al., 2003).

This is intrinsically linked to the sensitivity and specificity of the decoding performance. The impact of the falsely decoded trials will be highly dependent on the application and the actions taken upon error detection. A clear difference is observed between the corrective and learning use of the ErrPs. In the first case, ErrP classification errors are explicitly perceived by the user. This can be counterproductive if the false detections appear to impair proper use of the interface (e.g., by rejecting or changing correct commands), even if improved performance is achieved in the long-term. As an example, Perrin et al. (2012) reported that some users, despite having good ErrP decoding performance, still preferred the implementation of the P300 speller without the correction, since they perceived no benefit with respect to use of the P300 alone.

In contrast, the learning approach where the classifier or the device controller is updated according to the outcome of the ErrP decoding may mask these false detections. Moreover, it has been shown that reinforcement learning algorithms can converge toward optimal policies even in the case of noisy estimation of the reward signals, provided that the estimation performance is above chance level (Sutton and Barto, 1998). One can expect, thus, that the use of ErrPs for learning has lower requirements in terms of the minimal acceptable performance than immediate command correction. This holds, of course, provided that the initial performance of the control interface is already acceptable for the user. Future work assessing these requirements from a human factors perspective is certainly needed for effectively designing interaction systems that exploit these error-related signals.

Following basic studies on the ERN and FRN, BMI efforts of decoding error-related signals have so far mostly focused on the time-locked response generated by a discrete feedback stimulus. In consequence, exploitation of the ErrPs has been largely restricted to discrete tasks such as the P300-based speller or stepwise movements (c.f. section 3). Besides limiting the range of applications and their naturalness, this approach also limits the throughput of the system since after each command a time interval of several hundred milliseconds is required for evaluating the presence of an ErrP. Consequently, further research is required toward decoding in more continuous setups.

Although not an easy task, several lines of progress seem to be open. One alternative is to increase the pace at which the stimuli are presented. Typically, experiments exploiting ErrPs have an inter-stimulus interval of 2 s or more. In contrast, experiments using rapid serial visual presentation (RSVP) have shown singletrial decoding of EEG correlates of visual recognition with stimulus presentation rates higher than 4 Hz (Gerson et al., 2006). It still to be tested whether ErrPs can be detected in such fast-paced feedback presentations.

Another approach is combining usage of continuous feedback with additional discrete sensory events, used to time-lock ErrPs (Kreilinger et al., 2011, 2012). The first study used a game application where the decoding of MI-related patterns controlled lane changes in a continuously moving animated car. It provided feedback for correct or wrong changes in the form of multiple predictable collisions with point tokens (positioned on the correct lane) or barriers (on the wrong lane). The second work applied an interesting approach to the BMI-control of a robot arm, combining the performance of a continuous mental task with discrete feedback to elicit ErrPs. Subjects had to perform MI during a given period, then the robot arm moved and after that users should assess whether the robot's movement lasted the same amount of time as the MI task. Visual cues provided information of the robot movement duration. Offline analysis showed ErrP decoding above random level; however the performance in both studies was lower than those reported in purely discrete paradigms.

Besides the previous approaches employing, in essence, some strategy of circumvention of the continuous feedback problem, one can also try to directly tackle the possibility of decoding error-related signals in a purely asynchronous (non-time locked) manner. An example of such attempt with invasive electrocorticographic recordings (ECoG), benefiting from better signal-tonoise ratio than scalp EEG, demonstrated that error events can potentially be detected with good accuracy during a continuous task, given a temporal tolerance of several hundred milliseconds (Milekovic et al., 2013). For EEG, a potential avenue to explore is the use of the spectral content instead of temporal features. As shown in **Figure 2** the event-related ErrPs are characterized by positive theta modulations. Interestingly, it has been shown that erroneous manual responses elicit increases in both phase- and non-phase-locked theta activity (Trujillo and Allen, 2007). Moreover, the power increase in non-phase locked activity was higher than for the phase-locked activity. Noticeably, ErrPdecoding performance based on theta-power features was shown to be less sensitive to task changes (Omedes et al., 2013). As already mentioned the main ErrP changes across these tasks can be explained by latency shifts (Iturrate et al., 2014). Thus supporting the notion that oscillatory activity can allow asynchronous detection of erroneous events in continuous tasks.

To summarize, different studies have repeatedly demonstrated the feasibility of decoding error-related EEG signals on a singletrial basis. This has been achieved both when the errors are committed by the user, as well as when the errors are introduced by the interfacing device, in particular a BMI. The decoding accuracy of these signals is typically about 80%. This performance levels have been shown to usually be sufficient to improve the information transfer rate in different applications including motor-imagery based BMIs, P300 spelling and manual labeling of visual stimuli (c.f. section 4). Moreover, ErrPs can successfully be used as a learning signal to improve BMI decoders or the controller of an external device (c.f., section 5). All these results support the potential of error-related correlates to provide naturally elicited information about the user cognitive state that can be used to adjust the machine's behavior.

Despite these successful studies, several aspects remain to be further explored, as detailed above. These include improvement in decoding of these signals in more complex applications, as well as their further characterization in subjects with disabilities. More importantly, large scale evaluations involving end-users have to be performed from a user-centered perspective to identify performance requirements and design criteria that allow for optimal exploitation of these correlates in practical applications.

# **FUNDING**

This work has been supported by the Swiss-funded SNSF NCCR Robotics. This paper only reflects the authors' view and funding agencies are not liable for any use that may be made of the information contained herein.

#### **ACKNOWLEDGMENT**

We thank H. Zhang, I. Iturrate, J. DiGiovanna, S. Rousseau, and J. Sanchez for useful discussions on this topic.

# **REFERENCES**


connectivity features," in *34th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'12)* (San Diego, CA). doi: 10.1109/ EMBC.2012.6347541

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 15 February 2014; accepted: 30 June 2014; published online: 22 July 2014. Citation: Chavarriaga R, Sobolewski A and Millán JdR (2014) Errare machinale est: the use of error-related potentials in brain-machine interfaces. Front. Neurosci. 8:208. doi: 10.3389/fnins.2014.00208*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Chavarriaga, Sobolewski and Millán. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Inference of human affective states from psychophysiological measurements extracted under ecologically valid conditions

*Alberto Betella1†, Riccardo Zucca1†, Ryszard Cetnarski 1, Alberto Greco2,3, Antonio Lanatà2,3, Daniele Mazzei 2, Alessandro Tognetti 2,3, Xerxes D. Arsiwalla1, Pedro Omedas 1, Danilo De Rossi 2,3 and Paul F. M. J. Verschure1,4\**

*<sup>1</sup> Synthetic, Perceptive, Emotive and Cognitive Systems group (SPECS), Universitat Pompeu Fabra, Barcelona, Spain*

*<sup>2</sup> Research Centre "E. Piaggio", University of Pisa, Pisa, Italy*

*<sup>3</sup> Information Engineering Department, University of Pisa, Pisa, Italy*

*<sup>4</sup> ICREA, Institució Catalana de Recerca i Estudis Avançats, Barcelona, Spain*

#### *Edited by:*

*Anne-Marie Brouwer, Netherlands Organisation for Applied Scientific Research, Netherlands*

#### *Reviewed by:*

*Matthijs L. Noordzij, University of Twente, Netherlands Arjan Stuiver, University of Groningen, Netherlands*

#### *\*Correspondence:*

*Paul F. M. J. Verschure, Synthetic, Perceptive, Emotive and Cognitive Systems group (SPECS), Universitat Pompeu Fabra, C/Roc Boronat 138, Barcelona 08018, Spain e-mail: paul.verschure@upf.edu*

*†These authors have contributed equally to this work.*

Compared to standard laboratory protocols, the measurement of psychophysiological signals in real world experiments poses technical and methodological challenges due to external factors that cannot be directly controlled. To address this problem, we propose a hybrid approach based on an immersive and human accessible space called the eXperience Induction Machine (XIM), that incorporates the advantages of a laboratory within a life-like setting. The XIM integrates unobtrusive wearable sensors for the acquisition of psychophysiological signals suitable for ambulatory emotion research. In this paper, we present results from two different studies conducted to validate the XIM as a general-purpose sensing infrastructure for the study of human affective states under ecologically valid conditions. In the first investigation, we recorded and classified signals from subjects exposed to pictorial stimuli corresponding to a range of arousal levels, while they were free to walk and gesticulate. In the second study, we designed an experiment that follows the classical conditioning paradigm, a well-known procedure in the behavioral sciences, with the additional feature that participants were free to move in the physical space, as opposed to similar studies measuring physiological signals in constrained laboratory settings. Our results indicate that, by using our sensing infrastructure, it is indeed possible to infer human event-elicited affective states through measurements of psychophysiological signals under ecological conditions.

**Keywords: affect, conditioning, ecological validity, EDR, ECG, emotions, VR, wearable devices**

# **1. INTRODUCTION**

Decades of psychophysiological studies have demonstrated the role of the Autonomic Nervous System (ANS) in modulating human physiological responses (Andreassi, 2006; Cacioppo et al., 2007; Boucsein, 2012). This has facilitated an important channel for the inference of human affective states. In particular, two important physiological measurements of autonomic responses, namely, the electrodermal response (EDR) and electrocardiogram (ECG), have been widely used as indicators of psychological internal states and processes, due to their relative non-invasiveness, easy quantification and reliability (see Berntson et al., 2007; Dawson et al., 2007 for a review). ECG and EDR allow the extraction of heart-rate variability (HRV) and skin conductance (SC), respectively. HRV constitutes an objective index of sympathovagal balance (Stein et al., 1994; Acharya et al., 2006), while SC is a direct measure of eccrine sweat glands activity and reflects activity within the sympathetic axis of the ANS (Fowles et al., 1981; Valenza et al., 2010; Boucsein, 2012). EDR is acquired by measuring the electrical conductance on the skin of the hand palm (normally on the fingertips), where the concentration of eccrine glands is higher. The recorded signal consists of a superposition of two main components: a tonic level of skin conductance (SCL), representing the baseline of the signal, and a series of superimposed phasic increases in conductance. Each of these phasic elements represents a unitary skin conductance response (SCR), which reflects the response of eccrine sweat gland activity to an external stimulus (Boucsein, 2012). Heart rate is controlled by both the sympathetic and the parasympathetic branches of the ANS (Berntson et al., 2007). Among features that characterize cardiac response, one of the most reliable and robust measures with regard to ANS dynamics is HRV, which reflects the fluctuations of the cardiac beat-to-beat time distance (Acharya et al., 2006) and previous research has already shown that several HRV spectral measures are strictly related to sympatho-vagal balance variations (Stein et al., 1994; Valenza et al., 2013b). A broad consensus now exists among researchers that variations in electrical properties of the skin and cardiac activity are physiological markers of specific internal states associated to cognition and emotion (Ekman et al., 1983; Lang et al., 1993; Lang, 1995; Healey and Picard, 2005; Greco et al., 2012; Lanatà et al., 2012; Valenza et al., 2014).

Nonetheless, these physiological measures are prone to contamination by noise and artifacts that dramatically reduce their quality and reliability (Boucsein, 2012) and often occur on the same bandwidth as the signal, thus affecting its precision and informative utility. As a matter of fact, both ECG and EDR can be severely affected by such artifacts if not properly treated. External sources, such as the common power line noise (hum), can be detected from the spectrogram of the recorded signal and successfully reduced employing online or offline filters (or a combination of both). Other artifacts are due to physiological factors and are more difficult to treat. For instance, deep or irregular breathing, as well as speech, often induce an increase of non-specific EDR components (Hygge and Hugdahl, 1985; Schneider et al., 2003). In addition, physical activity itself can covary with the recorded physiological signals (Picard and Healey, 1997). The latter becomes a crucial issue in wearable systems, when acquisition is performed under real life conditions. For this reason, most research is often conducted under strictly controlled laboratory settings or artificial clinical environments, where the subject is usually wired to a fixed equipment and asked to avoid gross body movements in order to ensure optimal conditions in the acquisition of high quality signals. Strict laboratory settings, however, introduce other limitations. For instance, physiological responses, which are normally recorded for short segments of time (mainly due to the discomfort produced by long term use of the equipment) may not reflect the entire spectrum of emotional experiences that occur under everyday situations (Picard, 2000; Healey and Picard, 2005; Healey et al., 2010). These shortcomings spurred an increased interest in the development and application of wearable technologies to different fields of research from clinical and rehabilitation to behavioral studies (Valenza et al., 2008; Lanatà et al., 2010a; Pantelopoulos and Bourbakis, 2010; Poh et al., 2010; Lanatà et al., 2011; Boucsein, 2012; Patel et al., 2012). In addition, a good amount of ambulatory physiological research has recently been directed toward the development of devices that are both accurate and robust to motion artifacts, while, at the same time, comfortable to wear, easy to use (Picard and Healey, 1997; Ebner-Priemer and Kubiak, 2007) and reliable for investigating physiological correlates of emotions in natural settings. Despite these advances, experiments involving psychophysiological measures of affective states in the natural world still present a number of technical drawbacks such as loss of connectivity during signal acquisition, fitting problems with the wearable devices, difficulty in accurately labeling data (which often relies exclusively on subjects' self-assessment) and other external factors, such as social interactions, that cannot be fully controlled by the experimenter.

In this context, laboratory settings and life-like conditions do not have to be seen as mutually exclusive paradigms, but can rather be viewed as complementary to each other (Fahrenberg et al., 2007). For instance, the use of Virtual Reality (VR) offers an excellent compromise between the laboratory and natural world. Through VR it is possible to create life-like environments where users can act through natural movements and gestures, while, at the same time, accounting for a systematic control of the stimuli and the variables that are studied. For this reason, we have built the eXperience Induction Machine (XIM), an immersive space we previously used to study human behavior in ecologically valid conditions (Eng et al., 2005; Bernardet et al., 2007, 2010). More recently, we used the XIM to investigate the salience of social stimuli by measuring participants' physiological responses induced by spatial interaction with humans and avatars using a commercial data acquisition system (g.MobiLab, g.Tec, Austria). We showed that the salience of a social stimulus is directly mapped onto the physiological correlates of arousal (Inderbitzin et al., 2013). With the aim of conducting ambulatory emotion research, we expanded the XIM's capabilities with the integration of new wearable sensors for the acquisition of psychophysiological signals (Omedas et al., 2014). These devices are capable of real-time measurements of body posture, arm orientation, hand position, fingers movements, as well as psychophysiological signals such as ECG and EDR. These signals were specifically selected because of their reliability in inferring affective states and the possibility to integrate dedicated textile-based sensors on small-scale wearable devices (Paradiso and Caldani, 2010; Carbonaro et al., 2012).

In this paper we present results from two different studies conducted to validate the XIM as a general-purpose sensing infrastructure for investigating human affective states in lifelike conditions. In the first investigation, we aimed to validate the quality of the signals acquired through our new pool of wearable sensors. To do so, we exposed subjects to pictorial stimuli, covering the full range of arousal levels, while walking and performing hand gestures in the XIM environment. By interpreting the participants' ECG and EDR signals we were able to correctly classify and predict the presented stimuli in terms of arousal classes (a short version of the results was presented in Betella et al., 2014).

In the second study, we investigated whether we can correctly infer the features of conditioning from the signals recorded in a subject freely acting in a physical space while trained with a classical conditioning protocol. Classical conditioning has for long been a well-grounded paradigm in cognitive and clinical neuroscience. In particular, conditioning has been widely investigated using EDR (see Boucsein, 2012 for a comprehensive review) and applied to different clinical treatments for anxiety, phobias and other behavioral disorders (Field, 2006). Our results not only demonstrate behavioral features of conditioning in an ecological setting, but also show how psychophysiological signatures of conditioning can be effectively extracted from the acquired signals.

In both our studies, the main goal was to use our sensing infrastructure (i.e., the XIM and the wearable sensors) to tackle the well-known challenge of measuring psychophysiological signals in the presence of movements and gestures, thus advancing beyond standard laboratory protocols. The obtained results confirm the validity of our approach for the inference of human event-elicited affective states in life-like conditions. We show that using our custom-made technology, it is indeed possible to infer participants' emotional states (i.e., arousal) from acquired psychophysiological signals.

# **2. MATERIALS AND METHODS**

# **2.1. THE SENSING INFRASTRUCTURE**

# *2.1.1. The eXperience Induction Machine (XIM)*

The XIM (see an illustration in **Figure 1**) covers an area of about 25 m2 and is equipped with a number of effectors that include 8 projectors, 4 projection screens, a luminous interactive floor (Delbruck et al., 2007) and a sonification system (Le Groux et al., 2007). Along with the effectors, the XIM features a pool of sensors to measure users' explicit behavior and implicit states, including a marker-free multi-modal tracking system (Mathews et al., 2012), microphones and floor-based pressure sensors.

# *2.1.2. Wearable sensing systems*

The wearable devices used in this study were integrated into two main interfaces: a sensing glove for the simultaneous acquisition of hand gestures and EDR, and a sensing shirt for the acquisition of ECG and respiration. The choice of textile integrated sensors was primarily dictated by the advantages in terms of portability and usability for long-term monitoring and because they provide minimal constraints in terms of natural gestures and movements.

*2.1.2.1. Sensing glove.* The sensing glove (**Figure 2**) was specifically designed for the XIM space and it was conceived to measure both explicit (gestural information through forearm orientation and finger positions) and implicit (psychophysiological inference through EDR) signals. Two textile electrodes interwoven in the index and middle fingertips are used to measure EDR. In a previous work, electrical characteristics of interwoven textile electrodes were investigated in comparison with standard Ag/AgCl electrodes through simultaneous acquisition of the same EDR signals (Lanatà et al., 2012), where the impedance of textile electrodes in the EDR bandwidth was evaluated using a standard reference electrolytic cell and a high precision impedance meter. This study demonstrated that textile electrodes are equivalent to standard Ag/AgCl electrodes, thus allowing the acquisition of high quality signals. Grounded on previous results obtained with conductive elastomer sensors (Tognetti et al., 2006; Vanello et al., 2008), finger motion tracking is obtained through five

**FIGURE 1 | Schematic illustration of the eXperience Induction Machine (XIM).** XIM is a 5 × 5 × 4 m infrastructure equipped with a number of effectors (8 projectors, 4 projection screens, a luminous interactive floor and a sonification system) and sensors (marker-free multi-modal tracking system, microphones and floor-based pressure sensors). In addition, a custom-built sensing glove is used for the acquisition of hand gestures and electrodermal response (EDR), while a sensing shirt is used for the acquisition of electrocardiogram (ECG) signals and respiration (BR).

textile deformation sensors integrated on the glove metacarpophalangeal hand joints. Sensors are made of knitted piezoresistive fabric (KPF) material, previously demonstrated to be a valid choice for biomechanical and cardiopulmonary data acquisition (Taccini et al., 2008; Pacelli et al., 2013). Finger movements produce local deformations in the fabric that modify the electrical resistance of the sensors, which is highly correlated with the single finger degree of flexion. A customized algorithm for hand gestures recognition was developed (Carbonaro et al., 2012) to deal with the slow baseline drift of the sensor signal which is due to intrinsic characteristics of the textile substrate. Both EDR and deformation signals are acquired in real-time through a dedicated wearable and wireless electronic unit. Moreover, forearm orientation is measured by an Inertial Measurement Unit (HMC6343, Honeywell, MN, USA) embedded in the glove's electronics and worn on the dorsal area of the forearm close to the wrist.

Compared with other similar devices (e.g., the Aladin sensor glove Ritter, 2009), our sensing glove is not limited to the measurement of EDR, yet it integrates a broader pool of sensors while maintaining a high level of comfort and usability. Acquisition of fingers position and forearm orientation through wearable sensing systems allows to track fine user's actions more precisely and enable multi-user scenarios (i.e., avoiding the common problem of camera occlusion). In immersive environments, such as XIM, the use of low cost video-based systems like Kinect and LeapMotion does not guarantee an optimal resolution to track finger angles and hand/forearm orientation, especially when the users are not correctly positioned with respect to cameras. In addition, professional infrared (IR) sensors and magnetic or radio based tracking systems are usually prone to interference in cave-like environments due to the simultaneous presence of light projection systems, metal scaffolding and wireless devices.

*2.1.2.2. Sensing shirt.* The sensing shirt (Smartex srl, Italy) (Paradiso et al., 2005) is used in XIM to acquire ECG, Breathing Rate (BR) and tri-axial accelerometer data. It is equipped with a tiny electronic battery-powered unit that streams acquired data through a Bluetooth connection contained in the front pocket. ECG is acquired through two interwoven textile electrodes placed inside the shirt near the lower section of the pectoral muscles. BR is acquired through a KPF strain sensor interwoven in the shirt (Lanatà et al., 2010b; Paradiso and Caldani, 2010). This wearable system has been employed in previous studies on long-term monitoring of chronic patients, focusing on the early prevention of cardiovascular diseases (Scilingo et al., 2005). This sensing shirt has also been adopted in human-robot interaction studies aimed to investigate psychophysiological states of autistic children (Mazzei et al., 2012). The states were inferred by analysing signals acquired through the sensing shirt during the therapeutic protocol (Mazzei et al., 2010, 2011; Lazzeri et al., 2014) and the obtained results empirically validated the device's capabilities in the acquisition of high quality signals suitable for the analysis of psychophysiological measures.

#### *2.1.3. Data recording*

The data coming from the wearable devices is conveyed to the XIM sensing platform that captures and processes in real-time raw sensors data (Wagner et al., 2013a). This platform is implemented using the Social Signal Interpretation (SSI) framework (Wagner et al., 2013b), a set of tools for the recording, analysis and recognition of human behavior in real-time. The data stream of each sensor is transmitted through dedicated separate channels and preprocessed. The sensing platform then synchronizes the incoming streams by establishing a stable connection with all the sensors and by buffering data streams. Single buffers are compared upon regular time intervals according to an internal timestamp and synchronized, if necessary. Following the synchronization, each signal is processed separately to isolate noise and artifacts from relevant information.

### **2.2. ELECTROCARDIOGRAM**

From the data recorded via the sensing shirt we computed the HRV signal as the variability of the distance between consecutive R-peaks over time. Currently, R-wave peaks were taken as the reference point. It is worthwhile noting that R-peak is the most reliable ECG-measure against noise-artifact that can be made with respect to other possible points of the QRS complex. Before extracting the R-peaks, the ECG signal was filtered using a Moving Average Filter (MAF) to extract and subtract the baseline since it is commonly affected by low frequency disturbances (e.g., respiration activity). The Heart Rate (HR) was defined as HR= 60*/tR*−*R*, where *tR*−*<sup>R</sup>* is the time interval between two successive R-peaks. In order to detect QRS complexes, we treated the ECG signal using the automatic QRS detection algorithm proposed by Pan and Tompkins (Pan and Tompkins, 1985) and after this procedure R-peaks could be extracted. The obtained HR resulted in a time series of non-uniform RR intervals, hence it was interpolated and re-sampled in accordance to the recommendations of Berger et al. (2007). In this study, we solely focused on the intervals between normal (sinus) beats (NN-intervals). Given RR time series, a set of features were extracted in both time and frequency domains as well as by using non-linear methods (Acharya et al., 2006; Valenza et al., 2012a), which are summarized in **Table 1**. In the time domain, we extracted statistical parameters and morphological indices. Time-domain features were computed within consecutive non-overlapping time windows of 30 s in which a series of RR intervals were present. It is worthwhile noting that, for short-term ECG acquisitions (i.e., lower than several hours), windows should not be less than 20 s (Camm et al., 1996). More specifically, we derived simple MNN and SDNN, corresponding to the mean value and to the standard deviation of the NN intervals, and several statistical measures, such as standard error of the mean, root mean square, mean of squares, sum of squares,

#### **Table 1 | Summary of the features extracted from HRV/HR and SC signals.**


skewness, kurtosis' excess coefficient, mean absolute deviation, root mean square of successive differences of intervals (RMSSD), as well as several central moments. We also computed the number of consecutive differences of intervals, which differ by more than 50 ms (NN50) and the same features normalized with respect to the total number of intervals (pNN50).

In addition to the above statistical measures, a series of geometric measures were calculated from the RR intervals histogram. The HRV triangular index was obtained as the integral of the histogram (i.e., total number of RR intervals) divided by the height of the histogram, which was dependent on the selected bin width. Finally we extracted the TINN which corresponds to the baseline width of the RR histogram evaluated through triangular interpolation (see Camm et al., 1996 for details).

In the HRV frequency domain analysis, three main spectral components were distinguished in a spectrum calculated from short term recordings: Very Low Frequency (VLF) [0.003–0.04 Hz], Low Frequency (LF) [0.04–0.15 Hz], and High Frequency (HF) [0.15–0.4 Hz]. Short term recordings were intended as the time duration of HRV signal segments. In this work, HRV segments were in agreement with the picture presentation time. It is well known from the literature that the distribution of the spectral power gives an indication of modulations in the Autonomic Nervous System (ANS). Current HRV research in the frequency domain suggests that even though the frequency band division represents a unique non-invasive tool to achieve an assessment of autonomic function, the use of HF and LF components does not allow to precisely assess the state of sympathetic activation. Therefore, along with the estimation of the Power Spectral Density in the VLF, LF and HF band, we also calculated the LF/HF PSD Ratio which provides information about the Sympatho-Vagal balance (Camm et al., 1996).

In regards to non-linear analysis of HRV, it is reasonable to assume that non-linear mechanisms are involved in the genesis of HRV. In the latest decade, several non-linear measures have been used to investigate HRV behavior that do not fully comply with standard measures, neither in the time domain nor in the frequency domain. Further details on HRV non-linear methods can be found in Fusheng et al. (2000); Zbilut and Webber (2006); Chua et al. (2008); Valenza et al. (2012a,b, 2013a). In this work, we used the Poincaré plot (a graphical representation of the correlation between successive RR intervals) and entropy. More specifically, from the Poincaré plot we used the short and long term variability (i.e., SD1 and SD2). Moreover, non-linear measures often suffer from the curse of dimensionality, i.e., they cannot reliably be estimated because of the lack of a sufficient number of points in the time series. For this reason, we estimated system complexity which allows to quantitatively characterize the dynamics even with short time series (Kurths et al., 1995; Wessel et al., 2000). Entropy was also employed since it has already been adopted in HRV analysis with encouraging results (Pincus and Goldberger, 1994).

#### **2.3. ELECTRODERMAL RESPONSE**

EDR was obtained as the ratio between an imposed continuous voltage of 0*.*5 V applied to the index and middle fingers and the flowing current. In order to analytically split the EDR signal into its tonic (SCL) and phasic (SCR) components we adopted a convolution model (Benedek and Kaernbach, 2010a) where the Sudo-Motor Nerve Activity (SMNA) can be seen as the input of the model whose output is the EDR. The Impulse Response Function (IRF) is represented by a biexponential function (called the Bateman function) (Garrett, 1994), which is defined as follows:

$$IRF(t) = (e^{-\frac{l}{\tau\_1}} - e^{-\frac{l}{\tau\_2}}) \cdot \mu(t) \tag{1}$$

where *u*(*t*) is the stepwise function. Consequently, the result of the deconvolution between the EDR signal and IRF can be defined as the *driver function*, describing the SMNA behavior. This processing method allows for the identification of intrinsically overlapped inter-stimulus phasic responses, which are due to a sequence of stimuli in time. Further details on the EDR deconvolution method can be found in Benedek and Kaernbach (2010a).

Skin conductance decomposition to its components was performed using Ledalab software package in MATLAB (Benedek and Kaernbach, 2010b). The signal was filtered by means of a low pass zero-phase forward and reverse digital filter (Mitra and Kuo, 2006) with a cutoff frequency of 2 Hz. The phasic features were calculated within a time window (response window) up to 5 s length following the stimulus onset. We extracted the number of SCRs within the response window (nSCR), the latency of the first SCR (Lat), the Amplitude-Sum of SCRs reconvolved from phasic driver-peaks (AmpSum), the average phasic driver activity (Mean.SCR) as the time integral over response window by size of response window, the variance of the phasic driver signal (Var.SCR), the Phasic driver area under curve (AUC.SCR) and the maximum phasic driver amplitude (Max.SCR).

From the tonic driver signal, we obtained the following features: average level of (decomposed) tonic component (Mean.Tonic), variance of the tonic driver signal (Var.Tonic) and number of the non-specific response (i.e., the spontaneous skin conductance response unrelated to a specific stimulus) (NSR). See **Table 1** for a summary of the features extracted.

# **2.4. MOVEMENT DETECTION AND ARTIFACTS**

Along with the acquired ECG and EDR signals, acceleration, finger flexion and forearm orientation were also recorded using the tri-axial accelerometer, textile deformation sensors and the Inertial Measurement Unit embedded in the sensing shirt and glove (see Section 2.1.2 for more details). Moreover, in the first study all the subjects were video recorded for the entire duration of the experiment, and a precise annotation of the type of movements performed by the subjects was obtained. These sources of information were then used to isolate body movements (e.g., in our scenario the subject walks in XIM, grabs virtual objects and/or points with the hand toward different parts of the virtual scene), and subsequently identify those parts of the physiological signals that were potentially affected by motion artifacts.

In general, we expect portions of the acquired signals to be affected by artifacts, that are alternated with unaffected "clean" segments. The amount of affected signal increases depending on the intensity of the physical activity and frequency of movement. In this context, we hypothesize that, at least in the case of mild physical activity related to interaction in virtual reality, the segments of clean signals that can be extracted are sufficient to allow for a reliable identification of event-elicited affective states. Motion artifacts mostly affected the EDR signal due to the gesture of grabbing that induces a physical movement of the EDR electrodes. ECG signal measured on the subject's chest, instead, is normally not affected by hand or arms gestures. For this reason ECG recording was preprocessed as reported in Section 2.2 and the entire time window related to the stimuli presentation was used to further analyze the ECG signal.

An example of EDR signals acquired from the sensing glove in a representative subject, with and without the motion artifact induced by a series of grabbing gestures, is provided in **Figure 3A**. To do so, we acquired simultaneously EDR using two gloves, on the left and on the right hand, respectively. We instructed the subject to relax as much as possible (to avoid changes in affective states) and to use only the right hand to perform a series of grabbing gestures. Given the high impact of this gesture on the acquired EDR measure, the signal affected by this artifact cannot be used for affective states classification. These artifacts are due to skin stretch and/or compression and are particularly evident during the grabbing events, when the electrodes mounted on the glove fingertips physically touch the hand palm. **Figure 3B** illustrates the glove finger signals in correspondence to the EDR red trace in **Figure 3A** and clearly shows that fingers motion and EDR signal affected by the gesture are highly correlated. This qualitative observation is further confirmed by a Fourier analysis performed on fingers flexion and EDR during slow hand grabbing

at the same time of the peaks observed in the EDR. Each trace in **(A,B)** is the average of three sweeps ± standard deviation. **(C)** Comparison of EDR and finger flexion spectral contents. The Y axis shows the normalized amplitude of the Fourier transform.

tasks (**Figure 3C**) showing a strong correlation in frequency content of EDR and motion signals.

The portions of the signals that were strongly affected by artifacts (e.g., grabbing gestures) were excluded from the analysis with the aim of demonstrating that the remaining physiological signals associated to the induced stimuli were sufficient to accurately classify users' affective states. **Figure 4** shows a 90-s segment of annotated recording of one participant in our first study (Section 3), where the motion events captured through the sensors and the EDR signal are labeled according to the natural movements of the subject extracted from the analysis of the video recorded during the experimental trial. With the exception of the right hand grab event occurring after 12 s of recording (and thus excluded from the analysis), the impact of the body movements on the remaining portion of the EDR signal did not produce strong artifacts and allowed for the extraction of the features used to classify the subject's affective state. Motion events, such as hand grabbing, generate synchronous spikes on the EDR signal, as shown in **Figure 3**. The grabbing event can be easily extracted from the finger sensors through adaptive threshold based algorithms such as the one described in Carbonaro et al. (2012), that were implemented directly in our sensing platform SSI (see Section 2.1.3). In the specific context of ambulatory research of event-elicited emotion using our sensing infrastructure (i.e., the

work here presented), strong artifacts produced by movements accounted only for a small portion of the data collected (corresponding to the grabbing events) and were manually excluded from the analysis by means of post-processing techniques.

# **3. STUDY 1**

The aim of the first study was to correctly discriminate stimulus−elicited subjective arousal levels from the ECG and EDR recorded signals. We expected to observe a significant increase in SC peaks and HR when the subjects were exposed to highly arousing pictures. Moreover, we expected to observe significant changes in the values of the features extracted from the psychophysiological signals, in accordance to the arousal of the stimuli.

# **3.1. SAMPLE AND PROTOCOL**

A total of 7 voluntary subjects (4 females and 3 males, mean age = 29.7, *SD* = ±3*.*9) recruited from the University campus participated to this empirical validation. Before participating to the experiment, all the subjects read and signed an informed consent form declaring that they clearly understood all the experimental procedures and the aim of the study. The protocol of the experiment was approved by the local Ethical Committee.

Twelve different pictorial stimuli were selected from the IAPS pictures collection (Lang et al., 2008). Each stimulus represented a different rating value of arousal, thus covering the entire scale of arousal from a minimum rating of 1.72 to a maximum of 7.34 (see **Table 2** for a summary of the stimuli used for this study). Before being exposed to the images, participants were helped wearing the sensing shirt and the glove by the experimenter. A short test phase to verify the correct positioning and functioning of the sensors followed. Participants were then instructed to enter the XIM and stand at the designated starting point in the center of the room. A schematic illustration of the experimental protocol is shown in **Figure 5**. A 5-min baseline recording phase followed, during

**Table 2 | Selection of the 12 stimuli from the International Affective Picture System (IAPS) database.**


*The arousal ratings are reported for each stimulus, along with the class assigned in the data subsets α, β, and γ .*

which a black screen was displayed while participants were asked to maintain a natural standing and relaxed posture. After baseline acquisition, participants were told that they were free to walk and to assume a natural posture during the entire duration of the experiment. Subsequently, the first image was displayed on the frontal screen of the XIM. The order of presentation of the stimuli was randomized for each experimental session and subject. Each stimulus was displayed for 20 s and it was followed by a "beep" sound to alert the user about the possibility to proceed with the following trial. To start the new trial, the participant was instructed to make a "grabbing" gesture with the hand that wore the sensing glove. This event was interpreted and recorded by the sensing platform to provide an accurate time annotation for each stimulus. A 20 s black screen was inserted between each trial. All the subjects were video recorded using a video camera placed behind them on the XIM floor and the information was used as an additional source for movements' annotation.

# **3.2. DATA ANALYSIS**

A number of fixed time windows were used to segment the signals (EDR, HRV) in accordance to the experimental protocol. To compute each feature of the skin conductance's phasic component, the EDR signal was segmented in 5 s windows aligned to the onset of the visual stimulus. To compute the HRV features we instead used longer windows of 20 s corresponding to the entire duration of each visual stimulus.

The extracted features (see **Table 1**) were divided into three subsets *α*, *β* and *γ* in accordance with the arousal ratings of the stimuli (see **Table 2**):


A statistical inference analysis was conducted by means of both parametric and non-parametric tests, in accordance to the data distribution, to verify the null-hypothesis of having no statistical difference between the classes for both the 2-class (datasets *α* and *β*) and the 3-class (dataset *γ* ) problems. The significance level for all the tests was set to 0*.*05. A pattern recognition phase followed the statistical analysis to investigate whether the arousal content of the stimuli could be discriminated in 2 and 3 classes of *α*, *β* and *γ* respectively, considering the aforementioned subset of features.

# **3.3. PATTERN RECOGNITION**

An inter-subject analysis was performed for all subjects and all extracted features. The subsets *α* and *β* represent a 2-class problem, while subset *γ* represents a 3-class problem. Taking into account the entire dataset of features, the dimensionality of the features space was reduced through the application of Principal Component Analysis (PCA), considering the number of PCs that would be sufficient to explain 90% of the total variance.

We performed a classification phase to classify each sample of the dataset according to the set of classes. Among different classifiers, the Linear Discriminant Classifier (LDC) (Härdle and Simar, 2007), was the one that performed better in terms of accuracy and consistency in arousal discrimination. The performance of the classification process was examined through the confusion matrix, which expresses the capacity of the algorithm to recognize each sample as belonging to one of the predefined classes (a more diagonal confusion matrix corresponds to a higher degree of classification). The validity of the classification model was evaluated through the cross-validation method. For each validation step, the classifier was trained on the 80% of features randomly extracted from the whole dataset (training set) and tested on the remaining 20% (test set). More specifically, we performed 40 fold cross-validation steps in order to obtain unbiased results. The final results were expressed as the mean and the standard deviation of the 40 computed confusion matrices.

#### **3.4. RESULTS**

We assessed the non-Gaussianity of the features distribution using Kolmogorov–Smirnov tests with Lilliefors correction (*p <* 0.05). The *α* and the *β* datasets were submitted to a Mann–Whitney test to check for a difference between class *A*<sup>1</sup> (low arousal) and class *A*<sup>2</sup> (high arousal). The performed analysis on the *α* subset of features showed no statistical significance, whilst for the *β* subset LF resulted to be significantly higher in *A*<sup>2</sup> than *A*<sup>1</sup> (*p <* 0.05; Mdn *A*<sup>1</sup> 1097; Mdn *A*<sup>2</sup> 1684).

We conducted a Kruskal–Wallis (KW) test among the three classes *A*<sup>1</sup> (low arousal), *A*<sup>2</sup> (medium arousal) and *A*<sup>3</sup> (high arousal) of the *γ* dataset. The obtained results showed a significant effect (*p <* 0.01). Mann–Whitney tests were used to follow up these findings. A Bonferroni correction was applied to compensate for multiple comparisons. The pairwise comparisons showed a significantly higher value of RMSSD for *A*<sup>3</sup> as opposed to *A*<sup>1</sup> (*p <* 0.05; Mdn *A*<sup>1</sup> 195.98; Mdn *A*<sup>3</sup> 218.67) and a significantly higher value of HF for *A*<sup>3</sup> as opposed to *A*<sup>1</sup> (*p <* 0.05; Mdn *A*<sup>1</sup> 528.22; Mdn *A*<sup>3</sup> 727.46).

We assessed the parametric distribution of the mean HR values for each one of the three datasets by means of a Kolmogorov– Smirnov test with Lilliefors correction (*p>* 0.05). We submitted the mean HR in the *α* and the *β* datasets to an Independent Samples *T*-test. The results showed no statistical differences (*p >* 0.05) between the two classes in both the *α* dataset (mean *A*<sup>1</sup> = 82.40, *SD* = ±16.5; mean *A*<sup>2</sup> = 80.31, *SD* = ±12.9) and the *β* dataset (mean *A*<sup>1</sup> = 82*.*17, *SD* = ±15.6; mean *A*<sup>2</sup> = 79*.*85, *SD* = ±13.4). To test for differences in mean HR between the 3 classes of the *γ* dataset, we conducted an ANOVA. No significant effect was found (mean *A*<sup>1</sup> = 82.88, *SD* = ±16.9; mean *A*<sup>2</sup> = 82.94, *SD* = ±15.4; mean *A*<sup>3</sup> = 79.93, *SD* = ±13.1). Given all of the physiological features extracted from HRV and SC, we discriminated by a pattern recognition stage (Section 3.3) the two levels of arousal for the *α* and *β* datasets and the three levels of arousal for the dataset *γ* . As a result of the pattern recognition phase, the LDC classifier accounted for a high accuracy in the recognition of both the 2-class and the 3-class problems (**Tables 3**, **4** respectively).

The multivariate analysis with LDC for the entire dataset accounted for an accuracy between 73*.*3% and 88*.*9% in the 3 class problem (low, medium and high arousal), and exceeding 87% in the 2-class problem (low and high arousal).

#### **3.5. BODY MOVEMENTS AND SIGNALS ACQUISITION**

To exclude the possibility that the overall accuracy of the classification could be biased by body movements systematically associated to the arousal of the stimuli presented, we quantified the subjects' motor activity for each class of stimuli. To do so, we calculated the sum of the standard deviation for the three axes for both acceleration and orientation data within time windows equal to (or greater than) 20 s, that correspond to the presentation

#### **Table 3 | Confusion matrix of the LDC classifier for the 2-class problem for** *α* **and** *β* **datasets.**


*The results were obtained after 40 cross-fold validations.*

#### **Table 4 | Confusion matrix of the LDC classifier for the 3-class problem for** *γ* **dataset.**


*The results were obtained after 40 cross-fold validations.*

of the pictorial stimuli, thus obtaining two maximized activity indexes (Maximized Acceleration Index and Maximized Rotation Index). Using these indexes, we conducted an intra-subject analysis to compare the subjects' motor activity to the arousal classes identified in the 3 data subsets (see Section 3.3). We assessed the homogeneity of variance between the classes by conducting within-subject Levene tests. Our results did not show statistically significant differences across classes neither in the *α* and *β* subsets, nor in the *γ* subset (*p >* 0.05).

Additionally, we conducted an inter-subject analysis. We computed the standard deviation of acceleration and orientation for all the subjects within the time window that corresponds to the stimuli exposure and compared the obtained values to the classes of the 3 data subsets. Using the Levene test for both the activity indexes, we did not find statistically significant differences across classes neither in the *α* and *β* subsets (*p >* 0.05), nor in the *γ* subset (*p >* 0.05) (**Figure 6**).

The outcome of these two analyses indicates the homogeneity of variance in the activity indexes across the classes in the 3 data subsets, hence showing that the results obtained through the acquired psychophysiological signals were not due to artifacts produced by the subjects' motor activity.

### **4. STUDY 2**

The second study was designed to empirically validate the XIM infrastructure and its wearable sensors using a classical conditioning (CC) task. Classical conditioning has been extensively used to study autonomic responses in humans and other species due to its non invasiveness and the relatively fast underlying learning processes (Fanselow and Poulos, 2005; Boucsein, 2012), allowing a direct comparison with results already present in literature. In the CC paradigm, subjects learn to predict the occurrence of an aversive event (unconditioned stimulus or US, i.e., a mild electrical shock or a loud noise) from contextual cues (conditioned stimulus or CS, i.e., a tone or a light), which, after several presentations of the CS−US pairings, results in the expression of an anticipatory conditioned response (CR) (Pavlov, 1927; Rescorla, 1966; Dickinson and Mackintosh, 1978; Clark and Squire, 1998; Maren, 2001; Inderbitzin et al., 2010). In summary, our primary objective was to verify whether we can reliably extract signatures of EDR and ECG conditioning from the recordings of freely moving subjects in a VR scenario using virtual objects and IAPS pictures as conditioned and unconditioned stimuli, respectively. We expected to observe significantly stronger skin conductance responses to the CS events that were followed by a high arousing US by the end of the acquisition protocol. We further expected to observe longer reaction latencies during the trials where the CSs were followed by an aversive US image. In regards to the HRV, we expected to find a significant variation in the vagal control of the heart going from the acquisition to the extinction phase, along with a reduction of the global activity.

#### **4.1. SAMPLE**

A total of 11 voluntary subjects (7 females and 4 males, mean age = 27, *SD* = ±4.51) recruited from the University campus participated in the study. All participants completed and signed an informed consent providing information about the motivation of the study, the procedures adopted and the storage policy of the data collected. Subjects were informed about the possibility to leave the experiment at any moment if they were not

feeling comfortable with the experimental settings. The study was reviewed and approved by the local Ethical Committee.

# **4.2. CONDITIONING PROCEDURE AND EXPERIMENTAL DESIGN**

Before starting the experimental session, participants were provided with task instructions and fitted with the sensors by the experimenter. Subjects were then left alone in the XIM for 5 min to relax and get acquainted with the sensors and the XIM environment. An interactive virtual scenario consisting of a realistic 3D model of an equipped living room designed with SketchUp (http://www*.*sketchup*.*com) and rendered using Unity3D (http:// unity3d*.*com) was then projected on the frontal and the two lateral screens of the XIM (**Figure 7**).

The experimental protocol consisted of three different sessions in a within-subjects design: an acquisition and an extinction session that were followed by a self-assessment questionnaire to test for awareness of the contingencies between the stimuli presented. During each learning phase, participants were presented with a conditional visual stimulus (a photo camera, CS+, or a remote control, CS−, representing the reinforced and nonreinforced stimuli, respectively) that they had to collect from the virtual cabinet through a grabbing gesture. Once collected, a high or low arousal IAPS classified image (representing US+ and US− respectively), was displayed on the virtual TV screen. Based on the image segmentation obtained in the first study, we selected our stimuli from two subsets of images belonging to the negative (high arousal ratings) and neutral (low arousal ratings) categories of the IAPS database. In more detail, the acquisition phase consisted of 18 intermixed trials for each CS type presented in a random order (**Figure 8**). Each one of the 9 CS+ stimuli was followed by a high arousing image (US+, mean valence = 2.02, *SD* = ±1.44 and mean arousal = 6.95, *SD* = ±2.04), while the CS− stimuli were followed by a low arousing image (US−, mean valence = 5.06, *SD* = ±1.13 and mean arousal = 2.42, *SD* = ±1.65). Each trial followed a fixed guided sequence that drove the participant through the trial. USs were displayed for 15 s and during the acquisition phase they were followed by a black image for an interval of 17 ± 3 s. Following the acquisition phase, participants were left inside the room for 5 min to rest before starting the following session. The extinction phase consisted of 18 intermixed trials for each CS within the same context (**Figure 9**), always followed by a neutral image (US− trials only). At the end of the extinction phase the participants were asked to fill the self-assessment questionnaires, then they were assisted by the experimenter in the removal of the equipment and finally dismissed.

# *4.2.1. Questionnaires*

To measure subjective affective reactions to the stimuli, we used both a computer interactive version of the Self-Assessment

Manikin (SAM) (Bradley and Lang, 1994) and the Affective Slider. The latter is an alternative scale under development in our group that measures the same dimensions as the SAM questionnaire, but on a continuous scale (**Figure 10**). In addition, a second interactive questionnaire was delivered to each participant to asses the level of explicit awareness about the relation between the CSs and USs. Participants were shown a picture of the CS+ and asked to rate on a Likert scale ranging from −5 (strong disagreement) to

(remote control) were presented separately and in random order.

5 (strong agreement) the level of self-awareness about any causal relationship between the CS and the following US.

#### **4.3. PSYCHOPHYSIOLOGICAL MEASURES AND DATA ANALYSIS**

Electrodermal responses were acquired through the sensing glove, sampled at 100 Hz and stored for off-line analysis, whilst the ECG signals were acquired through the sensing shirt with a sampling frequency of 250 Hz. As already detailed in the Section 2.2, ECG was used to extract HR and HRV related features.

The recorded EDR waveforms were visually inspected and the analysis of skin conductance components was performed using Ledalab (Benedek and Kaernbach, 2010b). Before being decomposed into its phasic and tonic components, the signal was low pass filtered with a cut-off frequency of 2 Hz. For the event-related analysis, a number of fixed time windows were used to segment the signal according to the conditioning protocol and a set of phasic component features were extracted (see Section 2.3). A segment of the signal time-locked to each CS onset was used to derive a dependent measure of the cued response (separately for the CS+ and the CS−). In accordance to the EDR literature, for the analysis of cued responses we considered only those responses starting one second after the stimulus presentation.

The EDR and latencies to the grabbing data for each CS type were analyzed both in aggregated form by dividing each session into early, middle and late blocks of three trials, and on a trial by trial basis. The Amplitude-Sum of SCRs in a time window of 3 s after the CS onset was considered (first interval response), and the values were normalized with respect to subject's own maximum value for between-subjects comparison. Statistical analysis was performed with non-parametric tests (Wilcoxon ranksum, Kruskal-Wallis and Friedman) given the non-Gaussianity of the distributions, as assessed by independent Kolmogorov–Smirnov tests with Lilliefors correction. When required, the statistics were corrected for multi-comparisons. The significance level *α* for all tests was set to 0*.*05.

HRV analysis was conducted in accordance to the recommendations of Task Force on HRV (Camm et al., 1996). While for the event-related analysis of EDR we defined short time windows starting after the CS onset, to analyze HRV we extracted features using time windows of 30 s corresponding to the US and the black screen together. We refer to these time windows in the text as "CS+ trials" and "CS− trials," in accordance to the nature of the preceding CS. Four participants were excluded from the HRV analysis due to technical problems (i.e., signal degradation due to low battery charge) that occurred during the acquisition phase of the ECG signal (hence resulting in *N* = 7). Statistical analysis of HRV included non-parametric Wilcoxon and Friedman rankbased tests, due to the non-Gaussianity of the distributions, as assessed by Kolmogorov–Smirnov tests with Lilliefors correction. The significance level *α* for all the tests was set to 0*.*05.

# **4.4. RESULTS**

#### *4.4.1. Time to grab*

We looked for any significant difference in the latencies to the grabbing of the virtual object. Reaction times to the CS+ (mean = 3.7238 s, *SD* = ±0.76) were significantly longer than the latencies for CS− (mean = 3.3998 s, *SD* = ±0.73) during acquisition (paired Wilcoxon ranksum test, *p <* 0.05), while no significant differences were found for the extinction phase (mean CS+ = 3.5640 s, *SD* = ±1.3; mean CS− = 3.5696 s, *SD* = ±1.1). A significant difference was also found for CS+ reaction times between acquisition and extinction sessions (paired Wilcoxon ranksum test, *p <* 0.05), with longer reaction times during the acquisition session.

# *4.4.2. Self-assessment and awareness questionnaires*

The self-assessment ratings of arousal collected through the questionnaire were tested for normality with a Kolmogorov–Smirnov test with Lilliefors correction (*p >* 0.05), and a paired-samples *T*-test between the arousal ratings for CS+ and CS− was then conducted. We found a significant difference between the two stimuli: CS+ was rated as more arousing than the CS− (mean CS− = 3.23, *SD* = ±1.4; mean CS+ = 4.62, *SD* = ±2.7, *p <* 0.05) showing that the subjects differentiated between the two stimuli by the end of the experiment, thus demonstrating a trace of conditioned response being maintained after extinction. In addition, 95% of the participants explicitly reported that they strongly agreed with the statement "In the first level I noticed a relationship between the object in the cabinet and the image displayed on the TV" (mean rating = 4, *SD* = ±1.28), hence suggesting a causal contingency between CS+ and US+.

# *4.4.3. EDR*

A statistically significant difference was found for the CS+ amplitudes between the second three (middle block) and the last three (late block) trial during acquisition, with larger amplitudes during the late block (KW, *p* = 0.047, corrected for multi comparisons). A significant difference was also found for CS− amplitudes during extinction between the first three (early block) and the last three trials, with significantly smaller amplitudes in the last part of the extinction session (KW, *p* = 0.024, corrected for multicomparisons). A close to significance difference (*p* = 0.06) was found for the CS+ between acquisition and extinction trials. The other comparisons did not differ significantly, but overall the EDR showed characteristic trend patterns both in the acquisition and extinction phases (**Figure 11**).

# *4.4.4. HRV*

To analyze HRV, we performed a series of inter-subject comparisons for each one of the extracted features (see **Table 1**). We conducted a Friedman test between all the stimuli presentations in both acquisition and extinction phases, which consisted of 4 groups of data (i.e., all the CS+ trials and CS− trials in the 2 sessions). The results of the test indicate a significant difference between acquisition and extinction in Mean (*p <* 0.02), Median (*p <* 0.02) and RRmean (*p* = 0.01). These 3 features show a significantly higher mean value for CS+ trials than CS− trials during the acquisition phase and a significantly higher mean value of CS− trials as opposed to CS+ trials in the extinction phase (see **Table 5**).

An inter-subject comparison between CS+ trials and CS− trials in both the acquisition and extinction sessions was then conducted through means of Wilcoxon tests. No statistically significant results were found in the acquisition session, with the

**Table 5 | Mean values and SD of statistically significant features extracted from HRV (Mean, Median and RRmean) between CS+ trials and CS− trials.**


*For all these features, CS*+ *trials show a higher mean than CS*− *trials during acquisition, while this trend is inverted during the extinction phase.*

exception of a close to significance difference between CS+ trials and CS− trials for RR\_tri (*p* = 0.07). In the extinction phase, we found statistically significant differences in the first 4 consecutive CS+ trials between acquisition and extinction for RRmean, Mean and Median (*p <* 0.05).

Finally, taken the first 4 consecutive CS+ trials in the extinction phase, a number of features showed significantly lower in values in the last occurrence, as opposed to the first: RMSSD, HF, SD1 (*p <* 0.05). **Figure 12** shows that RMSSD, HF and SD1 decreased their ranks value from the first to the fourth CS+ trial. These features are indeed representative of the vagal control of the hearth (Camm et al., 1996; Tulppo et al., 1996; Berntson et al., 2005), therefore their concurrent decrease during the extinction phase represents a significant variation in the parasympathetic heart activity.

# **5. DISCUSSION**

The role played by psychophysiological correlates of human affective states has been investigated for more than a century, dating back to James-Lange's theory of emotions (Cannon, 1927) in the late 19th century. Until recently, most of the studies that measured psychophysiological signals, such as EDR and ECG, were conducted under strictly controlled laboratory settings. However, due to the improvements in hardware portability, the last decade has witnessed an increasing interest in the real world domain.

The bottleneck of conventional laboratory settings is created by the artificial conditions in which experiments are conducted, because they may or may not induce genuine emotions. When studying anxiety disorders, for instance, an unfamiliar laboratory environment can generate apprehension and stress in the participants, thus interfering with the natural emotional phenomena investigated. Similarly, the study of stress in a social context, such as family or workplace, necessarily requires investigation under naturalistic conditions (Wilhelm and Grossman, 2010). However, the discrepancy between the laboratory and the real world can be considerably minimized when investigating other topics, such as stress or mental workload detection during driving. Healey and Picard (2005), for instance, measured physiological data during a real-world driving task and classified the driver's stress using a recognition algorithm (Healey and Picard, 2005), while Stuiver et al. (2014) used cardiovascular measures to detect drivers' mental workload. The act of driving a real car (while wearing sensors with the awareness of being exposed to an experiment) or performing the same task in a laboratory using virtual reality would probably lead to similar results since the subjects' body motion is limited in both the approaches.

Moreover, in some cases, the laboratory can present a number of advantages when compared to life-like conditions. The latter, for instance, often present the problem of loss in connectivity between the sensors and the recording devices, while in controlled laboratory settings (even though there is a possibility of disconnection of devices) the probability of a data loss is extremely reduced. In addition, experiments that investigate affect recognition in life-like conditions are often aimed to measure long-term components of psychophysiological signals, such as tonic activity in EDR (see, as an example Healey et al., 2010), but can produce ambiguous results when investigating event-related states since the participants in the natural world

perform other activities than merely experiencing emotions (e.g., reading, talking, etc.). Furthermore, in the laboratory it is possible to investigate reactivity to specific classes of emotion-eliciting stimuli, while in life-like conditions there can be avoidance of negative emotion-eliciting situations (Wilhelm and Grossman, 2010; Valenza et al., 2012c). Additionally, data labeling in lifelike conditions relies almost exclusively on self-assessment. For an effective analysis, it is crucial to accurately annotate the data collected, since physical activity in ambulatory subjects can overwhelm the physiological effect of affective responses (Picard and Healey, 1997). A controlled laboratory environment, instead, provides all the means to easily annotate events along with the recorded signals with high accuracy and minimal delays, to determine a baseline for the acquired psychophysiological signals, and does not necessarily require the participants' self-assessment.

While we acknowledge the importance of achieving ecologically valid conditions in order to get genuine insights in the field of emotion research, we do believe that the definition of ecological validity in literature is often vague. Fahrenberg et al. (2007) support this viewpoint by presenting the laboratory and the field as alternatives that are not fundamentally opposed and by stressing the importance of removing this antithesis by developing new research strategies, that can be validated in the laboratory, while, at the same time, being close to daily life conditions (Fahrenberg et al., 2007). This is exactly why we built an immersive sensing infrastructure, the eXperience Induction Machine (XIM), which provides the unique opportunity to investigate human affective states in more ecologically valid environments than those offered by standard laboratory settings. By taking this hybrid approach, we are able to conduct ambulatory emotion research through the use of virtual reality and custom-made unobtrusive wearable technology suitable for the acquisition of psychophysiological signals without the constrains typical of standard laboratory settings, while, at the same time, ensuring that the subjects are timely exposed to systematically controlled stimuli according to the experimental design. This makes the XIM an ideal environment to conduct research on emotioneliciting events and reactions, such as the two studies that we conducted to validate our infrastructure which are discussed in this work.

In the first study, we exposed participants to a set of visual stimuli that covered the full range of arousal levels while they were free to walk around in the space and gesticulate with the aim of observing different psychophysiological signatures related to the arousal of the stimuli. Using a classifier, we were successful in discriminating and predicting arousal classes of the stimuli presented by only measuring participants' ECG and EDR, with an accuracy between 73% and 88% in the 3-class problem, and exceeding 87% in the 2-class problem. Although we found clear classification results, the analysis of psychophysiological data did not show a full consistency between the signals and the classes of arousal. We found significant trends for some of the features extracted from HRV, however these trends were not visible in all the data subsets. The *α* subset, for instance, did not present any trend. This result could be due to the fact that it comprised 5 stimuli per class (thus including almost the entire pool of stimuli), while classes in *β* and *γ* comprised 3 and 4 stimuli, respectively, hence covering just more extreme arousal levels. In the frequency domain, the *β* subset showed a significantly higher value for LF, while the *γ* subset resulted in significantly higher values for HF when associated to the presentation of highly arousing pictures as opposed to neutral images. In the time domain, we found significantly higher values of RMSSD in the *γ* subset for the high arousal class, whereas, contrary to our expectations, the analysis of the HR did not show any statistically significant result. Similar studies that address the effect of arousal on HR present heterogeneous results, in some cases even showing a decrease in HR following highly arousing stimuli. One explanation for this mixed outcomes can be the dominance of other cognitive processes (e.g., attention) during the experimental task (Brouwer et al., 2013).

Following the first study, we designed a second experiment using VR that goes beyond the standard laboratory setup used in the well-known classical conditioning paradigm. Previous researchers have, in fact, investigated conditioning in VR using psychophysiological measures (Grillon et al., 2006; Huff et al., 2011; Greville et al., 2013), albeit with one caveat; in all those studies participants were constrained to a chair and used devices such as joysticks to move in the virtual space, while keeping the non-dominant hand (where the EDR electrodes were mounted) still for the entire duration of the experimental session in order to minimize signal artifacts. This is precisely what we wanted to avoid in our experiment: by providing an ecological form of interaction and using our wearable technology, we tackled the challenge of acquiring meaningful psychophysiological recordings related to emotion-eliciting events in an ambulatory context. Additionally, we used the recorded motion events (e.g., hand grabbing) not only to isolate motion-related artifacts, but also to further support the results obtained through physiological measures (i.e., grabbing latencies and reaction time). Consistent with our hypothesis, we found an anticipation of the US manifested as longer response times to the CS+. These results are in line with other studies on classical and context conditioning in VR. Dawson et al. (1982) observed longer reaction times to a stimulus during the CS−US interval only for the CS+ that they interpreted as the CS+ allocating more attentional resources. In other studies the presentation of CS+ led to behavioral avoidance of certain locations (Grillon et al., 2006) or resulted in a negative performance in an interactive task requiring precise motor control (Greville et al., 2013). The results of the HRV analysis suggest that learning took place and was detected through psychophysiological measures. The values in time of Mean, Median and RRmean for CS+ trials as opposed to CS− trials inverted their trend from the acquisition to the extinction phase. Additionally, our findings show statistical changes in the activity related to the vagal control of the heart for time-dependent features such as RMSSD, which is associated with short-term, rapid changes in heart rate, and is correlated with vagus mediated components of HRV (Malik et al., 1996). From the analysis of the EDR, we observed significantly stronger skin conductance responses following the presentation of CS+. The significant difference in CS+ during the acquisition phase between the middle and the late block confirms the expected outcome and is consistent with previous research on conditioning (Grings and Dawson, 1973; Öhman and Bohlin, 1973; Prokasy and Kumpfer, 1973). During the extinction phase, we also found a significant reduction in the amplitudes of the responses related to CS− toward the end of the session. This result can be explained by the fact that the acquisition protocol followed in previous studies, normally, did not include the presentation of any US following the CS−, while in our experimental design neutral IAPS images were used as US− in order to ensure the ecological setting we designed (i.e., the virtual living room where the TV screen always displays an image after the subject collects the CS from the cabinet). The close to significant difference found for CS+ between acquisition and extinction suggests that a complete extinction probably did not occur. This interpretation is supported by the results of the self-assessment questionnaire administered at the end of the experiment, where the subjects reported higher arousal levels associated to CS+. To be more effective, experiments on conditioning conventionally adopt strong USs, such as electrical shocks, loud sounds, bright lights and evolutionary fear-relevant images as CSs. In our data we did observe characteristic trial-by-trial trends in the amplitude of SCRs for both the acquisition and extinction phases, however, these trends were not statistically significant. A possible interpretation of this outcome relies in the nature of the stimuli used in our experimental design (i.e., IAPS pictures with negative arousal and common objects as conditional stimuli) for the sake of ecological validity, while the adoption of fearful images would have produced stronger emotional responses than negative images (Courtney et al., 2010).

The results reported here are in line with previous research. Nevertheless, some of these findings also reflect the complexity in obtaining a homogeneous interpretation of psychophysiological signals across studies. One example we observed, in contrast to our expectations, is the similarity of the mean values of HR obtained when the subjects were exposed to stimuli conveying different arousal content. As a matter of fact, the uniform inference of psychophysiological correlates of emotions in life-like settings still constitutes a challenge in the field. This is mainly due to the concurrence of multiple cognitive processes that modulate both sympathetic and parasympathetic activity, and that are difficult to isolate, especially under ecologically valid conditions (i.e., short time windows, artifacts, etc.). Along with future improvements in hardware technology, one further step to tackle this issue is the addition of more (direct and indirect) physiological measurements, such as an eye-tracker, that can be used to measure attention and estimate mental workload.

# **AUTHOR CONTRIBUTIONS**

Alberto Betella, Ryszard Cetnarski, Riccardo Zucca, Paul F. M. J. Verschure: experimental design. Alberto Betella, Ryszard Cetnarski, Riccardo Zucca: data collection, data pre-processing. Alberto Betella, Riccardo Zucca: data analysis (accelerometers, IMU, video, questionnaires). Riccardo Zucca, Alberto Greco, Antonio Lanata: EDR analysis. Antonio Lanata: HRV analysis. Daniele Mazzei: Sensors and devices control software development and integration. Alessandro Tognetti: Design and development of wearable sensors and devices; algorithms for hand gesture tracking. Xerxes D. Arsiwalla: critical revision of the work. Pedro Omedas: integration of the sensors in XIM. Danilo De Rossi, Paul F. M. J. Verschure: supervision and final approval of the work. Alberto Betella, and Riccardo Zucca, contributed equally to this work. All the authors contributed to the draft and following revisions of the manuscript.

# **FUNDING**

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7- ICT-2009-5) under grant agreement n. 258749 [CEEDS]. The Generalitat de Catalunya (CUR, DIUE) and the European Social Fund are supporting this research.

# **ACKNOWLEDGMENTS**

Thanks to Gaetano Valenza (Centro Piaggio, UNIPI) for helping with data pre-processing and to Ivan Herreros-Alonso (SPECS, UPF) for his critical comments on the manuscript.

### **REFERENCES**

Acharya, U. R., Joseph, K. P., Kannathal, N., Lim, C. M., and Suri, J. S. (2006). Heart rate variability: a review. *Med. Biol. Eng. Comput.* 44, 1031–1051. doi: 10.1007/s11517-006-0119-0


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 18 February 2014; accepted: 22 August 2014; published online: 24 September 2014.*

*Citation: Betella A, Zucca R, Cetnarski R, Greco A, Lanatà A, Mazzei D, Tognetti A, Arsiwalla XD, Omedas P, De Rossi D and Verschure PFMJ (2014) Inference of human affective states from psychophysiological measurements extracted under ecologically valid conditions. Front. Neurosci. 8:286. doi: 10.3389/fnins.2014.00286*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Betella, Zucca, Cetnarski, Greco, Lanatà, Mazzei, Tognetti, Arsiwalla, Omedas, De Rossi and Verschure. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Developing an EEG-based on-line closed-loop lapse detection and mitigation system

# *Yu-Te Wang1,2, Kuan-Chih Huang3, Chun-Shu Wei 2,4, Teng-Yi Huang3, Li-Wei Ko5, Chin-Teng Lin3, Chung-Kuan Cheng1,6 and Tzyy-Ping Jung2,4,6\**

*<sup>1</sup> Department of Computer Science and Engineering, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA*

*<sup>2</sup> Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla, CA, USA*

*<sup>3</sup> Department of Electrical Engineering, National Chiao-Tung University, Hsinchu, Taiwan*

*<sup>4</sup> Department of Bioengineering, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, USA*

*<sup>5</sup> Department of Biological Science and Technology, National Chiao-Tung University, Hsinchu, Taiwan*

*<sup>6</sup> Center for Advanced Neurological Engineering, Institute of Engineering in Medicine, University of California San Diego, La Jolla, CA, USA*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Klas Ihme, German Aerospace Center, Germany Johanna Wagner, Graz University of Technology, Austria*

#### *\*Correspondence:*

*Tzyy-Ping Jung, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla, CA 92092, USA e-mail: jung@sccn.ucsd.edu*

In America, 60% of adults reported that they have driven a motor vehicle while feeling drowsy, and at least 15–20% of fatal car accidents are fatigue-related. This study translates previous laboratory-oriented neurophysiological research to design, develop, and test an On-line Closed-loop Lapse Detection and Mitigation (OCLDM) System featuring a mobile wireless dry-sensor EEG headgear and a cell-phone based real-time EEG processing platform. Eleven subjects participated in an event-related lane-keeping task, in which they were instructed to manipulate a randomly deviated, fixed-speed cruising car on a 4-lane highway. This was simulated in a 1st person view with an 8-screen and 8-projector immersive virtual-reality environment. When the subjects experienced lapses or failed to respond to events during the experiment, auditory warning was delivered to rectify the performance decrements. However, the arousing auditory signals were not always effective. The EEG spectra exhibited statistically significant differences between effective and ineffective arousing signals, suggesting that EEG spectra could be used as a countermeasure of the efficacy of arousing signals. In this on-line pilot study, the proposed OCLDM System was able to continuously detect EEG signatures of fatigue, deliver arousing warning to subjects suffering momentary cognitive lapses, and assess the efficacy of the warning in near real-time to rectify cognitive lapses. The on-line testing results of the OCLDM System validated the efficacy of the arousing signals in improving subjects' response times to the subsequent lane-departure events. This study may lead to a practical on-line lapse detection and mitigation system in real-world environments.

**Keywords: electroencephalogram (EEG), drowsiness, fatigue, driving, smartphone, cell-phone, brain computer interface (BCI)**

# **INTRODUCTION**

Fatigue-related performance decrements such as lapses in attention and slowed reaction time could lead to catastrophic incidents in occupations ranging from ship navigators to airplane pilots, railroad engineers, truck and auto drivers, and nuclear plant monitors. Fatigue (or drowsiness) "concerns the inability or disinclination to continue an activity, generally because the activity has been going on for too long," defined by European Transport Safety Council (Croo et al., 2001). Sixty percent of American adults reported that they have been driving a motor vehicle when feeling drowsy (National Sleep Foundation, 2005). Furthermore, studies have concluded that at least 15–20% of fatal car accidents are fatiguerelated (The Royal Society for the Prevention of Accidents, 2001; Connor et al., 2002; National Highway Traffic Safety Administration, 2003). Therefore, an earlier detection of driving fatigue is a crucial issue for preventing catastrophic incidents.

In order to detect the driving fatigue, several approaches have been proposed in scientific literature. (1) Computer vision-based systems (Bergasa et al., 2006; D'Orazio et al., 2007; Golz et al., 2010). Bergasa et al. (2006) used a real-time image-acquisition system to monitor drivers' visual behaviors that revealed a drivers' alertness level. Six parameters: percentage of eye closure, eye closure duration, blink frequency, nodding frequency, face position, and fixed gaze were included in a fuzzy classifier for identifying a driver's vigilance level. D'Orazio et al. (2007) proposed a neural classifier to recognize the eye activities from images without being constrained to head rotation or partially occluded eyes. (2) Driving behavior counter-measurements (Lin et al., 2009, 2010a, 2013). Lin et al. (2009) performed an event-related, lane-keeping driving task in an immersive virtual-reality environment. Subjects were asked to steer the stimulated car back to the middle of the cruising lane once they perceived the randomized lane-departure events. The results showed that the reaction time (RT), defined as the time interval between the onset of the simulated car deviation and the user response, could be improved by providing arousing auditory warning to the subjects combating with fatigue.

A Brain Computer Interface (BCI) translates neural activities into control signals to provide a direct communication pathway between the human brain and an external device (Wolpaw et al., 2002). Broadly speaking, BCIs can be grouped into three categories: active, passive and reactive BCIs (Zander and Kothe, 2011). Electroencephalogram (EEG)-based passive BCIs measure brain electrical activities from the scalp and enrich a human– machine interaction with implicit information on the actual user state without conscious effort from the user (Lehne et al., 2009; Zander et al., 2009, 2010; Zander and Jatzev, 2012). Given appropriate signal-processing algorithms in the passive BCIs, meaningful information can be directly extracted from the EEGs. For instance, time-domain analysis such as averaging across different channels, moving average with a specific window length, standard deviation, linear correlation and so on are useful approaches to extract information from EEGs (Dong et al., 2011). In a frequency-domain analysis, the short-time Fourier transform (STFT) is often applied to the EEG data to estimate the power spectral density in distinct frequency bands, including delta (1–3 Hz), theta (4–7 Hz), alpha (8–13 Hz), beta (14–30 Hz), and gamma (31–50 Hz). Many studies have shown that the brain dynamics linked to fatigue and behavioral lapses can be assessed by EEG power spectra (Kecklund and Akerstedt, 1993; Makeig and Inlow, 1993; Jung and Makeig, 1995; Makeig and Jung, 1996; Jung et al., 1997; Lal and Craig, 2002; Campagne et al., 2004; Horne and Baulk, 2004; Debener et al., 2005; Peiris et al., 2006; Davidson et al., 2007; Eichele et al., 2010, 2008; Golz et al., 2010; Lin et al., 2010a), combinations of EEG band power (Jap et al., 2011), alpha spindle parameters (Simon et al., 2011) and autoregressive features (Rosipal et al., 2007). These studies provided solid evidence for the neurophysiological correlates of fatigue and behavioral lapses. In short, while the physical- and behavioral symptom-based methods indirectly measure drivers' cognitive states, the neurophysiology-based methods offer a more direct path to assess the brain dynamic linked to fatigue and behavioral lapses with a high temporal resolution.

Efforts have also been made to assist individuals in combating fatigue and/or preventing lapses in concentration. For instance, Dingus et al. (1997) and Spence and Driver (1998) proposed using warning signals to maintain drivers' attention. The types of warning signals could be auditory (Spence and Driver, 1998; Lin et al., 2009), visual (Liu, 2001), tactile (Ho et al., 2005) or mixed (Liu, 2001). Empirical results showed that auditory warning could reduce the number of lapses in sustained-attention tasks (Spence and Driver, 1998), and could help subjects to maintain driving performance (Lin et al., 2009). More recent studies demonstrated that arousing auditory signals presented to individuals experiencing momentarily behavioral lapses could not only agitate their behavioral responses but also change their EEG theta and alpha power in a sustained-attention driving task (Jung et al., 2010; Lin et al., 2010a, 2013). However, the studies also showed that sometimes subjects did not respond to the arousing signals, and more importantly the EEG activity of these non-responsive episodes showed little or no changes following the ineffective warning (Jung et al., 2010; Lin et al., 2010a, 2013). Lin et al. (2013) later demonstrated the feasibility of using the post-warning EEG power spectra to predict the (in)efficacy of the arousing warning. A caveat of their studies was that the arousing warning was delivered to subjects after they behaviorally failed to respond to lane-departure events. In reality, the delivery of arousing warning could have been too late because the behavioral lapse might have led to catastrophic consequences. A truly EEG-based lapse monitoring system needs to continuously and non-invasively observe EEG dynamics to predict fatigue-related lapses, deliver arousing signal to arouse the user, and assess the efficacy of the arousing signal to trigger a repeated or secondary warning signal if necessary. Furthermore, all of the aforementioned studies were conducted with traditional bulky and tethered EEG systems and were performed in well-controlled laboratories. However, it is argued that there might be fundamentally dynamic differences between laboratory-based and naturalistic human behavior in the brain (McDowell et al., 2013). It thus remains unclear how well the current laboratory-oriented knowledge of EEG correlates of cognitive-state changes can be translated into the highly dynamic real world.

This study aims to extend previous studies to design, develop and test a truly On-line Closed-loop Lapse Detection and Mitigation (OCLDM) System that can continuously monitor EEG dynamics, predict fatigue-related lapses based on EEG signals, arouse the fatigued users by delivering arousing signals, and assess the efficacy of the arousing signal based on EEG spectra. This study hypothesized that (1) EEG spectral values would differ under different arousal states; (2) it is feasible to predict lapses based on the spectral changes in the spontaneous EEG; (3) arousing warning delivered to cognitively challenged subjects would mitigate cognitive lapse, and (4) the rectified performance would be accompanied by the changes in EEG power spectra. This study conducted an off-line experiment to explore the neurophysiological correlates of lapses, which tested the abovementioned hypotheses and guided the development of a truly OCLDM system. The system was then validated by an on-line driving experiment. Furthermore, to be practical for routine use in a car or workplace by freely moving individuals, the EEG-based lapse monitoring system must be non-invasive, non-intrusive, lightweight, battery-powered, and easily to put on and take off (Lin et al., 2008a,b). This study thus also investigates the feasibility of using a practical, low-density, lightweight dry EEG headgear and a smartphone-based EEG-processing platform (Wang et al., 2012) to build a truly mobile and wireless OCLDM System for real-life applications.

# **MATERIALS AND METHODS**

#### **SUBJECTS**

Eleven healthy and naive subjects (10 males and one female) with normal hearing and aged 20–28 years old participated in this study. All of them were free of neurological and psychological disorders. They were introduced how to manipulate the car and practiced ∼10 min to get acquainted before the experiment started. None of them worked night shifts or traveled across multiple time zones in the previous 2 months. All participants were asked to read and sign the informed consent form before participating in the studies. After the experiments, subjects were asked to complete the questionnaire for assessing their cognitive states during the experiments.

# **EXPERIMENTAL EQUIPMENT**

Experiments of this study were conducted in an 8-screen and 8-projector immersive virtual-reality (VR) environment that simulates the 1st person view scene of highway driving. This study adapted an event-related lane-departure driving paradigm originally proposed by Huang et al. (2006, 2007a,b, 2009) that allowed objective and quantitative measures of momentary event-related brain dynamics following lane-departure events and drivingperformance fluctuations over longer periods. The VR scenes simulated driving at a constant speed (at ∼100 km/h) on a highway with the simulated car randomly drifting away from the center of the cruising lane to simulate driving on non-ideal road surfaces or with poor alignment (Huang et al., 2007a,b, 2009; Lin et al., 2008b, 2009). The scene was updated according to the land-departure events and the subject's manipulation. The vehicle trajectory, user's input, and lane-departure events could be accurately logged and time-synchronized to the EEG recordings (Huang et al., 2007a,b, 2009; Lin et al., 2008b). There were no traffic or distractive objects other than 4-lane roads and dark sky appeared in the VR while the simulated car was cruising on the highway.

Thirty-two channel EEG data were collected from participants by the NuAmp system (32-channels Quick-Cap, Compumedics Ltd., VIC, Australia). The electrodes were placed according to a modified international 10–20 system with a unipolar reference at the right earlobe. The EEG activities were recorded with 500 Hz sampling rate and 16-bit quantization level.

# **EXPERIMENTAL PARADIGM**

**Figure 1A** shows the experimental paradigm of this study. The simulated car starts cruising at a fixed speed (∼100 km/h) on the 3rd lane and drifting to either right or left with equal probability within 8–10 s. Subjects were instructed to steer the simulated car back to the 3rd lane as soon as they noticed the lane drift. The simulated car keeps cruising on the right (or left) most lane if the subjects failed to respond to lane drift. The baseline period of each lane-departure epoch is defined as the 3 s before the onset of a lane-drifting event. The empty circle in **Figure 1A** represents the unexpected lane-departure events marked as the "deviation

represents the deviation onset. The double circle represents the response onset. The circle with the cross represents the response offset. The baseline is defined as the 3 s period prior to deviation onset. The response time (RT) of a driver is the interval from the deviation (empty circle) to the response onset (the double circle). A trial starts at deviation onset and ends at

tasks. The height of an arrow represents the response time in a single trial. The warning was delivered to the subject when the RT in the trial exceeded three times the mean RT of trials in the first 5 min of the task, when the subject was presumably alert and fully attended to lane-departure events. In this figure is adapted with permission from Figure 1 of Lin et al. (2010a).

onset." After the deviation onset, subjects were instructed to steer the simulated car back to the center of the cruising lane immediately (double circle), and the time when the subjects started steering was marked as the "response onset." The moment that the simulated car reached the center of the cruising lane (circle with cross) was marked as the "response offset." A subject's response time (RT) was defined as the time between the deviation onset and the response onset. At the first 5 min of the experiment, subjects were asked to be fully alert, verified by the vehicle trajectory and the video from a surveillance camera, to obtain an averaged alert RT (aRT) for each subject (1.51∼2.54 s), which is a threshold for the entire experiment. The entire experiment consisted of 5 min training and 85 min driving periods.

**Figure 1B** shows the criterion of delivering auditory warnings in the experiment. When a subject failed to respond within three times the aRT, the system treated the trials as a behavioral lapse and triggered a 1750 Hz tone-burst to arouse the subject from fatigue-related lapse in half (50%) of these drowsy trials (marked as the "current trial (CT)" in **Figure 1A**). The very next trial is defined as CT + 1, and so on. The lapse trials that were randomly selected to receive arousing warning were referred to as CT with warning, whereas the remaining half of trials that did not receive auditory warning were referred to as CT without warning. Note that our previous studies showed that in some trials subjects remained non-responsive following the arousing warning, which was analogous to sleeping through an alarm clock (Jung et al., 2010; Lin et al., 2013). If the RT of the following trial (CT + 1) was shorter than the double of the averaged aRT, the warning signal delivered in the CT trial was defined as an "effective warning." On the other hand, if the RT of the CT + 1 trial was longer than triple of the averaged aRT, the warning was defined as an "ineffective warning." This study did not include the trials with RTs between 2 and 3 aRT to define the alert vs. fatigue spectral thresholds because the cognitive states of the subjects during those trials were unclear. Note that, subjects didn't know about the warning before the experiments.

#### **DATA ANALYSIS**

The 32-channel EEG data were first down-sampled to 250 Hz, and a low-pass filter of 50 Hz and a high-pass filter of 0.5 Hz were applied. Channels or trials with severe artifacts (such as body movements or muscle activities) were manually removed (less than three channels and 20% trials per subject in general). The remaining EEG data were segmented into several 115 s trials, each of them consisting of 15 s before and 100 s after the lane-deviation onsets. Independent Component Analysis (ICA, Bell and Sejnowski, 1995; Makeig et al., 1997) implemented in EEGLAB (Delorme and Makeig, 2004) was then applied to decompose the ∼32-channel EEG into ∼32 independent components (ICs), based on the assumption that the collected EEG data from the scalp were a weighted linear mixture of electrical potentials projected instantaneously from temporally ICs accounting for distinct brain sources. The comparable ICs across subjects were grouped into component clusters based on their scalp maps, equivalent dipole locations and baseline power spectra of component activations (Jung et al., 2001; Delorme and Makeig, 2004). Across 11 subjects, there were 155 trials with warning (30 trials were ineffective and 125 trials were effective) and 192 trials without warning.

Since the RT and EEG power were not normally distributed, non-parametric statistic tests were performed for the data analysis (Delorme and Makeig, 2004). The Wilcoxon rank-sum test (Matlab statistical toolbox, Mathworks) was used to assess the effects of warning on RTs. Bootstrapping (EEGLAB toolbox, University of California, San Diego) was used to test the statistical significance of EEG power changes at specific frequency bins from 2 to 30 Hz with a 0.25 Hz resolution. To test group statistics, the intrinsic inter-subject RT differences were reduced by dividing RTs by the mean RT. The EEG spectra were normalized by dividing the spectral power by the standard deviation of the spectral distribution.

# **RESULTS: NEUROPHYSIOLOGICAL CORRELATES OF BEHAVIORAL LAPSES**

**EFFICACY OF AROUSING AUDITORY SIGNALS FOR RECTIFYING LAPSES** This study first explored the efficacy of the delivery of arousing auditory signals by measuring the change in subjects' reaction time. **Figure 2A** shows the boxplots of RTs of three trial groups:

**FIGURE 2 | Experiment results. (A)** The boxplot for the RT distribution of trials with effective warning, ineffective warning, and without warning among CTs and CTs + 1. Note that middle horizontal line is the median of the distribution, and the top and bottom of the rectangle are the third and first quartile, and the dash line ends are the maximum and minimum after outlier removal. **(B)** The component spectra of the alert CTs (black curve), with an effective warning (red curve), with an ineffective warning (light blue curve) and without warning (dark blue curve). The red, light blue and blue horizontal lines mark the spectral differences between the alert trials and trials with an effective warning, with an ineffective warning, and without warning, respectively. All the spectral plots were calculated from the activity of the bilateral occipital components separated by ICA.

Alertness, CT, and CT + 1 (left to right). The averaged aRT of trials within the Alertness group across 11 subjects was ∼676 ms. The RTs of the CT group with arousing warning (red and light blue) were statistically significantly shorter than those of trials without receiving arousing warning (dark blue). The RTs of the CT + 1 group with effective vs. ineffective warning differed while the RTs of the preceding group (CT) were comparable. Even though the subjects responded to the arousing warning by immediately steering the simulated car back to the cruising position, they could well be totally non-responsive to the very next lanedeparture event (∼10 s later). In other words, the arousing signals reliably rectified human behavioral lapses, but did not guarantee that subjects were fully awake, alert, or attentive. This suggests an analogous regime of snooze after an alarm is turned off.

#### **EEG DYNAMICS PRECEDING BEHAVIORAL LAPSES**

**Figure 2B** shows the mean scalp map of the bilateral occipital cluster (upper-right corner) and its component baseline power of drowsy trials without auditory warning (dark blue), with either effective (red) or ineffective warning (light blue). First, among the resultant ICA clusters, bilateral occipital components exhibited statistically significant spectral differences between trials with and without auditory warning. Second, the component power spectra exhibited tonic increases in theta (4–7 Hz), alpha (8–12 Hz), and beta (13–30 Hz) bands in drowsy trials (red, dark blue, and light blue), compared to the alert trials (black). Horizontal lines mark the frequency bins under which the spectral differences between alert trials and drowsy trials with either (in)effective warning, or without warning were statistically significant (alpha = 0.05, Bonferroni adjusted *p*-value of 0.05/(112 frequency bins) = 0.0004 for multiple comparisons). Note that the spectra shown here were calculated from the component activities prior to the lane-deviation onset. The nearly identical pre-lapse spectra of these three groups of non-responsive trials demonstrate the robustness of the broadband spectral augmentation preceding the behavioral lapses, suggesting the feasibility of using theta and alpha power from the lateral occipital areas to predict behavioral lapses in this sustained-attention driving task.

#### **EFFECTS OF AROUSING AUDITORY SIGNALS ON THE EEG**

Next, this study explored temporal spectral dynamics preceding, during and following fatigue-related behavioral lapses and following arousing warning. **Figure 3** shows time courses of spectral changes in the bilateral occipital area following ineffective warning (light blue trace), effective warning (red trace), and without warning (dark blue trace), compared to those of the alert trials (black trace). **Figure 3** shows that both theta- an alpha-band power steadily increased prior to the lane-departure onset (at time 0 s). Again, the trends of steady increasing theta- and alpha-band power leading to behavioral lapses in the three groups of drowsy trials were nearly identical, indicating the robustness of the theta and alpha augmentation preceding the behavioral lapses.

**Figure 3** also shows that after the lane-departure onset (at time 0 s), the alpha (top panel), and theta (bottom panel) power abruptly decreased by over 10 and 5 dB to nearly the alert (black trace) baseline, respectively. More importantly, following the subjects' responses, the spectra of trials with ineffective warning (light blue trace) and without warning (dark blue trace) rapidly rose from the alert baseline to the drowsy level in 5–15 s. The theta and alpha power of trials with effective warning, however, remained low for ∼40 s. The green horizontal lines mark the time points when the difference between the spectra of trials with

ineffective feedback, without feedback, and in alertness, respectively. The green horizontal line indicates the statistically significant differences (*p <* 0*.*01) between trials with effective feedback and without feedback. The brown indicates the statistically significant differences (*p <* 0*.*01) between trials with effective feedback and ineffective feedback.

effective warning and without warning were statistically significant (*p <* 0*.*01). The spectral difference between the trials with effective warning and without warning was significant from 7 to 18 s in alpha band and from 7 to 21 s in the theta band (*p <* 0*.*01). Furthermore, the spectral difference between the trials with effective and ineffective warning was significant from 7 to 16 s in both alpha and theta bands (brown horizontal lines).

In sum, these results provided invaluable insights into the optimal electrode locations (lateral occipital region) and EEG features (theta- and alpha-band power) for a practical OCLDM system detailed below. The EEG and behavioral data collected from this experiment were used to assess the EEG correlates of fatiguerelated lapses and build a lapse prediction model for the second experiment.

### **DEVELOPING A OCLDM SYSTEM**

Our previous study (Wang et al., 2012) proposed a cellphone based drowsiness monitoring and management system to continuously and wirelessly monitor brain dynamics using a lightweight, portable, and low-density EEG acquisition headgear. The system was designed to assess brain activities over the forehead, detect drowsiness, and deliver arousing warning to users experiencing momentary cognitive lapses, and assess the efficacy of the warning in near real-time. However, the system was not fully implemented nor experimentally validated in humans. Furthermore, according to the neurophysiological results in section Results: Neurophysiological Correlates of Behavioral Lapses, EEG signals collected over the lateral occipital regions were more informative for lapse detection. This study extends the previous work to design, develop, and test an OCLDM System.

#### **SYSTEM ARCHITECTURE**

**Figure 4A** shows the system diagram of the proposed OCLDM System. The system consists of two major components: (1) a mobile platform featuring the OCLDM algorithm, and (2) a mobile and wireless 4-channel headgear measuring EEG signals over the hair-bearing occipital regions with dry EEG sensors (Liao et al., 2011). The OCLDM System was implemented as an App on an Android-based platform (e.g., Samsung Galaxy S3). The smartphone has a Bluetooth module, 16 GB RAM, an ARM Cortex-A9 processor, Android (Ice Cream Sandwich) OS, and other components. When the App is launched, it can automatically search and connect to a nearby EEG headgear to receive data from the EEG acquisition headgear. In the mean time, the App opened an USB port to receive the events from a four-lane highway scene to synchronize the EEG data and scene events. The build-in speaker (or plug-in a ear set) of the smartphone delivers auditory warning signal once the OCLDM System detects that the subject is experiencing a cognitive lapse. Both the EEG data and scene-generated events could be logged onto either smartphone's build-in memory or an external microSD card for further analysis.

The mobile and wireless EEG acquisition headgear features a 4-channel lightweight portable bio-signal acquisition device powered by a 3.7 v Li-ion battery (Lin et al., 2010b). It consists of a TI MSP430 microprocessor, a pre-amplifier, a battery-charging circuit, a 24 bit ADC, a Bluetooth module, and dry spring-loaded

The EEG headgear collected 4-channel brain activities from the lateral occipital area while a subject was performing the lane-drifting experiment. The mobile signal-processing platform received the acquired EEG raw data through Bluetooth, and the event markers generated from the lane-departure scene through an USB interface. Finally, the auditory feedback was delivered to the subject when the averaged EEG power across four channels was 3 dB over the alert baseline. **(B)** A photo of a subject performing the on-line driving experiment while wearing a 4-channel EEG headgear (the white small box attached on a flexible band) over the lateral occipital area.

EEG sensors (Liao et al., 2011). The spring-loaded probes of the sensor can penetrate the hair to provide good electrical conductivity with the scalp. The microprocessor controls all the components including the amplifiers, digitizers, and transmits the digitalized EEG data to the Bluetooth module. The 4-channel EEG data are then transmitted to the authorized receiver of the OCLDM System. Depending on the applications, the system's sampling rate can be programmed at 128, 256, or 512 Hz. An experienced subject can easily put on this EEG acquisition device within 1–3 min without any help from a technician. **Figure 4B** shows a photo of a subject wearing a 4-channel EEG headgear and performing the simulated driving experiment.

### **SYSTEM SOFTWARE DESIGN**

**Figure 5** shows the program's state diagram of the proposed OCLDM System. Three major states, including Baseline Collecting (BC), Driving Performance Monitoring (DPM), and Warning Efficacy Assessment (WEA), were implemented in the program. When the program is launched, one can modify the parameters in the SETTING page, shown as a square box in the figure. For instance, the parameters can be the duration of baseline data collection, or the threshold of auditory warning delivering for the other two states. Depending on the applications, the lapse threshold in the DPM state can be calculated accordingly. For example, one can use a combination of power of alpha, beta, theta, and delta bands to detect cognitive lapse. The program then enters the next (DPM) state after the Baseline (calibration) data collection has completed. The DPM module continuously monitors the driver's neurophysiological data. The program stays in the DPM state until the lapse threshold is met, which depends on the neurophysiological results as shown in section Results: Neurophysiological Correlates of Behavioral Lapses. For instance, when the subject's power spectrum in alpha band is 3 dB higher than the threshold (alert baseline collected in the BC state), the program delivers an auditory warning to arouse the subject and enters the FEA state. The current value is stored as the lapse reference in the FEA module. The system repeatedly delivers auditory warning until the EEG power decreases to another threshold.

### **ON-LINE EXPERIMENTAL PARADIGM**

Three new male subjects (who did not participated in the first experiment) with normal hearing and aged 25–30 years old

over the alert baseline. Lapse2 represents that the averaged EEG power across four channels has not yet dropped 3 dB from the lapse power.

participated in the on-line closed-loop lane-departure driving experiments to evaluate the OCLDM System in a more naturalistic setting (in a regular office without any electromagnetic shielding). All of them were asked to read and sign the informed consent form before participating in the studies.

The entire experiment consisted of a 1 min training and a ∼60 min driving periods. During the training session, subjects were asked to stay fully alert. The averaged alpha power collected in the BC session was used as an alert baseline to determine whether a subject is experiencing cognitive lapses in the driving task. The subject performed the lane-departure driving experiments following the protocol below:


The cognitive lapses were detected when the subject's alpha-band power (Jung et al., 2010; Lin et al., 2010a, 2013), calculated by a moving-averaged STFT with a 256-point sliding window advanced at 1 s step running on the smartphone, was 3 dB over the alert baseline power (Lin et al., 2010a, 2013 and Results in section Results: Neurophysiological Correlates of Behavioral Lapses). This study used the alpha power fluctuations to monitor cognitive lapses because (1) a recent study showed that the alpha augmentation was sensitive to the transition from full alertness to mediate drowsiness, while the theta augmentation was more sensitive to the transition from mediate to deep drowsiness (Chuang et al., 2012); (2) the empirical results of this study showed that the augmentation of alpha-band power changes was greater than that of the theta-band power (**Figure 2**). The system would repeatedly deliver auditory warning until the alpha-band power amplitude has dropped to 3 dB below the power level when the cognitive lapse was identified.

# **RESULTS FROM THE OCLDM SYSTEM**

The numbers of detected cognitive lapses varied across subjects. **Table 1** lists the numbers of trials with effective, ineffective warning and without warning, respectively. Here, the way we defined the effective trials was based on the RT in response to the lane-departure event immediately following the arousing signal (CT + 1 whose RT was shorter than two times aRT); while the ineffective trials had RT longer than three times aRT.



**Figure 6** shows the boxplot of behavioral performance (RTs) of trials with effective trials (red), ineffective trials (light blue), and without warning (dark blue), compared to the averaged aRT (black) during the on-line experiments. The effective trials had RTs comparable to the averaged aRT (less than 1 s) in both CT and CT + 1. Note that the RTs of CT + 1 with effective vs. ineffective warning differed largely because that was how the effective and ineffective trials were defined. However, the RTs of CT trials (red and light blue) of these two groups of trials were very comparable. That is, even though the subjects responded to the arousing warning by steering the simulated car immediately back to the cruising position, they could well be totally non-responsive to the very next lane-departure event. This finding is consistent to our off-line study reported in section Results: Neurophysiological Correlates of Behavioral Lapses in which the arousing warning was delivered to the subjects who just had a behavioral lapse.

**Figure 7** showed the averaged alpha-band spectral time courses across subjects and trials with effective warning (red trace), with ineffective warning (light blue trace), and without warning (dark blue trace), compared to averaged aRT (black trace). All spectral time courses were aligned to the user response onset (thick vertical black line at time 0 s), and the auditory warning for effective- and ineffective-trials were delivered ∼5 s before

(light blue), and without feedback (blue trace), compared to trial with aRT (black). The time course of power was estimated by short time Fourier transform with 256 points of time window and 224 points overlapping.

the user response. In the trials following effective auditory warning, the alpha power decreased steadily and reached the averaged aRT in ∼7 s. The power spectra remained as low as that of the alert baseline from 7 to 20 s after response onset. In the trials with ineffective auditory warning, the spectral time series fluctuated fierously due to the small number of trials. In the trials without warning, the alpha power fluctuated before response onset and steadily dereased until ∼7 s. Thereafter, the alpha power increased again from ∼7 to 13 s, suggesting the subjects might be partially arouse by the lane-departure event and their own bebavioral reposense temportally but returned to the fatigue state rapidly thereafter.

#### **DISCUSSIONS**

Many studies have shown that the brain dynamics correlated with behavioral lapses can be assessed from EEG data. Recent studies have also shown auditory signals can arouse drowsy subjects and affect EEG activities (Lin et al., 2010a, 2013). However, in these studies, the arousing warning was delivered to subjects after they displayed behavioral lapses, which in reality may be too late because the behavioral lapse might have already had catastrophic consequences. Therefore, a system that features real-time lapse detection and delivers warnings to the drowsy subjects is desirable in preventing catastrophic incidents while driving.

The first experiment of this study showed that EEG power changes in either alpha or theta band can be used as an indicator for assessing the subjects' fatigue (cf. **Figure 3**), and auditory warning temporarily reduces the alpha and theta band power and mitigates the behavioral lapses (cf. **Figure 2A**). In addition, EEG changes after delivery of auditory warning are a good indicator of the efficacy of arousing warning. More importantly and interestingly, empirical results of the first study showed that arousing auditory signals could always reliably mitigate human behavioral lapses, but these immediate behavioral responses could not guarantee the subjects were fully awake, alert, or attentive, similar to snooze after an alarm is turned off. This finding may open a new research direction of how to accurately confirm a subject's cognitive level for some sustained-attention tasks, such as an aircraft navigator or a long-haul truck driver. In other words, further studies to explore the brain changes in this sleep inertia period may provide valuable insights of brain dynamics during a transitional state of lowered arousal occurring immediately after awakening from sleep. Based on previous studies (Lin et al., 2010a, 2013) and the results of the first experiment, this study further developed a truly OCLDM system to detect/predict cognitive lapse based on the EEG spectra, deliver arousing warning on the occurrence of cognitive lapse, and assess the efficacy of the arousing warning, again, based on the EEG spectra. Most importantly, the EEG spectra changes within ∼10 s after delivering arousing warning were closely monitored, such that any false-awake situations could be decreased. This study then documented the design, development, and on-line evaluation of the proposed OCLDM System that featured a lightweight wireless EEG acquisition headgear and a smartphone-based signalprocessing platform. Experimental results showed that subjects' EEG power could almost remain at the alert state without bouncing back to the drowsy level (cf. **Figure 7**). These results suggest that the proposed system could prevent potential behavioral lapses based solely on the EEG signals, and this demonstration could lead to a real-life application of the dry and wireless EEG technology and smartphone-based signal-processing platform. An interesting question is if the neural correlates of fatigue could be generalized across different sustained-attention tasks and different recording conditions. In the past few years, we have conducted several sustained-attention tasks, including auditory target detection tasks (Makeig and Jung, 1995, 1996; Jung et al., 1997), visual compensatory tracking tasks (Huang et al., 2008), and simulated driving tasks (Lin et al., 2005, 2006, 2008b) and found that performance-related EEG dynamics were comparable across tasks (Huang et al., 2007b). Results of these studies also showed the fatigue-related brain dynamics were quite consistent across different recording environments (within a well-controlled EEG laboratory vs. a 6-degree-of-freedom motion platform) and responding methods (using a button press or a steering wheel). Therefore, it is reasonable to believe the methods developed under this study could be translated from laboratory settings to real-world environments.

In sum, this study demonstrated the feasibility of translating a laboratory-based passive BCI system to a neuroergonomic device that is capable of continuously monitoring and mitigating operator neurocognitive fatigue using a pervasive smartphone in realworld environments. The passive BCI technologies might also be applicable to other real-world cognitive-state monitoring, such as attention, distraction, comprehension, confusion, and emotion. We thus believe more real-world passive BCI implementations will emerge in the foreseeable future.

#### **ACKNOWLEDGMENT**

This work was supported in part by US Office of Naval Research, Army Research Office (W911NF-09-1-0510), Army Research Laboratory (W911NF-10-2-0022), and DARPA (DARPA/USDI D11PC20183). This work was also supported in part by the UST-UCSD International Center of Excellence in Advanced Bioengineering sponsored by the Taiwan National Science Council I-RiCE Program under Grant Number: NSC-102-2911-I-009- 101, and the Aiming for the Top University Plan of National Chiao Tung University sponsored by the Ministry of Education of Taiwan, under Grant Number 103W963. The authors also appreciate Melody Jung's editorial assistance.

### **REFERENCES**


Zander, T. O., Kothe, C., Welke, S., and Roetting, M. (2009). "Utilizing secondary input from passive brain-computer interfaces for enhancing human-machine interaction," in *Foundations of Augmented Cognition. Neuroergonomics and Operational Neuroscience*, eds D. Schomorrow, I. Estabrooke, and M. Grootjen (Berlin; Heidelberg: Springer), 759–771.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 15 February 2014; accepted: 24 September 2014; published online: 13 October 2014.*

*Citation: Wang Y-T, Huang K-C, Wei C-S, Huang T-Y, Ko L-W, Lin C-T, Cheng C-K and Jung T-P (2014) Developing an EEG-based on-line closed-loop lapse detection and mitigation system. Front. Neurosci. 8:321. doi: 10.3389/fnins.2014.00321*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Wang, Huang, Wei, Huang, Ko, Lin, Cheng and Jung. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Knowledge-based identification of sleep stages based on two forehead electroencephalogram channels

# *Chih-Sheng Huang1,2, Chun-Ling Lin1,3, Li-Wei Ko1,4,5\*, Shen-Yi Liu1,6, Tung-Ping Su1,6 and Chin-Teng Lin1,2\**

*<sup>1</sup> Brain Research Center, National Chiao-Tung University, Hsinchu, Taiwan*

*<sup>2</sup> Institute of Electrical Control Engineering, National Chiao-Tung University, Hsinchu, Taiwan*

*<sup>3</sup> Department of Electrical Engineering, Ming Chi University of Technology, New Taipei City, Taiwan*

*<sup>4</sup> Institute of Bioinformatics and Systems Biology, National Chiao-Tung University, Hsinchu, Taiwan*

*<sup>5</sup> Department of Biological Science and Technology, National Chiao-Tung University, Hsinchu, Taiwan*

*<sup>6</sup> Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Josef Faller, Graz University of Technology, Austria Sarah Christin Freytag, Berlin Institute of Technology, Germany*

#### *\*Correspondence:*

*Li-Wei Ko, Brain Research Center, Institute of Bioinformatics and Systems Biology, and Department of Biological Science and Technology, National Chiao-Tung University, 416 MIRC., 1001 University Rd., Hsinchu 300, Taiwan e-mail: lwko@mail.nctu.edu.tw; Chin-Teng Lin, Brain Research Center and Institute of Electrical Control Engineering, National Chiao-Tung University, 416 MIRC., 1001 University Rd., Hsinchu 300, Taiwan*

*e-mail: ctlin@mail.nctu.edu.tw*

Sleep quality is important, especially given the considerable number of sleep-related pathologies. The distribution of sleep stages is a highly effective and objective way of quantifying sleep quality. As a standard multi-channel recording used in the study of sleep, polysomnography (PSG) is a widely used diagnostic scheme in sleep medicine. However, the standard process of sleep clinical test, including PSG recording and manual scoring, is complex, uncomfortable, and time-consuming. This process is difficult to implement when taking the whole PSG measurements at home for general healthcare purposes. This work presents a novel sleep stage classification system, based on features from the two forehead EEG channels FP1 and FP2. By recording EEG from forehead, where there is no hair, the proposed system can monitor physiological changes during sleep in a more practical way than previous systems. Through a headband or self-adhesive technology, the necessary sensors can be applied easily by users at home. Analysis results demonstrate that classification performance of the proposed system overcomes the individual differences between different participants in terms of automatically classifying sleep stages. Additionally, the proposed sleep stage classification system can identify kernel sleep features extracted from forehead EEG, which are closely related with sleep clinician's expert knowledge. Moreover, forehead EEG features are classified into five sleep stages by using the relevance vector machine. In a leave-one-subject-out cross validation analysis, we found our system to correctly classify five sleep stages at an average accuracy of 76.7 ± 4.0 (*SD*) % [average kappa 0.68 ± 0.06 (*SD*)]. Importantly, the proposed sleep stage classification system using forehead EEG features is a viable alternative for measuring EEG signals at home easily and conveniently to evaluate sleep quality reliably, ultimately improving public healthcare.

**Keywords: sleep quality, sleep stages, polysomnography (PSG), electroencephalogram (EEG), sleep stage classification system**

# **INTRODUCTION**

Monitoring human physiology during sleep is essential for individual health. Sleep is increasingly viewed as playing an important role in restitution (Akerstedt et al., 2007). As an important aspect of well-being, sleep quality is closely related to overall quality of life, life satisfaction, secretion of the stress hormone, cortisol, and inadequate immunity (Gallagher et al., 2010). Evaluating of sleep quality is especially relevant, owing to a considerable number of pathologies linked to the sleep. Sleep stages are also recorded for clinical diagnosis and the treatment of sleep disorders. Sleep quality is most closely related to the distribution of depth of sleep; indeed, sufficient sleep quality must reach adequate deep sleep. The depth of sleep is characterized by different cortical electrical activities. Several sleep stages can be defined by variations of cortical electrical activities and other physiological signals, i.e., muscle activity and eye movement. According to Rechtschaffen and Kales rules (R&K rules), sleep stage can be segmented into wakefulness, movement time (MT), REM and sleep stages S1, S2, S3, and S4 based on features of EEG, EOG, and EMG (Kales and Rechtschaffen, 1968). In addition to modifying the standard guidelines for sleep classification by R&K, the American Academy of Sleep Medicine (AASM) developed guidelines for terminology, recording method, and scoring rules for sleep-related phenomena (Iber et al., 2007). In the AASM guidelines, sleep stages S1 to S4 are referred to as NREM stage 1 (N1), NREM stage2 (N2), and NREM stage3 (N3). N3 reflects slow wave sleep (SWS, R&K stages S3 + S4).

As the reference standard clinical multi-parametric system, polysomnography (PSG) (Holland et al., 1974) is used in sleep studies to define the different physiological sleep stages and diagnose many sleep disorders, including narcolepsy, restless legs syndrome, rapid eye movement (REM) behavior disorder, parasomnias, and sleep apnea. The PSG system requires a minimum of 11 channels, including electroencephalogram (EEG), electromyogram (EMG), electrooculogram (EOG), oxygen saturation (SpO2), and electrocardiogram (ECG). However, assessing a complete PSG has several limitations. First, PSG is not a portable device and typically placed in a sleep center, which is unfamiliar environment for patients. Second, PSG requires many physiological electrodes and wires placed on the scalp and body, possibly affecting their sleep further. Third, a standard sleep diagnosis in clinical practice is time-consuming and expensive (Zoubek et al., 2007). These processes are monotonous, and time consuming and unproductive. A simpler EEG acquisition and analysis system must be operable by patients for home use, as well as solve current PSG problems.

Recent studies have adopted bioelectrical signals (i.e., EEG, ECG, EMG, and EOG signals), which allow subjects to operate at home in order to develop sleep stage scoring methods, while attempting to obtain results similar to those of experts involved in visual scoring (Park et al., 2000; Anderer et al., 2005; Tian and Liu, 2005; Berthomier et al., 2007; Doroshenkov et al., 2007; Virkkala et al., 2007; Wang et al., 2009; Güne¸s et al., 2010; Jo et al., 2010; Yιlmaz et al., 2010; Eiseman et al., 2011). The classification structure of most of sleep stage classifications consists of feature extraction and classification schemes. The references differ from each other not only in the presented feature extractions and the corresponding classification schemes, but also in different bioelectrical signals used, such as EEG, EOG, or ECG. Feature extraction is a highly efficient means of achieving a satisfactory classification performance in order to develop a sleep stage classification approach. Certainly, if extracted features can achieve a high separability in distinguishing between different classes, classifiers can perform satisfactorily. In recent studies (Berthomier et al., 2007; Doroshenkov et al., 2007; Güne¸s et al., 2010; Jo et al., 2010), the signal process procedure regards an entire 30-s epoch as a processing unit to extract spectral and temporal information directly. The specific characteristics of sleep stages are smoothened easily within an entire 30-s signal process. For instance, the k-complex and sleep spindle only appear suddenly in a short period with 0.5–1.5 s in a 30-s epoch. Therefore, a short-term signals process should be incorporated when a developing feature extraction approach. Compared with EEG measurement, despite overcoming the hair problem, EOG and ECG still have certain limitations. For instance, EOG and ECG requires adhesive electrode pads, and the locations of EOG (or ECG) and corresponding amplifier are divergent from each other. A subject's sleep position may interfere with the wire, thereby degrading the EOG and ECG signal quality. Despite the persistent hair and conductive gel problems associated with EEG measurement, recent developments to resolve these problems include a headband and portable EEG recording device, as well as a dry polymer foam electrode for long-term EEG measurement (Lin et al., 2008, 2011). Additionally, physiological characteristics during sleep are more easily identified by EEG than by EOG and ECG, explaining why the former is preferred when classifying sleep stages.

Berthomier et al. (2007) assessed an automatic sleep scoring software (ASSEEGA). The system adopts a 3-step procedure for automatic sleep scoring, based on a single EEG channel. In classifying five sleep stages, the agreements between ASSEEGA and two expert manual scorings are 76.0% (kappa = 0.67) and 78.2% (kappa = 0.69). Although highly promising for diagnostic and automatic ambulant scoring. This system still requires 2 bipolar channels (Cz-Pz, international 10–20 standard system), which are located at the back of the skull and hair site, conductive gel, and a laboratory EEG recording device to achieve a high resolution. The connection between hair and conductive gel, and subject's sleep position may also worsen the EEG quality, further lowering the estimation accuracy of the sleep stage.

This work develops a sleep stage classification system via two forehead EEGs, i.e., FP1 and FP2. FP1 and FP2 EEG measurements have the following advantages: non-hairy site EEG recording is performed; the two signals also contain eye movement information; and the system is easily self-adhesive and self-applicable for homecare users and long-term monitoring. The proposed classification system incorporates a novel feature extraction approach, capable of extracting spectral information while considering manual scoring rules. The proposed system further incorporates the relevance vector machine (RVM) as the basic classifier. Importantly, the proposed system provides preliminary results for diagnostic assistance and automatic ambulant scoring to determine whether a patient requires detailed testing with the PSG system in a sleep laboratory. Furthermore, the headband and portable EEG device as well as dry EEG electrodes (Lin et al., 2008, 2011) greatly facilitate the implementation of the proposed system in homecare setting for long-term monitoring of sleep quality, as well as for large-scale population studies.

# **MATERIALS AND METHODS SUBJECTS AND DATA ACQUISITION**

Ten right-handed adults participated in our study (ten males; mean age 24 ± 6 years). None of the participants reported having a history of psychological disorders. Following a detailed explanation of the experimental procedure, all participants completed a consent form before the experiment. To avoid influences from other external factors, all subjects were instructed not to consume alcoholic or caffeinated drinks or sleeping pills beforehand. The experiments were performed at night (10:00 p.m.–08:00 a.m.). All experimental procedures received approval from the local ethics committee (Institutional Review Board of Taipei Veterans General Hospital, Taiwan).

The sleep PSG signals were recorded with a sampling rate of 128 Hz using Sandman Elite (Sandman Elite, Nellcor Puritan Bennett [Melville] Ltd., Kanata, Ontario, Canada) (**Figure 1D**). All subjects were required to sleep for a single night in the sleep laboratory of National Chiao-Tung University (**Figure 1A**) and wore all PSG electrodes during sleep. The complete PSG recording contains six channels EEG (F3, F4, C3, C4, O1, O2), two channels EOG, chin EMG, leg EMG, airflow signals, lead-II ECG, oximetry, nasal pressure, snoring sounding, and body position (**Figure 1B**). Forehead EEG signals from FP1 and FP2 were also recorded by the same PSG system simultaneously (**Figure 1C**). All of the EEG signals were re-referenced to the opposite lateral mastoids (A1

**FIGURE 1 | Experiment environment and the polysomnography recording device (Holland et al., 1974). (A)** The experiment environment in the sleep laboratory of National Chiao-Tung University. **(B)** The real situation with

**Table 1 | Distribution of sleep stages for each subject.**


*The number in each lattice represents the number of corresponding sleep stages. SC refers to the subject code.*

and A2). The contact impedance between all of the electrodes and scalp was controlled to be lower than 5 k*-*. No adjustment or artificial removal techniques were applied to the data. Each data set contained 5–8 h of forehead EEG signals and complete PSG signals.

#### **SLEEP STAGE MANUAL SCORING**

Sleep data for each subject were scored visually based on the manual scoring rules of AASM to five sleep stages by an experienced sleep expert. The five sleep stages are W, N1, N2, N3, and REM, and each 30 s sequential epochs was assigned to a sleep stage. **Table 1** summarizes the distribution of sleep stage belonging to subjects.

#### **PROPOSED AUTOMATIC SLEEP STAGES CLASSIFICATION SYSTEM**

Visual manual scoring often identifies the different sleep stages based on EEG activities. While attempting to incorporate the advantages of EEG activities of manual scoring rule, this work presents a novel sleep stage classification system embedded with a feature extraction approach, which is inspired by the sleep complete PSG recording. **(C)** The FP1 and FP2 EEG channels are also recorded by the same PSG system simultaneously. **(D)** The PSG amplifier (Sandman Elite, Nellcor Puritan Bennett [Melville] Ltd., Kanata, Ontario, Canada).

clinician's expert knowledge in translating two forehead EEG signals to the relevant features, and relevance vector machine (RVM) in order to classify the sleep stages automatically. **Figure 2** displays the flowchart of the proposed sleep stage classification system. As per AASM recommendations, a 30-s sequential EEG recording should function as a unit to assign a sleep stage. In the preprocessing step, all of the 30-s EEG signals are filtered by a band pass filter within 0.5–50 Hz. The following sections described the proposed feature extraction, normalization, and RVM procedures in detail. After RVM, the input 30-s EEG recording assigns a sleep stage. When the recording procedure stops, the final sleep stage results for the whole recording can be estimated.

#### *Feature extraction*

Previous studies extracted frequency-domain features by fast Fourier transform (FFT) within the entire30-s signals. However, the previous studies regarded the entire 30-s signals as a processing unit to directly extract frequency-domain features by FFT. Under this circumstance, the specific characteristics of power spectrum density of 30 s signals are easily smoothened, and the corresponding sleep spectral activities are lost, when the characteristics of sleep appear only at a short period in the time signals. The entire 30 s signals contain a significant amount of information, and the spectral information obtained from FFT directly cannot accurately reflect the advantages of the manual scoring rules. To resolve this problem, this work presents a novel feature extraction approach to extract spectral features by short-time Fourier transform and manual scoring knowledge, which retain the properties of temporal manual scoring rules and represent the spectral response in power spectral density. According to the manual scoring rules of AASM (Iber et al., 2007), the EEG activities include alpha rhythm, theta rhythm, K complex, sleep spindle and slow waves. For instance, the epoch is scored the wakefulness when more than 50% of the epoch has alpha (8–12 Hz) rhythm. The epoch is scored as the N1 when alpha rhythm is attenuated and replaced by low amplitude, predominantly theta (5–7 Hz) rhythm for more than 50% of the epoch. The epoch is scored as N2 when the K complex or sleep spindle (12–14 Hz) occurs

with the theta background rhythm. The epoch is scored as N3 when more than 20% of the epoch has high amplitude slow wave activity. If a single epoch contains 2 or more stages, the stage that contains the greatest portion of the epoch is assigned.

**Figure 3** describes the proposed feature extraction for each 30-s EEG signal. Following the expert knowledge of manual scoring rules, the short-time Fourier transform (STFT) with a 1 s Hamming window overlapped with a 0.5 s window is used rather than using FFT within the entire 30-s signals (**Figure 3A**). Power spectrum densities (PSD) of fifty-nine segments are calculated after using STFT in entire 30-s EEG signals. Following STFT, the PSD of each segment is normalized to avoid individual differences. PSD of each frequency bin of each segment is divided by the total PSD of each segment (**Figure 3B**). Proposed features are identified in the following definition.

The slow wave activity (0.5–2 Hz and amplitude of greater than 75 microvolt) refers to an important index to score N3, and the ratio of slow wave activity has to be greater than 20% visually. Therefore, PSD in lower delta (0.5–2 Hz, denoted as ↓Delta, **Figure 3C**) is chosen to represent the feature of the slow wave activity. Moreover, the average PSDs of the lower delta of upper and lower 80% of 59 segments are viewed as the features of slow wave activity.

As the most important characteristics in visually identifying N2, and K complex and sleep spindle occur spontaneously and roughly every two epochs. K complex is a well-delineated negative sharp wave in EEG immediately followed by a positive component with total duration at least 0.5 s; in addition, sleep spindles are oscillations of sigma (12–14 Hz) with duration of 0.5–1.5 s. Hence, the maximum PSD of sigma band among 59 segments and the average PSD of sigma band of the remaining 58 segments is represented as features of sleep spindle. Moreover, the maximum PSD of delta band (1–4 Hz) among 59 segments and the average PSD of delta band of the remaining 58 segments are represented as features of K complex (**Figure 3C**).

While tending to appear during drowsy, meditative, and sleep onset, theta rhythm scores the epoch as N1, N2, and REM. The average PSDs of theta of upper and lower 50% of 59 segments represent the features of light sleep (N1+N2) (**Figure 3C**).

Movement time, normal resting waking consciousness and wakeful relaxation with eyes closed are accompanied by gamma rhythm (30–50 Hz), beta rhythm (15–30 Hz) and alpha rhythm, respectively. Experts score the epoch as stage wakefulness, when beta and alpha rhythm appear more than 50% of epoch. Thus, the average PSDs of beta and alpha band of upper and lower 50% of 59 segments are represented as the features of stage wakefulness. Movement time stage is mainly accompanied by muscle artifacts obscuring the EEG for more half an epoch. Hence, the average PSD of gamma of 59 segments is represented as the feature of movement time stage (**Figure 3C**).

As mentioned earlier, for FP1 and FP2 EEG channels, sixteen features are extracted, respectively. Two features are extracted as ↓delta activity; two features are extracted as delta activity; two features are extracted as theta activity; two features are extracted as alpha activity; two features are extracted as sigma activity; two features are extracted as beta activity; and one feature is extracted as gamma activity.

For investigating the influence of the proposed feature extraction approach, the conventional PSD feature extraction approach is compared with the proposed one. The conventional PSD feature extraction approach is calculated by the fast Fourier transformation directly for each entire 30 s EEG signal. Notably, this work does not further consider the feasibility of integrating the frequency bins to the specific frequency bands such as delta and theta. The PSD activity ranging from 1 to 50 Hz is used here as input features. Also, for FP1 and FP2 EEG channel, fifty features are extracted, respectively.

#### *Relevance vector machine*

Relevance vector machine (RVM) is a learning algorithm based on Bayesian framework and support vector machine (SVM) (Tipping, 2001). RVM has a form similar to that of SVM; they differ in the measurement between binary classes. SVM learns the maximal distance of margins between binary classes, while RVM learns the maximal probability of margins between binary classes. In contrast with RVM, SVM has the following disadvantages: the number of support vectors (SVs) grows with an increasing number of training patterns; the overfitting problem may occur if SVM selects too many SVs; the decision value is derived from the hyperplane function of SVM in the feature space, making its formation as the probability degree impossible; and the penalty parameter of SVM must be set; this penalty parameter significantly influences the classification results. This parameter is generally determined by the cross-validation approach. Further details of RVM and SVM can be found in Tipping (2001).

#### **SYSTEM PERFORMANCE VALIDATION**

To illustrate the efficiency of the proposed feature extraction approach, this work evaluates the separability of different feature extraction approaches by using the Fisher criteria (Fukunaga, 1990). Two Fisher criteria are expressed as follow:

$$J\_1 = \frac{\operatorname{tr}(\mathbf{S}\_b)}{\operatorname{tr}(\mathbf{S}\_\mathbf{w})}$$

$$J\_2 = \operatorname{tr}(\mathbf{S}\_\mathbf{w}^{-1}\mathbf{S}\_{b)}$$

where *Sb* and *Sw* denote the between-class and within-class scatter matrix, and *tr (A)* refers to the trace of square matrix *A*. A larger *J***<sup>1</sup>** and *J***<sup>2</sup>** imply a larger separability of the presented features in feature space.

Under the extracted feature approaches, this work compares the classification performances of linear discriminate analysis (LDA) (Fukunaga, 1990), k-nearest neighbor classifier (k-NN) (Fukunaga, 1990), SVM (Chang and Lin, 2011; Li et al., 2012), and RVM (Tipping, 2001). Several trials are performed for k-NN, in which the value of k is varied from 1 to 20, to determine the value that maximizes the accuracy. The selected k in k-NN is 13. For simplicity, this work only adopts the linear kernel for SVM and RVM to evaluate how the proposed feature approach influences. From the perspective of SVM and RVM, the advantage of SVM and RVM is the extension of feature space by the kernel function. The feature space of SVM and RVM is implicitly defined by the kernel function. Hence, two popular kernel functions, i.e., linear and radial basis function (RBF) kernel of SVM and RVM, are more closely examined. As for the SVM, a penalty parameter (also called slack variable) C of SVM, in which the trade-off between the margin and the size of the slack variables in this experiment is controlled, is determined by a grid search within given set {0*.*1*,* 0*.*5*,* 1*,* 10*,* 50*,* 100*,* 500*,* 1000*,* 1500}. Here, the C selected from the grid search is 50. A grid search is also performed to derive the proper parameter of RBF kernel within a set {0*.*1*,* 0*.*25*,* 0*.*5*,* 0*.*75*,* 1*,* 2*,* 5*,* 10*,* 50*,* 100*,* 1000}. Here, the selected parameter of RBF for both SVM and RVM is 0.5.The multiclass strategy in SVM and RVM adopted in this work is a one-against-all strategy (Bottou et al., 1994; Li et al., 2012).

In this work, the sleep PSG data of ten subjects are collected. To evaluate the performance of the proposed classification system, the classification performance is evaluated using leaveone-subject-out cross validation (LOSO). Implementing LOSO involves taking the data from one subject as the testing set and the data from other remaining subjects as the training set; the same procedure is repeated until all subjects are including in the testing set. As is well known, the training data and the testing data should be independent of each other. Restated, the testing information should not be used in the training step. The k-fold cross validation approach is the conventional means of evaluating the classification performance. However, this approach cannot ensure that the training data and testing data are independent. Because the training data and testing data are from different subjects, the training data and the testing data in LOSO are independent. Hence, LOSO is less subjective than the normally adopted k-fold cross validation within the single subject.

This system performance is evaluated using three valid indices, i.e., overall accuracy, sensitivity, and Cohen's kappa coefficient. Overall accuracy refers to proximity of measurement results to the actual value and precision to the repeatability or reproducibility of the measurement. Sensitivity is performed to reflect the ability to identify positive results for each class. Cohen's kappa is a statistical measure of inter-rater agreement or inter-annotator agreement for qualitative (categorical) items (Cohen, 1960). As is generally assumed this measurement is more robust than simple percent agreement calculation since kappa takes into account the agreement occurring coincidentally.

# **EXPERIMENTAL RESULTS**

Performance of the proposed feature extraction approach is evaluated by using Fisher criteria, i.e., *J***<sup>1</sup>** and *J***2**, to demonstrate the separability. **Table 2** lists the values of *J***<sup>1</sup>** and *J***2**. Both of *J***<sup>1</sup>** and *J***<sup>2</sup>** in the proposed feature extraction approach are larger than **Table 2 | Separability measurements by using Fisher criterion for the conventional PSD feature extraction and the proposed feature extraction approach.**


the conventional PSD feature extraction approach, implying that the proposed feature extraction approach has a better separability than conventional PSD feature extraction approach. For illustration, principal component analysis (PCA, Fukunaga, 1990) is performed to decompose the proposed extracted features and conventional frequency PSD features to first two principal components (PCs), respectively. **Figures 4A,B** show the scatter plot of the first two PCs from the conventional PSD feature extraction approach and the proposed feature extraction approach, respectively. These figures clearly reveal that the spatial distribution of the proposed feature extraction in PC space has the better scatter distribution than that of the conventional PSD feature extraction one. The scatter plot of the proposed feature extraction approach presents the scatter points from different groups, i.e., wakefulness, N2, N3, and REM, which are leading ones in their own industries. However, the conventional PSD feature extraction approach can not verify this observation. Most data points from the conventional PSD feature extraction approach are mixing in PC space. Additionally, regardless of in which feature extraction approaches, most of the data points of N1overlap with N2, and REM, because the EEG characteristics of N1 in manual scoring rules closely resemble that of N2, REM.

**Figure 5** and **Table 3** display the average classification performances from two feature extraction approaches and four classifiers. The performances, both in terms of overall accuracy and kappa coefficient, of the proposed feature extraction more significantly improves (*p <* 0*.*05, paired *t*-test) than the conventional frequency PSD features. The proposed feature extraction approach has an approximately 20% greater increase in overall accuracy and kappa coefficient than the conventional frequency PSD extraction approach. The overall accuracy and kappa coefficient can reach as high as 76.7 and 68.2% by RVM, respectively.

**Figure 4** shows scatter plots of the sleep pattern, while **Table 2** lists the separability values of Fisher criteria, which compare the

**FIGURE 5 | The classification performance comparison between conventional frequency PSD feature extraction and the proposed feature extraction with LDA, k-NN, SVM, and RVM.** Acc. and kappa represents

overall accuracy and Cohen's kappa coefficient, respectively. Error bars indicate standard deviations. For better visualization, the kappa values have been scaled with the factor 100. ∗*p <* 0*.*05, ∗∗*p <* 0*.*01, ∗∗∗*p <* 0*.*001.

**Table 3 | Comparison of feature extraction approaches in terms of classification performance.**


*Each lattice represents the mean value of the validation index from ten subjects, and the corresponding bracket is the standard deviation. Acc. and kappa represents overall accuracy and Cohen's kappa coefficient, respectively. Highlighted parts display the optimum performance of all comparison results.*

features of the conventional extracted PSD with those of the proposed approach. According to the above results and Fisher criteria, the proposed feature extraction approach has a better separability than the conventional PSD feature extraction approach, as also verified by the classification performance in **Table 3**. The rise in classification performance depends on the proposed feature extraction approach while considering the manual scoring criteria. In particular, the kappa coefficient in the proposed feature extraction approach is increasing significant implying that the proposed method improves both the overall performance of classification and its accuracy for each class with a balance trade-off. For instance, the sensitivity for wakefulness in LDA (conventional PSD feature extraction approach) is 95.7%, i.e., the highest sensitivity of a single class; however, the sensitivity of the other classes is extremely low.

**Table 4** and **Figure 6** summarize the results of SVM and RVM with RBF kernel and liner kernel. The optimum result is RVM with linear kernel, in which the accuracy and kappa can reach 76.7% and 0.68, respectively. Next, RVM and SVM are compared, revealing a significant increase in both linear kernel function (over accuracy, *p* = 0*.*012, paired *t*-test; kappa coefficient, *p* = 0*.*024, paired *t*-test) and RBF kernel function (over accuracy, *p* = 0*.*033, paired *t*-test; kappa coefficient, *p* = 0*.*069, paired *t*-test). Two kernel functions in SVM and RVM are also compared. Applying the RBF kernel in SVM has a ∼1.5% improvement in overall accuracy and kappa coefficient. However, it does not reach a statistically significant level (over accuracy, *p* = 0*.*071, paired *t*test; kappa coefficient, *p* = 0*.*089, paired *t*-test). In terms of RVM, applying the RBF kernel in RVM also does not reach a statistically significant level (over accuracy, *p* = 0*.*582, paired *t*-test; kappa coefficient, *p* = 0*.*658, paired *t*-test).

In terms of SVM, although the number of SVs is normally less than the training patterns, the number of SVs grows larger with an increasing number of training patterns. However, the overfitting problem occasionally occurs in SVM with a large number of SVs. The number of SVs of SVM is generally larger than that of RVs


**Table 4 | Performance comparison of different kernel functions in**

*Each lattice represents the mean value of the validation index from ten subjects, and the corresponding bracket is the standard deviation. Acc. and kappa represents the overall accuracy and Cohen's kappa coefficient, respectively.*

N3 91.3% 89.8% 87.9% 86.7% REM 63.8% 76.0% 77.6% 77.6%

in RVM. The number of support vectors (SVs) is also compared with that of relevance vectors (RVs). RVM is characterized by the fewer number of RVs than that of SVs in SVM.

Fewer RVs can avert the overfitting problem. For each subject, the training patterns come from the remained nine subjects, explaining the variation in the number of training patterns. This work adopts the one-against-all multiclass strategy, which is a "one class vs. all others" method, for SVM and RVM. Therefore, the train population is considered from all classes. In the oneagainst-all multiclass strategy, SVM and RVM train individual decision hyperplane (formula of SVs or RVs and corresponding coefficients) for each class. From ten subjects, the mean of number of train pattern is 7425.9, and standard deviation is 131.4. **Table 5** shows the mean of number of SVs and RVs from linear kernel and RBF kernel. Regardless of whether linear kernel or RBF kernel is adopted, the number of RVs is significantly less than the number of SVs (all *p*-values are less than 0.0001 with Student's *t*-test). Although the number of trained SVs is in the thousands, the number of trained RVs only ranges less than several hundred, thus representing a significant difference between RVs and SVs.

To investigating how class imbalance prior influences the classification performance, this work describes the confusion matrix between the proposed sleep classification approach, in which RVM is used with linear kernel function and expert manual scoring (**Table 6**). The overall accuracy and kappa coefficient, as computed from this confusion matrix, are 76.7% and 0.69, respectively. **Table 6** and the performance in **Table 4** demonstrate that the reported validation indices are not biased.

**Figure 7** shows the estimation results, based on the proposed classification approach and the manual scoring results for one subject (S07). The top plot is the distribution of the estimated sleep stages from proposed classification, and the below plot is the distribution of the sleep stages from the sleep expert. For this subject, the accuracy between the proposed classification approach and the sleep expert's scoring is 82.5% (kappa = 0.77). The proposed classification approach with only forehead EEG can reach a quite similar performance with the sleep expert.

**FIGURE 6 | The classification performance comparison between SVM and RVM with linear and RBF kernel functions.** Acc. and kappa represents overall accuracy and Cohen's kappa coefficient, respectively. Error bars indicate standard deviations. For better visualization, the kappa values have been scaled with the factor 100. ∗*p <* 0*.*05.

**Table 5 | Number of support vectors in SVM and number of relevance vectors in RVM.**


*Each block is the mean from all ten subjects, and the corresponding bracket is the standard deviation.*



#### **DISCUSSION**

Based on the scatter plot and Fisher criteria, this work demonstrates that the proposed feature extraction approach can achieve a better separability than the conventional PSD feature extraction approach. The classification performance also demonstrates that the proposed feature extraction approach is more effective

than the conventional PSD feature extraction approach in distinguishing between different sleep stages. The proposed feature extraction approach is considered with the sleep stage manual scoring knowledge. Additionally, time-frequency analysis is performed to extract the spectral activities in a short segment window signal not whole 30-s signals. The proposed feature extraction approach characterizes not only by its temporal information of manual scoring, but also by its spectral response. Sleep experts distinguish between the different sleep stages by counting the rhythm and the amplitude of EEG visually by per-second EEG signals from whole 30-s signals. The proposed feature extraction approach applies the short-time Fourier transform with a one second window and overlap 0.5 s. Also, the PSD activity is transferred from STFT to several specific frequency bands, e.g., low delta (1–2 Hz), delta (1–4 Hz), theta (5–7 Hz), alpha (8–12 Hz), sigma (12–14 Hz), beta (15–30 Hz), and gamma (30–50 Hz). With the entire 30-s data, 59 segments are extracted. To overcome the individual differences, the PSD of each segment is divided by the total sum of the PSD of each segment. Hence, the PSD activity in the proposed feature extraction approach is the proportion response of PSD. Based on the expert's manual scoring knowledge, several features are extracted from these specific frequency bands of 59 segments. STFT can achieve the spectral activity of specific frequency response which may be more or less resonant in the spectral space. For instance, the sleep spindle is a sigma rhythm lasting from 0.5 to 1 s, and occurring suddenly. If the fast Fourier transfer is applied to a N2 epoch with one spindle of whole 30 s EEG signals, the specific sigma frequency response cannot be identified clearly. The principal spectral response might be a background spectral response, i.e., theta spectral response, and the amplitude of PSD within the sigma band might be extremely low. However, STFT can separate the signals into several segments in order to calculate the PSD individually. If the spindle occurs, STFT can enhance the corresponding spectral activity more than that of FFT with whole signals. Moreover, in addition, to using STFT to transfer the signals to PSD, this work also proposes sorted power activities to extract the manual rules' properties for the specific frequency spectral activity as the sleep features. The corresponding specific frequency spectral activities are extracted by following the manual scoring rules. For instance, the maximum power of sorted sigma power activities can represent the feature of the sleep spindle. If the current epoch is N2, the value of the maximum power of sorted sigma power activities is higher than that of the maximum power of sorted sigma power activities from the other sleep stages.

The value extracted by average PSDs of alpha and beta of upper 50% of 59 segments from wake stage is higher than that of sleep. The value extracted by the average PSDs of the slow wave of upper 80% of 59 segments from N3 stage is higher than that of other stages. Similarly, the other features from different frequency bands can represent the other sleep stages. Berthomier et al. (2007) characterized several contrast functions, which are feature extraction approaches, as defined by the EEG power activity calculating from whole epoch directly. Although Berthomier et al. (2007) considered the baseline resting EEG frequency of each individual to adjust the spectral criteria, the extracted features were still calculated from the whole 30-s EEG signals. The specific frequency activity still diminishes, when the frequency activity is calculated within the whole 30-s signals. Hence, the proposed feature extraction approach is a more effective means of extracting the sleep characteristics.

SVM has recently achieved higher empirical accuracy and better generalization capabilities than other standard supervised classifiers (Fatma Guler and Ubeyli, 2007; Lotte et al., 2007; Xu et al., 2009; Li et al., 2012). However, as mentioned earlier, SVM is limited in the number of SVs and the selection of penalty parameter. The penalty parameter in SVM is adjusts the generalization capability. RVM is an extension algorithm that eliminates the disadvantages of SVM. SVM learns the maximal distance of margins between binary classes while, in contrast, RVM learns the maximal probability of margins by exploiting a probabilistic Bayesian learning framework between binary classes. The penalty parameter in SVM is usually determined by the cross-validation approach. The chosen penalty parameter depends on the setting candidate set. Too much training time is expended when selecting the penalty parameter in SVM, if the setting candidate set has a wide range. The chosen penalty parameter is a local optimum parameter, depending on the setting candidate set, not the global optimum parameter. The chosen penalty parameter, which affects the number of SVs, incurs the overfitting problem in training phase. RVM can estimate penalty parameters automatically. Additionally, RVM can improve the problem in the number of SVs. Accounting to our results, the number of SVs can significantly decrease in the number of RVs; the classification performance can also be improved by RVM. Hence, the RVM is applied as the basic classifier in the proposed sleep stage classification system.

The detection of N1 is always the most problematic aspect of the sleep stages (Virkkala et al., 2007). Identifying a significant feature in EEG that could separate N1 from wakefulness, N2, and REM, is rather difficult because N1 is a transition phase in the changes of wakefulness and other sleep stages (Virkkala et al., 2007). The sleep EEG characteristics of N1 closely resemble those of N2, REM, and resting wakefulness. Our results demonstrate that the extracted features from PSDs in N1 resemble to N2, REM and resting wakefulness (**Figure 4**). Moreover, many epochs of N1 are misclassified to Wake, N2, and REM (**Table 6**), explaining the difficulty in automatically identifying N1 by a computer. Efforts are underway in our laboratory to address this problem.

Recent efforts have attempted to develop a more reliable sleep system with few bioelectrical channels, i.e., one EEG, one ECG, or two EOGs, in order to simplify the complex PSG inspection (Berthomier et al., 2007; Virkkala et al., 2007; Yιlmaz et al., 2010). Virkkala et al. (2007) devised an automatic sleep stage classification via two EOGs. The performances (Virkkala et al., 2007) with 5 sleep stages are 72.5% epoch-to-epoch agreement and 0.63 Cohen's kappa, and the sensitivity of Wake, REM, N1, N2, and SWS are 74.10, 72.7, 39.2, 79.1, and 73%, respectively. Although this work applies two channel signals, the FP1 and FP2 EEGs, which can reflect the eye movement, have more information in classifying sleep stages. Hence, both the characteristics of sleep EEG and eye movement are captured. Additionally, the proposed classification system accurately estimates sleep stages. Yιlmaz et al. (2010) presented a sleep stage and obstructive apneic epoch classification via single-lead ECG. The performances (Yιlmaz et al., 2010) with 6 sleep stages are 73.1% epoch-to-epoch agreement, and the sensitivities of Wake, REM, NREM1, NREM 2,NREM 3, and NREM 4 are 95.6, 84.9, 98.5, 61.8, 94.3, and 87.4%, respectively.

Performance of the classification (Yιlmaz et al., 2010) is satisfactory, even the sensitivity of NREM 1 (98.5%), which is the most difficult sleep stage to be identified automatically (Berthomier et al., 2007; Virkkala et al., 2007). Yιlmaz et al. (2010) applied the 10-fold cross validation within a single subject data, which totally separates a subject's self-data as the training data and also as the testing data. For instance, for a subject with total 800 epochs, partitioning produces 10 subsets with 80 epochs each. Therefore, the training set (720 epochs) and testing set (80 epochs) include totally separate sets of data. The training set and testing set originate from a specific subject. Moreover, the properties of training and testing data resemble each other, the corresponding with the over-fitting problem in training phase. Notably, attempting to use a subject-dependence model by a specific subject in order to classify another independent dataset may cause worst results. LOSO cross-validation is a more objective evaluation approach for machine learning experiment involving human subjects to allow for subject-to-subject variation. The testing data are subject-independent to the training data. Hence, the performance evaluation by LOSO is more effective and reliable than k-fold cross validation in developing a general model involving human subjects. As mentioned earlier, the right and left EOG signals are recorded by placing two electrodes at the nasal and temporal canthal regions of the eye, in which one electrode is attached to the middle of the forehead as ground electrode while another electrode is placed on the left mastoid M1 as reference electrode. ECG signals are acquired by two electrodes in a modified leads II configuration (Malmivuo and Plonsey, 1995). The positive and negative leads are placed on the fourth inter costal space and the left of the sternum. Also, both of the EOGs and ECG still require adhesive electrode pads, and the locations of EOGs (or ECG) and corresponding amplifier diverge from each other. The sleep position may affect the recording quality of physiological signals. Berthomier et al. (2007) presented an ASSEEGA based on an EEG channel. In terms of performance, the proposed sleep stage classification system is nearly equivalent to the ASSEEGA. However, ASSEEGA still requires 2 bipolar channels (Cz-Pz, international 10–20 standard system), making it infeasible for homecare applications. First, the position of electrodes is not easily identified, self-adhesive, and self-applicable for a self-operating user. Second, the conductive gel and a laboratory EEG recording device are still required in the recording signals. Although several portable EEG devices (e.g., Mindo-4S (Mindo, Hsinchu, Taiwan), MindWave Mobile (NeuroSky, CA, USA), and Emotiv epoc headset (Emotiv, Eveleigh NSW, Australia), as well as dry sensors) can overcome this problem, the comfort of dry electrodes of Cz and Pz is still a major challenge during sleep.

Anderer et al. (2005) recently developed and optimized an automatic classification system based on a central EEG channel, two EOG channels and a chin EMG channel; in addition, the final validation of overall epoch-by-epoch agreement is 80% (Cohen's kappa is 0.72) between the proposed automatic classification system and human expert scoring. Obviously, the data-rich recordings have more information, e.g., sleep brain electrical activity from EEG, muscle activity from EMG, and eye movement from EOGs. Such information-rich physiological data provide more important indices to classify the difference between REM and light sleep, i.e., the rapid eye movement and the lowest mandible muscle activity. Hence, the data-rich recording can achieve an excellent performance. Although the performance in this work is not equivalent to that of the classification system (Anderer et al., 2005), the proposed system attempts to reduce the number of full PSG signals to fewer physiological channels, as well as further classify the sleep stages effectively. Therefore, the proposed system uses forehead EEGs, i.e., FP1 and FP2. FP1 and FP2 have the following advantages: the physiological data include information from sleep brain electrical activity and eye movement; and adopting the forehead EEGs makes it feasible for self-application for a self-operating user. With the portable EEG device (Lin et al., 2008) and dry sensors (Lin et al., 2011), home-based users can easily to wear the EEG headband to record the sleep forehead EEG signals by self-applicable. Furthermore, we can further analyze with the collected data can be analyzed further, even leading to the development of on-line sleep stage classification software.

Despite its contributions, this work has certain limitations. The collected data are limited to young and healthy study participants. The sleep stages in normal person are expected to diverge from the norm and to be more heterogeneous than those of older or younger, healthy individuals or patients. The sleep stage manual scoring rules are based on counting the rhythm of different frequency activities. The proposed system attempts to comply with this criterion in order to extract the sleep features. The system also uses STFT as temporal and visual rules, i.e., processing the signal within one-second window, and further extracts features by PSDs representing different rhythms of different frequency activities. This system is created by the healthy subjects, and does not have obvious evidence to verify that the proposed sleep stage classification is reliable in older individuals or patients. Therefore, efforts are underway in our laboratory to study the relation between patients and the proposed sleep stage classification system.

# **CONCLUSION**

This work presents a novel sleep stage classification system, consisting of a novel feature extraction method and RVM classifier, based on only two forehead EEG channels. Also, the classification performance is consistent with the sleep clinician's expert knowledge. Experimental results demonstrate the feasibility of using the proposed system as the preliminary screening results for a preclinical diagnosis to assist clinicians in making a diagnosis (rather having a depth testing with PSG system in a sleep laboratory) to reduce time for the procedure. Moreover, the proposed system only uses two forehead EEG signals, allowing us to apply the wearable and wireless EEG recording device (Lin et al., 2008; Liao et al., 2012) [e.g., Mindo-4S (Mindo, Hsinchu, Taiwan) and MindWave Mobile (NeuroSky, CA, USA)] in order to record the patient's EEG signals at home. Importantly, the proposed system provides an easier way for large population studies, long-term sleep monitoring, and home-based daily care. Efforts are underway in our laboratory to integrate the wearable and wireless EEG recording device and the proposed sleep stage classification system. As an important aspect of performance, the automatic artifact detection might be a possible way to improve the efficacy of the proposed system. Hence, the efficacy of the automatic artifact detection should be considered in the proposed system in the future work.

### **ACKNOWLEDGMENTS**

This work was supported in part by the Aiming for the Top University Plan of National Chiao Tung University, the Ministry of Education, Taiwan, under Contract 102W9633, and in part by UST-UCSD International Center of Excellence in Advanced Bioengineering sponsored by the Ministry of Science and Technology I-RiCE Program under Grant Number: NSC-102- 2911-I-009-101. Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-10-2-0022. The views and the conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S Government. The U.S Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

# **REFERENCES**


interface device with novel dry foam-based sensors. *J. Neuroeng. Rehabil.* 9:5. doi: 10.1186/1743-0003-9-5


making method. *IEEJ Trans. Electron. Inf. Syst.* 129, 614–619. doi: 10.1541/ieejeiss.129.614


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 17 February 2014; accepted: 05 August 2014; published online: 04 September 2014.*

*Citation: Huang C-S, Lin C-L, Ko L-W, Liu S-Y, Su T-P and Lin C-T (2014) Knowledge-based identification of sleep stages based on two forehead electroencephalogram channels. Front. Neurosci. 8:263. doi: 10.3389/fnins.2014.00263*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Huang, Lin, Ko, Liu, Su and Lin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Brain fingerprinting classification concealed information test detects US Navy military medical information with P300

#### *Lawrence A. Farwell <sup>1</sup> \*, Drew C. Richardson2, Graham M. Richardson3 and John J. Furedy4*

*<sup>1</sup> Brain Fingerprinting Laboratories, Inc./Brain Fingerprinting, LLC, Seattle, WA, USA*

*<sup>2</sup> Federal Bureau of Investigation, FBI Laboratory, Quantico, VA, USA (at the time of the research)*

*<sup>4</sup> Department of Psychology, University of Toronto, Toronto, ON, Canada*

#### *Edited by:*

*Anne-Marie Brouwer, Netherlands Organisation for Applied Scientific Research, Netherlands*

#### *Reviewed by:*

*Maja Stikic, Advanced Brain Monitoring, Inc., USA Maarten Andreas Hogervorst, Netherlands Organisation for Applied Scientific Research, Netherlands*

#### *\*Correspondence:*

*Lawrence A. Farwell, Brain Fingerprinting Laboratories, Inc., 14220 37th Ave. NE, Seattle, WA 98125, USA e-mail: brainwave@larryfarwell.com* A classification concealed information test (CIT) used the "brain fingerprinting" method of applying P300 event-related potential (ERP) in detecting information that is (1) acquired in real life and (2) unique to US Navy experts in military medicine. Military medicine experts and non-experts were asked to push buttons in response to three types of text stimuli. Targets contain known information relevant to military medicine, are identified to subjects as relevant, and require pushing one button. Subjects are told to push another button to all other stimuli. Probes contain concealed information relevant to military medicine, and are not identified to subjects. Irrelevants contain equally plausible, but incorrect/irrelevant information. Error rate was 0%. Median and mean statistical confidences for individual determinations were 99.9% with no indeterminates (results lacking sufficiently high statistical confidence to be classified). We compared error rate and statistical confidence for determinations of both information present and information absent produced by classification CIT (Is a probe ERP more similar to a target or to an irrelevant ERP?) vs. comparison CIT (Does a probe produce a larger ERP than an irrelevant?) using P300 plus the late negative component (LNP; together, P300-MERMER). Comparison CIT produced a significantly higher error rate (20%) and lower statistical confidences: mean 67%; information-absent mean was 28.9%, less than chance (50%). We compared analysis using P300 alone with the P300 + LNP. P300 alone produced the same 0% error rate but significantly lower statistical confidences. These findings add to the evidence that the brain fingerprinting methods as described here provide sufficient conditions to produce less than 1% error rate and greater than 95% median statistical confidence in a CIT on information obtained in the course of real life that is characteristic of individuals with specific training, expertise, or organizational affiliation.

**Keywords: P300, concealed information test, brain fingerprinting, P300-MERMER, ERP, LNP, event-related potential, detection of concealed information**

# **INTRODUCTION**

#### **THE CLASSIFICATION CIT**

The concealed information test (CIT) or guilty knowledge test (GKT) has been used to detect concealed information since Lykken (1959). Until the 1980s, the dependent measures were autonomic nervous system (ANS) responses. The ANS-based CIT is a *comparison* CIT (Lykken, 1959). The comparison CIT compares the responses to crime- or situation-relevant and irrelevant items. If the responses to the relevant items are larger, then the determination is made that the subject knows the relevant information. ("Larger" is variously defined.) Otherwise, the determination is made that the subject does not know the information.

Farwell and Donchin (1991) introduced three innovations in the CIT (Farwell, 2013). They (1) applied a classification CIT, rather than the conventional comparison CIT; (2) used event-related brain potentials (ERPs) as the dependent measure; and (3) computed a statistical confidence for each individual determination using the technique of bootstrapping. (Farwell and Donchin, 1991 was preceded by abstracts on the same studies, Farwell and Donchin, 1986, 1988a). Several researchers subsequently applied ERPs and bootstrapping in a comparison CIT (e.g., Johnson and Rosenfeld, 1992; Rosenfeld et al., 2004, 2008; Meixner and Rosenfeld, 2014). This is a fundamentally different paradigm (see Discussion and Appendix 2).

In the classification CIT, three types of stimuli are presented. (1) "Probes" are relevant to the investigated situation. Probes contain information that the subject has no way of knowing other than participation in the investigated situation (and, in field cases, that the subject denies knowing or recognizing as being crime-relevant). (2) "Targets" also may be relevant to the investigated situation. (In all our recent applications including the study

*<sup>3</sup> Department of Cell and Developmental Biology, Vanderbilt University, Nashville, TN, USA*

reported here, they are.) Experimental protocols ensure that the subject knows the targets, for reasons other than participation in the investigated situation. The response to targets provides a template for the subject's response to known, situation-relevant information. (3) "Irrelevants" contain irrelevant information. The response to irrelevants provides a template for the subject's response to irrelevant information. If the ERP response to the probes is mathematically classified as being more similar to the ERP response to the known, relevant target information than to the irrelevants, the subject is determined to be "information present" with respect to the information contained in the probes. If the ERP response to the probes is more similar to the ERP response to the irrelevant information than to the targets, the subject is determined to be "information absent." If the probe ERP response cannot be classified with a high statistical confidence as being more similar to either the target or the irrelevant response, no determination is made; the outcome is "indeterminate."

The classification CIT with ERPs can be applied in two different types of tests. Specific issue tests detect knowledge of a specific event such as a crime. Specific screening or focused screening tests detect knowledge relevant to specific training or expertise, or inside knowledge of a particular organization or group. This study is a specific screening test conducted in collaboration with the US Central Intelligence Agency (CIA) and the US Navy. The information detected is military medical knowledge in US Navy military medical experts.

We compared the classification CIT with the comparison CIT by analyzing our data with both methods. Farwell and Donchin (1991) used the classification CIT with the P300, an electrically positive event-related potential (ERP) maximal at the parietal midline area of the head that is elicited when a subject recognizes and takes note of a stimulus that is significant in the current context. In this research we compared the results of using the P300 alone vs. the P300 plus the late negative component (LNP; together, P300-MERMER, memory and encoding related multifaceted electroencephalographic response)1 . We compared error rate and statistical confidences produced by the classification CIT with the results of analysis applying the comparison CIT on the same data. We investigated whether the classification CIT provides significantly lower error rate and higher statistical confidences than the comparison CIT.

#### **PREVIOUS RESEARCH ON SUFFICIENT AND NECESSARY CONDITIONS FOR A VIABLE CIT FOR REAL-WORLD FIELD USE**

In our view, in order to be considered reliable, an ERP-based CIT must reliably produce less than 1% error rate and median statistical confidences of greater than 95% for individual determinations, including median statistical confidences greater than 90% for both information-present and information-absent determinations, across different field and laboratory conditions. These same criteria, in our view, are the minimum criteria required to effectively and ethically apply a technique in criminal investigations or any application with non-trivial consequences.

Farwell and Donchin's (1991) method provided the sufficient conditions to meet these requirements. They only established sufficient conditions, and did not investigate which conditions were necessary. Since then, research has progressed substantially in two parallel, largely non-overlapping series of studies. One series of studies has investigated the sufficient conditions to meet these criteria under varying field and laboratory conditions. Another series of studies has investigated the necessary conditions.

Eight previous peer-reviewed studies conducted by six researchers in four laboratories have applied a specific set of methods in the ERP-based CIT [Farwell and Donchin, 1991 (two studies); Allen and Iacono, 1997; Farwell and Smith, 2001; Farwell et al., 2013 (four studies); see also Iacono, 2008]. These specific methods are the only methods that have reliably produced less than 1% error rate and median 95% statistical confidences for individual determinations, including over 90% for both information-present and information-absent determinations, in the laboratory and the field. These same error rates and statistical confidences have been achieved with countermeasures, without countermeasures, and in field conditions where it is unknown whether countermeasures are being used or not (Farwell et al., 2013). (Countermeasures are physical or mental procedures that a subject may practice in an attempt to influence the outcome of a test. They were not studied in this research).

The methods applied in these studies are the same as in the original Farwell and Donchin studies, with several improvements based on the more demanding requirements of field applications, as described below. Farwell (2012) documented these methods (or rules or recommendations) as 20 brain fingerprinting scientific standards (Appendix 1). Farwell (1994, 2012) defined "brain fingerprinting" as the classification-CIT technique incorporating the 20 standards. These methods are applied here2 . We have focused our previous research primarily on establishing the sufficient conditions because this provides a technique we can use, and have successfully used, in the field. Previous studies by others investigated the necessary conditions to obtain low error rates and high statistical confidences (Farwell, 2012, 2014; Appendix 2).

# **P300 AND LNP**

Farwell and Donchin (1991) used the classification CIT with the P300. In this research we compared the results of using the P300 alone vs. the P300 plus the LNP. Our rationale for this is as follows.

Early P300 research (e.g., Sutton et al., 1965) used very simple stimuli, such as auditory clicks and tones. As the sophistication of experimental designs progressed, more complex stimuli were used, including simple words and phrases presented visually. The latency of P300 was found to increase with stimulus complexity

<sup>1</sup>To obtain statistical confidences for each individual determination, Farwell, Donchin, Wasserman, and Bockenholt collaborated to introduce the statistical technique of bootstrapping in the field of psychophysiology. This first was published in conference abstracts by Farwell and Donchin (1986, 1988a), and full papers by Wasserman and Bockenholt (1989) and Farwell and Donchin (1991), which reported the same studies as the abstracts. Wasserman and Bockenholt used Farwell and Donchin's application of bootstrapping in the classification CIT as an example of the correct use of the technique.

<sup>2</sup>The study reported here appeared first as a conference abstract, Farwell and Richardson (2006).

and the concomitant stimulus evaluation time (Magliero et al., 1984). With simple words and phrases, an inter-stimulus interval (ISI) of 1000–1500 ms or less was adequate for the subject to process the stimuli and to capture the entire ERP response (Farwell and Donchin, 1988b). Farwell and Donchin (1991), for example, used phrases consisting of two, one-syllable words, and an ISI of 1500 ms.

In conducting research at the FBI in 1993, however, Farwell et al. (2013; Farwell, 1994, 1995a,b) had the task of developing text stimuli that accurately represented knowledge unique to FBI agents. This required some stimuli to be several long words. To give subjects time to fully process the stimuli, we extended the ISI to 3000 ms. Under these conditions, we found that the positive P300 peak was followed by a negative peak with a peak latency of up to 1200 ms, which we termed the late negative potential (LNP).

The stimuli we used in this study and in previous research (Farwell and Smith, 2001; Farwell et al., 2013) were more personally significant than the stimuli presented in most previous P300 research. LNP may be driven at least in part by this personal significance. Compared to many previous P300 studies, our stimuli may also be more salient, be more related to previous memories, require more complex processing, and involve a task more important to the subject. They are also presented with a longer ISI than that applied in most previous P300 studies. Further research is necessary to identify the antecedent conditions and delineate the functional significance of the LNP.

We called the overall pattern of the P300 followed by the LNP in the time domain, along with concomitant changes in the frequency domain and other putative changes measurable by other mathematical methods, a P300-MERMER. Although the P300-MERMER—and for that matter, the P300—may be comprised of additional features that are not visible in the time domain (Farwell, 2012; Farwell et al., 2013), the time-domain pattern is sufficient to define and to detect the response. This pattern consists of a positive peak followed by a negative peak (or a negative-positive-negative pattern if the N200, a wellknown negative component that generally precedes the P300, is included).

We compared results obtained using the P300 alone with results obtained by including the P300 plus the LNP. Our computations consider only the conventional, time-domain characteristics of the signals. The difference between our two epoch-related analyses is the length of the epoch analyzed, and therefore the inclusion or exclusion of the LNP and its amplitude, morphology, and latency.

In the early 1990s, when Farwell et al. (2013) first encountered the LNP that follows the P300, we initially hypothesized that LNP was an artifact, perhaps generated by the analog filters and the return of the P300 to baseline. The data contradicted this hypothesis, however. If the LNP were an artifact produced by the filter's effect on the P300, then similar P300s with identical filters would produce similar LNPs. We found that the latency, amplitude, and morphology of the LNP varied independently of the P300. Also, the scalp distribution of the LNP was more frontal than that of the P300. Moreover, the negative peak persisted when we varied our filter settings (Farwell, 1994, 2012). Even recording without analog filters did not eliminate the LNP, or substantially change its characteristics. This definitively disproved the filter-generated-artifact hypothesis (Farwell et al., 2013).

The data we recorded with filters also contradict the hypothesis that the LNP is an artifact. We used the same recording equipment for all subjects and all scalp sites. If the LNP were an artifact produced by the equipment, the same equipment would produce identical effects in different scalp sites and different subjects. The features of the LNP would be a function of the features of the P300. This was not observed. For different scalp sites in one subject, and for different subjects, the relative amplitude, latency, and morphology of the LNP and the P300 were very different. Sometimes there was a difference of hundreds of milliseconds in the latency, and amplitude differences of a factor of two or more, in LNPs that followed virtually identical P300s recorded from different subjects (Farwell et al., 2013). In some cases the LNP was considerably larger than the P300 at one scalp site (Fz) and considerably smaller than the P300 at another (Pz) for the same subject. In short, the data contradict the hypothesis that the LNP (or the latter part of the P300-MERMER) is an artifact produced by some combination of the P300, the return to baseline after the P300, and the filters and other equipment.

In the current paradigm, a negative peak (the N200) precedes the P300 positive peak, and another negative peak (the LNP) follows the P300. Our first observation of this tri-phasic negative-positive-negative morphology in the ERP response was in the early 1990s (Farwell, 1994, 2012; Farwell and Smith, 2001; Farwell et al., 2013). Others applying intracranial recordings have observed this same negative-positive-negative pattern in a number of brain structures (Halgren et al., 1998; Linden, 2005). These include dorsolateral and orbital frontal cortices, anterior cingulate (Baudena et al., 1995), amygdala and hippocampus (Halgren et al., 1986; Stapleton and Halgren, 1987), superior temporal sulcus (Halgren et al., 1995), and inferior parietal lobe/supramarginal gyrus (Smith et al., 1990).

Others investigating the ERP-based CIT, including Meijer et al. (2007), have also reported the LNP. Brouwer et al. (2010) observed the LNP and investigated its utility in brain-computer interfaces. Several other studies (Matsuda et al., 2009; Gamer and Berti, 2010, 2012) reported a difference in the N200 in responses to relevant stimuli in ERP-based CITs. Virtually all researchers conducting research on ERP-based CITs now include in their data-analysis algorithms both the P300 and the LNP (for reviews, see Farwell, 2012, 2014), although some refer to the entire response including both positive and negative peaks as "P300" (e.g., Rosenfeld et al., 2008) and some refer to the positive peak as "P300" and the entire response as "P300-MERMER" or "P300 + LNP" (e.g., Sutton et al., 1965; Farwell, 2012, 2014; Farwell et al., 2013).

Changes in the frequency domain and other changes in the dimensionality and other characteristics of the signal may be included in the term "P300-MERMER." The positive and negative time-domain changes constituting the P300 and the LNP are sufficient to detect and characterize the response, and are all that are measured in this research, although they undoubtedly do not constitute a complete and comprehensive description of all the patterns of electrophysiological activity that manifest the underlying information-processing brain activity (Farwell, 1994, 2012; Farwell and Smith, 2001).

We compared the error rate and statistical confidences produced by data analysis including the P300 plus the LNP with the results of analysis using the P300 alone. We investigated whether the classification-CIT analysis with the P300 plus the LNP provides significantly lower error rate and/or higher statistical confidences than the analysis with the P300 alone.

# **SUMMARY OF RESEARCH QUESTIONS**

Our primary research questions are as follows:


# **MATERIALS AND METHODS**

# **SUBJECTS**

We tested 16 experts (information present) and 14 non-experts (information absent) in military medicine. Experts were students and faculty at Uniformed Services University of the Health Sciences (USUHS) possessing professional knowledge of military medicine. Non-experts lacked this specific expertise and training. Mean age of 30 subjects was 26; standard deviation was 2.9. Mean ages of information-present and information-absent subjects were 27 and 25, respectively; standard deviations were 3.2 and 2.6, respectively. 15 subjects (8 information present) were female.

Experimental procedures were approved by the Brain Fingerprinting Laboratories, Inc., ethics committee and performed in accordance with the ethical standards of the 1964 Declaration of Helsinki, including written informed consent prior to participation.

#### **STIMULI**

Three types of stimuli consisting of words or phrases were presented on a computer screen: probes, targets, and irrelevants. Probes contain specific information relevant to the investigated situation. The test is designed to detect the subject's knowledge or lack of knowledge of the information contained in the probes as relevant in the context of the investigated situation.

In this specific screening study, the relevant information detected was known only to experts in military medicine. Information was obtained from interviews with USUHS military medical experts. Individuals interviewed were not tested. Probe stimuli contained the relevant information to be detected. We presented two additional types of stimuli. Responses to target stimuli provide a template for the subject's brain response to known information relevant to the investigated situation. Responses to irrelevant stimuli provide a template for the subject's brain response to irrelevant information. Target stimuli present information relevant to the investigated situation that is known to be known to the subject. There are significant, proven advantages to using targets that are relevant to the investigated situation rather than inherently irrelevant targets that are made relevant only by task instructions (Farwell, 2012; Farwell et al., 2013), although we and others have successfully used the latter (Farwell and Donchin, 1991). Target stimuli, unlike probes, were identified as such to the subject in experimental instructions. Subject instructions also conveyed the significance of each target in the context of the investigated situation, and required a different behavioral response to targets than to probes and irrelevants, as described in the next section.

For each probe (and each target) comparable irrelevants were structured that contained similar, plausible, but incorrect information about the investigated situation. For a subject lacking the relevant knowledge contained in the probes, the irrelevants and probes were equally plausible as correct, relevant details. Each probe and its comparable irrelevants were indistinguishable for a subject lacking the information that the test was structured to reveal. Each probe contained correct, relevant information fitting the description of that probe. The two irrelevants comparable to each probe contained incorrect information that would be plausible as fitting that same description for an individual lacking the information contained in the probes. For example, a probe stimulus could be the technical name of a military medical procedure in which experts are trained. Corresponding irrelevants could be technical terms that do not name any real procedure. For security reasons, the exact stimuli cannot be given. Subjects were provided with a description of each probe that specified the significance of the probe in the context of the investigated situation, but were not informed which was the correct, situation-relevant probe and which were the corresponding irrelevants.

Similarly, each target stimulus contained correct, situationrelevant information, and the two irrelevant stimuli comparable to each target contained comparable, incorrect but plausible information. Unlike probes, targets were identified as such in instructions to the subjects.

Stimuli were constructed in groups of six: one probe, one target, and four irrelevants. For each probe there were two comparable irrelevants. For each target there were two comparable irrelevants. We used a ratio of 1/6 targets, 1/6 probes, and 2/3 irrelevants so targets and probes were relatively rare, which is known to enhance P300 amplitude (Farwell and Donchin, 1991).

Our prediction was that targets would elicit a large P300 + LNP (or P300-MERMER) in all subjects, irrelevants would not elicit a large P300 + LNP, and probes would elicit a large P300 + LNP only in information-present subjects. Thus, for informationpresent subjects, ERP responses to probes would be similar to ERPs for targets. For information-absent subjects, ERP responses to probes would be similar to ERPs for irrelevants.

There were 32 unique probes, 32 unique targets, and 128 unique irrelevants, a total of 192 unique stimuli. These comprised 32 groups of stimuli, each consisting of one probe, one target, and four irrelevants. 20 probes were words or phrases embodying the relevant knowledge; 12 were acronyms. The same stimuli were presented to all subjects. Each unique stimulus was presented more than once, so the total number of stimulus presentations was greater than the total number of unique stimuli.

### **PROCEDURE**

Before the test, we made certain that the subject understood the significance of the probes. We described the significance of each probe to the subject. We then showed the subject the probe and the corresponding irrelevants, in the context of the description of the significance of the probe, without revealing which was the probe. Thus, subjects were informed of the significance of each probe stimulus, but were not told which stimulus was the probe and which were corresponding irrelevants. For example, subjects were told, "One of these three items is the term for a medical technique applied to burn victims in battlefield situations" followed by a list of one probe and two irrelevants (in alphabetical order). Although the descriptions of the probes were made known to subjects, the probe stimuli themselves were never identified as probes.

Targets were explicitly identified to the subjects. Experimental instructions ensured that the subject knew the targets and their significance in the context of the investigated situation. We described the significance of each target to the subject. We showed the subject each target and the corresponding irrelevants, in the context of the description of the significance of the target. We also showed subjects a list of the targets, and noted that subjects would be required to recognize the targets during the test. We instructed subjects to press a button with one thumb in response to targets, and another button with the other thumb in response to "all other stimuli." The subject's task was to read and comprehend each stimulus, and then to indicate by a button press whether the stimulus was a target stimulus or not.

For a subject possessing the knowledge embodied in the probes, "all other stimuli" consisted of two types of stimuli: probes containing the known situation-relevant information, and irrelevant stimuli. For a subject lacking the tested knowledge, "all other stimuli" appeared equally irrelevant. Probes were indistinguishable from irrelevants. For "all other stimuli" (that is, everything except targets), the subject was instructed to push the opposite button from the one pushed in response to targets. This instruction applied whether the subject perceived these as a single category (all equally irrelevant, if the subject was information absent) or as two categories (irrelevant, and relevant to the concealed information being tested, if the subject was information present).

The differential button-press task in response to every stimulus presentation ensured that the subject was required to read and comprehend every stimulus, including the probe stimuli, and to prove behaviorally that he had done so on every trial. This allowed us to avoid depending on detecting brain responses to assigned tasks that the subject could covertly avoid doing, while performing the necessary overt responses (see Appendix 2).

Testing was divided into separate blocks. In each block the computer display presented 72 stimulus presentations or trials. In blocks 1–3, four stimulus groups were presented in each block, that is, in each block there were four unique probes, four unique targets, and 16 unique irrelevants. Each stimulus was presented three times in a block to make the total of 72 stimulus presentations per block. Stimuli were presented in random order. In blocks 4–7, five stimulus groups were presented in each block in random order until 72 trials had been presented. (Since the total of 72 trials is not divisible by 5, some randomly selected stimuli were presented 3 times and some 4.) In blocks 1–3, stimuli were acronyms. In blocks 4–7, stimuli were words and phrases.

Immediately before each block, we repeated the description of the significance of each of the probes and targets that were to appear in each block (but not the actual stimuli). For example, "In this test you will see the term for a medical technique applied to burn victims in battlefield situations, a medical instrument applied in field wound treatments, a type of injury sustained from exposure to chemical weapons, and the name of the individual who developed the preferred treatment for exposure to sarin gas."

Stimuli were presented for 300 ms at an ISI of 3000 ms. A fixation point ("X") was presented for 1000 ms prior to each stimulus. For each trial, the sequence was a fixation point for 1000 ms, the stimulus (target, probe, or irrelevant) for 300 ms, a blank screen for 1700 ms, and then the next fixation point.

Trials contaminated by artifacts generated by eye movements or muscle-generated noise were rejected on-line, and additional trials were presented until 72 artifact-free trials were obtained. Trials with a range of greater than 97.7 microvolts in the EOG channel were rejected. Data for "rejected" trials were collected and recorded, but rejected trials did not contribute to the count of trials presented, so each rejection resulted in an additional stimulus presentation. In 7 blocks, a total of 84 probe, 84 target, and 336 irrelevant artifact-free trials were collected, for a grand total of 504 trials. (Previous research, e.g., Fabiani et al., 1987 has shown that repeating the stimuli does not substantially affect the relevant brain response.)

Brain responses were recorded from the midline frontal, central, and parietal scalp locations (Fz, Cz, and Pz, International 10–20 system) referenced to linked mastoids, and from a location on the forehead to track eye movements. Med Associates silver-silver chloride disposable electrodes were held in place by a custom headband.

Data were digitized at 333 Hz, and resampled at 100 Hz off-line for analysis. Electroencephalograph (EEG) data were amplified at a gain of 50,000 using custom amplifiers. Electro-oculograph (EOG/eye movement) data were amplified at a gain of 10,000. Impedance did not exceed 10 kilohm. Analog filters passed signals between 0.1 and 30 Hz. Data were stored on disk for off-line analysis.

### **DATA ANALYSIS**

We analyzed ERP data from the Pz scalp site. Data were digitally filtered using a 49-point, equal-ripple, zero-phase-shift, optimal, finite impulse response, low-pass filter with a passband cutoff frequency of 6 Hz and a stopband cutoff frequency of 8 Hz (Farwell et al., 1993). Trials with a range of greater than 97.7 microvolts in the EOG channel were excluded from analysis. We decided on this threshold based on our previous experience (Farwell and Donchin, 1991; Farwell et al., 2013). In exploratory data analysis, we have varied this threshold considerably, and the results are robust even if we change this parameter within quite a wide range.

For each subject's data we conducted two separate classification-CIT analyses applying bootstrapping as described below. One analysis used the positive P300 peak followed by the LNP, a later negative peak (together also known as the P300-MERMER). A second analysis included only the positive P300. The P300 + LNP epoch was defined as 300–1800 ms after stimulus onset. The P300 epoch was 300–900 ms after stimulus onset. The two analyses were identical except for the epoch analyzed. A third analysis applied bootstrapping with the comparison CIT on the full P300 + LNP epoch, as in previous ERP studies with the comparison CIT.

The data analysis produced three sets of results for each subject: (1) a determination of information present or information absent along with a statistical confidence for the determination using the classification CIT and the full P300 + LNP epoch; (2) a comparable determination and statistical confidence using the P300 alone with the classification CIT; and (3) a comparable determination and statistical confidence using the comparison CIT on the full epoch. This allowed us to compare the error rate/accuracy and statistical confidence provided by (a) the P300 + LNP vs. the P300 alone in a classification CIT, and (b) the classification CIT vs. the comparison CIT.

#### **BOOTSTRAPPING**

#### *Classification-CIT bootstrapping method*

The primary data-analysis task was to determine whether the ERP responses to the probe stimuli contained a large P300 and LNP similar to that elicited by the targets, or whether the probe responses lacked a large P300 and LNP, like the irrelevants.

We used bootstrapping (Wasserman and Bockenholt, 1989; Farwell and Donchin, 1991; Farwell et al., 2013) to determine whether the probe responses were more similar to the target responses or to the irrelevant responses, and to compute a statistical confidence for this determination for each individual subject. The metric for similarity was double-centered correlation.

The bootstrapping procedure accomplished two goals: (1) to take into account the variability across single trials, while also maintaining the smooth and relatively noise-free shape provided by signal averaging; (2) to isolate the critical variable—knowledge of the information embodied in the probes—by classifying the responses to the probe stimuli as being either more similar to the target responses or to the irrelevant responses. We conducted two classification-CIT analyses, one using only the P300 and one using the P300 plus the LNP (together also known as the P300-MERMER).

Briefly, the bootstrapping procedure for the classification CIT is as follows. We repeat the following procedure 1000 times. Randomly sample P probes, T targets, and I irrelevants, with P, T, and I equal to the total number of probe, target, and irrelevant trials in the data set, respectively. In each iteration, compare the probe-target correlation with the probe-irrelevant correlation. Count the number of times that the probe-target correlation is greater than the probe-irrelevant correlation, and convert this to a percentage. This is the probability that the probe response is more similar to the target response than to the irrelevant response, or the probability that information present is the correct determination. 100% minus this is the probability that the probe response is more similar to the irrelevant response, or the probability that information absent is the correct determination.

We set an *a priori* bootstrapping probability criterion of 90% for an information-present determination and 70% (in the opposite direction) for an information-absent determination. If the probability was greater than 90% that the probe response was more similar to the target response than to the irrelevant response, we classified the subject as information present. The bootstrap probability is the statistical confidence for this determination.

The probability that information absent is the correct determination is 100% minus the probability that information present is the correct determination. For example, if there is a 90% probability that the probe response is more similar to the target than to the irrelevant response (information present is correct), then there is a 10% probability that the probe response is more similar to the irrelevant response (information absent is correct). If the probability was greater than 70% that the probe response was more similar to the irrelevant response than to the target response (equivalent to a 30% probability that the probe response was more similar to the target response), we classified the subject as information absent. The bootstrap probability is the statistical confidence for this determination.

If the results did not meet either criterion, we did not classify the subject in either category. The outcome would then be indeterminate (although there were no indeterminates).

For each subject, each data analysis method produced a determination and a statistical confidence, e.g., information present, 99.9% statistical confidence. The statistical confidence is the probability that the determination is correct, based on the withinsubjects statistical computation taking into account the size of the effect and the variability in the data.

**Figure 1** illustrates example stimuli, ERP responses, bootstrapping probabilities, and determinations for a hypothetical classification CIT to determine if an individual has information regarding US political history.

Error rate is the percentage of incorrect information-present (false positive) and information-absent (false negative) determinations. Accuracy is 100% minus the error rate. In reporting error rates and/or accuracy, indeterminates must be reported as such. In reporting "accuracy," some authors have confounded indeterminates with false positives and/or false negatives, reporting "accuracy" as the percentage of tests that result in a correct determination, and hiding the number of indeterminates. This irretrievably hides the true error rate if there are indeterminates, and makes it impossible to make a meaningful comparison with studies that report the true error rate. In any meaningful reporting, indeterminates if any must be identified as such, and not confounded with false positive or false negative errors. (Some legitimate techniques such as Bayesian analysis do not allow indeterminates, in which case this must also be reported).

We restricted our conclusions to a determination as to whether or not a subject knew the specific situation-relevant knowledge embodied in the probes at the time of the test. Our procedures recognize the fact that the ERP-based classification CIT detects only presence or absence of information—not guilt, innocence, honesty, lying, deception, or any past action or non-action.

#### *Comparison-CIT bootstrapping method*

The comparison CIT uses bootstrapping in an entirely different way. The comparison CIT ignores the target responses and applies

**FIGURE 1 | Classification CIT: stimuli, responses, and determinations.** Example of a classification CIT to determine whether an individual has information about US political history. Subject instructions: "You will view names of former US presidents. Some of these are on this list: George Washington, John Adams, Ronald Reagan. Press button T [target] for any name on this list. Otherwise press button O [other].

bootstrapping to compute the probability that the amplitude of the probe ERP is larger than the amplitude of the irrelevant ERP. The amplitude of the ERP response is defined as the difference between the highest voltage in the P300 window (300–900 ms) and the lowest voltage in the LNP window (900–1800 ms). (This is essentially the sum of the peak amplitudes of the P300 and the LNP.) This is in accord with the metric used in previous applications of the comparison CIT (e.g., Rosenfeld et al., 2008).

Trials are randomly sampled with replacement and averaged as described above for the classification CIT, except that only probe and irrelevant trials are sampled and averaged. In each of 1000 iterations, the amplitude of the ERP in the sampled probe average is compared with the amplitude of the ERP in the sampled irrelevant average. The percentage of times that the sampled probe ERP is larger than the sampled irrelevant ERP provides an estimate of the probability that the probe ERP is larger than the irrelevant ERP. If the probability that the probe ERP is larger than the irrelevant ERP is greater than 90%, then the subject is determined to be information present. If the probability that the probe ERP is larger than the irrelevant ERP is less than 90%, then the subject is determined to be information absent. (The comparison CIT does not have an indeterminate category.) A probability of 90% that information present is correct (that is, probe ERP is larger than irrelevant ERP) is equivalent to a probability of 10% (that is, 100%–90%) that information absent is correct. Therefore, any subject with a probability of over 10% that information absent is correct is determined to be information absent. This results in subjects being determined to be information *absent* when the computed bootstrap probability is as high as 89.9% that information *present* would be the correct determination, that is, as low as a 10.1% statistically computed probability that the selected information-absent determination is correct. Information-absent George Washington was a US president on the list, so then you should press T." Since Bill Clinton was not on the list and was not identified to the subject as a US president, the subject will press button O for Bill Clinton, whether he recognizes him as a US president (information present) or not (information absent). IP, Information Present; IA, Information Absent.

statistical confidences range from 10.1% to 99.9% and average 50% (chance) (see Appendix 2).

Correct information-absent determinations are of two types, valid and invalid. Valid determinations are those that have a greater than 50% (chance) statistical confidence, i.e., a greater than chance computed probability of being correct. An invalid determination is a (correct) determination where the statistical confidence is less than chance (50%); that is, the computed probability that the determination is correct is less than 50%. Such a result is invalid because clearly one cannot validly report that "Our statistical procedure determined that the individual is information absent; the statistics computed a probability of [15%] that the determination is correct." (This also applies to any other percentage lower than 50%.) Such a statement is not statistically meaningful or logically tenable. To be valid, the computed statistical confidence for a result must at least be better than chance (see **Figure 2** and Appendix 2). To be scientifically meaningful and practically useful, it must be considerably better than chance.

There are serious scientific, mathematical, logical, and statistical flaws with the ERP-based comparison-CIT data-analysis procedure, as described in Farwell (2012, 2014), Farwell et al. (2013), and Appendix 2. These flaws cannot be corrected by simply changing the criterion for information present/absent determinations. We have implemented this procedure, however, because this is the way that the bootstrapping statistical confidence has been computed in all or virtually all of the comparison-CIT studies that have previously applied bootstrapping (e.g., Rosenfeld et al., 2008).

**Figure 2** illustrates example stimuli, ERP responses, bootstrapping probabilities, and determinations for a hypothetical comparison CIT to determine if an individual has information regarding US political history.

**FIGURE 2 | Comparison CIT: stimuli, responses, and determinations.** Example of a comparison CIT to determine whether an individual has information about US political history. Subject instructions and button presses are the same as in **Figure 1**. IP, Information Present; IA, Information Absent.

# **RESULTS**

The results are delineated in **Tables 1**–**4** and illustrated in **Figures 3** and **4**. **Table 1** presents the error rate/accuracy of the results of the classification CIT, for both P300 + LNP and P300 analysis methods. Both P300 + LNP and P300 analysis methods produced 0% error rate, 100% accuracy. Both also produced no indeterminates.

**Table 2** presents the error rate/accuracy of the comparison CIT. (Only correct determinations and errors are tabulated: the comparison CIT does not have an indeterminate category).

**Table 3** presents the individual determinations and the statistical confidences for each subject whose true state was information present. It compares the results obtained with the classification CIT with P300 + LNP with the other two methods: classification CIT with P300 and comparison CIT.

**Figure 3** presents the brain responses to probe, target, and irrelevant stimuli for each of the information-present subjects, averaged across all trials for each subject.

**Table 4** presents the individual determinations and the statistical confidences for each subject whose true state was information absent. It compares the results obtained with the classification CIT with P300 + LNP with the other two methods: classification CIT with P300 and comparison CIT.

**Figure 4** presents the brain responses to probe, target, and irrelevant stimuli for each of the information-absent subjects, averaged across all trials for each subject.

# **RESULTS OF THE CLASSIFICATION CIT WITH P300 + LNP ANALYSIS**

All classification-CIT determinations with the P300 + LNP analysis were correct. Error rate was 0%: there were no false positives and no false negatives. Accuracy was 100%. Also, there were no indeterminates. Grier's A' (Grier, 1971) was 1.0.

**Table 1 | Classification CIT error rate/accuracy of determinations with P300 + LNP and P300.**

**Classification CIT: error rate/accuracy with P300 + LNP and P300**


All information-present statistical confidences were above the *a priori* criterion of 90%. All information-absent determinations were above the *a priori* criterion of 70% (in the opposite direction). Median statistical confidence was 99.9% with the P300 + LNP. Mean statistical confidence was 95.1% with the P300 + LNP.

All of the information-present determinations were made with a statistical confidence of at least 99%, and all but one were made with a statistical confidence of 99.9%. Median statistical confidence for information-present determinations was 99.9%, and mean statistical confidence was also 99.9%.

All information-present determinations exceeded the *a priori* criterion of 90% statistical confidence by at least 9 percentage points in the bootstrap probability computation. No information-present determinations were close to an indeterminate outcome. All information-present determinations were extremely far from a false negative. The lowest informationpresent determination was separated by a buffer of 69 percentage points in the bootstrap probability computation from the criterion for a false negative. (Exceeding the 70% probability for

#### **Table 2 | Comparison CIT error rate/accuracy.**

# **Comparison CIT: error rate/accuracy**


an information-absent determination would result in a false negative. This is equivalent to 100 – 70% = 30% probability for an information-present determination. Lowest information-present probability obtained was 99.3%, and 99.3 – 30% = 69.3%).

All of the information-absent determinations also exceeded the corresponding *a priori* criterion of 70% statistical confidence for information-absent determinations. Median informationabsent statistical confidence with the P300 + LNP was 91.8%. Mean information-absent statistical confidence with the P300 + LNP was 89.7%.

All information-absent determinations were far from a false positive. The least statistically confident information-absent determination was separated by a buffer of 64 percentage points in the bootstrap probability computation from the criterion for a false positive. (Exceeding the 90% probability for an informationpresent determination would result in a false positive. This is equivalent to 100 – 90% = 10% probability for an informationabsent determination. Lowest information-absent probability obtained is 74.2%, and 74.2 – 10% = 64.2%).

Statistical confidences for information-absent determinations were lower than for information-present determinations, however, and some were close to an indeterminate outcome. 9 information-absent determinations had statistical confidences of less than 95%. 6 had statistical confidences of less than 90%. 2 statistical confidences were less than 75% and were within 5

**Table 3 | Determinations and statistical confidences for information-present subjects.**

**Determinations and statistical confidences, information-present subjects**


*Errors are underlined.*

#### **Table 4 | Determinations and statistical confidences for information-absent subjects.**


**Determinations and statistical confidences, information-absent subjects**

*Errors are underlined; invalid determinations, i.e., correct determinations made with less than 50% (chance) statistical confidence, are in italics.*

percentage points of an indeterminate outcome. Possible reasons for this are discussed below.

#### **RESULTS OF THE CLASSIFICATION CIT WITH P300 ANALYSIS**

As with the classification-CIT P300 + LNP-based analysis, all determinations with the classification-CIT P300-based analysis were correct. Error rate was 0%: there were no false positives and no false negatives. Accuracy was 100%. Also, there were no indeterminates. Grier's A' (Grier, 1971) was 1.0. All informationpresent statistical confidences were above the *a priori* criterion of 90%. All information-absent determinations were above the *a priori* criterion of 70% (in the opposite direction). Median statistical confidence was 97.2% with the P300 alone. Mean statistical confidence was 91.9% with the P300 alone. For information-present subjects, median statistical confidence was 99.6%, and mean statistical confidence was 98.1%. For information-absent subjects, median statistical confidence was 84.0%, and mean statistical confidence was 84.8%. With the P300 analysis, 12 subjects (2 information present and 10 information absent) had statistical confidences of less than 95%, and 7 (all information absent) had statistical confidences of less than 75% and were within 5 percentage points of an indeterminate outcome. All determinations were very far from a false negative or false positive error.

#### **COMPARING CLASSIFICATION-CIT P300 + LNP ANALYSIS vs. P300 ANALYSIS**

The classification-CIT P300 + LNP-based analysis produced significantly higher statistical confidences for individual determinations than the classification-CIT P300-based analysis (*p <* 0*.*0001, Wilcoxon matched-pairs signed rank test). The statistical confidence for the P300 + LNP-based analysis was an average of 3.2% higher than the statistical confidence for the P300-based analysis. In every case where there was a difference, the statistical confidence produced by the P300 + LNP was higher than that produced by the P300 alone. The P300 yielded a greater number of determinations with relatively low statistical confidence, close to an indeterminate outcome, than the P300 + LNP.

#### **RESULTS OF THE COMPARISON CIT**

Error rate with the comparison CIT was 20% overall, 19% false negatives for information-present subjects and 21% false positives for information-absent subjects. (The comparison CIT does not have an indeterminate category.) Mean statistical confidence for correct determinations was 67.0%. The lowest statistical confidence for a correct determination was 11.6%. Median statistical confidence for correct determinations was 93.8%. For information-present subjects, statistical confidences for correct determinations were all over 90%, as required by the criterion of 90% probability for information-present determinations; median was 99.9%; mean was 98.9%. As predicted by the statistical model, statistical confidences for correct information-absent determinations were on average not better than chance (50%). Median was 28.9%; mean was 29.4%. Most of the correct informationabsent determinations were invalid, i.e., made with less than a 50% (chance) statistical confidence.

# **CLASSIFICATION CIT (P300 + LNP ANALYSIS) vs. COMPARISON CIT**

Even if we conservatively consider the 0% error rate of the classification CIT to be "less than 1%" for the sake of avoiding the anomalies of 0%, the comparison CIT produced more than an order of magnitude higher error rate than the classification CIT. This difference was significant (*p <* 0*.*05, sign test). Moreover, the comparison CIT produced significantly lower statistical confidences for correct determinations than the classification CIT (*p <* 0*.*0007, Wilcoxon matched-pairs signed rank test). On average, the comparison CIT produced statistical confidences 28.2 percentage points lower than those of the classification CIT in the bootstrap probability computation (for correct determinations). As predicted by the statistical model, this difference was particularly striking for information-absent subjects: comparison-CIT statistical confidences averaged 60.4 percentage points lower than classification-CIT statistical confidences for information-absent subjects. Correct statistical confidences for information-absent subjects with the comparison CIT averaged 29.4%, which is less than chance (50%).

# **DISCUSSION CONCLUSIONS**

Our results suggest the following:


Our results suggest that the classification CIT, when practiced according the methods and standards described here, is a reliable and valid method for detecting concealed information obtained in the course of real life that is characteristic of individuals with

specific training, expertise, and/or affiliation with a particular agency or organization.

In our view, the minimum criteria for valid, reliable, and ethical field use for an ERP-based CIT are an error rate of less than 1% along with median statistical confidences of greater than 95%, including greater than 90% for both information-present and information-absent determinations. Our results, taken together with the results of our previous research and independent replications by others (see Farwell, 2012, 2014), suggest that the classification-CIT methods reported herein provide sufficient conditions to meet these criteria in the laboratory and the field. In our view, the methods applied in this research are sufficiently valid and reliable to be ethically applied in field use with substantial consequences to the outcome. These methods can be (and have been) reliably and effectively applied in field criminal cases.

Our results and those of all previous studies taken together (see Farwell, 2012, 2014) suggest that the comparison CIT with ERPs falls far short of these performance criteria for both error rate and statistical confidence. They also suggest that including the P300 + LNP in data analysis provides higher statistical confidences than P300 alone, but it is not a necessary condition for low error rate and high statistical confidences.

The most striking feature of the data reported to date, including the data of this study, is that there is a sharp bimodal distribution of error rates and statistical confidences, based on the following. One set of methods, as described here, applies the classification CIT and always has produced less than 1% error rate and greater than 95% median statistical confidences. Alternative methods, exemplified by Rosenfeld et al. (2008), Dietrich et al. (2014), and Meixner and Rosenfeld (2014), apply the comparison CIT and have produced an order of magnitude higher error rates, as well as statistical confidences averaging no better than 50% (chance) for information-absent determinations. Two reviews including all previous publications in English, Farwell (2012, 2014), documented that only the specific methods that substantially incorporate the 20 brain fingerprinting standards have so far reliably produced less than 1% error rate and greater than 95% median statistical confidences in the laboratory and the field. These are the methods applied in this research. The results of this research suggest that the differences in statistical methods between the classification CIT and the comparison CIT are responsible, at least in large measure, for the extremely large differences between the statistical confidences achieved empirically by the respective techniques.

Our experiment is a specific screening test where the information detected was relevant to expertise and experience in a particular field. Subjects obtained the tested information in the course of real life over a period of years, completely unconnected to any experimental procedures at the time the information was gained by the subjects. The results contribute to the accumulating evidence [e.g., the FBI and bomb-maker studies in Farwell et al. (2013)] that these methods provide a reliable and accurate technique for such applications.

In a few previous studies, real-life information has been detected for real-world crimes with life-changing consequences [the real crime study of Farwell et al. (2013)], and other reallife specific events [experiment 2 of Farwell and Donchin (1991), Farwell and Smith (2001), and people (Meijer et al., 2007)]. Almost all other ERP-based CIT studies have detected information obtained by the subjects in the course of a laboratory information-imparting procedure such as a mock crime (Farwell, 2012, 2014). Meixner and Rosenfeld (2014) conducted a comparison CIT in detecting information regarding unscripted activities that subjects had videotaped the previous day in conjunction with the experiment. Such activities are different from real-life activities; no one would commit an actual crime under such circumstances. Meixner and Rosenfeld failed to cite the previous peer-reviewed publications reporting field studies on real-world crimes and other real-life events, and falsely claimed to be the first study investigating information obtained in real life. Their results were similar to those of other comparison-CIT studies, including the results reported here (Appendix 2).

#### **FIELD APPLICATIONS IN REAL-WORLD CRIMES**

These results complement the results of previous studies (Farwell and Smith, 2001; Farwell et al., 2013) in which the classification CIT was applied to detect concealed information regarding specific events, including field applications involving real-world major crimes. Field applications with life-changing or lifethreatening consequences to the outcome involve more demanding conditions, including high motivation and other emotional factors, complexities, logistical challenges, uncontrolled context, and other factors that are difficult to bring under experimental control. We have conducted classification-CIT tests in realworld situations in which all of these demanding conditions were present, for example, tests on both innocent and guilty individuals who were facing the death penalty for murder as well as individuals who had already been convicted of murder and were attempting to establish their innocence. In such situations, low error rate and high statistical confidence are obviously of paramount importance.

The low error rate produced by the classification-CIT methods applied was one of the key features considered when brain fingerprinting was ruled admissible in court in the Harrington murder case (Harrington v. State, 2001; Farwell and Makeig, 2005; Roberts, 2007) in which a falsely convicted man was ultimately exonerated and freed. Extremely low error rates and high statistical confidences were equally important for using the ERP-based classification CIT to bring perpetrators such as serial killer J. B. Grinder to justice (Farwell, 2012; Farwell et al., 2013).

#### **WHAT ARE THE PRIMARY METHODS THAT MAY HAVE CONTRIBUTED TO THE LOW ERROR RATE AND HIGH STATISTICAL CONFIDENCES REPORTED HEREIN?**

The following features of the methods practiced in this research may have contributed to the low error rate and high statistical confidences obtained here and in previous studies with these methods. The primary difference between this research and various studies that produced an order of magnitude higher error rates and average 50% (chance) statistical confidence for information-absent determinations is that we used the classification CIT, rather than the comparison CIT. The comparison CIT was used in virtually all of the studies that have reported high error rates and low statistical confidences (Farwell, 2012, 2014). We applied a classification statistical algorithm, rather than a comparison algorithm, in data analysis. We used each subject's response to situation-relevant target stimuli as a template for that subject's brain response to known, situation-relevant information. We used the subject's response to irrelevant stimuli as a template for that subject's brain response to unknown or irrelevant information. We then used bootstrapping to classify the subject's brain response to the probe stimuli as being more similar to his response to known information relevant to the investigated situation (targets) or to her response to unknown, irrelevant information (irrelevants). This allowed us to make both information-present and information-absent determinations with a high statistical confidence that the determination made is in fact correct in light of the effect size and variability in this subject's data, and that the opposite determination would be incorrect (see Appendix 2).

By contrast, the comparison CIT ignores the target responses and compares only the probe and irrelevant responses, resulting in lower accuracy and statistical confidences averaging 50% (chance) for information-absent determinations, as described in Appendix 2 and in Farwell (2012, 2014) and Farwell et al. (2013).

One previous error and resulting misrepresentation (we presume inadvertent) has caused considerable confusion in this regard (see Appendix 2). Rosenfeld et al. (2004) purported to be a replication of Farwell and Donchin (1991), but in fact did not use the two-tailed classification CIT of Farwell and Donchin, but rather a one-tailed method similar to the comparison CIT of Rosenfeld's other studies (see Appendix 2). The high error rates and low statistical confidences of Rosenfeld et al. (2004) have been mistakenly cited (Rosenfeld et al., 2008) as evidence that Farwell and Donchin's classification-CIT methods are inaccurate (and susceptible to countermeasures), whereas in fact those results only demonstrate that Rosenfeld et al.'s fundamentally different methods are inaccurate (and susceptible to countermeasures) (Farwell, 2011).

Our current results demonstrate once again that the comparison CIT produces higher error rates and lower statistical confidences than the classification CIT, even when the other brain fingerprinting scientific standards (Appendix 1) are substantially met.

#### **WHAT METHODS ARE NECESSARY TO PRODUCE HIGH STATISTICAL CONFIDENCES WITH BOOTSTRAPPING?**

To produce high statistical confidences with bootstrapping, first of all the methods applied must be effective in producing the predicted experimental effects in the brain responses. Given that, what else is necessary in the statistical methods?

The statistical model of the classification CIT predicts high statistical confidences for both information-present and information-absent determinations, and this is what has been consistently reported. The statistical model of the comparison CIT predicts average statistical confidences no better than chance (50%) for information-absent determinations, and this also is what has been reported in the studies to date (Farwell, 2012, 2014; Appendix 2).

The bootstrapping technique applied here, and in all studies implementing the 20 standards, uses a classification CIT. It computes the probability that the probe responses are more similar to the target responses than to the irrelevant responses. 100% minus this is the probability that the probe responses are more similar to the irrelevant responses. This allows for a result of a high statistical confidence for both information-present and information-absent determinations. The comparison CIT computes the probability that the probe responses are larger than the irrelevant responses. This probability is expected to be high for information-present subjects. For information-absent subjects, probe and irrelevant responses are expected to be identical, so the expected value of the probability that the probe response is larger is 50%. This is the expected bootstrap probability that information present is the correct determination, the expected information-present statistical confidence. This makes the expected information-absent probability or statistical confidence also 50% (i.e., 100-50% = 50%). Thus, the expected statistical confidence for an information-absent determination with the comparison CIT is 50% (chance), assuming that the methods and statistics work as predicted. This is described in detail in Appendix 2.

Statistical confidences for information-absent determinations reported for the comparison CIT to date have in every study averaged approximately 50% (or less). Approximately half of the information-absent statistical confidences reported have been invalid, that is, less than 50% (chance) (Farwell, 2012, 2014). In approximately half of the cases, authors reported less than a 50% probability that the chosen (information-absent/"innocent") determination was correct, according to the statistics used to arrive at the determination. For example, in Meixner et al. (2009, p. 215; Table 2; "innocent" subject 11) the subject was determined to be "innocent" (information absent) when the computed probability was 85% that "guilty" was the correct determination (i.e., that the probe P300 was larger than the irrelevant P300). Statistical confidence for this (correct) determination was 15%, far less than chance. 60% of subjects correctly determined to be "innocent" in this condition had statistical confidences of less than 50% (chance) that this determination was correct (i.e., had invalid results).

The comparison CIT in this research, as in previous comparison CIT studies (e.g., Rosenfeld et al., 2008; Dietrich et al., 2014; Meixner and Rosenfeld, 2014; see Appendix 2), produced markedly higher error rates and lower statistical confidences than those of the classification CIT. The results of this research, along with the results of all previous research (Farwell, 2012, 2014), suggest that applying the classification CIT rather than the comparison CIT is not only a sufficient condition, but is also a necessary condition for obtaining median 95% statistical confidences, and in particular for obtaining greater than 90% median statistical confidences for information-absent subjects—or even for obtaining greater than chance (50%) median statistical confidences for information-absent subjects (see Appendix 2).

#### **WHAT ADDITIONAL METHODS MAY HAVE CONTRIBUTED TO THE LOW ERROR RATE AND HIGH STATISTICAL CONFIDENCES REPORTED?**

We used double-centered correlation as a measure of the similarity of the probe response to the target or irrelevant response (see Appendix 2). This metric has the advantage of including the entire response, not just a single point (or average of a few points) such as the peak amplitude or the difference between the positive P300 peak and the negative LNP peak. It inherently takes into account not only the peak amplitude, but also the latency and morphology of the full ERP. With the correlation metric, latency differences between probe, target, and irrelevant responses, as well as individual differences in latency and morphology of the ERP, contribute to the characterization of the response and hence to the accuracy and statistical confidence of the result. The information contained in such differences is lost when the P300 is characterized by a single number such as peak amplitude, as applied in, for example, Rosenfeld et al. (2008) and Meixner and Rosenfeld (2014). Our more comprehensive characterization of the waveform may be one reason for the low error rate and high statistical confidence of this research and the previous studies that have used this method.

The term "brain fingerprinting" arises from an analogy to fingerprints that has several facets. Fingerprinting matches prints from the crime scene with prints on the fingers. DNA "fingerprinting" matches biological samples from the crime scene with biological samples from the suspect. "Brain fingerprinting" matches information from the crime scene with information stored in the brain of the subject. Moreover, fingerprints calculate a match based on multiple characteristics. In the autonomic skin conductance response (SCR) as well as in comparison-CIT P300 measurements, the response is generally defined in terms of a single parameter. With SCR this may be the maximum conductance increase that occurs following stimulus onset. With the P300 this is usually peak-to-post-(negative)-peak amplitude, defined as a single number. Brain fingerprinting, like fingerprinting, uses multiple facets of the response to compute a match between known patterns and the pattern tested, taking into account not only the peak amplitude but also the morphology and time course of both the positive and negative peaks in the response.

We used situation-relevant targets. Target stimuli, like probes, were relevant to the information detected. This makes the targets more similar to the probes for the subjects who possess the relevant information, and thus may increase accuracy and statistical confidence (Farwell et al., 2013). The difference between the targets and the probes was that the targets were identified to the subject in subject instructions and required a special button press, and probes were not identified in instructions and required the same button press as irrelevants.

#### **WHAT ARE THE POSSIBLE SHORTCOMINGS OF THE CURRENT STUDY?**

Despite the 0% error rate, the results of this research have certain shortcomings when considered in light of the rigorous requirements demanded by field applications with major consequences. Although all determinations were correct and very far from a false positive or false negative error, the statistical confidence of some determinations was low enough to be close to an indeterminate. This contrasts with previous studies (Farwell and Smith, 2001; Farwell et al., 2013), where all determinations were correct and also far from an indeterminate result.

One reason for this shortcoming may be the relatively low number of trials presented in this research, and consequently a lower signal-to-noise ratio. [This does not, however, explain why the FBI agent study (Farwell et al., 2013) produced higher statistical confidences than this research, without more trials. Further research may identify other differentiating factors]. This research used only 84 probe trials and 84 target trials in the averages. In previous studies where we have used at least 100 probe trials and an equal number of targets, statistical confidences have been considerably higher. Moreover, the results of these two studies demonstrate that while brain fingerprinting standard 13 (use at least 100 probe trials—see Appendix 1) has been shown to be useful for producing optimal results, it is not absolutely requisite for achieving high levels of accuracy or statistical confidence. In other words, standard 13 is part of the well-established set of sufficient conditions, but is not a necessary condition for low error rate and high statistical confidences.

#### **SUMMARY**

We used the classification CIT to detect information gained by subjects in the course of real life. They gained the tested information in real-life events over a period of years before, and completely unrelated to, the experimental procedures. This was a specific screening or focused screening test, rather than a specific issue test. That is, rather than detecting information obtained at a particular place and time (such as while committing a crime), we detected information known to people with specific training, expertise, and organizational affiliation, specifically knowledge of military medicine by US Navy military medical experts. Subjects obtained this knowledge through a variety of experiences at different times and places for different individuals.

In detecting this concealed information, the classification CIT with the P300 + LNP produced 0% error rate and median 99.9% statistical confidence for individual determinations, a significantly lower error rate and higher statistical confidences than those produced by the comparison CIT.

Although the classification-CIT methods using both the P300 and the P300 + LNP produced the same 0% error rate, the P300 + LNP produced significantly higher statistical confidences for individual determinations. In continued field use, with the concomitant demanding conditions, eventually errors (or at least indeterminates) may occur with these methods. If so, then the higher statistical confidences produced by the P300 + LNP (rather than the P300 alone) can be expected to result in lower error rates when the error rate is non-zero.

In our view, to reliably produce the predicted experimental effect and to be viable for field use, a technique must consistently produce less than 1% error rate, along with high statistical confidences for both information-present and information-absent determinations.

The results of this study, together with the results of similar studies such as the FBI agent study and the bomb-maker study of Farwell et al. (2013), suggest that the classification CIT methods specified here, when the full P300 + LNP epoch is employed in data analysis, can be used effectively in specific screening tests to detect knowledge characteristic of individuals with specific training, expertise, and/or affiliation with a particular agency or organization. In our current study, this was accomplished in a specific screening test under controlled conditions, with the limitations inherent thereto. Prior research has applied these same methods in field conditions in a specific issue test in investigating actual crimes, with the concomitant complications related to motivation, emotions, logistics, experimental control, and other uncontrollable factors. Taken together with previous successful field applications in real-world criminal investigations, our results suggest that these methods may have application in both national security and law enforcement, for instance in identifying trained terrorists, bomb makers, members of a terrorist cell, hostile intelligence agents, members of an organized crime organization, and others with specific knowledge, expertise, training, and/or affiliations of interest.

### **ACKNOWLEDGMENTS**

Funding was provided by the US Central Intelligence Agency (CIA), Contract No. 92-F138600-000. We are grateful to the US Navy and USUHS for providing subjects and facilities. We are grateful to Dr. Christine Furedy (York University) for assistance in editing the manuscript. Study design; collection, analysis, and interpretation of data; and writing of this report were undertaken solely by the authors. The views expressed herein are solely the views of the authors.

# **REFERENCES**


**Conflict of Interest Statement:** Research contract 92-F138600-000 US Central Intelligence Agency (CIA). Richardson was an FBI agent at time of research. US Navy and USUHS provided facilities and subjects. Farwell is inventor in US patents (#7,689,272; 5,363,858; 5,406,956; 5,467,777) and one UK patent (# GB2421329) relevant to the research. Farwell is the Chairman and Chief Scientist of Brain Fingerprinting Laboratories, Inc., member of Brain Fingerprinting, LLC and Brainwave Science, LLC, commercial neuroscience companies.

*Received: 31 January 2014; accepted: 23 November 2014; published online: 23 December 2014.*

*Citation: Farwell LA, Richardson DC, Richardson GM and Furedy JJ (2014) Brain fingerprinting classification concealed information test detects US Navy military medical information with P300. Front. Neurosci. 8:410. doi: 10.3389/fnins.2014.00410*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Farwell, Richardson, Richardson and Furedy. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# **APPENDIX 1**

# **BRAIN FINGERPRINTING SCIENTIFIC STANDARDS**


him the probe and the corresponding irrelevants, without revealing which is the probe. Ask the subject if he knows (for any non-situation-related reason) which stimulus in each group is situation-relevant/crime-relevant. Describe the significance of the probes and targets that will appear in each test block immediately before the block.


standard of general acceptance in the scientific community), and one using the P300 + LNP (P300-MERMER) to provide the current state of the art.


# **APPENDIX 2**

# **PREVIOUS RESEARCH ON SUFFICIENT AND NECESSARY CONDITIONS FOR CRITERION LOW ERROR RATES AND HIGH STATISTICAL CONFIDENCES**

#### *Summary of previous results*

The most striking feature of the data reported to date, including this research, is the sharp bimodal distribution of error rates and statistical confidences. So far, the only proposed explanation for this bimodal distribution is that a specific set of classification-CIT methods, as described here and in the 20 brain fingerprinting standards (Appendix 1), have consistently produced less than 1% error rate and greater than 95% median statistical confidences for individual determinations (Farwell, 2012, 2014; Farwell et al., 2013). Alternative methods, exemplified by Rosenfeld et al. (2004, 2008), Dietrich et al. (2014), and Meixner and Rosenfeld (2014), have produced an order of magnitude higher error rates, as well as statistical confidences averaging no better than chance (50%) for information-absent ("innocent"/"nonknowledgeable") determinations. By applying two different data-analysis methods to the same data, our research directly addresses a fundamental difference in methods: the classification CIT vs. comparison CIT. This fundamental difference in methods accounts for a major difference in the results reported by previous studies applying the respective methods.

Previous results can be summarized as follows. All data are consistent with the hypothesis that standards 1–20 (Appendix 1) provide sufficient conditions for an ERP-based CIT to obtain less than 1% error rate and greater than 95% median statistical confidence, including greater than 90% statistical confidence for both information-present and information-absent determinations. At least some of these standards constitute necessary conditions. Several previous studies (e.g., Johnson and Rosenfeld, 1992; Rosenfeld et al., 2008; Dietrich et al., 2014; Meixner and Rosenfeld, 2014) demonstrated that applying comparison-CIT methods that fail to meet Standards 3–6, 8–15, and 17–20 resulted in high error rates, low statistical confidences, and invalid results. For reviews of all relevant studies in English to date, including a detailed discussion of which of the 20 standards were found to be necessary conditions in which studies, see Farwell (2012, 2014). Among the necessary conditions are standards 14, 15a, and 17, which describe the classification CIT and distinguish it from the comparison CIT. Standard 11, which requires subjects to read and process each stimulus and prove with a differential button press that they have done so, is a necessary condition for tests on motivated subjects in field conditions with major consequences to the outcome, but not for tests with accommodating subjects in the absence of non-trivial consequences. Standard 13, which requires a minimum number of probe trials in the average, is not a necessary condition, but does contribute to higher statistical confidences and potentially to higher accuracy.

#### *Why the classification CIT (standards 14, 15a, and 17) is a necessary condition for high statistical confidences*

In the classification CIT, probes and targets are both relevant details about the investigated situation. For information-present subjects, both are expected to result in essentially the same ERP, containing a large P300 and LNP. For an information-absent subject, the irrelevants and (unrecognized) probes are indistinguishable and equally irrelevant, and are expected to elicit identical ERPs lacking a large P300 and LNP. Classification-CIT bootstrapping asks the statistical question, "What is the probability that the probes are more similar to the targets than to the irrelevants?" 100% minus this is the probability that the probes are more similar to the irrelevants. The expected value of this statistic is 100% in the information-present direction (probes resemble targets) for information-present subjects, and 100% in the information-absent direction (0% in the information-present direction; probes resemble irrelevants) for information-absent subjects (see **Figure 1**). Thus, if the data are as predicted, the classification CIT provides a high statistical confidence for both information-present and information-absent determinations.

The comparison CIT ignores targets and asks the statistical question, "What is the probability that the probe ERPs are larger than the irrelevant ERPs?" For an information-present subject, the expected value is 100%. If the ERPs are as predicted, this method can deliver a high statistical confidence for information present subjects. That is, it can provide a high probability that information present is the correct determination. If the bootstrap probability that information present is correct is over 90%, the subject is determined to be information present (see **Figure 2**). If the bootstrap probability that information present is correct is less than 90%—equivalent to a probability of greater than 10% (i.e., 100–90%) that information absent is correct—the subject is determined to be information absent. Subjects are determined to be information absent with as low as a 10.1% statistically computed probability that this determination is correct. For an information-absent subject, probe and irrelevant ERPs are expected to be identical, so the expected value of the probability that probes are larger is 50%. This is the expected information-present statistical confidence, which makes the expected information-absent statistical confidence also 50% (i.e., 100% – 50% = 50%). This means that if the ERPs are as predicted and the statistics work as designed, the average statistical confidence for information-absent determinations will be 50% (chance). In all comparison-CIT studies reported to date (Farwell, 2012, 2014), the average statistical confidence for information-absent subjects has been approximately 50% (or less), and subjects have been determined to be information absent with as low as 10.1% statistical confidence or computed probability that this determination is correct. In summary, the comparison CIT, in accord with the predictions of the statistical model, has in every study to date produced statistical confidences averaging no better than chance for information-absent determinations. Thus, applying the classification CIT rather than the comparison CIT is a necessary condition for obtaining high statistical confidences, or even better-than-chance statistical confidences for information-absent subjects3 .

#### *Corrections regarding previous studies*

Two major errors have led to considerable confusion and misinformation in the literature. Johnson and Rosenfeld (1992) cited Farwell and Donchin (1988a, 1991) and Wasserman and Bockenholt (1989) as the origin of the bootstrapping method, but they did not apply the classification-CIT method introduced by Farwell, Donchin, Wasserman, and Bokenholt. They applied a comparison CIT. Several other subsequent studies by the same group and others have cited Farwell and Donchin and Wasserman and Bockenholt as the source of their bootstrapping method, but have applied the comparison CIT instead. This has led to confusion and misinformation because the results of these two methods are substantially different. In particular, the high error rates and low statistical confidences characteristic of the comparison CIT have sometimes been falsely attributed to the classification CIT (e.g., Rosenfeld et al., 2004, 2008) or to all ERP-based CITs (see Farwell, 2012, 2014; Farwell and Richardson, 2013). Our data demonstrate that these two methods are substantially different not only in experimental design but also in results, for both error rate and statistical confidence.

Another error and consequent (good faith) misrepresentation has led to some apparently (but not actually) highly anomalous results in the literature. Rosenfeld et al. (2004) reported the high error rates and low statistical confidences that are typical of comparison-CIT studies that did not implement the 20 brain fingerprinting scientific standards, but mistakenly characterized their study as a replication of Farwell and Donchin (1991), the original classification-CIT study substantially implementing the standards. The error rate in the "FIT" condition, which was mistakenly described as a replication of Farwell and Donchin, was 46% in one group and 31% in another (without countermeasures), and even higher with countermeasures. Both of these are obviously higher than the less-than-1% error rate that has been achieved (both with and without countermeasures) in all instances in which the Farwell and Donchin methods were actually applied. The reason for this discrepancy is that, although Rosenfeld et al. characterized their methods as a replication of Farwell and Donchin, the actual methods they applied, as in that group's other studies, did not implement over half of the 20 brain fingerprinting standards, specifically standards 3–6, 8–10, 12–15, and 17–20 (Farwell, 2012, 2014; Farwell and Richardson, 2013). Among many other fundamental differences, they did not use the classification-CIT bootstrapping method of Farwell and Donchin and Wasserman and Bockenholt (1989). Thus, the high error rates Rosenfeld et al. reported are consistent with the high error rates of the other studies that did not implement many of the same standards. Rosenfeld et al. erroneously concluded that their results showed that the Farwell and Donchin method was inaccurate and susceptible to countermeasures (Rosenfeld et al., 2008), whereas in fact their results showed that their fundamentally different method is inaccurate and susceptible to countermeasures.

When the distinction between the classification CIT and the comparison CIT, and the other standards, are taken into account, the pattern of results in the literature is clear. Two different sets of methods produce two different sets of results. One set of methods, the classification CIT implementing the 20 standards, always has produced low error rates and high statistical confidences. Another, different set of methods implementing the comparison CIT has produced high error rates and low statistical confidences. In light of this distinction, the bimodal distribution of error

<sup>3</sup>We report the bootstrap probability that information present is correct for information-present determinations and the probability that information absent is correct for information-absent determinations. Some previous publications (e.g., Rosenfeld et al., 2004, 2008) report the probability that information present ("guilty") is correct for information-absent ("innocent") as well as information-present determinations. Such papers report bootstrap statistics ranging from 0.01% to 89.9% for information-absent determinations (i.e., the probabilities that information present is correct), with a lower bootstrap figure corresponding to a higher statistical confidence. In such terminology, a determination of "information absent (or innocent), bootstrap index 80%" actually means "information absent, bootstrap probability of being correct 20%." In such terminology, any bootstrap index for a correct information-absent determination that is higher than 50% is an invalid result (although Rosenfeld et al. and some other researchers do not clearly identify such results as invalid). In our terminology, any bootstrap index for a correct information-absent determination that is lower than 50% is an invalid result.

rates and statistical confidences is explicable and even predictable. When this fundamental distinction is ignored or blurred, the literature inexplicably appears to contain two strikingly different groups of results for implementations of one undifferentiated method. Our experiment and results contribute to the clarification of this distinction in methods and the concomitant difference in results.

### *Why a required button-press discrimination for all stimuli (standard 11) is a necessary condition*

The comparison CIT has been implemented in two different experimental designs for stimulus presentation and subject responses. One is the same as the three-stimulus target-probeirrelevant design applied in the classification CIT and in our comparison-CIT experiment. The difference between the classification and comparison CITs with this design is in the data analysis: *classifying* the probe ERPs as being more similar to the target ERPs or the irrelevant ERPs (or neither—indeterminate) vs. ignoring the targets and *comparing* the probe ERPs with the irrelevant ERPs to determine whether the probe ERPs are significantly larger. Another version of the comparison CIT uses a four-stimulus "complex trial protocol" design (Rosenfeld et al., 2008). Each trial presents two stimuli. The first is always either a target or a non-target. The second is always either a probe or an irrelevant. Targets and non-targets are a completely different type of stimuli from probes and irrelevants, e.g., meaningless numbers (target: "six," nontargets "one" through "five"). Thus, the targets do not provide a template for an information-present response, and without such a template the classification CIT cannot be used. The four-stimulus design must use the comparison-CIT data-analysis algorithm.

In the three-stimulus design, subjects are required to read and process every stimulus, decide if it is a target or not, and push the appropriate button. All subjects, regardless of motivation, are required to perform the same information-processing and button-press tasks. Subject strategies for responding to the four-stimulus complex trial protocol comparison-CIT design, by contrast, differ substantially depending on the motivation of the subject. Subjects are required to distinguish by a button press between targets and non-targets, so they must read and process them, regardless of motivation. After they have pushed the correct button in response to the target/nontarget, they know that no discrimination will be required in response to the next stimulus. They know for certain that either a probe or an irrelevant will appear, and they simply push the same button whatever appears. Accommodating laboratory subjects read and process the probe/irrelevant stimuli as instructed, push the button, and respond with different ERPs to probes and irrelevants, resulting in better-than-chance accuracy of the test (Meixner and Rosenfeld, 2014). Motivated subjects with life-changing consequences to the outcome recognize that they do not need to read and process the probe/irrelevant stimuli in order to push a button whenever something appears on the screen shortly after the target/nontarget discrimination and button press. Thus, motivated subjects do not read and process the probe/irrelevant stimuli. They simply press the single required probe/irrelevant button when something appears on the screen after the target/nontarget, without reading and processing that probe/irrelevant stimulus. Thus, for motivated subjects with life-changing consequences to the outcome, there is no processing of the information content of probes and irrelevants, there are no differences between probe and irrelevant ERPs, and accuracy rate is 0% for the four-stimulus complex trial protocol (Farwell et al., 2013). By contrast, the required button-press discrimination on every trial in the three-stimulus CIT ensures that the subjects read and process every stimulus, resulting in the predicted ERPs to targets, probes, and irrelevants and reliable results of the test. This behaviorally required buttonpress discrimination in response to every stimulus, including when a probe or irrelevant is presented, may not be necessary for tests with accommodating subjects and lacking any nontrivial consequences, as several studies (e.g., Rosenfeld et al., 2008; Dietrich et al., 2014; Meixner and Rosenfeld, 2014) have shown. It is, however, a necessary condition for reliable results in field use or any application when subjects are highly motivated not to reveal the concealed information, e.g., when they are facing major life-changing consequences to the outcome (Farwell et al., 2013).

# *Recent comparison-CIT publications provide additional evidence for necessary conditions*

Several recent comparison-CIT studies have provided additional evidence for the necessary conditions for low error rates and high statistical confidences. Dietrich et al. (2014) conducted a four-stimulus comparison-CIT test and obtained results similar to those of other previous comparison-CIT studies, including ours. They varied the number of trials in the analysis, and concluded that "even procedures that utilize as few as 33 trials can reliably detect the presence of concealed information." This is in accord with our finding that standard 13 is not a necessary condition. The statement that the four-stimulus complex trial protocol method of Dietrich et al. can "reliably detect. . . " depends on one's definition of "reliably." Dietrich et al.'s results are indeed no less reliable than those of previous comparison-CIT studies. They are, however, much less reliable than the results obtained with the classification CIT in this research and all previous studies meeting the same standards. In separate experiments Dietrich et al. applied two different comparison CITs: the threestimulus method and the four-stimulus method. Their subjects were accommodating college students who presumably read and processed the probe and irrelevant stimuli as instructed, despite the fact that the behavioral (button-press) demands of the task did not require them to do so in the four-stimulus method. Thus the probe and irrelevant waveforms were significantly different. They also found that subjects later recalled stimuli in both the three-stimulus and four-stimulus experiments. All of this is expected behavior and results for accommodating subjects when there are no non-trivial consequences to the outcome. Their study does not address the phenomenon described above that with the four-stimulus method, motivated subjects in situations with lifechanging consequences to the outcome of the test do not read and process the probe and irrelevant stimuli because the button-press task does not require it in the four-stimulus method, and therefore the four-stimulus complex trial protocol has 100% error rate (0% accuracy) with highly motivated subjects and life-changing consequences to the outcome (Farwell et al., 2013).

Meixner and Rosenfeld (2014) conducted a comparison CIT. They published only bootstrap probabilities and failed to identify a specific criterion for distinguishing between information present ("knowledgeable") and information absent ("nonknowledgeable") subjects4 . Applying the usual 90% criterion that was applied in all of their previous studies (and in virtually all others), Meixner and Rosenfeld's results are as follows. With one of their two analysis techniques, 25% of determinations for information-present subjects are false negatives, and only 33% of the information-absent determinations are valid, i.e., 67% are invalid, having a computed bootstrap probability of less than chance (50%) of being correct. Their other analysis technique correctly classifies only 33% of the information-present subjects, with 67% false negatives. (Any higher criterion would produce even more errors; any lower criterion would produce unacceptably low statistical confidences for both information-present and information-absent determinations). Their results contribute to establishing the necessary conditions for a viable ERP-based CIT. Their results are comparable to those of the other comparison-CIT studies published to date, including the research we report here. They provide additional data in support of the hypothesis that the application of the classification CIT, rather than the comparison CIT, is a necessary condition for obtaining low error rates, high statistical confidences, reliability, and validity.

<sup>4</sup>Note that Meixner and Rosenfeld (2014), unlike our reports, report only the probability that information present (or "knowledgeable") is correct (or the corresponding number of bootstrap iterations), for all subjects, even if the determination would be information absent (or "nonknowledgeable").

# A brain-computer interface for potential non-verbal facial communication based on EEG signals related to specific emotions

# *Koji Kashihara\**

*Information Solution, Institute of Technology and Science, The University of Tokushima, Tokushima, Japan*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Dennis J. McFarland, Wadsworth Center, USA Rolando Grave De Peralta, Geneva Electrical Neuroimaging Group, Switzerland Sebastian Grissmann, LEAD Graduate School, Germany*

#### *\*Correspondence:*

*Koji Kashihara, Information Solution, Institute of Technology and Science, The University of Tokushima, 2-1 Minamijyousanjima, Tokushima 770-8506, Japan e-mail: kashihara.koji@ tokushima-u.ac.jp*

Unlike assistive technology for verbal communication, the brain-machine or brain-computer interface (BMI/BCI) has not been established as a non-verbal communication tool for amyotrophic lateral sclerosis (ALS) patients. Face-to-face communication enables access to rich emotional information, but individuals suffering from neurological disorders, such as ALS and autism, may not express their emotions or communicate their negative feelings. Although emotions may be inferred by looking at facial expressions, emotional prediction for neutral faces necessitates advanced judgment. The process that underlies brain neuronal responses to neutral faces and causes emotional changes remains unknown. To address this problem, therefore, this study attempted to decode conditioned emotional reactions to neutral face stimuli. This direction was motivated by the assumption that if electroencephalogram (EEG) signals can be used to detect patients' emotional responses to specific inexpressive faces, the results could be incorporated into the design and development of BMI/BCI-based non-verbal communication tools. To these ends, this study investigated how a neutral face associated with a negative emotion modulates rapid central responses in face processing and then identified cortical activities. The conditioned neutral face-triggered event-related potentials that originated from the posterior temporal lobe statistically significantly changed during late face processing (600–700 ms) after stimulus, rather than in early face processing activities, such as P1 and N170 responses. Source localization revealed that the conditioned neutral faces increased activity in the right fusiform gyrus (FG). This study also developed an efficient method for detecting implicit negative emotional responses to specific faces by using EEG signals. A classification method based on a support vector machine enables the easy classification of neutral faces that trigger specific individual emotions. In accordance with this classification, a face on a computer morphs into a sad or displeased countenance. The proposed method could be incorporated as a part of non-verbal communication tools to enable emotional expression.

**Keywords: neutral faces, source localization, aversive conditioning, face recognition, electroencephalogram, brain computer interfaces**

# **INTRODUCTION**

Unlike assistive technology for verbal communication, the brainmachine or brain-computer interface (BMI/BCI) has not been sufficiently established as a non-verbal communication tool for amyotrophic lateral sclerosis (ALS) patients (Nijboer et al., 2009; Tomik and Guiloff, 2010). Although face-to-face communication provides rich emotional information, late-stage ALS patients may experience difficulties in expressing their emotions because the disorder causes severe muscular paralysis (Iversen et al., 2008; Nijboer et al., 2009). Similarly, autistic individuals cannot adequately communicate negative feelings during non-verbal communication (Dalton et al., 2005). Emotions in these individuals will be easily predicted by looking at emotional facial expressions of visitors (e.g., positive feelings triggered by smiling faces and negative feelings stimulated by angry faces). However, people feel certain emotions even when observing an inexpressive face that is associated with an experience or socially relevant memory. Emotional prediction induced by inexpressive faces of visitors will necessitate advanced judgment underlain by various brain activities.

A crucial component of smooth communication is the ability to discern emotional states from facial expressions. Previous neuroimaging studies have implicated the fusiform gyrus (FG) and superior temporal sulcus (STS) as specifically involved in face processing activities (George et al., 1999; Leppänen and Nelson, 2009). This finding is supported by evidence that FG lesions in patients with prosopagnosia impair the ability to perceive facial configurations (Sergent and Signoret, 1992; Barton et al., 2002). The neural system for face perception is divided into a core system (the inferior occipital gyrus, lateral FG, and STS) for visual analysis and an extended system [intraparietal sulcus, amygdala (AMG), insula, etc.] for cognitive functioning in attention, mouth movement, facial expression or identity, and emotion (Haxby et al., 2000).

Healthy people easily respond to facial expressions that communicate happiness, sadness, and anger by using similar facial expressions. Although a patient with severe ALS cannot effectively form facial expressions, family members and carers could, to a certain extent, notice the patient's emotions from facial expressions in visitors. That is, the emotions in the patient will similarly correspond with the facial expressions at which he/she is looking (Keltner and Ekman, 2000). However, a more difficult requirement in coping with social situations is the ability to read an individual's emotional responses to neutral faces. In particular, a person may feel certain emotions when encountering neutral expressions that are associated with previous experiences. However, ALS or autistic individuals cannot directly express their emotions—a situation that diminishes their quality of life. For individuals who are unable to summon appropriate facial expressions during communication, desirable technologies are passive or affective BCIs (Nijboer et al., 2009; Zander et al., 2010) that enable the communication of specific emotions. Such technologies should facilitate real-time reception of and response to patients' emotions rather than be restricted to interpreting facial expressions.

Even a neutral face associated with previous information or memory (e.g., previously observed behavior and personal characteristics, familiar situations, etc.) can elicit various emotions (Todorov et al., 2007) and enhance brain activities (Kleinhans et al., 2007; Taylor et al., 2009). Experimental conditioning studies indicated that specific cues (e.g., angry or fearful faces) associated with negative feelings modulate AMG activity (Morris et al., 2001; Knight et al., 2009). Despite the progress made in research, however, a comprehensive conditioning study on inexpressive faces associated with negative emotions has not been conducted, with specific focus on the rapid neural dynamics (e.g., *<*1 s) in human cortex networks.

Although the effects of conditioned neutral faces remain unknown, previous studies on electromagnetic brain activity revealed significant electrical responses to faces at 100-ms (P1) and 170-ms (N170) latencies. P1 amplitudes in occipital regions are facilitated by visual stimuli, such as fearful faces (Pourtois et al., 2004). Specific N170 responses are characterized by posterior temporal negative deflection (Bentin et al., 1996; Eimer, 2000) and are slightly but more significantly enhanced in response to fearful faces than to other facial expressions (Batty and Taylor, 2003; Pegna et al., 2008). These face-specific amplitudes may therefore vary even when an individual looks at negatively conditioned inexpressive faces because of the modulation of emotional valances on the basis of individual experience.

Event-related potential (ERP) amplitudes for face stimuli have not been adequately elucidated. For example, some researchers showed that both famous and unfamiliar faces elicit identical amplitudes (Eimer, 2000; Schweinberger et al., 2002); they argue that late-period processing (e.g., 300–500 ms) is more important than early-period processing in the recognition of facial expressions or movements. By contrast, other studies revealed that familiar faces can enhance N170 amplitudes (Rossion et al., 1999; Caharel et al., 2002). In clarifying the reasons for the incongruence in results, a potentially promising approach is applying source localization to total channel responses to determine novel ways of explaining neuronal dynamics in the cortex, even under slightly differing EEG responses between experimental conditions.

Source localization supports the enhanced activation of the lateral FG at P1 and N170 latencies, such as the activation observed in the direct perception of facial expressions (Utama et al., 2009) and face processing during imaginary situations (Ganis and Schendan, 2008). Nevertheless, in the brain's neural response to inexpressive faces giving rise to negative emotions, definitive conclusions regarding the effects of source localization remain elusive. The primary purpose of this study, therefore, was to investigate the effects of aversively conditioned neutral faces on rapid central responses and evaluate the potential of electroencephalogram (EEG) signals as tools for detecting such responses. Source localization was performed to identify cortical activities. The emotions evoked by individual experience with a specific person may cause changes in physiological characteristics during rapid face processing.

If EEG signals can detect actual emotional responses to specific inexpressive faces, the results could be incorporated into the design and development of BMI/BCI-based non-verbal communication tools. Brain signals, such as EEG, magnetoencephalography (MEG), near-infrared spectroscopy, and electrocorticogram (ECoG) data, can be integrated with BMI/BCI functionality, thereby enabling the development of technologies that offer advantages to paralyzed patients (Lebedev and Nicolelis, 2006). In particular, ECoG or intracortical single-unit recordings show high spatial resolution and good signal-to-noise (S/N) ratio. Despite the advantage of this method, however, the invasive nature of the measurement may present technical difficulties and pose clinical risks (Leuthardt et al., 2004). By contrast, EEG or MEG signals can be non-invasively measured regardless of low spatial resolution and the presence of artificial noise (Gramfort et al., 2010). To improve poor S/N ratios, researchers have applied a wavelet transform with reasonable noise reduction to the singletrial classification of EEG signals (Tallon-Baudry et al., 1996; Hsu and Sun, 2009). EEG signals are also convenient for everyday measurement because EEG equipment is inexpensive and portable.

Because substantial changes in the EEG signals measured by scalp electrodes can be easily detected, extensive brain activity in the cortex (e.g., P300 responses) has been successfully employed in active BMI/BCI studies (Piccione et al., 2006; Mak et al., 2012). For non-verbal communication, the FG and STS regions are specifically involved in face processing (George et al., 1999; Hoffman and Haxby, 2000). For example, electrodes placed around posterior temporal regions can detect strong electrical responses to human faces, with the responses exhibiting 100-ms (P1) and 170-ms (N170) latencies (Bentin et al., 1996; Eimer, 2000). Especially for ALS patients who cannot freely move or speak, an important component of smooth communication is the ability to convey the emotions triggered by facial recognition to support persons. In this regard, therefore, real-time BMI/BCI-based face recognition is a desirable next-generation application of EEG signals for the facilitation of non-verbal communication.

Support vector machines (SVMs) are useful and efficient methods for classifying biological signals (Lotte et al., 2007). However, a few problems are presented by parameter settings for SVMs that are directly linked to accuracy. To consider approaches to the use of BMI/BCIs with EEG signals, an essential requirement may be the abstraction of brain activity at functional frequencies under reduced artificial noise (Tallon-Baudry et al., 1996). Such issues are effectively addressed by time–frequency analyses (Kashihara et al., 2009), which could also elucidate the neural activities that are crucial for face processing. The time–frequency data for an SVM classifier enable the efficient extraction of meaningful changes in ERP responses. Accordingly, the second purpose of this study was to develop and evaluate an analytical method in which an SVM classifier evaluates negative emotional responses to inexpressive faces. The method was developed on the basis of time–course and time–frequency EEG data. A face morphing application triggered by the SVM classifier was also tested to determine its utility in non-verbal communication.

# **STUDY 1: EEG MEASUREMENT**

Using EEG signals, this study investigated the effects of aversively conditioned neutral faces on rapid central responses. The emotions induced by individual experience with a specific person may modify the physiological characteristics that arise during face processing.

#### **MATERIALS AND METHODS**

#### *Participants*

A total of 22 right-handed healthy volunteers from Nagoya University were recruited for the study. The physiological responses of the 12 participants (6 males and 6 females; age: 27*.*0 ± 0*.*9 years) to conditioned neutral faces were examined. The remaining 10 participants (see Section Basic Study of Neutral Face Stimuli) were asked to take part in a basic experiment to evaluate the equality of the face stimuli used in this work. All the participants had normal or corrected-to-normal vision and had no history of serious medical problems. This study was approved by the ethics committee of our institute. Written informed consent was obtained from all the participants after they were provided an adequate description of the experiment.

#### *Stimuli*

Five inexpressive faces, as visual stimuli, were selected from the Japanese Female Facial Expression (JAFFE) database1. Thirty scenery images without artificial objects (e.g., skies, mountains, seas, etc.) were collected from free web sources. All the images were converted into grayscale bitmap images. The visual stimuli were presented on a 21-inch CRT monitor (640 × 640 pixels, with a resolution of 1024 × 768 pixels) that was positioned at the same height as the participants' eyes. The distance from the stimuli was set at 140 cm, indicating a visual angle within almost 5◦. During the experiments, a loudspeaker was placed behind a participant and a 100-dB white noise burst was used as the auditory stimulus (Morris et al., 1998).

#### *Procedure*

After the sensors for the EEG measurement were attached onto the participants, they were asked to view a short series of images (10 trials) during the acclimation period. They were first asked to rate three face images, after which the two-phase experiment was initiated. The two phases were (1) aversive conditioning to a neutral face (*conditioning* phase) and (2) physiological measurement using the conditioned stimuli (*data acquisition* phase).

*Conditioning phase.* **Figure 1A** shows the experimental procedure adopted in the conditioning phase. A 500-ms "start" stimulus was followed by an interstimulus interval (ISI) between 800 and 1200 ms, after which one of the two faces selected from the dataset was presented onscreen for 1 s. A single neutral face (conditioned stimulus: CS+) was always followed by a 500-ms noise burst; the other face (CS−) was not paired with the noise burst. Fifty percent of all the face stimuli were of CS+ type (or CS−), and the image types were presented in random order. To evaluate the difference between the two faces, the participants were asked to press the left and right keys for the first and second faces, respectively. Intertrial interval (ITI) was varied between 5 and 6 s. This phase comprised 40 trials (20 trials under each stimulus condition). After all the trials were completed and the participants finished their 5-min rest period, the next phase was initiated.

*Data acquisition phase.* **Figure 1B** shows the experimental procedure used in the data acquisition phase. A 500-ms "start" stimulus was followed by an ISI jittered between 800 and 1200 ms.

**FIGURE 1 | Experimental procedures in the (A) conditioning and (B) data acquisition phases.** One of the multiple images was presented at every trial. CS+, aversively conditioned face stimuli; CS−, unconditioned stimuli. All the faces were neutral; CS+ was followed by a noise burst in **(A)** (all) and **(B)** (50%).

<sup>1</sup>Lyons, M. J., Kamachi, M., and Gyoba, J. (1997). *Japanese Female Facial Expressions (JAFFE), Database of digital images*. Available online at: http://www.kasrl.org/jaffe\_info.html

One of the two faces (CS+ and CS−) used in the conditioning phase or a scenery image (randomly selected from the dataset) was presented for 1 s. The images in the three types (unconditioned and conditioned neutral faces and scenery) were presented in random order. Half of the CS+ images were followed by a 500-ms noise burst (paired CS+), whereas the rest were presented without sound (unpaired CS+). This method was intended to efficiently maintain the effect of aversive or fear conditioning and was based on a procedure discussed in previous studies (Büchel et al., 1998; Morris et al., 2001). The CS− faces were always followed by silence, and the ITI was between 7 and 8 s.

This phase was initiated in three blocks, with each block involving 72 trials (24 trials for each image type). The rest period between the blocks was 5 min. To enable the participants to concentrate on the tasks, they were asked to press a key as quickly as possible when the cue stimulus of a railroad image was presented twice in a block. The participants were asked to refrain from body movements during the physiological measurement and to refrain from blinking during the image presentations. The face images (CS+, CS−, and a dummy) were also rated at the final period of the experiment.

#### *Data acquisition*

*EEG recording.* EEGs were recorded using EGI Inc.'s HydroCel Geodesic Sensor Net (65 channels) in accordance with the international 10-10 electrode system. The signals from an EEG amplifier (Net Amps 300) were sampled at 500 Hz with data acquisition software (Net Station ver. 4.2). The electrode impedances for all channels were kept below 50 k*-*, as recommended by EGI Inc. The EEG amplifier used in this study can record input with high electrode impedance, without the attachments causing scalp abrasions and without the need for a recording paste and gel. The recording net with electrodes uses a saline solution for the electrical conductor, thereby resulting in high electrode impedances. The high impedances are regulated by the amplifiers to guarantee recording accuracy (Ferree et al., 2001). The features of the EEG amplifier are also highly useful in EEG recordings for patients who cannot withstand lengthy setup procedures or painful scalp abrasions.

*Rating of face stimuli.* The participants were asked to rate neutral faces (0 = not at all, 1 = mild, 2 = moderate, 3 = strong, and 4 = extreme) to evaluate the changes in the emotions that arose [fear, anxiety, aversion, discomfort, anger, relief, favor, and pleasure in relation to social situations (Nesse, 2005)] before and after the experiment. Three images of neutral faces were randomly chosen from the database and presented onscreen; these images were the same as the CS+ and CS− faces and a dummy image that had not been presented in the previous experiments. The participants reported the emotions that they instantaneously experienced when they looked at the presented faces. In the final rating, the degree to which they experienced unpleasant feelings upon exposure to the noise burst (a five-point scale of 0–4) was confirmed by oral declaration.

#### *Data analysis and statistics*

All data are expressed as mean ± standard error. *p*-values less than 0.05 were considered statistically significant.

*Event-related potentials.* The average reference montage and a digital bandpass filter of 1–30 Hz were applied offline. For all the conditions, the EEG signals were segmented into epochs ranging from 100 ms before stimulus onset to 700 ms after stimulus onset. Baseline correction was performed by subtracting the mean of the 100-ms pre-stimulus interval from the data after stimulus onset. Trials in which ocular activity was greater than ±50µV within a 50-ms period or movement artifacts with amplitudes exceeding ±200µV were excluded from analysis. Although noise due to micromovements may contaminate EEG signals, sufficient signal averaging can eliminate as much of this noise as possible. After the trials were averaged in each condition, the grand average among the participants was calculated. Data on cue stimuli with key presses were excluded from the analysis. The regions of interest were the left (P7, P9, TP9) and right (P8, P10, TP10) posterior temporal regions that are correlated with face processing; the bilateral occipital electrodes (O1 and O2) that are related to attention in the primary visual cortex were used in the P1 analysis.

In the ERP responses at the posterior temporal regions, the maximum values between 60 and 120 ms and the minimum values between 100 and 200 ms were defined as P1 and N170 responses, respectively. For every amplitude in the early (P1 and N170) and late (average between 200 and 700 ms under a 100-ms analysis window) ERP components, repeated Two-Way ANOVA [two levels (left and right) in the bilateral recording sites; three levels (CS+, CS−, and scenery) in the presented images] was performed. In significant main effects, the Holm method was used for multiple comparisons.

*Source localization.* Standardized low-resolution electromagnetic tomography (sLORETA, the LORETA-KEY software package) was used for source localization, which has a possibility to estimate the local region from which cortical generators originate in each time window by solving the inverse problem (Pascual-Marqui, 2002). Because sLORETA analysis depends on noise levels (Grave de Peralta Menendez et al., 2009), the EEG data obtained after sufficient signal averaging were applied in the source localization. The solution space of sLORETA is restricted to 6239 voxels with a 5-mm<sup>3</sup> cortical gray matter.

The average of the post-stimulus period between 50 and 700 ms was compared with that of the baseline (100-ms prestimulus period) in each condition. The sLORETA images for the average of the ERP data in the P1 (60–120 ms), N170 (120–150 ms), and post-stimulus periods between 200 and 700 ms (100-ms time window without overlap: five windows) were then calculated to compare the difference between the CS+ and CS− conditions. The P1 and N170 periods that were analyzed were determined from the results on the ERP responses.

The statistical non-parametric map with smoothing and linear scaling (Nichols and Holmes, 2002) was used to estimate the significantly activated parts determined from the source localization between the CS+ and CS− conditions. Voxel-by-voxel *t*-tests of the LORETA images were performed. The significance threshold was based on a permutation test (6000 rounds). The corrected *t*-values were plotted onto a magnetic resonance imaging (MRI) template (Colin27 brain, T2-weighted images) with a color scale bar. As an experimental limitation, the correct location may have been slightly shifted because a standard brain template (Colin27 brain) was used for each subject. All the *p*-values were one-tailed. Results are presented in the Montreal Neurological Institute coordinates with assigned Brodmann's area labels.

*Rating of face stimuli.* The first rating scores were subtracted from the final scores. A positive value indicates increased appeal and a negative value indicates the opposite. The extent to which the participants experienced unpleasant feelings upon exposure to the noise burst was calculated as the average across all the participants.

For the changes in rating scores before and after the conditioning experiment, repeated Two-Way ANOVA (three levels of CS+, CS−, and dummy images and eight levels of basic and social emotions) (Nesse, 2005) was carried out under the assumption of an equal-interval scale. For significant main effects, the Holm method was used for multiple comparisons.

#### **RESULTS**

#### *ERP responses*

**Figure 2** shows the grand average of the ERP responses in the (**Figure 2A**) bilateral posterior temporal regions and (**Figure 2B**) two-dimensional topography at the focused latencies. At around 90 ms from face stimulus onset, the peak positive potential (P1) appeared, followed by a considerable negative potential (N170) at around 140 ms. Especially for late latency (600 ms in **Figure 2B**), the topographical maps of EEG activity changed across the image types.

ANOVA revealed the main effects of the image condition in the P1 [*F*(2*,* 44) = 4*.*15, *p <* 0*.*05] and N170 amplitudes [*F*(2*,* 44) = 63*.*67, *p <* 0*.*01] at the posterior temporal lobe. The N170 values for the face images were significantly (*p <* 0*.*01) greater than that for the scenery image in each hemisphere. However, these values did not significantly differ between CS+ and CS−. For the P1 amplitude at the occipital electrodes (O1 and O2), the main effects and interaction were statistically non-significant.

For the ERP responses between 200 and 700 ms (100-ms time window), ANOVA indicated that type of image exerted significant main effects: *p <* 0*.*01 in each period. For all the time windows, the average ERP values for the face images in each hemisphere were significantly lower than that for the scenery (*p <* 0*.*01). Especially in the period between 600 and 700 ms [*F*(2*,* 44) = 28*.*72, significant main effect], the CS+ value in the right hemisphere decreased to a more significant extent than did the CS− value (*p <* 0*.*05).

#### *Source localization*

**Table 1** shows the statistically significant regions among all the participants (baseline vs. post-stimulus response) in each experimental condition. In relation to the baseline, the significant areas localized by sLORETA (*p <* 0*.*05) were the bilateral FG, inferior temporal gyrus, and middle temporal gyrus for both the CS+ and CS− conditions. Especially in the CS+ condition, the right hemisphere exhibited stronger activity than did the left hemisphere. During the 600–700-ms latency, the right FG was more significantly activated under CS+ than under CS− (**Figure 3**; *p <* 0*.*05).

#### *Rating of face stimuli*

**Figure 4** illustrates the affective changes found under the three face conditions. Overall, the rating scores for CS+ tended to reflect increased negative emotions and decreased positive emotions. ANOVA revealed the main effect of emotion [*F*(7*,* 252) = 6*.*17, *p <* 0*.*01] and significant interaction [type of image × emotion, *F*(14*,* 252) = 4*.*54, *p <* 0*.*01]. In the items showing a significant simple main effect, the CS+ and CS− conditions significantly differed in the rating scores for aversion (*p <* 0*.*05), discomfort (*p <* 0*.*01), relief (*p <* 0*.*01), and pleasure (*p <* 0*.*05). The changes in the scores for aversion (*p <* 0*.*05) and discomfort (*p <* 0*.*05) in the face under CS+ were significantly higher than those derived when the dummy was presented. The extent to which unpleasant feelings were experienced upon exposure to the noise burst was 3*.*0 ± 0*.*3 (75% of the max. score).

#### **BASIC STUDY OF NEUTRAL FACE STIMULI**

The same procedure as that applied in the data acquisition phase (see Section Procedure) was performed to evaluate the effect of


**Table 1 | Brain areas of significant activity (***p <* **0***.***05) under the face image types in relation to the baseline.**

*CS*+*, conditioned stimulus; CS*−*, unconditioned stimulus. BA, Brodmann's areas; FG, fusiform gyrus; ITG, inferior temporal gyrus; MTG, middle temporal gyrus.*

neutral face stimuli under no condition (7 males and 3 females; age: 25*.*0 ± 0*.*7). The ITI was between 5 and 6 s because no aversive conditioning was applied. The parameters for the ERP responses (P1, N170, and latencies between 200 and 700 ms) and the significant source localization sites were statistically evaluated. The neutral faces exhibited no significant differences in terms of ERP response amplitudes and source estimation (*p >* 0*.*05), indicating equality across the image dataset.

#### **DISCUSSION 1**

Even for the inexpressive faces, aversive conditioning in which the noise burst was used induced a posteriori negative feelings.

Although greater attention was paid to the faces than to the scenery, the CS+ and CS− conditions exhibited statistically nonsignificant differences in the N170 amplitudes in the posterior temporal regions and the P1 amplitudes at the O1 and O2 sites. This result indicates that an extensive and meaningful process of distinguishing the differences between the inexpressive faces would have worked, except in the early face processing. By contrast, the late ERP response (600–700 ms) evoked by the aversively conditioned neutral faces resulted in a statistically significant ERP component at the electrodes attached around the posterior temporal lobe; this result suggests the involvement of higher face recognition functions (Eimer, 2000) under situations wherein expressions are associated with a previous negative experience. A limitation of the experiment was that the meaningful electrode positions were carefully established to enable the detection of EEG responses. Nevertheless, other brain functions may be detected or missed, depending on region of interest. Severe ALS patients may also cause attenuated brain activity (Guo et al., 2010).

For facial expression or mental imagery, previous studies involving source estimation indicated that some brain areas, including the FG, are activated for early face processing at P1 and N170 (Ganis and Schendan, 2008; Utama et al., 2009). Even for the inexpressive faces with aversive conditioning, the source estimation in this study indicated continuous FG activation in both the CS+ and CS− conditions (**Table 1**). Furthermore, the right FG was significantly activated, with longer latencies of 600–700 ms under CS+ than under CS− (**Figure 3**). This result presumably reflects aversively conditioned responses to incoming noise burst.

The AMG is activated by short-term fear or aversive conditioning (Büchel et al., 1998; Morris et al., 2001; Knight et al., 2009), and viewing emotional faces effectively increases the connectivity between the AMG and FG (Leppänen and Nelson, 2009). The mental imaging of a face can also enhance FG activity (Ganis and Schendan, 2008). These findings suggest that recalling a conditioned face that reflects negative emotions similarly activates the FG and AMG. Because the results of the present study (**Table 1**, **Figure 3**) showed strong activation of the inferior temporal cortex (FG) rather than the STS, negative learning may have been performed via the inferior network for face recognition, through which the AMG and FG were accessed (Morris et al., 1998; Knight et al., 2009). The STS is generally activated by the recognition of gaze direction or facial expressions of emotion (Haxby et al., 2000). Because the visual targets (i.e., inexpressive faces) of the aversive conditioning in this study did not feature such dynamic movements, the STS activity may not have influenced conditioned facial perception and/or recognition.

The central responses were modulated primarily at the right sides of the FG; this modulation is related to the aversively conditioned neutral face stimuli. The activated area (**Table 1**) on the left-side FG (vs. the right side) under CS+ decreased to 1/7 (5 vs. 34 voxels), whereas the ratio under CS− (44 vs. 39 voxels) generally remained at around 1. This result presumably reflects the intensive recruitment of right-side neural functions in predicting impending danger. This finding is supported by previous studies in which the right-side FG activity in patients with prosopagnosia was dominant (Barton et al., 2002); this activity also dominated even during presentations of a fearful face to healthy participants (Pegna et al., 2008).

The aversion and discomfort reflected by the emotional ratings were significantly increased by the conditioned neutral face (**Figure 4**), which could trigger changes in brain activity. By contrast, the fearful emotion was non-significantly correlated with the conditioned neutral faces, which would have failed to induce the ERP responses (i.e., early face processing responses, such as P1 and N170) specific to typical fear stimuli (Batty and Taylor, 2003; Pegna et al., 2008). Note, however, that the participants in this study may have experienced unconscious fear during the EEG measurement.

For ERP studies that focus on a specific stimulus, sLORETA software is equipped with a procedure for baseline correction, which has been applied in numerous studies on source localization (e.g., Utama et al., 2009; Scharmüller et al., 2011). Nevertheless, baseline correction must be carefully used in source localization given the occurrence of local changes in each sensor location (http://www*.*electrical-neuroimaging*.*ch/faq*.*html).

The mathematical constraint in LORETA software is the smoothness of spatial activation, in which neighbor neurons are assumed to exhibit similar activations. From direct measurement on neural tissue, active states are characterized by a reduction in the synchrony of adjacent neurons (Cruikshank and Connors, 2008; Poulet and Petersen, 2008). Contrary to the LORETA hypothesis, sluggish states may be identified by the similarity in activity of neighbor neurons under a defined time window for source localization (Grave de Peralta Menendez and Andino, 2000).

The conditioned faces further emphasized activity in the right FG (i.e., face-selective cells) during late facial processing. The EEG equipment detected responses at the posterior temporal regions; these responses reflect activity in faceresponsive neurons. Neutral stimuli other than faces would affect various brain areas with different EEG responses. Expert object recognition may stimulate the same face-selective cells (Tanaka and Curran, 2001), although such recognition is a rare response.

#### **STUDY 2: APPLICATION OF A BRAIN-COMPUTER INTERFACE**

Study 1 revealed that the posterior temporal lobe ERP responses to the aversively conditioned neutral faces significantly changed during late face processing. As previously stated, this study developed an efficient method for classifying implicit negative emotional responses to specific neutral faces by using EEG signals.

#### **MATERIALS AND METHODS** *Classification by SVM*

The SVM classifier for determining a hyperplane that optimally separates samples from two classes with the largest margin (Cortes and Vapnik, 1995; Cristianini and Shawe-Taylor, 2000) was used in this study. An optimal SVM separating hyperplane is calculated by solving constrained optimization thus:

$$\min\_{\mathbf{z},b,\xi} \left( \frac{1}{2} \left\| \mathbf{z} \right\|^2 + C \sum\_{i=1}^{l} \xi\_i \right). \tag{1}$$

Equation (1) is subject to *yi*(*z* · *φ*(*xi*) + *b*) + *ξ<sup>i</sup>* ≥ 1 and *ξ<sup>i</sup>* ≥ 0 (*i* = 1*,* ··· *, l*), where *l* is the number of training vectors, *yi* ∈ {−1*,* +1} denotes the class label of the output, and *z*<sup>2</sup> <sup>=</sup> *<sup>z</sup>T<sup>z</sup>* represents the squared Euclidean norm. Weight parameter *z* determines the orientation of the separating hyperplane, *b* is a bias, *ξ<sup>i</sup>* is the *i*th positive slack parameter, and *φ* shows a non-linear mapping function. Parameter *C* indicates the penalty term. With a large *C*, a high penalty is assigned to training errors. The two points closest to the hyperplane substantially affect the orientation, thereby resulting in a hyperplane that is close to other data points. With a small *C*, these points move inside the margin and the orientation of the hyperplane changes, thereby generating a large margin. To address this issue, a formulation of the SVM that uses the parameter 0 *< ν* ≤ 1 can be applied. This parameter can regulate the fractions of support vectors and margin errors (*ν*-SVM).

The vector *φ*(*xi*) that corresponds to a non-zero value is a support vector of the optimal hyperplane. A desirable approach is to use a small number of support vectors to complete a compact classifier. The optimal separating hyperplane is calculated as a decision surface of the form sgn:

$$f(\mathbf{x}) = \text{sgn}\left(\sum\_{i=1}^{L} \alpha\_i \nu\_i K(\mathbf{x}\_i, \mathbf{x}) + b\right),\tag{2}$$

where sgn(.) ∈ {−1, +1}. *K* is the non-linear kernel function, and it projects samples to a high-dimension feature space via a nonlinear mapping function. *L* is the number of support vectors. As the non-linear kernel, the radial basis function is defined as

$$K(\mathbf{x}\_i, \mathbf{x}\_j) = \exp\left(-\boldsymbol{\gamma} \left| \|\mathbf{x}\_i - \mathbf{x}\_j\|\right|^2\right),\tag{3}$$

where the value of the kernel parameter *γ* determines the variance of the function.

#### *Features for SVM*

**Figure 5A** shows the diagram of the BCI based on the SVM classifier and face morphing application. Time–course and time– frequency data for the SVM classifier were extracted from the ERP responses to the face and scenery images.


**FIGURE 5 | (A)** Diagram of the proposed method in which a support vector machine (SVM) classifier is used. **(B)** Illustration of the face morphing software triggered by the classifier's result.

$$\mathcal{W}(t, f\_0) = \exp\left(\frac{-t^2}{2\sigma\_t^2}\right) \cdot \exp\left(2\pi f\_0 it\right) / \sqrt{\sigma\_t \sqrt{\pi}}.\tag{4}$$

The standard deviation of the time domain (*σt*) is inversely proportional to the standard deviation of the frequency domain [*σ<sup>f</sup>* <sup>=</sup> (2*πσt*)−1]. The effective number of oscillation cycles in the wavelet (*f*0/*σf*) was set at 6, with *f*<sup>0</sup> ranging from 4 to 12 Hz (i.e., theta and alpha bands) in increments of 0.1 Hz. After the subtraction of a linear trend, the continuous wavelet transform of a time series [*u*(*t*)] was calculated as the convolution of a complex wavelet with *u*(*t*): *u*˜(*t,f*0) = *w*(*t,f*0)∗*u*(*t*). The squared norm of the wavelet transform was calculated in a frequency band at around *f*0. **Figure 6** shows examples of the wavelet transform in the ERP responses to the presented images. The power spectrum of the wavelet transform in the face stimulus indicated a characterized pattern, compared with that in the scenery stimulus.

#### **VALIDATION OF THE SVM CLASSIFIER**

#### *Data and analysis*

A 36-fold cross-validation among the three types of categorized data (unconditioned and conditioned neutral faces and scenery) was performed to evaluate the accuracy of the SVM classifier. The tested EEG data were the same as those used in Study 1 (12 participants). An SVM classifier trained by using 35 data was evaluated by using the remaining data (i.e., test data); this procedure was repeated, with modifications to the training and test data (a round

robin for all the data: 36 rounds), to calculate classification accuracy. The independent parameter *γ* in Equation (3) was fixed at a constant value during this cross-validation (*ν* = 0*.*5), thereby resulting in changes to the other parameters, such as the optimal objective value of the dual SVM problem and the bias term in the decision function. Here, the independent kernel parameter *γ* of Equation (3) was set as four types: 0.0001, 0.01, 1, and (number of features)−1, and the accuracy of the cross-validation was computed for each kernel parameter. The tolerance of termination criterion for the SVM was set at *ε*−*6*.

#### *Classification results*

**Figure 7** shows the results of the 36-fold cross-validation regarding the performance of the SVM classifier in evaluating the negative emotional responses to inexpressive faces determined from the ERP responses. In both features, the SVM classifier exhibited an accuracy higher (80% at the maximum) than the chance level (i.e., 33% in each). Overall, the classification accuracies for the faces and scenery tended to be higher than those for the CS+ and CS− conditions. However, classification accuracy depended on the kernel parameter value. Compared with the time–course data of the SVM classifier, the wavelet-transform data showed stable accuracy in each category (e.g., almost 70% especially in the kernel parameter *γ* = 0*.*01 and 1), indicating the existence of the optimal parameter setting.

#### **MORPHING APPLICATION**

After the SVM classifier was trained by the above-mentioned training sample, the face morphing application (Microsoft

Visual C++ 2012, OpenCV ver. 2.4, and LIBSVM ver. 3.17) was tested under the hypothesis of real-time data acquisition. When a neutral face associated with a negative emotion is predicted by the SVM classifier, the face morphing application is automatically triggered. As shown in **Figure 5B**, the target neutral face gradually (about 1–2 s) changed into a negative one (JAFFE database1). Therefore, individuals' emotions can be easily estimated using the proposed non-verbal communication tool.

#### **DISCUSSION 2**

In Study 2, the SVM classifier for non-verbal communication was evaluated by identifying the ERP responses to implicit emotional faces. For the control of mechanical or computer devices, active BMI/BCI studies have been based on the features of EEG data on various scalp regions (e.g., P300 responses) (Piccione et al., 2006; Mak et al., 2012). The present study especially focused on the activity of the posterior temporal lobe in relation to face processing; the SVM feature was used after the application of the wavelet transform. In the optimal range of the kernel parameter, the SVM classifier for the time–frequency domain showed stable accuracy (almost 70% in each category). Developing an autotuning method for appropriate parameter setting of the SVM classifier is an advantageous approach because of the individual differences of ERP responses. Thus, wavelet transform data will be effective for such a case. An accuracy higher than 60% will be regarded as a tentative indicator of a successful case for affective BCIs (Nijboer et al., 2009; Mak et al., 2011). A desirable future direction for BMI/BCI research is the development of small embedded systems for the everyday use of face morphing software that can reflect various emotions.

A critical safety problem arises when a patient experiences a severe accident as a malfunction occurs in active BCIs for basic motion (e.g., wheelchair, artificial arm, etc.). The proposed system is a device intended to enhance quality of life through emotional estimation. When a patient may possibly experience a negative emotion, the people around the patient (e.g., families, friends, and carers) can support him/her by paying particular attention to emotional estimation. If they recognize or predict the patient's negative emotions (e.g., 60 or 70% negative feeling, no perfect or maximum level for each emotion), the patient could live a full and humane life. Nevertheless, further studies on more accurate classifier methods are required. For example, a cascaded classifier may effectively increase classification accuracy.

Time–frequency analyses can identify the neural activities crucial for emotional face processing, but further modifications are required to obtain higher accuracy in SVM classification. This requirement may be satisfied with the direct image analysis (e.g., the bag of features scheme; Csurka et al., 2004) of contour maps in time–frequency data (Kashihara et al., 2011). Speeded-up robust features (SURF) and scale-invariant feature transform (SIFT) can effectively search for local information on object boundary (Bay et al., 2008). In the bag of features scheme, visual words are generated by the k-means algorithm to cluster the feature vectors of SIFT or SURF and create a visual vocabulary (Duda et al., 2001). Each image can be represented by a histogram of visual words. Therefore, the bag of features scheme may serve as a means of novel interpretation that determines the effective features of an SVM classifier and could extract meaningful changes in time– frequency data.

Several limitations constrain the practical application of the BCI approach adopted in this study. The specific brain region considered in the classification was limited to the right FG area identified by source localization from multiple electrodes attached onto a participant's scalp. Crucial issues for consideration are the separation of brain activities into multiple inputs and movement artifacts, which may influence the results of source localization. This study assumed application under a static measurement situation for patients and excluded active BCIs, such as moving artificial arms and wheelchairs. The next step, therefore, is to develop an accurate identifier that enables the practical application of BCIs and reduces noise; noise contamination of muscle or eye movements must be prevented through improvements in hardware, software, and algorithms.

To realize feature extraction from the frequency domain reducing external noise, data after the wavelet transform was used in machine learning (i.e., SVM); this application was implemented under the assumption that practical BCI analysis will be a component of future works. Experiment 2 was then performed to show the possibility of passive or affective BCI application for the detection of and response to high emotions. Therefore, the new techniques for wearable and hardware devices (robust to body movements) and efficient algorithms (e.g., signal filtering methods, including wavelet analysis) to eliminate external noise are desirable innovations for advancing the practical application of BCIs. In such application, accurate estimation from single trials can be factors for consideration.

The conditioned neutral facial stimuli (Experiment 1) were used to induce negative emotions based on previous experiences and to identify the specific brain area that is activated during such responses because dynamic EEG response has not been sufficiently clarified under such situations. As the next step, actual non-verbal communication should be evaluated by using real faces (families, friends, etc.) associated with patients' experiences and emotions.

Finally, the difficulty encountered in BCI training for autistic patients must be considered in the design and development of a practical system. EEG responses to facial stimuli and emotions may not require extensive training on scenarios such as those featuring active BCIs. Because posterior temporal regions are directly correlated with face processing primarily in the right FG, EEG responses might be measured in a straightforward manner, regardless of a specific training program.

### **CONCLUSION**

Inexpressive faces associated with negative experiences induce aversive and unpleasant feelings. The ERP study and source localization revealed that the aversively conditioned neutral faces activated late face processing (600–700 ms), rather than early face processing (e.g., P1 and N170), in the right FG region. Further evaluations of emotionally conditioned faces would elucidate the complicated brain activities involved in social cognition. As a non-verbal communication tool for BMI/BCIs, the proposed SVM classifier has a possibility to enable the easy detection of inexpressive faces that trigger specific individual emotions. In accordance with this classification, a face on a computer morphs into an unpleasant look. In future studies, the proposed classification method, which uses EEG signals, could be integrated with non-verbal communication tools to enable the expression of other emotions. More accurate classifiers should also be investigated to realize the practical application of BMI/BCIs.

#### **ACKNOWLEDGMENTS**

This study was partially funded by a Grant-in-Aid for Scientific Research (C) from Japan Society for the Promotion of Science (KAKENHI, 25330171). The author would like to thank Nagoya University and JST for providing assistance during the EEG experiment.

#### **REFERENCES**


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 29 December 2013; accepted: 22 July 2014; published online: 26 August 2014. Citation: Kashihara K (2014) A brain-computer interface for potential non-verbal facial communication based on EEG signals related to specific emotions. Front. Neurosci. 8:244. doi: 10.3389/fnins.2014.00244*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Kashihara. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# A new paradigm to induce mental stress: the Sing-a-Song Stress Test (SSST)

# *Anne-Marie Brouwer\* and Maarten A. Hogervorst*

*TNO, Perceptual and Cognitive Systems, Soesterberg, Netherlands*

#### *Edited by:*

*Cuntai Guan, Institute for Infocomm Research, Singapore*

#### *Reviewed by:*

*Albert Gjedde, University of Copenhagen, Denmark Angela J. Grippo, Northern Illinois University, USA Joanna Kitlinska, Georgetown University, USA*

#### *\*Correspondence:*

*Anne-Marie Brouwer, TNO, Perceptual and Cognitive Systems, PO Box 23, Kampweg 5, 3769 ZG Soesterberg, Netherlands e-mail: anne-marie.brouwer@tno.nl* We here introduce a new experimental paradigm to induce mental stress in a quick and easy way while adhering to ethical standards and controlling for potential confounds resulting from sensory input and body movements. In our Sing-a-Song Stress Test, participants are presented with neutral messages on a screen, interleaved with 1-min time intervals. The final message is that the participant should sing a song aloud after the interval has elapsed. Participants sit still during the whole procedure. We found that heart rate and skin conductance during the 1-min intervals following the sing-a-song stress message are substantially higher than during intervals following neutral messages. The order of magnitude of the rise is comparable to that achieved by the Trier Social Stress Test. Skin conductance increase correlates positively with experienced stress level as reported by participants. We also simulated stress detection in real time. When using both skin conductance and heart rate, stress is detected for 18 out of 20 participants, approximately 10 s after onset of the sing-a-song message. In conclusion, the Sing-a-Song Stress Test provides a quick, easy, controlled and potent way to induce mental stress and could be helpful in studies ranging from examining physiological effects of mental stress to evaluating interventions to reduce stress.

**Keywords: stress, arousal, paradigm, heart rate, skin conductance**

# **INTRODUCTION**

When studying physiological effects of sudden arousing and negative emotional events, it is desirable to induce a considerable amount of emotional stress in an easy, controlled and efficient manner, while respecting ethical standards with regard to experimental participants. A range of protocols in this direction has been described in the literature. Kreibig (2010) lists experimental paradigms of 134 studies on effects of emotion on autonomic nervous system activity. With respect to the negative emotions fear and anxiety, paradigms that have been used include watching film clips (Eisenberg et al., 1988), viewing pictures from the standardized International Affective Picture System (IAPS) by Lang et al. (2005) (Ritz et al., 2005), listening to sounds from the standardized International Affective Digitized Sound system (IADS) by Bradley and Lang (2007) (Brouwer et al., 2013), listening to music (Krumhansl, 1997), expecting an electrical shock (Bloom and Trautt, 1977), mental imagery (Van Diest et al., 2006), negative events in a gaming context (Brouwer et al., 2011), recall of personal emotional events and directed facial expression (Ekman et al., 1983). The paradigm that has become the worldwide standard for inducing psychological stress is the Trier Social Stress Test (TSST—Kirschbaum et al., 1993). In this paradigm, participants are asked to take over the role of a job applicant. They are informed that they will be giving a speech to a "selection committee" consisting of three people, and that their speech performance will be video and audio recorded for analysis. At the start of the procedure, participants get to see the business-like equipped test room with the committee, wearing white coats. During a stress anticipation interval of 10 min, participants prepare a speech that should convince the selection committee that they are the perfect candidate for a job. Participants are provided with paper and pencil to help them prepare. Following the stress anticipation interval, they deliver their speech standing in front of the selection committee. The members of the committee respond in a standardized, non-emphatic way. After 5 min of free speech, the participant is instructed to perform a serial subtraction task aloud, having to start over again on each mistake. In a later version, the paradigm has been adapted slightly to involve two to three people in the selection committee and only 3 min of preparation time (Kudielka et al., 2007).

The TSST is considered the best available paradigm to reliably elicit strong neuroendocrine stress responses (Dickerson and Kemeny, 2004). Responses of the sympathetic nervous system are also strong in comparison to other paradigms. Heart rate increases in the order of 17 bpm (Kirschbaum et al., 1993; 20 bpm in a TSST for groups: von Dawans et al., 2011) when comparing a baseline measurement to a measurement taken at the end of the 10-min preparation interval. In a public speaking task that is comparable to the TSST but that made use of a virtual rather than a real audience (Westenberg et al., 2009), heart rate increased with about 10 bpm and skin conductance increased with approximately 2µS. In their TSST review paper, Kudielka et al. (2007) mention a heart rate increase of 15–25 to the psychosocial stressor. It is unclear though which heart rate values are compared to arrive at this estimate. It probably involves heart rate during performance of the task, which is arguably the time that the most stress is experienced but which is also confounded by standing and talking such that part of the increase in heart rate is likely due to physical activity. As a rough comparison to physiological effects found using the TSST, viewing arousing pictures with negative valence usually elicits short rises in skin conductance of at most a few tenths of µS (Codispoti et al., 2001; Codispoti and De Cesarei, 2007) while heart rate decelerates for about one or at most a few bpm (Codispoti and De Cesarei, 2007; Brouwer et al., 2013).

The stress elicited by the TSST is multi-facetted. Participants perform cognitive tasks while preparing and delivering a speech, they are standing upright and talking, but the element that is considered to be mainly responsible for at least the strong neuroendocrine stress responses is social-evaluative threat (Dickerson and Kemeny, 2004). Social-evaluative threat occurs when an important aspect of the self-identity is, or could be negatively judged by others.

While the TSST elicits reliable stress responses and is a wellstudied paradigm, it is a relatively complicated procedure geared to examining neuroendocrine responses. We present an easy, short and effective stress-inducing paradigm that is geared toward examining stress responses controlled by the sympathetic nervous system such as skin conductance and heart rate. These stress responses typically occur faster than neuroendocrine responses. However, they can also be strongly affected by posture or body movements such as speech, so it is important to keep these constant across stress levels. In our paradigm, participants are presented with neutral messages on a screen, interleaved with 1-min time intervals. The final message is that the participant should sing a song aloud after the interval has elapsed. Participants sit still during the whole procedure. Skin conductance and heart rate are compared between time intervals that follow neutral messages and the one that follows presentation of sing-a-song "stress" message. In contrast to the TSST and a number of other stress inducing paradigms, both body movements and sensory input are kept constant while only mental stress is varied. This ensures that effects of the stress message on skin conductance and heart rate can only be attributed to psychological processes, and are not caused or partly caused by an increase in visual input or by the fact that the individual starts moving or talking.

In accordance to the stress inducing potency of socialevaluative threat, earlier studies indicate that anticipating and watching oneself singing with an audience, does indeed elicit emotional stress. In a study by Harris (2001) participants sang "Star Spangled Banner" while being recorded on video. After singing, there was a 10-min break followed by a 6-min relaxation period during which physiological baseline measures were taken. Then, the participant watched the video tape together with the experimenter and two confederates. No instructions were given to the participants on how to behave during this viewing interval. During the first minute of the viewing interval, heart rate was 4.5 (first study) or 3 (second study) bpm higher than during the baseline measurement. Changes in heart rate were significantly correlated with subjective ratings of feeling calm (*r* = −0*.*43) and feeling embarrassed (*r* = 0*.*38). Hofmann et al. (2006) investigated autonomic correlates of social anxiety and embarrassment in shy and non-shy individuals. They informed their participants upon arrival in the lab that they would be asked to give a speech and to sing in front of a video camera. Their results indicated that anticipating singing results in a stronger increase of heart rate (compared to an eyes-closed baseline) than either anticipating giving a speech or watching videos of the performance together with confederates. Average heart rate increase was 11 bpm, and skin conductance increased with approximately 4µS.

The main aim of the present study was to examine the potency of our Sing-a-Song Stress Test to elicit stress-related increases in heart rate and skin conductance. On the one hand, we expected the test to be effective given relatively strong effects of previous studies on social evaluative threat and singing, on the other hand, effect sizes may have been expected to be small since we excluded effects of confounds that possibly contributed to effects in other studies. We checked whether eventual increases could be detected by a change detection algorithm on the level of an individual participant, simulating an online stress detection algorithm in a situation that a participant is sitting still. Ultimately, we are interested in online and real-life detection of sudden stress based on physiological signals—if possible without attaching sensors to the body. Wieringa et al. (2005) showed that it is possible to recover heart rate from no-contact recorded images. Poh et al. (2010) described a robust algorithm to determine heart rate from varying color of a face as recorded by a camera. A similar algorithm has been exploited in Vital Signs Camera technology by Philips. To compare the performance of this technology to heart rate as measured using traditional electrocardiography (ECG), we measured heart rate using ECG and a camera simultaneously. A previous experiment indicated that heart rate based on camera images is very similar to heart rate extracted from ECG when averaging across 2 min intervals (Hogervorst et al., 2013).

# **MATERIALS AND METHODS**

### **PARTICIPANTS**

Twenty-five participants (15 female, 10 male) took part in the experiment. They were between 19 and 50 years old with a mean age of 30 and a standard deviation of 11 years. None of them suffered from a heart disease or took drugs that affect heart rate, as verified through a questionnaire. Participants received a monetary reward to make up for their travel and time. The study is in accordance with the Declaration of Helsinki and has been approved of by the local ethics committee. All participants signed an informed consent form prior to taking part in the experiment.

#### **MATERIALS**

#### *Stimuli*

We selected nine phrases that were approximately of the same length and structure as the 10th sentence which was the following (translated into English): "Task: start singing a song aloud when the counter reaches zero. Keep sitting still until that moment." Since we did not want the other nine phrases to elicit stress, we picked neutral phrases from the Dutch Wikipedia site about vacuum cleaners. A translated example is "Phrase: the first vacuum cleaner was constructed by Sweep Company. This was in 1907 and the device was called hoover."

#### *ECG and skin conductance*

To record ECG, we used self-adhesive Kendall Neonatal ECG electrodes (TYCO healthcare, Neustadt, Germany). The ECG channel electrode was placed at the sixth left intercostal space (midclavicular line); the reference electrode was placed at the first right intercostal space (midclavicular line); the ECG ground electrode was placed at the sixth right intercostal space (midclavicular line). Recording frequency was 256 Hz. To record skin conductance, we used the g.GSRsensor2 from G.tec Medical engineering (G.tec GmbH, Schiedlberg, Austria). Electrodes were applied to the fingertips of the index finger and the middle finger. Physiological data was processed by a G.tec USB Biosignal amplifier.

### *Vital Signs camera*

Heart rate was also determined using a camera and Philips Vital Signs camera software. The camera was an IDS UI-2220SE-C-HQ (uEye SE) 12-bit color camera, recording raw RGB-video, 8-bit per channel at 20 Hz and 768 × 576 pixels. The Vital Signs camera software requires the user to initially indicate a skin location by drawing a rectangle around a suitable face region as displayed on a monitor. This region is then automatically tracked. The Vital Signs camera software derives a measure of heart rate from subtle variations in skin color. Each measure is based on a window of 25.6 s and the resulting heart frequency is reported with a resolution of 0.039 Hz or 2.34 bmp at an output frequency of 20 Hz.

# **PROCEDURE**

Participants were not given any instructions prior to the study e.g., with respect to eating and drinking or physical activity prior to coming into the lab. The participant, as well as two confederate "participants" (one male and one female), were picked up by the experimental leader from the waiting area. In the experimental room, the experimental leader explained that they would be sitting in turn behind the monitor while heart rate and skin conductance are monitored and they are filmed by a camera. The task description was to sit as still as possible and silently read the messages that appear on the monitor, interchanged by a counter counting down from 60 to 0 s. They were told that one of the messages could entail a task that they then needed to carry out after the subsequent counter reached 0. Participants were not told that the experiment was about stress or involved singing. Subsequently, the participant and the confederates filled out the informed consent form as well as a questionnaire with general questions about age, use of cigarettes, coffee and alcohol, physical activity and cardiovascular health. The experimental leader then appointed the real participant as the one to start, seated him or her behind the monitor and then attached the physiological sensors. The monitor was in between the participant and the confederates, who were sitting on chairs facing a wall such that the angle between their gaze and that of the participant was 90 degrees. After the experiment, participants were asked whether they had believed that the confederates were experimental participants. They were also asked to indicate the stress they experienced during the minute before singing on a scale of 1 (not stressed at all) to 10 (extremely stressed). The experimental room did not have windows to the outside. Lighting was constant office lighting.

# **ANALYSIS**

# *ECG and skin conductance*

The ECG signal was filtered by the g-tec amplifier with a 0.5–100 Hz band pass filter, and afterwards by a 2.5 Hz highpass 2-sided Butterworth filter. R wave to R wave interval (RRI, the interval between successive heart beats) was extracted from this signal. Outliers resulting from the RRI derivation were determined by calculating differences from a moving median as calculated over 30 samples. Values outside the 5–95% quantile range were removed. For each participant, we then determined the mean RRI of each of the 1-min count-down intervals (blocks), and mean RRI was converted to HR (bpm). Also, mean skin conductance level was calculated over the 1-min blocks.

### *Statistical analysis*

In order to test whether heart rate and skin conductance show a sudden increase after presentation of the sing-a-song sentence, we performed paired *t*-tests on values of heart rate and skin conductance averaged over the 1-min blocks after offset of the 9th (vacuum cleaner) message and the 10th (sing-a-song) message, for each participant. Correlations are computed between subjective stress ratings and differences in heart rate between the 9th and the 10th block as well as between subjective stress ratings and differences in skin conductance between the 9th and the 10th block. Positive values indicate higher values in the 10th than in the 9th block.

# *Stress detector*

By designing a set of stress detectors based on detecting changes in state, we wanted to get an impression to what extent the onset of the sing-a-song message could be detected for a single individual and in real time. Note that it was not our goal to design the most optimal detector. Performance of these detectors can thus be considered to be the lower bound to performance under the circumstances tested.

# *Basic change detection algorithm*

The "normal" state of the signal was described by calculating the average and standard deviation of the signal during blocks 1–9 of the experiment, for each of the participants separately. Next, a normalized signal, akin to the *z*-score, was computed by subtracting the mean signal and dividing this difference by the standard deviation. A threshold was set above which a signal was considered to be deviating from normal, therewith signaling a change of state. Here, a change was assumed to correspond to the onset of stress. For each of the types of signals (as used in different detector types as described below) a suitable threshold was chosen, such that a change in physiological state was most often detected in block 10 and least often in other blocks (false alarms).

#### *Detector types*

Detectors were designed that made use of the various types of information available. Two detectors were based on heart rate only: one was based on heart rate derived from the ECG signal and one on heart rate derived from the camera data. The stress detection threshold for these detectors was set to a value of 4.

For skin conductance we used two different types of detectors. One detector used the raw signal with a threshold value of 6. Another detector used a filtered version of skin conductance. Filtering was done by subtracting the mean over the interval starting at 33 s and ending 3 s before the current moment from the mean over the last 3 s. This filter acts as an edge detector and eliminates slow variations in the skin conductance. The filtered signal was used as signal input to the basic change detection algorithm. We used a threshold value of 6 for this skin conductance edge detector (i.e., the same as for the skin conductance detector based on the raw signal). The skin conductance edge detector turned out to perform better than the detector based on the raw signal due to the existence of slow variations in the skin conductance level.

Finally, a detector was constructed in which the skin conductance edge output and ECG heart rate were combined. Both signals were converted to a normalized *z*-score before they were combined by calculating the square root over the sum of squares and dividing by the square root of 2: *<sup>z</sup>* <sup>=</sup> sqrt(*x*<sup>2</sup> <sup>+</sup> *<sup>y</sup>*2)/sqrt(2) (where *x* is the *z*-score of the skin conductance signal, *y* is the *z*-score of heart rate, and *z* is the combined *z*-score). In correspondence with the individual thresholds (6 and 4) the threshold for the combined signal was set to 5.

#### **RESULTS**

#### **GENERAL**

All 25 participants started singing. Three participants accidentally started to sing too early, that is, before the counter had reached zero. After the experiment had ended, four participants stated that they had doubted that the confederates were real participants. None of them was a participant who started singing early. The data of the three early singing participants are excluded from further analyses to prevent confounding the mental stress interval with body movements.

#### **HEART RATE RESPONSE**

Heart rate responses to the presentation of the sing-a-song sentence were generally very strong as indicated by the data of an example participant in **Figure 1A**. For every participant, heart rate was on average higher in the 10th than in the 9th block. **Figure 2A** indicates the heart rate for each of the 10 blocks averaged across participants. A paired *t*-test indicates a significant difference between the 9th and the 10th block (*t*<sup>21</sup> = 6*.*21, *p <* 0*.*001, mean difference: 15.3 bpm, Cohen's d: 1.63, paired *t*-tests comparing the 10th block to all other blocks resulted in *p*-values *<* 0.001 as well).

#### **SKIN CONDUCTANCE RESPONSE**

Like heart rate responses, skin conductance responses to the presentation of the sing-a-song sentence were generally very strong. Data of an example participant is presented in **Figure 1B**. For two of the participants, skin conductance recording failed. For all of the 20 remaining participants, skin conductance was on average higher in the 10th than in the 9th block. **Figure 2B** indicates the average skin conductance in the different blocks. A paired *t*-test indicates a significant difference between the 9th and the 10th block [*t*(19) = 5*.*37, *p <* 0*.*001, mean difference: 10.9 µS, Cohen's d: 0.81, paired *t*-tests comparing the 10th block to all other blocks resulted in *p*-values *<* 0.001 as well].

#### **SUBJECTIVE RATINGS**

The subjective rating of experienced stress was missing for one out of 22 participants. Ratings vary from 2 to 10 with a mean of 6.4 and a standard deviation of 2.1. As indicated in **Figure 3A**, subjective ratings correlated with skin conductance response (i.e., the difference in mean skin conductance level between the 9th and the 10th block: *r* = 0*.*50, *p* = 0*.*03). We did not find a significant correlation between subjective ratings and heart rate response (**Figure 3B**; *r* = 0*.*26, *p* = 0*.*26).

block following the sing-a-song sentence. Error bars indicate standard errors of the mean. Stars indicate that heart rate and skin conductance are significantly higher in block 10 compared to all other blocks at an alpha level of 0.001.

#### **VITAL SIGNS CAMERA SYSTEM**

There is a strong correspondence between heart rate as determined by the Vital Signs camera system and ECG. The example data in **Figure 1A** show that while the high frequency heart rate variability cannot be closely followed due to limited resolution, the heart rate as given by the camera system nicely follows the ECG data. The scatter plot presented in **Figure 4** represents average heart rate values for each of the 25 participants and each 1-min block as based on the camera and ECG. On average, ECG indicates a heart rate that is 0.55 bpm faster (*SD* = 2*.*22) than the Vital Signs Camera system. The median indicates a 0.25 bpm faster heart rate. The mean absolute difference is 0.92 bpm (*SD* = 2*.*09). The median absolute difference is 0.38 bpm. 77% of the heart rate values as measured by the camera is within 1 bpm difference from ECG; 90% is within 2 bpm difference.

#### **STRESS DETECTOR**

**Table 1** shows the results of the various stress detectors for 20 participants since data of the three participants that started to sing early and the data of the two participants with failed skin conductance recording were excluded. Shown are the number of participants for which a state change was detected in block 10 ("Detections") and the total number of detected state changes during blocks 1–9 (false alarms: "FAs") over all participants. The false alarms often occurred at the start of the experiment and seemed at least partly be due to signal recording artifacts. When

**FIGURE 3 | Association between subjective stress and (A) skin conductance and (B) heart rate response.** Responses are determined by the difference in average skin conductance level or heart rate between the 9th and the 10th block and plotted against subjective stress rating for each participant. The correlation between subjective stress and skin conductance response was significant; the correlation between subjective stress and heart rate response was not. The plotted lines are linear regression lines.

false alarms occurring during the first 20 s of the experiment were not counted, this resulted in a 60% reduction of false alarms as indicated in the table. The last two columns show the mean and median detection times (DTs), indicating the time between the


*"Detections" are the number (and %) of participants out of 20 for which stress was detected in the 10th block. "FAs" are the total number of false alarms ("stress detections" in block 1–9). "FAs after 20 s" represents the same but after excluding the first 20 s of block 1. "Mean DT" and "Median DT" (detection time in s) indicate when stress was detected relative to the onset of the sing-a-song sentence.*

start of block 10 and detection. For determining detection times, we only used blocks in which a state change was detected. These detection times are quite short, especially considering that the participant and his/her physiological system need some time to react to the stimulus.

Of the single sensor detectors the skin conductance edge detector performs best, detecting a change in the 10th block for 85% of the participants. The detector that uses the raw skin conductance signal shows the lowest detection performance (75% of the participants) indicating that preprocessing is required to compensate for the slow variations in the raw signal.

The detector that uses heart rate as derived from ECG seems to perform better than the one based on the camera in terms of the number of detections and the number of false alarms. Whereas both signals are largely the same, the resolution of the camera signal is limited resulting in more noise.

The detector that uses a combination of heart rate (from ECG) and filtered skin conductance performs best, with a detection rate of 90% (18 out of 20 subjects) and only two false alarms that were both found during the first 20 s of the experiment. The missed detections by the 5 different detectors originate from 8 participants, labeled "A" through "H" in **Table 2**. This table shows the overlap of participants with missed detections across the different detectors. For one of the participants ("A"), none of the detectors detected a state change in block 10. It is possible that this participant simply did not experience stress. Participants B and C did not show a detectable skin conductance response to the stress message while heart rate did show a response. The converse is true for participants D and E. More noisy variants of skin conductance (i.e., the raw data version) and heart rate (i.e., the camera version) resulted in below detection effects for three participants for one or the other signal (participant F, G, and H).

#### **DISCUSSION**

#### **THE SING-A-SONG STRESS TEST**

The Sing-a-Song Stress Test proved to be an effective way to induce stress: rises in heart rate and skin conductance across 1 min, or even a fraction thereof, were in the same order as those elicited by the TSST viewed over 10 min. Rises in heart rate and skin conductance responses were an order of magnitude larger than those generally reported to be elicited by perceptual stimuli. Most subjective stress ratings reflected moderate to strong experienced stress, and individual subjective ratings correlated with rises in skin conductance. The same trend was observed for heart rate. As opposed to the TSST, our paradigm features a sudden

#### **Table 2 | Participants without detected stress response for at least one detector.**


*The missed detections per detector equal the number of detections as presented in Table 1 subtracted from 20 (the number of included participants).*

stressor with known onset which makes it suitable to investigate physiological responses in short time windows. This paradigm also offers a clear within-subject baseline that controls for sensory input and body movement, and that is close in time to the stress interval (i.e., the 9th "neutral" count-down block as compared to the 10th "stress" count-down block). Since the paradigm only involves a passive audience of two confederates who are, during the experiment, not directly in the line of sight of the participant, possible effects of confederates behaving differently toward different participants are minimized. The limited role of the confederates would also potentially enable the use of different confederates within one project. The Sing-a-Song Stress procedure is short, with a complete experiment lasting about 11 min. The only equipment that is needed is a computer and a screen presenting messages at fixed time intervals. The limited number and role of the confederates, the short procedure and the fact that no special equipment is needed make the paradigm low-cost. Since participants do not move and are in front of a monitor rather than in front of a real-life audience, the test could in principle be applied in an MRI scanner (as also addressed in Quaedflieg et al., 2013). Stress induction paradigms, such as the Sing-a-Song Stress Test, can be useful in a range of stress-related studies (Kudielka et al., 2007). Examples include validating stress response sensitivity of physiological sensors or signal processing algorithms, evaluating interventions to reduce stress or examining the association between personality characteristics and stress responses.

#### **LIMITATIONS OF THE PARADIGM, POSSIBLE IMPROVEMENTS AND FUTURE EXPERIMENTS**

The instruction of singing after the 10th count-down interval was not followed by three out of 25 participants. They started singing earlier than that and we had to exclude these data to prevent confounding the mental stress interval with body movements. If we had included data of these participants, mean heart rate increase would still have been 15.3 bpm (as is the average hart rate increase without these participants), while the skin conductance increase would on average have been larger (12.6 µS rather than 10.9µS). This stronger response would both be consistent with an effect of movement or a different kind of breathing, and of early singers being more stressed than the other participants which would arguably be consistent with the fact that they were not able to follow the instruction.

The Sing-a-Song Stress Test induces stress by asking participants to sing a song after a designated interval is over. While physiological effects can only be due to mental processes, we do not know exactly which mental processes contribute to the observed effects, similar to the TSST. Both the TSST and the Singa-Song Stress Test involve mental workload besides social anxiety and (anticipated) embarrassment. The cognitive task of our participants was arguably easier than preparing a speech, but they did need to choose a song to sing. Increasing workload (therewith probably also increasing stress) has been found to be associated with increasing heart rate and skin conductance. In an experiment that varied workload while keeping perceptual input and movement constant, heart rate was about 4 bpm higher and skin conductance level about 1.5µS higher in a quite difficult high workload condition compared to a very easy, low workload condition (Brouwer et al., 2014). While it may be impossible to completely disentangle mental processes such as workload and more emotion-related stress for all individuals, as well as to properly equate the strengths of these processes, it seems that in general effects of workload on physiology are less strong than effects of emotional stress (Harris, 2001; Hofmann et al., 2006). Future experiments using the Sing-a-Song Stress Test could dissociate workload and emotional stress to some extent by telling participants which well-known song they are supposed to sing or by replacing the neutral messages by neutral mental tasks. Also, overall mental stress may be increased by increasing the workload aspect of the task (e.g., "sing a song that is about a city and that none of the participants before you have sung").

Stress responses when repeating the paradigm in the same individual could reveal to what extent there is habituation. If stress can be repeatedly induced, this would increase the value of the test for fMRI studies. For the TSST and other stress protocols, habituation of the neuroendocrine processes has been reported but the sympathetic nervous system has been found to respond more uniformly to repeated exposure to psychosocial challenge (Kudielka et al., 2007).

Future studies that may help to further improve the paradigm include examining the effect of the number of confederates (with a minimum of zero—where the experimental leader is the only audience available) as well as the effect of the role they play. Now we had confederates waiting in the waiting area and fill out forms so as to make the participants believe that they were really fellow participants, but this may not be necessary. Four of the 25 participants indicated after the experiment that they had doubted that the confederates were real participants. While obviously, this number of participants is too low to allow firm statements on the possible effect of believing the scenario, these four participants did not show a trend to weaker physiological or subjective stress responses. Another way of improving the paradigm could be to reduce the number of neutral messages. **Figure 2** shows that there is not much heart rate and skin conductance variation over the first nine blocks suggesting that e.g., four rather than nine blocks may be sufficient. Finally, future experiments should include a baseline subjective stress rating before the experiment starts, and not only after the stress interval. While the time of baseline would be different than the physiological baseline at block 9, taking the difference between the baseline and the subjective rating about experienced stress in block 10 would remove the effect of differences between participants in rating own subjective stress in general. This difference measure would probably comprise a more sensitive subjective stress measure that is expected to correlate better with skin conductance and heart rate responses.

### **DETECTING SUDDEN INCREASE IN STRESS: APPLICATIONS**

This study provides a clear and well-controlled example of the strong effects of mental stress on heart rate and skin conductance. We implemented simple algorithms that could be used in real time to identify a sudden increase in stress within an individual and showed that this was also possible, be it to a lesser extent, using only a camera based heart detection system. This may enable potential applications. Examples are the evaluation of interventions (e.g., scents) that reduce the stress inducing quality of certain stimuli (e.g., a dentist drill) or supporting a therapist when administering anxiety-evoking stimuli in exposure therapy (Popovic et al., 2006). Applications of stress detectors would be particularly valuable in cases where individuals cannot or do not want to express the stress they experience. This could be the case in patients who cannot express themselves effectively due to physical or mental limitations. Stress detectors may help to clarify how they experience certain medical treatments and how they can be modified to elicit less stress in general or in a particular individual. In the safety domain, stress detectors (from a distance) may be one way to increase the chances of picking the appropriate individuals from a waiting line for search by looking at responses to stimuli like the appearance of a drugs sniffing dog. Of course, many of these situations do involve potential effects of movement that we here carefully excluded. These effects will give rise to false alarms. False alarm rates could be reduced by taking into account concurrent measurements of movements. It may also be necessary to limit application of such a stress monitor to situations where there are very little movements. Regardless, in any application the presence of false alarms should be dealt with, e.g., by treating the system as one of several (fallible) stress indicators, or, in the case of evaluating interventions, by looking at the mean of large data sets. Not discussed here, but essential when designing applications around mental stress detection are privacy concerns.

#### **CONCLUSIONS**

By making use of the relatively strong effects of social evaluative threat on physiological measures of stress, we designed a new experimental paradigm to induce stress at a clearly specified moment in time in an easy and effective way. The effects of heart rate and skin conductance turned out to be strong and detectable on the level of an individual participant. We hope that the Singa-Song Stress Test will facilitate standardized studies on mental stress that are controlled for sensory and movement confounds.

#### **ACKNOWLEDGMENTS**

Thanks to Manje Brinkhuis for running the experiment, and members of the project team Wouter Vos and Marc Menkhorst. Financially enabled by the Netherlands Ministry of Safety and Justice within the programme "Innovatie Maatschappelijke Veiligheid" initiated by the Center "Innovatie en Veiligheid," Utrecht. Additional support came from the BrainGain Smart Mix Programme of the Netherlands Ministry of Economic Affairs and the Netherlands Ministry of Education, Culture and Science.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 29 January 2014; accepted: 08 July 2014; published online: 29 July 2014. Citation: Brouwer A-M and Hogervorst MA (2014) A new paradigm to induce mental stress: the Sing-a-Song Stress Test (SSST). Front. Neurosci. 8:224. doi: 10.3389/fnins. 2014.00224*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Brouwer and Hogervorst. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# GASICA: generic automated stress induction and control application design of an application for controlling the stress state

# *Benny van der Vijgh1,2\*, Robbert J. Beun1, Maarten van Rood1,2 and Peter Werkhoven1*

*<sup>1</sup> Buys Ballot Laboratory, Department of Information and Computing Sciences, Utrecht University, Utrecht, Netherlands <sup>2</sup> Department of Neurology and Neurosurgery, University Medical Center Utrecht, Utrecht, Netherlands*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Dennis J. McFarland, Wadsworth Center for Laboratories and Research, USA Sebastian Grissmann, LEAD Graduate School, Germany*

#### *\*Correspondence:*

*Benny van der Vijgh, Department of Neurology and Neurosurgery, University Medical Center Utrecht, Universiteitsweg 100, Utrecht 3584 CG, Netherlands e-mail: b.h.vandervijgh@ umcutrecht.nl*

In a multitude of research and therapy paradigms it is relevant to know, and desirably to control, the stress state of a patient or participant. Examples include research paradigms in which the stress state is the dependent or independent variable, or therapy paradigms where this state indicates the boundaries of the therapy. To our knowledge, no application currently exists that focuses specifically on the automated control of the stress state while at the same time being generic enough to be used in various therapy and research purposes. Therefore, we introduce GASICA, an application aimed at the automated control of the stress state in a multitude of therapy and research paradigms. The application consists of three components: a digital stressor game, a set of measurement devices, and a feedback model. These three components form a closed loop (called a *biocybernetic loop* by Pope et al. (1995) and Fairclough (2009) that continuously presents an acute psychological stressor, measures several physiological responses to this stressor, and adjusts the stressor intensity based on these measurements by means of the feedback model, hereby aiming to control the stress state. In this manner GASICA presents multidimensional and ecological valid stressors, whilst continuously in control of the form and intensity of the presented stressors, aiming at the automated control of the stress state. Furthermore, the application is designed as a modular open-source application to easily implement different therapy and research tasks using a high-level programming interface and configuration file, and allows for the addition of (existing) measurement equipment, making it usable for various paradigms.

**Keywords: GASICA, stress state, stress state control, psychological stressor, physiological response, feedback model, stressor game**

# **INTRODUCTION**

For various cognitive and affective (neuroscience) research and therapy purposes, information on the internal stress state of patients and participants and, desirably, exerting control over this stress state, is important. For instance, during exposure therapy it is of importance to have information concerning the internal stress state of a patient, and, based on the inclination of the therapist, either control this stress state to keep psychological responses to the exposed stimuli within certain boundaries, or leave the state as is, and use the received information to determine which states are optimal for a specific therapy. The same holds for various cognitive and affective research, e.g., memory and risktaking research, where the stress state of a participant can act as a confounding factor during the research.

Currently, to our knowledge, no application exists that focuses specifically on the automated control of the stress state while at the same time being generic enough to be used in various therapy and research purposes. There exist tasks that allow to adjust the presented stressor intensity, for example the Montreal Imaging Stress Task (MIST, Dedovic et al., 2005). However, this adjustment cannot be done real-time: the intensity can only be set at the start of the task, hereby not allowing control over the stress state during the therapy or research paradigm. Therefore, we propose a new application in this paper, dubbed GASICA: *Generic Automated Stress Induction and Control Application*. This application is aimed at automated control of the internal stress state in various therapy and research paradigms by online and continuous monitoring of the stress state.

The stress state generally refers to the current state of stress present in the subject (i.e., a patient or participant). However, over the years, stress research has produced a myriad of definitions for the concept of stress, expressing a wide variety of views on the matter. This variety has emerged due to several reasons, such as the multidisciplinary nature of stress research and the developing view on stress in the past decades.

Our definition of stress is based on the definition by Newport and Nemeroff (2002) where stress is defined as any challenge to homeostasis of an individual that requires an adaptive response of that individual. In order to use this definition in the construct of GASICA, we use it in the following form: "*Stress is the state* *resulting from the ensemble of responses that are aimed at (facilitating) restoration and/or maintenance of (psychological) homeostasis to internal or external stimuli that present (perceived) challenges to this (psychological) homeostasis.*"

To be more specific, the stimulus presenting the challenge to (psychological) homeostasis will be referred to as the *stressor*, and the response to this challenge referred to as the *stress response*. In par with the interactional approach on stress (Jones and Bright, 2001), we furthermore identify variables influencing the relation between the stressor and stress response, referred to as *intervening variables*.

Stressors, stress responses and intervening variables can be divided into different categories. With regard to stressors, two of the categorizations that are often made and that are relevant to our application, are the division of stressors in *physical* and *psychological* stressors, and in *acute* and *chronic* stressors (see, amongst others, Jones and Bright, 2001; Dickerson and Kemeny, 2004). Physical stressors are in general defined as metabolically demanding stressors, such as physical exercise or a cold pressor task, where the hand of a subject is cooled down. Psychological stressors are in most cases defined as non-metabolically demanding stressors, for example as used in the Trier Social Stress Test, in which a subject is asked to perform a public speech task and mental arithmetic (Kirschbaum et al., 1993). The division between acute and chronic stressors refers to the length a stressor is presented. The precise boundary between these categories varies in the literature, although a consensus can be discerned on the stressor lengths on the ends of the spectrum: stressors that are presented within the realm of minutes are regarded as acute stressors and stressors that are present for weeks or longer will be regarded as chronic.

The stress response can also be divided into different categories. An often-made categorization is the distinction between *physiological*, *psychological* and *behavioral* stress responses (see, amongst others, Jones and Bright, 2001; Lovallo, 2005). Physiological stress responses can include, among others, alterations in heart rate (variability), cerebral activity, and electrodermal activity, while psychological responses typically include alterations in affect and cognition, and behavioral responses include alterations of the exhibited behavior.

Intervening variables have been divided into a broad array of different categories in the literature, depending on the aim of the categorization made. However, there are several categories that are prevalent throughout this literature. These categorizations include, amongst others, *individual difference variables*, including variables such as age, gender and personality type, and *environmental variables*, including variables such as the surrounding temperature.

Within the stressor and stress response categories we introduce a further distinction between *type* and *form*. This distinction will aid us in the further design and description of GASICA. We define within this context stressor types to refer to the manifestation of a stressor, i.e., the concrete entity that produces the stressor: for example a public speaking task, a mathematical task, or a digital game producing a stressor. Stressor form will refer to the kind of stressor(s) that is or are presented through this stressor type, for example workload, social-evaluative threat (the possibility of being negatively judged by others), or frustration.

Analogous we introduce the term stress response type to refer to the manifestation of physiological stress response(s), such as heart rate responses (e.g., an elevated heart rate), or cortisol responses (e.g., an increased cortisol level in the blood). Stress response form refers to the kind of physiological response system the response type is originating from, e.g., stress responses of the form *sympathetic* are originating from the sympathetic nervous system (responsible for the electrodermal stress response type), and *hemodynamic* responses are responses of the hemodynamic response system (responsible for changes in blood pressure).

The difference between the concepts stress and *arousal* remains an interesting point of discussion in the breadth of the stress research community. To our knowledge there is currently no consensus on how to separate the two. Of the different views that are currently present on this matter, we concur with the view expressed by Day and Walker (2007), in that the difference between these two concepts needs to be sought rather in a top-down characterization of qualitative appraisal, opposed to a bottom-up characterization of physiological responses. Based on this thought Day and Walker propose three ways in which stress differs from arousal. The first way is that stress differs from arousal in terms of the qualitative appraisal that precedes it. The second is that stress is only elicited by aversive challenges. And the third is that there is a difference in the resulting physiological state between stress and arousal, which the authors state based on, amongst others, memory research. However, the authors indicate that the exact differences between these physiological states are currently not elucidated.

In GASICA we will utilize an *acute psychological* stressor (i.e., a stressor that falls both in the psychological stressor category and the acute stressor category) and we will measure *physiological* stress responses (i.e., stress responses in physiological signals). An *acute* stressor is used since, for practical reasons, a stressor cannot be presented for days or weeks within the application, as this will not be suitable for most therapy and research paradigms. Furthermore, a *psychological* stressor is used, as this allows for more flexibility than a physical stressor. This because the adjustment of physical entities, e.g., changing the temperature of water, is more cumbersome and less flexible than the adjustment of a psychological stressor, allowing a more generic application. A *physiological* stress response is used as an indication of the internal stress state as this category of stress response is the best objectively measurable and quantifiable of the three different stress response categories. Therefore, in the context of our application, we further refine our definition of the stress state, defining it as the current overall stress response in spontaneously generated (neuro)physiological signals, i.e., the current sum of different (neuro)physiological stress response types.

In order to achieve the aim of automated controlling of the stress state, it is essential to control the stressor intensity *causing* the current stress state. Therefore, our proposed application consists of three components, depicted in **Figure 1**:

(1) A stressor, allowing to induce stress responses.


In this paper, we will describe the design of GASICA by discussing each of the three separate components in detail in the upcoming sections.

# **GASICA**

In this section we will further design the different components in our application (i.e., the stressor component, measurement component and the feedback model) using a top-down approach. In this design process we aim to maximize the diagnosticity, sensitivity, and reliability of GASICA, i.e., we aim to maximize the extent to which we are able to measure different levels psychological stress across inter- and intra-individual differences (for a detailed treatment of these concepts, see Fairclough, 2009). As the category of the utilized stressor and stress response are already determined, we will first determine the form and type of the utilized stressor and stress responses, using requirements formulated based on our aim to continuously measure and exert control over the stress state in various research and therapy paradigms. Subsequently we will determine the characteristics of the different components using a recent meta-analysis we conducted (van der Vijgh et al., 2015) as guideline. Finally we will determine the instantiations and/or implementations of the components, using sets of requirements and the meta-analysis as guidelines. This process is described in the following sections for the different components separately.

# **COMPONENT 1: STRESSOR** *Stressor type*

For the first component, the stressor, we utilize an acute psychological stressor. To determine the *type* of acute psychological stressor best suited for use in our application we formulated four requirements:

(1) Multidimensionality

The stressor should allow for presentation of multiple stressor forms (e.g., workload, emotion induction, and frustration) to make it suitable for multiple therapy and research designs.

(2) Adjustability

In order to allow for presenting different stressor intensities, the stressor characteristics should be adjustable, in a way that results in different stressor intensities. Furthermore, through this adjustment it should also be possible to adjust the stressor form presented.

(3) Real-time

The adjustment of the stressor characteristics should be possible in real-time, in order to respond to changing stress states. In this context, real-time refers to the realm of seconds, so that the application can adjust the stressor within a couple of seconds when needed.

(4) Continuity

To allow continuous control over the stressor that is presented, it should be possible to both present and adjust the stressor continuously for a prolonged period of time.

When we look at acute psychological stressors to match against these requirements, we find a plethora of different stressor types. In an influential meta-analysis by Dickerson and Kemeny acute psychological stressors were subdivided in four mutually exclusive types: public speaking and verbal interaction tasks, cognitive tasks, emotion induction procedures, and noise exposure tasks (Dickerson and Kemeny, 2004).

Public speaking and verbal interaction tasks refer to tasks in which subjects have to verbally interact with other human subjects, such as in interviews or through public speaking. The stressor intensities of these tasks are hard to adjust, especially in real-time, and the tasks present a mostly one-dimensional stressor form, being mostly social stress (social-evaluative threat).

Cognitive tasks are tasks such as arithmetic tasks, the Stroop task, vigilance-reaction time tasks, and analytical tasks, e.g., puzzles. These kind of tasks are, especially when presented digitally, well adjustable, also in real-time, and can be presented continuously. However, these tasks also present a mostly one-dimensional stressor form, being workload.

Emotion inducement tasks are tasks that present emotioneliciting material that elicit a negative affective state, such as the viewing of aversive pictures or film. These tasks are by definition one-dimensional in the sense that these only induce emotion as stressor form.

Noise exposure tasks exist of the presentation of loud noises. These kinds of tasks are also by nature one-dimensional.

Another widely used type of acute psychological stressor not discussed in this analysis consists of a combination of the latter three stressor types, hereby alleviating the limitation of onedimensionality pertaining to the individual stressor types: a *digital stressor game*, i.e., a digital game producing a stressor. In a recent meta-analysis 5448 articles were found when using search phrases to find studies utilizing this type of stressor1 , indicating the widespread use of this type of stressor (van der Vijgh et al., 2015). We adhere to the definition of a game used in this analysis: a game is defined as a type of play activity conducted in the context of a pretended reality, in which the player(s) try to achieve at least one arbitrary, nontrivial goal by acting in accordance with rules (Adams, 2010). Key elements in games are players, (inter)action, environment, goals, and rules. In a game, players interact with entities in the environment or with other players in accordance with a set of rules in order to achieve a set goal. In the case a player controls a specific entity, this entity is referred to as the avatar. Game characteristics are characteristics of any (part of) of the key elements or the game as a whole, such as the game type, the presence of game music or time pressure, or the amount of aversive stimuli present in the digital game.

As in digital games many, or even all, of the key elements are taken over by computer technology, this provides possibilities that allow to *adjust* the stressor intensity, in *real-time*, by adjusting the game characteristics of the digital game stressor. Also, by adjusting these game characteristics, it is possible to present *multi-dimensional* stressor forms, *continuously*, satisfying our requirements. Even more, a (digital) game provides a *narrative*, the pretended reality in which the goal that is set is tried to be achieved. This narrative provides possibilities to conceal the adjustments made to the stressor characteristics and changes between stressor forms from the subject by presenting these adjustments as part of the narrative. The presence of a narrative also provides a way to incorporate different research and therapy paradigms in the application, as it allows to create a specific narrative for each specific paradigm. Given these properties (adjustable, real-time, continuity, multidimensional, and a narrative), a digital game stressor is selected as the stressor type in our application.

#### *Characteristics*

In order to present different stressor intensities using this type of stressor, it is essential to have insight in which digital game characteristics elicit physiological stress responses, i.e., alter the stress state, and to what degree. Additionally, the effects of adjustments of these characteristics on the stress state should be predictable.

In the same meta-analysis by van der Vijgh et al. (2015), the relation between digital game characteristics and physiological stress responses is analyzed. Using meta-regression, this analysis identified four stressor game characteristics, presenting different stressor forms, that significantly moderated the physiological stress responses in a predictable and consistent manner. This indicates that these characteristics can be used to elicit stress responses and that adjustments of these characteristics are expected to result in predictable changes of the stress state. These four characteristics will therefore be instantiated in the digital game:

(1) Aversive stimuli

The first game characteristic identified is the presence and intensity of *aversive stimuli* in the digital game, such as the visual or auditory presentation of scenes of violence, blood, or gore. Aversive stimuli have been found to elicit physiological stress responses both inside as well as outside a digital game context. Outside the context of a game, the presentation of aversive stimuli such as in a picture rating task (Stegeren et al., 2008), passive viewing of aversive pictures (Sokhadze, 2007) or film viewing containing aversive stimuli (Miller et al., 1995) has been found to induce physiological stress responses such as alterations in heart rate, electrodermal activity and frontal EEG activity. Within the context of a game aversive stimuli can, for example, include the presence of violence, blood and gore (Hebert et al., 2005), and torture (Tafalla, 2007), also inducing physiological stress responses (Carnagey et al., 2007). This characteristic presents a stressor

<sup>1</sup>The search phrases used were *game AND physiolog*\* and *stress\* AND game*, where *AND* indicates a logical conjunction and the asterisk ("\*") indicates a wildcard, i.e., any combination of characters. The search was performed in Pubmed, Scopus, PsycInfo, and the IEEE Xplore Digital Library.

of the form emotion induction, as it induces a negative affect.

(2) Realism

This characteristic concerns the amount of *realism* presented in the digital game. This refers to the degree to which a subject will identify the presented stimuli as realistic. This characteristic does not present a separate stressor form in itself, but rather heightens the immersion, resulting in heightened physiological stress responses. Several studies have found the amount of realism in digital games to be related to resulting physiological stress responses. For example, Ivory and Kalyanaraman (2007) found that more technologically advanced, although otherwise comparable, digital games elicited higher electrodermal stress responses and Barlett and Rodeheffer (2009) showed that more realistic digital games significantly heighten the heart rate stress response.

(3) Game music

*Game music* refers to whether or not music is presented in the game. A recent overview provided by Sokhadze (2007) makes clear that although there are inconsistent results to be found, music has the potential to elicit physiological stress responses. Examples include work by Nyklicek et al. (1997), who found significant differences in both cardiovascular and respiratory variables in response to different fragments of music and white noise. Other physiological stress responses, such as in skin temperature, have also been found by, amongst others, McFarland (1985), who found that music with different valence and arousal (subjectively determined) have different effects on skin temperature. Game music induces emotion, the same stressor form as aversive stimuli.

(4) Game type

This characteristic concerns the *game type*, the type of digital game. Examples of different game types include action, adventure, strategy and management, role playing games (RPG), simulation or board and card games (Ritterfeld et al., 2009). In the meta-analysis, it was found that puzzle games induce the highest physiological stress responses. Game type in itself does not present a stressor form, as it acts as a container characteristic: the characteristics that are contained within a specific game type are responsible for the resulting stressor form.

Besides these four characteristics, several other game characteristics were identified in this meta-analysis to be related to physiological stress responses, but were not used in the metaregression because these could not be objectively qualified or were not reported sufficiently in the included studies. As we aim to design an application that can be applied in various therapy and research paradigms, we aim to include as many different game characteristics as possible, allowing a wider range of stressor forms to be presented and thereby providing greater flexibility for use in more paradigms. Therefore, we screened these characteristics that were excluded from the meta-regression to see if these were fit to use in our stressor. Of these additional characteristics, three digital game characteristics are selected to be incorporated in the digital game, based on the property of these three characteristics to be adjustable during execution of the application, bringing the total of game characteristics used in our digital game stressor to seven:

(5) Time pressure

*Time pressure* refers to the presence of limited time before a certain goal has to be achieved. Studies utilizing time pressure paradigms (Wahlstrom et al., 2002) have consistently been found to elicit physiological stress responses. The stressor form time pressure presents is workload, as it increases the demand placed on the subject.

(6) Sound level

*Sound level* concerns the sound level at which auditory stimuli are presented during the game. This characteristic presents a stressor of the form noise induction and it has been shown that high sound levels, mostly studied at 75 dB and above, elicit physiological stress responses (Smith et al., 1997; Selander et al., 2009).

(7) Disabling of input

*Disabling of input* refers to the disabling of the control the subject has in the game making it harder to achieve the goals set, resulting in frustration and physiological stress responses (Reuderink et al., 2009).

In **Table 1** an overview is given of the seven included stressor game characteristics and the stressor forms these present.

# *Game design*

The stressor game is designed around the seven selected game characteristics, aiming to provide a narrative through which the game characteristics can be adjusted as part of the narrative, reducing the possibility of subjects noticing the adjustments as being part of the stress state control method.

The game is designed as a 3D puzzle game because the same meta-analysis (van der Vijgh et al., 2015) indicated that this game type elicited the highest stress response of the analyzed game types. The narrative provided is that an adolescent boy or girl (i.e., the avatar, the entity that is controlled by the subject, gender is not made explicit) has to find notes that are scattered across the house and garden of his or her uncle, an inventor and scientist, who needs these notes within a given amount of time in order to finish his work. The boy/girl sets out to find these notes wearing one of the latest inventions by this uncle, a heavy suit that should help the subject to find these notes, outfitted with a radio and head up display that can present auditory and visual stimuli to the wearer. However, this suit turns out not to work as expected, malfunctioning from time to time, resulting in, for example, the possible presentation of images or sounds, and the restriction of movement. This suit allows the game characteristics to be altered as needed without the subject registering this as a conscious adaptation of the game, as these alterations are presented as malfunctions of the suit the subject is wearing. Furthermore, the narrative serves to heighten immersion and reduce boredom and annoyance by motivating the subject to keep participating in the current paradigm. The stressor game is designed to exclude fast-paced motor action or complicated cognitive tasks in order


**Table 1 | Overview of stressor game characteristics, with presented stressor form, instantiation within the stressor game, the way the instantiations fit in the narrative and how these are adjusted in order to control the stress state.**

to prevent uncontrolled elements of the stressor to induce physiological reactions. Furthermore, the setting of the game does not include political or ideological content, to prevent unwanted side-effects, nor fast-paced visual or auditory sequences (e.g., no flashing lights) to prevent uncontrolled stress responses and to make sure the digital game can be utilized in multiple paradigms.

Within the game there are two separate conditions: a *fitting* and *manipulation* condition. The fitting condition serves to fit an individual feedback model for each subject. This condition consists of a maze (the garden of the uncle) that presents a homogeneous environment that is suited for the fitting of the feedback model (for more details, see Section Fitting). To prevent unwanted stressor effects of getting lost in the maze we placed trees as landmarks to help subjects find their way. The manipulation condition serves to utilize the fitted feedback model to control the stress state and to present tasks or therapy elements (for more details, see Section Overview). This condition is placed inside the house of the uncle. Impressions of both conditions are given in **Figure 2**.

### *Instantiations of characteristics*

Within this design, the seven selected stressor game characteristics are instantiated in order to control the stress state. We chose instantiations that allowed changes to the characteristics to be immediately noticeable. An overview of the instantiations and how these fit in the narrative of the stressor game is given in **Table 1**.

Aversive stimuli are presented using the International Affective Picture System (IAPS) (Lang et al., 1988) and the International Affective Digitize Sounds (IADS) (Bradley and Lang, 1999), standard sets containing over 1000 pictures and 167 sounds respectively. These sets are scored on valence, arousal and dominance by subjects in an on-going sequence of studies, currently containing the scoring from over 18 separate studies. By summing the scores on inversed dominance, valence, and arousal, we derived a new scale in which a low value corresponded with images and sounds scored as unhappy, arousing and being controlled and higher values corresponded with images scored as happy, non-arousing and being in control. Pictures are presented full screen to achieve maximal effect, the subject is explained that this is displayed on the display incorporated in the suit, and sounds are presented over the built-in radio, both are malfunctions of the suit.

Realism is altered through changing the point of view to either 1st or 3rd person view, i.e., the view the subject has in the stressor game is either through the eyes of the avatar or from above, just beyond the avatar, having view of the complete avatar. Altering this view has been found by Dahlquist et al. (2010) to alter presence: in 1st person view a greater sense of presence was reported. This adjustment is also presented as a malfunction of the suit.

As game music has been found to induce different responses, it is hard to determine how to instantiate this characteristic. Based on the review provided by Sokhadze (2007) we used the results from the study by Nyklicek et al. (1997) as a basis, as this study was found to be the most comprehensive study, and having clear significant results. These results indicate the possibility to elicit different emotional states (recognized based on physiological responses), utilizing music excerpts that are characterized by valence and arousal dimensions. Based on these findings we chose to use a standard set of music excerpts, scored on several dimensions, including valence, and arousal, by 116 subjects (Eerola and Vuoskoski, 2011). We derived a new scale by summing the values for valence, tension, and energy. In this scale low values correspond to negatively valenced, tensed, and energetic music, and high values correspond to positively valenced, relaxed, and calm music. Music is presented over the built-in radio, in the narrative of the game this is explained as a malfunction of the suit.

The game type will not be altered during the execution of the game, and is set as a puzzle game. This is because changes in the game type has also consequences for the narrative and game design, which is undesirable.

Time pressure will be applied through a countdown presented to the subject indicating that he or she needs to find the next note before the timer reaches zero. This timer can be turned on or off and can start at any given number of seconds. Within the narrative of the game the time pressure is applied through the uncle announcing over the radio that he needs new notes quickly.

Sound level is adjusted by presenting the auditory stimuli in different intensities, this is presented as a malfunction of the radio in the suit.

Disabling input is induced by (partially) disabling control by disabling the keys needed to control the game. In the narrative of the game this is also a malfunction of the suit the subject is wearing.

Each of the stressor characteristics instantiations can be presented either *transient* or *state-wise*, as can be determined by the experimenter or therapist. This refers to the duration an adjustment of the instantiation is present: either for a set duration (that can be determined by the experimenter or therapist) or constantly, until it is adjusted again.

An additional strength of the included stressor game characteristics and the respective instantiations is that these present stressor forms have a high degree of ecological validity. This is due to fact that these stressors have a high correlation with stressors found in real life. Exemplars are the real-life depictions of aversive visual and auditory stimuli of the IAPS and IADS that are used, and the presentation of time pressure on task-completion, as presented through the countdown in the stressor game.

#### *Implementation*

In order to implement the stressor game we aimed to utilize a software environment which fulfilled three requirements:

(1) Data exchange

It must be possible to both import and export information into and from the environment during execution. Import is needed to receive information from the feedback model, in order to receive which game characteristics will be adjusted in what way. Export is needed to send markers to the measurement component or any additional hardware.

(2) Characteristic implementation

All of the selected game characteristics instantiations must be implementable in the environment. For example, the environment must allow for a high level of realism in order to be able to vary the amount of realism presented.


In order to make the application suitable for different therapy and research paradigms, it is needed to have a high-level programming interface that allows to easily implement the needed research or therapy tasks within the application.

Based on these requirements the software environment Virtual BattleSpace 2 (VBS2) by Bohemia Interactive was selected (Simulations, 2011). This is a 3D simulation environment that allows for complete control over the simulation to adjust the relevant game characteristics and fulfill the above requirements. The subjects control the stressor game with the four directional keys of a standard alphanumeric keyboard.

# **COMPONENT 2: MEASUREMENT**

#### *Stress response types*

In the measurement component we include a multitude of physiological stress response types, both to provide information suited for a variety of research and therapy paradigms as well as to gain as much information as possible regarding the current stress state. In order to determine which types and corresponding measurements are suitable for inclusion we drafted six requirements, the first three concerning the stress response types, the latter three concerning the corresponding measurements, given below:

(1) Responsitivity

The stress response type should respond to the selected stressor game characteristics.

(2) Response consistency

In order to succeed in achieving the intended effect in the measured stress response upon adaptations of the stressor, the response must be consistent within a subject to a (digital game) stressor (adaptation), both quantitatively and temporally. This entails the response should have a consistent sign (either increasing or decreasing) between repeated presentations of an identical stressor game characteristic, and this response must occur within the same time span between repeated presentations. This consistency is only required for identical circumstances. For example the response is not required to be consistent between cases where in one of the cases the stress response type is already at a physiological possible maximum or minimum.

(3) Low response latency

In order to be able to measure the effect of stressor characteristic adjustment and allow for any subsequent adjustments, the stress response must emerge after such an adjustment with the least amount of delay as possible. In practice, this requires the latency to be within the realm of seconds, in order to reliably determine the effect of a presented stressor.


In order to reliably relate the stress responses to adjustments in the stressor game, the measurement must be applicable without inducing an additional stress response or disturbing the experiment or therapy.

(6) Measurement fMRI compatibility

In order to allow usage of the application in additional research and therapy paradigms, the measurement must be applicable inside a MRI scanner, without disturbing (f)MRI imaging, allowing the use of the application in combination with this technique.

We compiled a list of all stress response types and corresponding measurements encountered in the meta-analysis on digital game characteristics utilized in the previous section (van der Vijgh et al., 2015), as this provides an exhaustive overview of the kind of measurements performed with stressor games in the past 36 years. We reviewed these types and measurements to match these against the above requirements. The results hereof are given in **Table 2**, using the same numbering of the requirements as above. **Table 2** contains 13 stress response types and corresponding measurements that meet all requirements. Although we aim for a multitude of measurements, including all these stress response types will require six different measurement devices2 , which presents practical problems and is not expected to provide additional information concerning the overlap in the different stress response types and forms. Therefore, to reduce the number of needed measurement devices, we look at the stress response *form* the different stress response types have.

The types are either of the cardiac response form (measured using ECG and ICG), hemodynamic response form (measured with blood pressure monitor and photophlesmograph), stress responses stemming from sympathetic activation (measured using electrodes), and the neural stress form, measured using electroencephalography (EEG). In order to reduce the number of needed measurement devices, we chose one measurement from each stressor form by selecting the measurement of the stress response types with the highest mean effect. These mean effect sizes3 are taken from the meta-analysis in which these stress response types were analyzed (van der Vijgh et al., 2015). For the different stress response forms and measurement combinations, these effect sizes are given in **Table 3**. Given this approach, we select to measure heart rate and heart rate variability (using ECG), systolic and diastolic blood pressure (using a blood pressure monitor), the electrodermal stress response type (electrodes) and the neural response using EEG.

#### *Implementation*

In order to implement the measurement component we aimed to utilize hardware and software that fulfilled three requirements:

(1) fMRI compatible

One of the requirements on the stress response types is that is possible to perform the measurement within a fMRI environment. Therefore, this requirement is extended to the hardware and software of the implementation as well.

(2) Continuous

We selected the stress response types amongst other things on the possibility to measure these continuously, therefore the hardware used to perform the measurement of these types also needs to be able to measure continuously.

(3) Broad spectrum

In order to make the application suitable for multiple research and therapy paradigms we select hardware and software that is easily extendable with measurement equipment for additional stress response types or the measurement of other dependent physiological variables, such as the use of EEG.

We were not able to find a manufacturer that provided interconnected equipment that fulfilled all our requirements and also measured all selected stress response types at the same time. We selected the system that fulfilled all three requirements and was able to measure the most stress response types at the same time. This is the fMRI-compatible equipment by Biopac Systems Inc. that allows continuous measurement of all our selected stressor types, except EEG. This results in the use of heart rate (variability), blood pressure and electrodermal activity measurements in the measurement component. The base station of this equipment, the *MP150,* allows easy extension with additional equipment through a plug-and-play interface which allows the plugging in of additional equipment by the same manufacturer, that is automatically recognized by the accompanying software, *Acqknowledge*. This ensures easy extension of the utilized stressor types, making the application suited for additional paradigms.

For the measurement of heart rate and variability with ECG we use the ECG 100C MRI amplifier with disposable electrodes in a lead-II type ECG. For the measurement of the electrodermal

<sup>2</sup>Heart rate and heart rate variability are measured using electrocardiography (ECG); systolic, diastolic and mean blood pressure are measured using a blood pressure monitor; electrodermal response is measured using electrodes on the skin; cardiac output, pre ejection period, cardiac index, left ventricular ejection time and vascular rigidity index are measured using impedance cardiography (ICG); digital blood volume pulse is measured using a photophlesmograph; and neural spectral power or event-related potentials are measured using electroencephalography (EEG).

<sup>3</sup>These mean effect sizes represent the difference between baseline measurements and measurements during digital stressor presentation.

**Table 2 | Overview of stress response types and corresponding measurements matched against requirements for determining the utilized stress response types in the measurement component.**


*(Continued)*

#### **Table 2 | Continued**


*A plus sign (*+*) indicates that the requirement of the respective column is met, a minus sign (*−*) indicates this requirement is not met. In the latter case, the comment column indicates why this is the case. Rows corresponding to stress response types and measurements that meet all requirements are given in gray.*



*Gray rows indicate selected types and forms for use in application.*

response we use two disposable electrodes on the hand or foot connected to an EDA 100C MRI amplifier. For measuring systolic and diastolic blood pressure we use a small pressure pad on the thumb or elbow that detects these measures using continuous pressure, allowing for continuous measurement, connected to an HLT-100C amplifier. This method utilizes an additional software package for extraction of the systolic and diastolic blood pressure from the raw blood pressure signal, *Caretaker*, which transfers the measurement data to the *Acqknowledge* software.

#### **COMPONENT 3: FEEDBACK MODEL**

The feedback model serves to select the optimal adjustment of the stressor game characteristics, i.e., the adjustment that minimizes the difference between the current stress state and the desired stress state. To this end, it models the relations between the size of the stress response types selected in the previous section and the stressor game characteristics adjustments chosen in the preceding section. These relations are expressed in rules that are fitted for each individual subject during the fitting condition of the application. It is important to realize here that the formulas presented in this section can be used to calculate values to a precision that does not necessarily reflect the same precision of the entities these are presenting, i.e., the provided formulas and resulting values in this section should be seen as an approximation of the respective entities these present.

#### *Stress state*

In correspondence with our definition of the stress state as the ensemble of responses to internal or external stimuli that present (perceived) challenges to the (psychological) homeostasis, the stress state in the feedback model is expressed as the weighted summation of stress responses in the selected stress response types (i.e., heart rate, heart rate variability, blood pressure, and electrodermal response). In order to estimate this stress state, we calculate the *physiological activity state.* This latter state is derived by calculating the activity for each stress response type separately, expressed in the standardized mean difference effect between the baseline and stress response values, denoted as Hedges' *g* (Hedges, 1981). The formula for the size of the activity of a given stress response type expressed in *g* is given by:

$$\lg = \frac{\mu\_{\text{stress response}} - \mu\_{\text{baseline}}}{\sigma\_{\text{baseline}}}$$

Here μbaseline refers to the mean value of the specific stress response type (e.g., heart rate) during a baseline measure. This baseline measure is a measure of the physiological activity state of the subject in rest, before the start of the respective paradigm. This measure needs to be performed shortly before the application is used, while the subject is instructed to relax in the same position as he or she will be in when using the application. Furthermore, the μstress response and σbaseline refer to the mean current physiological activity of a given type and the standard deviation of the corresponding baseline measurement values. In this manner, the sign of the resulting *g* will be positive when the physiological activity is higher compared to the corresponding baseline measurement, with the value indicating the change from baseline expressed in standard deviations. The physiological activity state, calculated as the weighted summation of these stress response types is given by:

$$p\_i = \sum \left(\mathbf{g}\_i \* \mathbf{w}\_i\right) / \sum \mathbf{w}\_i$$

Here *pc* stands for the current physiological activity state, summing over all *gi,* the current size of the respective stress response types, multiplied by *wi*, representing the respective weights assigned to these types. These weights are introduced to allow specific stress response types to have more influence on the physiological activity state than other response types, by assigning the respective weight of this response type a higher value than other weights. This can be desirable in certain therapy or research paradigms in which specific stress response types are more informative for the aim of the paradigm than others. By default the weights are all initialized to the value of 1, resulting in all stress response types having the same influence on the physiological activity state. The state is normalized by dividing it by the summation of all weights. In this manner the normalized weights sum up to one, resulting in a physiological activity state *pc* that is expressed in the weighted average standard deviations change from baseline, making interpretation more intuitive. Calculating *pc* in this manner allows to combine multiple stress response types, takes personal variance in physiological signals of individual subjects into account and makes sure that the physiological activity state value is close to 0 at the beginning of each experiment, by using the baseline measurements.

We use this physiological activity state to estimate the current stress state *sc*. Based on the current stress state estimated in this manner and the given desired stress state *sd*, the feedback model selects a stressor characteristic adjustment that is predicted to result in a stress state that is closest to the desired stress state. In order to determine this adaptation, the feedback model utilizes *rules* that model the relation between stressor characteristic instantiations and the different stress response types.

#### *Rules*

For every stressor characteristic instantiation, the feedback model contains exactly one rule that models the relation between all the available adjustments of the instantiation and the corresponding responses in the different stress response types. For example, for the stressor characteristic "game music" the feedback model contains a rule that predicts the response for each of the different stress response types (i.e., heart rate (variability), blood pressure and the electrodermal response) for each of the possible adjustments, i.e., for each music sample that can be presented.

Two kinds of rules are contained in the feedback model: *discrete* and *continuous* rules. The discrete rule is used for modeling relations concerning stressor game characteristics that are instantiated in discrete levels, such as realism, which is instantiated by adjusting the point of view to either the first or the third person view, i.e., two levels. This kind of rule is represented as a matrix *Dij*, with the rows representing the change in the size of the respective response types, and the columns representing the transitions between levels of the stressor game characteristic instantiation. There are *i* rows, equal to the amount of stress response types, and *j* columns, equal to the number of 2-permutations of the set of levels of the stressor game characteristic [i.e., P(*amount of levels*, 2)], representing all the possible transitions from one level of a given characteristic to another. In this manner the element *Dij* refers to the predicted change in response type *i*, i.e.,  *gi,* when the stressor game characteristic transition is applied that belongs to column *j.* For example, the rule for transitions of the realism characteristic instantiation would consist of a four by two matrix, with the four rows presenting the changes in size of the different stress response types and the two columns representing the two possible transitions: from first to third point of view, and vice versa. Within this presentation, the predicted change in the current stress state, i.e., *sc*, when applying transition *j* on the stressor game characteristic presented by discrete rule *rd,* is therefore equal to the weighted summation (using the weights corresponding to the respective stress response types) of the elements in column *j* in matrix *D*:

$$r\_d(j) = \widehat{\triangle s\_c} \quad \text{with } \widehat{\triangle s\_c} = \sum\_i \left(\mathbf{D}\_{ij} \* \boldsymbol{w}\_i\right),$$

The continuous rule is used for modeling relations of stressor characteristics that are instantiated in a continuous manner, such as aversive stimuli, which is instantiated by using sounds and pictures from the IAPS and IADS using a continuous scale. This kind of rule consists of a simple linear regression model for each response type, in which the respective predicted response sizes are regressed on the continuous measure of the stressor characteristic. This entails that the predicted change in the current stress state, *sc*, when applying an adjustment with value *<sup>x</sup>* of the continuous measure of the stressor game characteristic instantiation presented by the rule *rc,* is equal to the weighted summation of the predicted changes in stress response type sizes, gi *.* These changes are equal to the value predicted by the fitted regression model:

$$r\_{\mathfrak{c}}(\mathfrak{x}) = \widehat{\Delta s\_{\mathfrak{c}}} \qquad \text{with } \widehat{\Delta s\_{\mathfrak{c}}} = \sum \left( \widehat{\Delta g\_{i}} \* \! \! \! \! / \! \! \! / \! \! \! / \! \! \! / \! \! \! / \! \! \! / \! \! \! / \! \! \! / \! \! \! / \! \! \! / \! \! \! / \! \! \! \! \! / \! \! \! \! \! \! \! / \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \!$$

In the example of the instantiation of aversive stimuli in the form of pictures from the IAPS, the desired change in the current stress state can be inserted, resulting in a value for *x* that corresponds to the value of the scale used for aversive pictures that is predicted to result in this desired change of stress state. Subsequently, the picture with the value closest to *x* is selected to be used as stressor game characteristic adjustment.

In order to select the stressor characteristic adjustment that is predicted to result in the stress state that is closest to the desired stress state, the feedback model will inspect the predicted stress state change of all rules when applied on the current stress state. Subsequently it will select the rule that results in the stress state that is closest to the desired stress state *sd* and apply this rule, i.e., stressor characteristic adjustment, to the stressor game. This entails that one rule is executed at a time. When we consider all possible rules *r* with all possible input values *x* or *j* as a set *R*, this selection process is given by:

$$\min\_{r(\mathbf{x}\vee j)\in R} \{ |s\_d - (s\_c + r(\mathbf{x}\vee j))\rangle \}$$

In cases where two or more rules would result in identical predicted stress states that are both the closest to the desired stress state, the feedback model selects the rule that has the largest effect on the stress response type that has the value the farthest away from the desired stress state. This is applied to prevent flooring and ceiling effects of stress response types.

#### *Fitting*

In order to derive the rules of the feedback model for each individual subject, the subject is presented with a predetermined sequence of stressor characteristic adjustments in the fitting condition, i.e., the condition where the subject searches for notes in the maze in the garden. This sequence can be specified by the therapist or researcher, allowing to determine which rules will be used and therefore included in the feedback model and how many times and which specific adjustment will be presented.

The response to the adjustments are calculated by using the measurements x seconds before the adjustment as a baseline and the measurements y seconds after the adjustment as response, after which *g* is calculated for each stress response type as described in section Stress state. The values of x and y can be chosen by the therapist or researcher in order to allow for specific intervals of measurement suited for the specific stress response types being used. Because we only adjust one stressor characteristic at a time and keep all other variables that we identified as being relevant to the response constant, and do so in a homogeneous environment (the maze looks virtually the same at any given time), we take the measured response as representing the expected change in the stress response types after applying this specific adjustment. Although we aim to control all relevant variables and solely adjust the characteristic of interest, we would recommend therapists or researcher creating a sequence of adjustments to use multiple presentations of a specific adjustment, resulting in multiple-trial measurements in the fitting condition. In this manner, any remaining effect of variation in non-controlled variables on the measured response will be reduced.

The measured sizes of the response types to these stressor characteristics are used to fit the feedback model rules. For the discrete rules, this entails creating a matrix of the possible transitions between the levels together with the measured responses. The continuous rules are derived by constructing a simple regression model using least squares regression estimation (LSRE), with the measured responses as data to fit the regression model.

#### *Intervening variables*

Intervening variables, i.e., variables influencing the relation between the stressor and stress response, can intervene with the stress state that is predicted by a rule executed by the feedback model. In these cases, the feedback model cannot reliably predict, and therefore not control, the stress state. Currently, the feedback model controls for two kinds of intervening variables.

First, the model controls for intervening variables of the individual difference variables category, such as gender and age, by fitting an individual feedback model for every subject separately.

Second, the feedback model aims to reduce intervening flooring and ceiling effects of stress response types by selecting rules that have the largest effect on the stress response type that has the value the farthest away from the desired stress state.

#### *Implementation*

The feedback model is implemented in Python (van Rossum, 1995), as this language provides classes for interaction with the measurement component software *Acqknowledge* and is a highlevel programming language, allowing easier adaptation in the future when this is needed for a given paradigm.

#### **OVERVIEW**

In **Figure 3** the schematic overview GASICA from **Figure 1** is revisited, and elaborated with the outcomes of the top-down design process of the different components as described in the previous sections. In this figure the constant loop performed in GASICA is depicted. In this loop the digital stressor game presents various stressor forms such as emotion induction, workload and frustration to the subject (**Table 1**) through the adjustment of (instantiations of) stressor game characteristics, e.g., the adjustment of time pressure through the presentation of a countdown. Simultaneously, five different physiological stress response types (i.e., heart rate, heart rate variability, diastolic, and systolic blood pressure, and electrodermal response) of different forms (respectively cardiac, hemodynamic, and sympathetic) are measured (**Table 3**). These measurements are relayed to the feedback model, which determines the current stress state and selects a rule (resulting in a adjustment of an instantiation of a game characteristic) that will minimize the difference between the desired stress state and the predicted current stress state after applying the rule. This rule is then applied in the stressor component, resulting in an adjustment of the game characteristic instantiations, hereby closing the loop that will run continuously. Because this loop runs continuously, any changes in the current stress state that are not a consequence of the adjustment of stressor characteristics (e.g., spontaneous drifts or influences of non-controlled variables) are constantly measured and corrected for. A narrative is used in the stressor game to prevent subjects to identify adjustments as part of the therapy or research intervention: the subject has to find notes wearing a heavy suit that contains a radio and display, and which malfunctions from time to time, resulting in for example, the presentation of images, sounds, and the restriction of movement. Intervening variables are taken into account by the feedback model in two ways, most prominently by constructing an individual feedback model for each subject.

Important to realize here is that GASICA utilizes single stressor characteristic adjustments to alter the stress state. However, the envisioned workings of GASICA do not rely on the expectation that single adjustments will result in the exact responses as found during the fitting condition. Because we *continuously* alter different characteristics, we get an ensemble of stressor manipulations that together have more power to alter the stress state toward the desired stress state. In other words, it is not the case that each specific, relatively mild, stressor is expected to elicit the exact response measured during the fitting, but rather the ensemble of all stressor characteristics that are continously presented

based on the current stress state, that is expected to result in the alteration of the stress state toward the desired stress state.

Instructions provided to subjects can be altered to fit the needs of the respective paradigm being used. Two general instructions that need to be given in any paradigm are that (1) the subject should perform to the best of their abilities, i.e., find as many notes as possible (monetary incentive could be employed here), and (2) the subject should remain as still as possible in order to ensure valid physiological meaurements. Important to note here is that due to the first instruction, GASICA is expected to also present a certain amount of social-evaluative threat, as the subject feels that they are evaluated on how well they are performing.

Most of the properties of GASICA can be altered by the therapist or researcher using it, as indicated in relevant sections in this manuscript. In this manner we aim to present a generic application that can be used in a multitude of paradigms. Some of the most important alterations include:


#### *Implementation*

In **Figure 4** the complete architecture of GASICA is given, containing the implementations of the different components and the implemented connections between the components. The stressor game environment (VBS2) and the feedback model (Python) are run together on one pc, the *stimulus pc*, and the stress response measurement software (Caretaker and Acqknowledge) is run on another pc, the *acquisition pc*. This distinction is made because the software on the different pc's has different requirements: the stressor game environment requires more graphical power, whereas the measurement software mostly requires large memory and fast writing to the hard disk. By separating the components, we can utilize pc hardware that is better suited for different software, and prevent interference between the software packages. Furthermore, an additional module, the *Connection and sync*

*module*, is developed as a dynamic link library (DLL) in C++ and is utilized as a plugin in VBS2, allowing to connect additional measurement equipment to GASICA using either the parallel or the serial port. Additionally, this module serves to synchronize the measurements in the measurement component with events in the stressor game or in any additional tasks that are used in different paradigms.

Furthermore, the therapist or researcher can determine several properties of GASICA as described above. In order to facilitate this adjustment, all properties can be set through a single configuration file. Even more, all software components are coded as modular open-source code and will made available in due time on gasica.com. This allows therapist or researchers to change any element of the application that could not be altered using the configuration file, for example the way the stress state is determined, or the game design. In **Figure 5** a picture of GASICA in use is included, with the different components from **Figure 4** encircled.

#### **DISCUSSION**

We have presented GASICA, an application aimed at controlling the internal stress state in various therapy and research paradigms by online and continuous monitoring of the stress state through (neuro)physiological signals. Here we discuss the strengths and limitations of the application, and the future directions.

#### **STRENGTHS**

Through the fulfillment of the requirements of the different components, GASICA is an application that presents several strengths:

(1) Multidimensional

The application allows to present different stressor forms, such as workload, emotion induction, and frustration. This allows to investigate the effects of these specific stressor forms, in isolation or in combination, in different paradigms.

(2) Ecological valid

The stressor forms presented and the instantiations of the stressor game characteristics that present these forms allow for the presentation of ecological valid stressors. This allows for the execution of paradigms that produce results with more generalizable power to real-world situations.

(3) Controllable

The application allows to both control the stressor form and stressor intensity that is presented, aiming to result in the control of the stress state of the subject. If this succeeds,

it opens many possibilities. For example, it allows to use GASICA to keep the stress state of the subject within desirable bounds given a specific paradigm, for example in therapies such as exposure therapy. Another example is to use the application to keep the stress state of a subject on a certain level for the duration of the paradigm, relevant, for example, in cases where the stress state is an intervening variable on the dependent variable in research.

(4) Generic

The application is generic in several respects. First, the configuration file allows to adjust any of the properties of GASICA, such as the adding of tasks to the application (the task itself can be constructed using the high-level programming language of VBS2), or the possibility to exclude or include stressor game characteristics and alter existing ones, allowing to present the stressor forms that are required in the respective paradigm. Second, the narrative and 3d world can be adjusted (the environment ensures easy import of existing 3d worlds and objects). Third, a wide range of physiological measurements can be used by adding Biopac amplifiers through a simple plug-and-play interface. After the new amplifier is added, the complete GASICA application automatically detects this new signal and utilizes it for control of the stress state. Additionally, measurement equipment from other vendors can be added through the connection and sync module. These properties allow adjustment of the application to make it suitable for different therapy and research paradigms. Furthermore, the entire application will be provided as modular open-source software on gasica.com, in order to allow any adjustments that are not feasible through the configuration file. This generic nature also allows GASICA to be used in additional ways. Examples could be to add neural activity measurement devices and make the measurements from these devices available to the subject, hereby effectively using GASICA as a neurofeedback application. Other examples could be in the entertainment field, where GASICA can be used to optimize the user experience, or for training purposes, aimed at the training of functioning in stressful jobs, such as in the military, aviation, or firefighting.

# **LIMITATIONS**

Several limitations can be anticipated with this application. First, it is to be expected that the rules fitted during the fitting condition capture a relation between stressor and response that is context-dependent, for example dependent of the elapsed time the stressor has been presented. As such, these relations are prone to change during the course of the experiment or therapy. At this moment, the application does not control for this. We aim to investigate these effects during the upcoming validation study and determine possible solutions, such as using *adaptive rules* in the feedback model, i.e., rules that adjust to changing relations between stressors and responses.

Second, currently only a few intervening variables are controlled for in the application. In the upcoming validation study we will analyse the effect of several intervening variables, such as elapsed time of the current paradigm, and include these in the feedback model.

Third, the application utilizes linear regression models to model the relation between continuously instantiated stressor game characteristics and stress response type sizes. However, it is not certain whether this relation is linear. We will use the data from future research to assess what kind of model best fits these relations.

#### **STRESS INDUCTION**

One of the important questions regarding the use of GASICA is whether it induces stress, or that the found responses are due to other concepts, most prominently, arousal. As stated in the treatment of these concepts in the introduction, we concur with the distinction proposed by Day and Walker (2007). We feel that according to this distinction GASICA should be considered as a stressor, as GASICA presents many stimuli that have been found to be aversive stimuli, entailing both qualitative appraisal in terms of aversiveness, and meeting the requirement that aversive challenges must be utilized.

Given that it is pivotal to establish with the highest possible certainty that the responses to GASICA are indeed representing stress, we have planned a follow-up validation study to this Technology Report. In this study a large study population is used, containing a control group to control for other factors of the digital stressor game that can contribute to physiological responses, such as motor activity. Furthermore, additional measurements are included, such as additional subjective measurements and cortisol measurements. As cortisol is an important and widely used biomarker for stress that could not be used as an online measurement in the measurement component, we will use it as an offline measurement in this study to gain more insight in the effects of GASICA on stress.

#### **ACKNOWLEDGMENTS**

We would like to thank our reviewers for their invaluable and insightful feedback, providing us with new angles to view our application from. This project is part of the research program *Treatment of cognitive disorders based on functional brain imaging*, funded by the Netherlands Initiative Brain and Cognition, a part of the Organization for Scientific Research (NWO) under grant number 056-14-014.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 31 January 2014; accepted: 18 November 2014; published online: 08 December 2014.*

*Citation: van der Vijgh B, Beun RJ, van Rood M and Werkhoven P (2014) GASICA: generic automated stress induction and control application design of an application for controlling the stress state. Front. Neurosci. 8:400. doi: 10.3389/fnins.2014.00400*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 van der Vijgh, Beun, van Rood and Werkhoven. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Fusion of electroencephalographic dynamics and musical contents for estimating emotional responses in music listening

# *Yuan-Pin Lin1,2\*, Yi-Hsuan Yang3 and Tzyy-Ping Jung1,2*

*<sup>1</sup> Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California, San Diego, La Jolla, CA, USA*

*<sup>2</sup> Center for Advanced Neurological Engineering, Institute of Engineering in Medicine, University of California, San Diego, La Jolla, CA, USA*

*<sup>3</sup> Music and Audio Computing Lab, Research Center for IT Innovation, Academia Sinica, Taipei, Taiwan*

#### *Edited by:*

*Jan B. F. Van Erp, Toegepast Natuurwetenschappelijk Onderzoek, Netherlands*

#### *Reviewed by:*

*Kenji Kansaku, Research Institute of National Rehabilitation Center for Persons with Disabilities, Japan Dezhong Yao, University of Electronic Science and Technology of China, China*

#### *\*Correspondence:*

*Yuan-Pin Lin, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California, San Diego, 9500 Gilman Drive, Mail code 0559, La Jolla, CA 92093-0559, USA e-mail: yplin@sccn.ucsd.edu*

Electroencephalography (EEG)-based emotion classification during music listening has gained increasing attention nowadays due to its promise of potential applications such as musical affective brain-computer interface (ABCI), neuromarketing, music therapy, and implicit multimedia tagging and triggering. However, music is an ecologically valid and complex stimulus that conveys certain emotions to listeners through compositions of musical elements. Using solely EEG signals to distinguish emotions remained challenging. This study aimed to assess the applicability of a multimodal approach by leveraging the EEG dynamics and acoustic characteristics of musical contents for the classification of emotional valence and arousal. To this end, this study adopted machine-learning methods to systematically elucidate the roles of the EEG and music modalities in the emotion modeling. The empirical results suggested that when whole-head EEG signals were available, the inclusion of musical contents did not improve the classification performance. The obtained performance of 74∼76% using solely EEG modality was statistically comparable to that using the multimodality approach. However, if EEG dynamics were only available from a small set of electrodes (likely the case in real-life applications), the music modality would play a complementary role and augment the EEG results from around 61–67% in valence classification and from around 58–67% in arousal classification. The musical timber appeared to replace less-discriminative EEG features and led to improvements in both valence and arousal classification, whereas musical loudness was contributed specifically to the arousal classification. The present study not only provided principles for constructing an EEG-based multimodal approach, but also revealed the fundamental insights into the interplay of the brain activity and musical contents in emotion modeling.

#### **Keywords: EEG, emotion classification, affective brain-computer interface, music signal processing, music listening**

# **INTRODUCTION**

Through monitoring ongoing electrical brain activity, electroencephalography (EEG)-based brain-computer interfaces (BCIs) allow users to voluntarily translate their intentions into commands to communicate with or control external devices and environments, instead of using conventional communication channels, e.g., speech and muscles (Millan et al., 2010). Several types of EEG signatures are theoretically defined and empirically proved to be robust in actively and reactively actuating BCIs (Zander and Kothe, 2011), such as evoked potentials, eventrelated potential (ERP), and sensorimotor rhythms (Wolpaw et al., 2002). Nowadays, a new categorization called passive BCI was introduced (Zander and Kothe, 2011). It enables users to involuntarily interact with machines by means of implicit user states, e.g., emotion. Researches are attempting to augment BCI's ability with emotional awareness and intelligence in response to users' emotional states, so called affective brain-computer interfaces (ABCIs).

Emotion is a psycho-physiological process as well as a natural communication channel of human beings. Music is considered as an extraordinary mediator to evoke emotions and concurrently modulate underlying neurophysiological processes (Blood et al., 1999). Upon profound findings in musical emotions, using machine-learning methods to characterize spatio-spectral EEG dynamics associated with emotions has gained increasing attentions in the last decade, namely EEG-based emotion classification, due to its promise of potential applications such as musical ABCI (Makeig et al., 2011), neuromarketing (Lee et al., 2007), music therapy (Thaut et al., 2009), implicit multimedia tagging (Soleymani et al., 2012a; Koelstra and Patras, 2013) and triggering (Wu et al., 2008). Given diverse EEG patterns, the major efforts in the previous EEG-based emotion classification works (not limited to music stimuli) were to seek an optimal emotion-aware model by leveraging feature extraction, selection and classification methods (Ishino and Hagiwara, 2003; Takahashi, 2004; Chanel et al., 2009; Frantzidis et al., 2010; Lin et al., 2010b; Petrantonakis and Hadjileontiadis, 2010; Koelstra et al., 2012; Soleymani et al., 2012b). Despite many approaches and advances in EEG analysis in the past decade, how to precisely categorize EEG signals into distinct emotional states remains challenging.

Music is an ecologically valid and complex stimulus that conveys emotions to listeners through compositions of musical elements, such as mode, tempo and timber (Peretz et al., 1998; Schmithorst, 2005; Gomez and Danuser, 2007; Zatorre et al., 2007). Listeners would be able to more and less perceive and recognize the same emotions as the music expresses (Schmidt and Trainor, 2001; Juslin and Laukka, 2003). Analogous to the EEG domain, researchers in music signal processing field devoted to map acoustic characteristics of musical contents into emotion semantics labeled by human annotators, namely music emotion recognition (Yang and Chen, 2011, 2012). Most of previous works employed publicly available toolboxes, such as MIRToolbox (Lartillot and Toiviainen, 2007), Marsyas (Tzanetakis and Cook, 2002), and PsySound (Cabrera, 1999), to extract a wide variety of musical features and then used machine-learning algorithms to automatically learn the associations between extracted features and emotions expressed in music (Yang et al., 2008a; Aljanaki et al., 2013). The aforementioned evidence raises a natural question whether or not the acoustic characteristics of musical contents can further improve the EEG classification results.

Using EEG features in conjunction with other information sources recently shed light on this issue, for example peripheral biosignals (Chanel et al., 2009; Koelstra et al., 2012; Soleymani et al., 2012a), eye gaze (Soleymani et al., 2012a,b), musical structures (Koelstra et al., 2012), and facial expression (Koelstra and Patras, 2013) have been proposed. Particularly for the music study, Koelstra et al. (2012) reported that a multimodal approach, fusing decision outputs from EEG and music classifiers, marginally improved the classification performance over using solely EEG modality. It remained unclear whether or not the acoustic characteristics of musical contents effectively contribute to the emotion modeling.

This study attempted to examine the roles of EEG and music modalities in the multidiscipline emotion classification problem in music listening upon two posed hypotheses. The first hypothesis was that the EEG modality reflecting spatio-spectral brain activities of the whole brain about implicit emotion responses should dominate the multimodal approach for emotion classification, as compared to the music modality, in which the implicit emotions concerned the responses automatically induced by the stimulus itself (Gyurak et al., 2011). This study adopted machine-learning methods, i.e., feature extraction, selection, and classification, to systematically assess a composite feature space synchronizing EEG dynamics and musical characteristics in accordance to time scale. The relative contributions from EEG and music modalities then can be explored. Furthermore, one can imagine that the use of a high-density EEG montage over the whole head might be more difficult or impractical for real-life ABCI applications. The applicability of whole-head EEG dynamics (in the first hypothesis) might no longer hold if only few electrodes are available over a certain region or regions (Lin et al., 2010b). Thus, this study posed another hypothesis that the musical contents might complement less informative EEG dynamics for emotion classification and consequently improve over the EEG modality result. This study explored the minimal set of informative electrodes from multiple subjects for emotion classification. Such few electrodes mostly populated over the fronto-central regions were used to simulate the absence of whole-brain EEG dynamics. Exploring the validity of these two hypotheses might elucidate potential advantages and limitations in fusing EEG dynamics and musical contents for the emotion classification problem.

#### **MATERIALS AND METHODS**

#### **EEG DATASET AND MUSIC EXCERPTS**

This study adopted the Oscar movie soundtrack dataset (Lin et al., 2010b) to test the feasibility of using a multimodal approach for emotion classification. The EEG signals were collected from 26 healthy subjects who were undergraduate and graduate students (16 males, 10 females; age 24.40 ± 2.53) mostly from engineering-related colleges. The experiment protocol and EEG recording were approved by the Human Research Protections Program of National Taiwan University. The music-listening experiment targeted four emotion classes (joy, anger, sadness, and pleasure) in accordance to the two-dimensional circumplex emotion model composed of valence (positive-negative) and arousal (high-low) axes (Russell, 1980). Sixteen music excerpts from the soundtracks of Oscar winning movies were used to induce the targeted emotions. Each subject underwent a 4 block music experiment; each block contained four counterbalanced 30-s music trials corresponding to four targeted emotions. After music listening, the subjects labeled their felt emotions on a discrete scale, for example, joy (positive valence and high arousal), anger (negative valence and high arousal), sadness (negative valence and low arousal), and pleasure (positive valence and low arousal). In the experiment, a 32-channel Neuroscan EEG module placed according to the International 10–20 system (**Figure 1**) and referenced to the linked mastoids (algebraic average of left and right) was adopted to acquire EEG signals with a sampling rate of 500 Hz and a bandpass filter at 1–100 Hz. Subjects were asked to keep their eyes closed, remain seated, and minimize head/body movements. After the music experiment, each subject's data was consisted of 16 30-s EEG segments labeled by self-reported emotional states (joy, anger, sadness, or pleasure).

Referring to the recent works (Koelstra et al., 2012; Soleymani et al., 2012a,b; Koelstra and Patras, 2013), most of EEG-based classification tasks addressed and performed on the basis of emotional valence and arousal, e.g., categorizing EEG signals into positive or negative valence, instead of discrete emotion states. To make a direct comparison with the latest reports, this study addressed the binary emotion classification problem. The selfreported emotion labels of the Oscar movie soundtrack dataset were separately merged into the binary categories of valence and arousal. The valence scale comprised positive (joy and pleasure) and negative (anger and sadness) levels, whereas arousal scale contained high (joy and anger) and low (pleasure and sadness) levels. There were 16 pairs of 30-s EEG signals and music excerpts for each of 26 subjects available for analysis and comparison.

#### **EEG FEATURE EXTRACTION**

Previous neurophysiological studies documented EEG spectral changes either in distinct regions or between hemispheres (Davidson, 1992; Schmidt and Trainor, 2001; Aftanas et al., 2004; Sarlo et al., 2005; Sammler et al., 2007). Such evidence might in part facilitated the use of spectral dynamics within/between channels for the EEG-based emotion classification, e.g., spectra in individual channels and spectral asymmetry in left-right channel pairs (Lin et al., 2010b; Koelstra et al., 2012; Soleymani et al., 2012a; Koelstra and Patras, 2013). In the literature, the patterns of spectral differences along anterior and posterior brain regions have also been explored (Schmidt and Trainor, 2001; Sarlo et al., 2005). However, no study has attempted to address the feasibility of using spectral differences in fronto-posterior channel pairs in this domain. Prior to construct a multimodality approach, this study aimed to explore an optimal EEG features from different types, including the power spectral density in individual channels and the power spectral asymmetry in the left-right and fronto-posterior channels pairs.

For each of 16 30-s EEG trials, the short-time Fourier transform with non-overlapping 1-s Hamming window was applied

to extract the power spectral density in five frequency bands, including delta (δ: 1–3 Hz), theta (θ: 4–7 Hz), alpha (α: 8–13 Hz), beta (β: 14–30 Hz), and gamma (γ: 31–50 Hz) over 30 channels (two reference channels were excluded). The band-specific power spectra of the individual channels formed a feature dimension of 150 (5 bands × 30 channels) and was labeled as PSD hereafter. To characterize the spectral-band asymmetry in respect of laterality (in left-right direction) and caudality (in fronto-posterior direction), this study defined two feature types namely DLAT and DCAU to separately extract the differential spectral asymmetry of 12 left-right and 12 fronto-posterior channel pairs from 30 individual channels, both forming a feature dimension of 60 (5 bands × 12 channel pairs). Furthermore, this study also named a feature type MESH by merging PSD, DLAT and DCAU, a dimension of 270, for comparison. **Table 1** summarizes the aforementioned four EEG feature types. It is noted that the feature vectors of each type were separately normalized to the range from 0 to 1.

#### **MUSIC FEATURE EXTRACTION**

Emotion expression in music is usually associated with different acoustic characteristics (Juslin, 2000; Gabrielsson and Lindström, 2010). This study employed commonly used music information retrieval toolboxes, i.e., MIRtoolbox (Lartillot and Toiviainen, 2007) and PsySound (Cabrera, 1999), to extract the acoustic features that represent various perceptual dimensions of music listening, including pitch, dissonance, loudness, and timber. The data samples of the musical features were aligned to the EEG features with one sample per second. The music feature types are summarized in **Table 2** and depicted as followings.

Pitch is the auditory attribute of sounds which can be ordered on a scale from low to high. The harmonic aspect of music can be described in terms of the relationship between two or more simultaneous pitches, whereas the melodic aspect is related to the temporal succession of pitches (Muller et al., 2011). This study used the MIRtoolbox to extract three major elements describing the pitch properties in music, including the key clarity, mode, and harmonic flux. The key clarity refers to the similarity (or key strength) that best describes one of the 24 musical keys, e.g., C major. Next, the musical mode represents the difference between the best major




key and the best minor key in key strength, which is often related to the sensation of valence in music (Gabrielsson and Lindström, 2010). The harmonic flux indicates a large difference in harmonic content between consecutive frames, such as chord changes, strong melody or bass line movement. This feature may be relevant as some psychology studies have found that large melodic intervals are perceived as more powerful (i.e., high-arousal) than small ones (Gabrielsson and Lindström, 2010).

Dissonance measures the harshness or roughness of the acoustic spectrum (Cabrera, 1999). The dissonance generally implies a combination of notes that sound harsh or unpleasant to people when played at the same time. Empirically, many musical pieces involve a balanced combination of consonance and dissonance sounds, e.g., the release of harmonic tension might create pleasure (Parncutt and Hair, 2011). Four elements describing the dissonance were calculated by the PsySound, including tonal dissonance (HK and S) and spectral dissonance (HK, S). The tonal and spectral dissonance measures the dissonance among tonal components and models the degree deviating from the noisiness of the sound, respectively. Note that HK and S are two methods forming the results in different scales.

Loudness is the perceptual intensity of sounds and depends primarily on the physical intensity as well as frequency and duration. This study employed the PsySound to derive five features depicting the human sensation of sound loudness across frequency, including loudness, sharpness (Z, A), timbral width, and volume. The loudness is an integral of the spectral distribution of loudness sensation. In general, loud music tends to be associated with high arousal and potency, whereas soft music relates to low arousal. Next, sharpness Z and A are two models distinctly characterizing the sharpness of the sound sensation in a scale from dull to sharp (Cabrera, 1999). The former model emphasizes high frequencies, whereas the later one is sensitive to the positive influence of loudness toward sharpness. The timbral width is defined as the flatness, i.e., width of the peak, of the loudness' spectral distribution, whereas the volume is derived based on the relative strength between total loudness and sharpness (Cabrera, 1999). The relationship between these two features and emotion processing is relatively less understood.

Timber that reflects the acoustic spectro-temporal characteristics is often considered as the quality of sound that makes a particular musical sound different from another. To model the timber, this study employed the MIRToolbox and computed the Mel-frequency cepstral coefficients (MFCC). MFCC characterizes the spectral shape of the sound by taking the coefficients of the discrete cosine transform of log-power spectra expressed on a non-linear perceptual-related Mel-frequency scale (Davis and Mermelstein, 1980). Typically, only the 10–20 lowest coefficients were retained for analysis (Muller et al., 2011). Referring to (Koelstra et al., 2012), this study only adopted the first 13 coefficients. The timber type was named as MFCC hereafter.

#### **FUSION OF EEG AND MUSICAL FEATURES**

Through using multidisciplinary signals, a multimodal approach can usually boost single modality results. Decision-level and feature-level fusions are two commonly used schemes to obtain the integration of multiple signal sources (Kittler et al., 1998; Sargin et al., 2007). The feature-level fusion works by concatenating features of different modalities and then feeding the composite feature vector to a classifier, whereas the decisionlevel fusion allows single modalities to process independently and then derive a final decision from multiple outputs. It is worth noting that since this study attempted to evaluate the relative contributions of EEG and music modalities, the feature-level fusion that synchronizes the features of different modalities along time more likely conforms to the objective. After applying a feature selection processing (described at the next section), this study defined a term, namely percent composition, to reveal the percentages of contributions of EEG and musical features to a multimodal feature composition. Prior to classification, each of the addressed EEG and musical features was independently normalized between 0 and 1, making features equally weighted to a classifier.

#### **FEATURE SELECTION**

Feature selection plays a chief role in solving classification problem. Given a plenty of raw features, the selection procedure is capable of extracting only a subset of task-relevant features while removing redundant/irrelevant ones. Feature reduction not only leads to computational efficiency, but also reduces the number of electrodes required in real-life applications (Lin et al., 2010b). This study employed an F-score index, a ratio of betweenand within-class variations (Chen and Lin, 2006), to pinpoint the most emotion-relevant features/electrodes, which has been proven effective for the EEG-based emotion classification problem (Lin et al., 2010b). The F-score index of the *i*th feature is defined as following:

$$F(i) = \frac{\sum\_{l=1}^{\mathcal{S}} (\overline{\boldsymbol{\chi}}\_{l,i} - \overline{\boldsymbol{\chi}}\_{i})^2}{\frac{1}{n\_l} \sum\_{l=1}^{\mathcal{S}} \sum\_{k=1}^{n\_l} (\overline{\boldsymbol{\chi}}\_{k,l,i} - \overline{\boldsymbol{\chi}}\_{l,i})^2}$$

where *xi* and *xl,<sup>i</sup>* are the mean values of the *i*th feature for entire dataset and for class *l* (*l* = 1 ∼ *g*, *g* = 2 for positive and negative classes in valence or high and low classes in arousal), respectively; *xk,l,<sup>i</sup>* is the *k*th sample value of the *i*th feature for class *l*, and *nl* is the number of samples in class *l*. The larger F-score value indicates higher discrimination power. It assumed that the features with highest F-score values account for the most emotion-tagged information and contribute more to emotion classification.

To test the first hypothesis, the F-score based feature selection was applied to each subject's EEG dataset separately to generate a subject-dependent EEG feature set. To test the second hypothesis, this study simulated the consequences of unavailability of whole-head EEG data. This study applied the F-score feature selection to explore the commonality of the informative EEG features from 26 subjects, i.e., subject-independent set. More specifically, an objective index, namely the level of feature independency (LFI), was defined as the number of subjects having the same informative features. After sorting and accumulating the F-score-sorted subject-dependent EEG features, the LFI-guided subject-independent EEG feature sets were then explored. The LFI value was empirically set and tested from 0.1 up to 0.6. Note that no informative features were commonly observed over 18 subjects (LFI = 0.7). The subject-independent EEG feature set with LFI = 0.6 was supposed to return a minimal set of electrodes to test the second hypothesis. It is also important to explore the common EEG patterns across subjects in emotion processing.

#### **FEATURE CLASSIFICATION AND VALIDATION**

Support vector machine (SVM) is a popular machine-learning algorithm that projects input data onto a higher dimensional feature space via a transfer kernel function, in which classification can be made more easily than in the original feature space. The iterative learning processing of an SVM eventually converges into optimal hyperplanes giving maximal margins between classes. This study used LIBSVM software (Chang and Lin, 2011) to build the SVM classifier and employed a radial basis function (RBF) kernel to non-linearly map the original data onto a higher dimensional space.

Regarding the classification validation, this study adopted a leave-trial-out (LTO) validation method to each individual's dataset to obtain the emotion classification results. The LTO validation provides a generalized performance by averaging classification results *N* times with each of *N* trials to be tested (*N* = 16 in this study). In each repetition, the SVM model was trained with 15 trials and then tested against the remaining trial. It is noted that prior to the LTO validation a grid-search procedure (Chang and Lin, 2011) was applied to the entire dataset to decide an optimal parameter pair (γ, *C*) for the size of the RBF kernel and the penalty of decision boundary from various pairs (γ: 2−<sup>1</sup> <sup>∼</sup> <sup>2</sup>3, *<sup>C</sup>*: 2−<sup>4</sup> <sup>∼</sup> <sup>2</sup>1), which corresponded to the best SVM training accuracy. The classification accuracy was defined by the ratio of correctly classified number of samples and the total number of samples. The averaged classification performance was obtained by averaging the classification results across 26 subjects. This study employed a paired *t*-test to access the statistical significance in classification performance between different feature types or modalities. As a baseline, the majority-voting accuracy defined by the majority class of the training data was also provided, i.e., random guessing. For example, given a training set consisted of positive (63%) and negative (37%) samples in the valence classification, the majority accuracy was 63% for assigning a new sample as positive valence. The significant difference of the obtained classification accuracy versus majority voting was tested using a one-sample *t*-test.

# **RESULTS**

# **TESTING THE FIRST HYPOTHESIS: EEG DYNAMICS DOMINATED A MULTIMODAL APPROACH IN EMOTION CLASSIFICATION COMPARED TO MUSICAL CONTENTS**

**Figure 2** summarizes the valence and arousal classification results of the subject-dependent EEG feature types (DLAT, DCAU, PSD, and MESH). It is noted that the condition "without feature selection" shows the results using all the features, while the condition "with feature selection" shows the maximum accuracy through the add-one-feature-in procedure and the number of the features eventually used. In general, using different EEG feature types without the feature selection tended to have comparable results that were notably worse than majority voting. Using only informative features (with high F-score values), the classification accuracies for all the feature types were markedly improved (*p <* 0*.*01) upon the results without using feature selection, and were significantly better than the majority voting (*p <* 0*.*01). The MESH generated maximum accuracies of 76.08 ± 6.39% and 74.27 ± 4.82% for valence and arousal classification, respectively, which significantly outperformed other feature types (*p <* 0*.*01). The feature selection also considerably reduced the feature dimensionality from 270 to below 30. This was very likely attributed to the fact that the F-score feature selection effectively pruned the less informative features from the whole feature space, largely alleviating the interference caused by redundant/irrelevant features. Thus, the MESH was used to merge with musical contents to form a multimodal approach in the following sections.

**Figure 3** summarizes the classification results using the subject-dependent EEG features (i.e., MESH), musical features (i.e., MUSIC), and subject-dependent multimodal approach. Note that the multimodal features were obtained by applying the F-score feature selection to the composite features of the MESH and MUSIC features. The multimodal approach obtained the maximum accuracies of 76.97 ± 6.18% and 76.25 ± 4.88% for valence and arousal classification, respectively. The results using musical features alone were around 65% and only significantly outperformed the majority voting for arousal classification but not for valence classification, which were all significantly worse than EEG and multimodality approaches (*p <* 0*.*01). The subject-dependent EEG features did not notably benefit from the inclusion of musical features. The classification performance using the multimodal features compared favorably (*p >* 0*.*1) to those using the EEG features.

**Figure 4** further shows the percent composition of contributions of EEG and musical features to the subject-dependent multimodal approach. The composition was derived based on how many informative features led to the maximum classification accuracy. This result indicated that the EEG feature types, especially DLAT and DCAU, dominated the composition of multimodal features for valence and arousal classifications, while the musical features barely contributed. This might explain the marginal improvement using the multimodal approach versus the EEG-only modality.

In sum, the feature type MESH, consisted of the twodirectional power asymmetry and individual power spectra across the whole scalp and frequency bands, better characterizing the

**feature selection.** The numbers above the bars represent the mean values of the results, whereas the numbers in bold indicate the

mean values of the results, whereas the numbers in bold indicate the accuracies significantly better (*p <* 0*.*01) than the majority voting accuracy (valence: <sup>∼</sup>63%, arousal: <sup>∼</sup>61%). †Indicates that the accuracy with feature selection significantly outperformed that without feature selection (*p <* 0*.*01).

EEG dynamics about emotional responses than the musical features. The above empirical results proved the first hypothesis that the EEG modality that accessed spatio-spectral brain activity of the whole brain dominated the classification of emotional responses in the multimodal approach.

# **TESTING THE SECOND HYPOTHESIS: MUSICAL CONTENTS CAN COMPLEMENT EEG DYNAMICS UNAVAILABLE IN WHOLE-HEAD EEG RECORDINGS**

To test the second hypothesis that musical contents can complement EEG dynamics unavailable in whole-head EEG recordings, this study simulated the circumstance of classifying emotion states based on fewer informative EEG features/electrodes. The LFI index (0.1∼0.6) was defined to systematically reduce the whole-brain electrode montage (30) to different subsets of electrodes located at certain regions. Under such constrain, the relationship between the EEG dynamics and musical contents can be evaluated.

**Figure 5** presents the valence and arousal classification results using the LFI-sorted subject-independent EEG features (i.e., MESH) with/without feature selection. Overall, the number of features can be seen to progressively reduce as the LFI value increased from 0.1 to 0.6. The number of electrodes required for the feature sets in turn was reduced. These feature sets, however, gave very limited estimations in emotional responses against the majority voting. The reason was attributed to the fact that the discriminative power of the subject-independent features with a compromise of a subject population might not be guaranteed to each of subjects. At LFI = 0.6, the required electrodes were dramatically reduced from the whole-scalp montage (30) to ten and seven electrodes for valence and arousal classification, respectively. As shown in **Figure 6**, most of the informative EEG features (listed in **Table 3**) were extracted from the frontocentral electrodes versus others. It is worth noting that the DLAT, extracted from left-right electrode pairs, dominated the composition of the EEG features, compared to others (DCAU and PSD). According to these results, the subject-independent EEG feature set (LFI = 0.6), which involved a low-density fronto-central montage, was adopted for emotion classification in the rest of the study.

**Figure 7** shows the classification results using the subjectindependent EEG features (i.e., MESH given LFI = 0.6), musical features (i.e., MUSIC), and subject-independent multimodal approach. Note that the sorted multimodal features were derived by applying the F-score feature selection to the composite feature vector of the MESH and MUSIC features. The multimodal approach resulted in the maximum accuracies of 66.93 ± 7.10% and 67.04 ± 5.78% for valence and arousal classification, respectively, following by the musical features and the EEG features. Most importantly, the multimodal approach outperformed the EEG-only features by around 6% for valence (*p <* 0*.*05) and 9% for arousal (*p <* 0*.*01) classification. There was no significant difference between the multimodal approach and the musical features (*p >* 0*.*3).

**FIGURE 5 | The valence and arousal classification results of the subject-independent EEG features (type: MESH) in term of the average number of features, electrodes, and accuracies using with/without feature selection under the LFI criteria (0.1 ∼ 0.6).** The numbers near to

the nodes represent the mean values of the results. †Indicates that the accuracy with feature selection significantly outperformed that without feature selection (*p <* 0*.*01), yet were comparable (*p >* 0*.*1) to majority voting accuracies (valence: ∼63%, arousal: ∼61%).

**Figure 8** shows the percent composition of contributions of EEG and musical features to the subject-independent multimodal features. As a baseline, the composition of the subjectindependent EEG features is also provided. The comparative result showed the EEG and musical features performed complementarily in the multimodal approach. The musical features

**Table 3 | The informative EEG features that consistently appeared across multiple subjects.**


competed to the EEG features and replaced the ones with relatively low discriminative power, especially for arousal scale. This evidently explains the reason that the subject-independent multimodal approach leading to significant improvements upon the subject-independent EEG results. **Table 4** lists these informative musical features, which consistently appeared in above half of the subjects.

In sum, the EEG features extracted from a subset of brain regions was unable to effectively encompass the complex brain

**FIGURE 7 | The valence and arousal classification results using the subject-independent multimodal approach (LFI = 0.6) with/without feature selection.** The results of the subject-independent EEG modality (feature type: MESH) and the music modality (feature type: MUSIC) are also provided for comparison. The numbers above the bars represent the mean values of the results, whereas the numbers in bold indicate the accuracies significantly better (*p <* 0*.*02) than the majority voting accuracy (valence: <sup>∼</sup>63%, arousal: <sup>∼</sup>61%). †indicates that the accuracy with feature selection significantly outperformed that without feature selection (*p <* 0*.*01).

**Table 4 | The informative musical features in the subject-independent multimodal approach.**


dynamics about emotions. The corresponding EEG features simply retuned the classification accuracy equivalent to random guessing. Under this circumstance, the EEG modality could benefit from the inclusion of the acoustic characteristics of musical contents. The aforementioned simulation result proved the second hypothesis that the musical contents can compensate for EEG dynamics unavailable in whole-head EEG recordings to improve the classification performance to some extent.

### **DISCUSSION**

Music is an ecologically valid and complex stimulus that conveys emotions to listeners through musical composition. Using only EEG signals to classify music-induced emotional responses remained challenging. By exploiting the complementary nature of multidisciplinary modalities, the fusion of EEG and musical dynamics has been recently reported (Koelstra et al., 2012). However, it remains unclear when acoustic characteristics of musical contents effectively contribute to the modeling of emotional responses. To this end, this study adopted machinelearning methods, including feature extraction, selection and classification, to systematically assess a composite feature space by aligning EEG and musical features in time. The empirical results suggested that when EEG signals from the whole head were available, the inclusion of musical contents contributed little to the emotion classification model. On the contrary, if EEG dynamics only available from a small set of electrodes (likely the case in real-life BCI applications), the music modality tended to play a complementary role to enhance the EEG-based classification performance. To the best of our knowledge, no study has attempted to elucidate the roles of the EEG and music modalities in the emotion classification problem. The present study not only provided principles for building an EEG-based multimodal approach, but also revealed the fundamental insights into the interplay of the brain activity and musical contents in emotion modeling.

#### **INDIVIDUAL VARIABILITY AND COMMONALITY OF THE EEG DYNAMICS FOR EMOTION CLASSIFICATION**

Individual variability has been reported in emotion regulations (Gross and John, 2003). Such variability may introduce the disparity of informative EEG patterns across individuals or subgroups (Lin et al., 2010a, 2011). To estimate the emotional states, it is plausible to expect a subject-specific classification model that well learned from an individual would have an optimal classification accuracy (Lin et al., 2010b). In the present study, the comparison in valence and arousal classification using subject-dependent and -independent features addressed this issue. The classification performance using the LFI-guided subject-independent EEG features (c.f. **Figure 5**) was notably worse than that using the subject-dependent EEG set (c.f. **Figure 2**). The commonality of the valence- and arousal-specific EEG features/electrodes from multiple subjects was rather small. There were only seven and four informative EEG features consistently appeared in over 15 of 26 subjects for valence and arousal classification, respectively (c.f. **Table 3**). These results suggested that the individual variability substantially affected emotion classification, especially for arousal scale, and thereby posed a great challenge to learning a subject-independent emotion model using only the EEG signals.

However, it is worth noting that exploring a consensus set of emotion-relevant EEG activity from multiple subjects is of great important to normative emotion research. In this study, the electrodes placed over the fronto-central region were relatively discriminative for most of subjects (c.f. **Figure 6**), which was in line with the previous studies (Altenmuller et al., 2002; Lin et al., 2010a). Over the brain region, the lateralized power asymmetry (in the left-right direction) well characterized the changes of emotional states, which may be supported by the role of the frontal cortical lateralization in emotion processing (Altenmuller et al., 2002; Allen et al., 2004). Specifically, the frontal theta asymmetry (FT7-FT8) and the fronto-central alpha asymmetry (FC3-FC4) associated with the valence scale was in line with other studies (Davidson, 1992; Aftanas et al., 2001; Schmidt and Trainor, 2001), whereas the fronto-central theta asymmetry (F7-F8 and FC3-FC4) related to the arousal scale was supported by Aftanas et al. (2004). Furthermore, several informative spectral asymmetries in the delta band for both emotional valence and arousal partially conformed to the previous works (Lin et al., 2010a,b). Accordingly, the index of rhythmic lateralization presumably better differentiated the brain activity into emotional states and acted consistently for multiple subjects, compared to the caudality (power asymmetry in the fronto-posterior direction) and individual spectra.

### **THE ROLE OF EEG AND MUSIC MODALITIES IN EMOTION CLASSIFICATION**

The empirical results of this study evidently suggested that the inclusion of the acoustic characteristics of musical contents did not guarantee to complement EEG dynamics in the emotion classification problem. One key factor is that whether or not the EEG signals can be extracted from the whole brain and across entire frequency bands to encompass the full emotion-modulated spatio-spectral dynamics.

The optimized subject-dependent results showed that the EEG modality with and without the inclusion of the music modality were comparable in the performance (c.f. **Figure 3**) and tended to dominate the feature composition in the multimodality model (c.f. **Figure 4**). This indicated that the musical content brought very limited or redundant discriminative power to the classification of emotional responses. The aforementioned individual variability might explain such results. The music modality that lacks of correlates of internal psychophysiological reactions might more and less introduce conflicts with the brain signals, i.e., EEG modality, in reflecting the felt emotional responses. It is true that the listeners might not actually perceive and experience the same emotion as music tried to express (Gabrielsson, 2002). Accordingly, it is reasonable to conclude that if the informative EEG features can be obtained from the whole brain and entire frequency bands, the inclusion of musical contents barely contributed to the classification model. The multimodal approach might not be necessary.

However, in practical ABCI applications, an EEG cap with the whole-brain coverage might be impractical and unavailable in consumer-level headsets, e.g., the MindWave headset (NeuroSky, Inc.) and the Emotiv EPOC headset (Emotiv systems, Inc.). In this case, the EEG features measured by the electrodes sparsely placed at a certain brain region(s). The suboptimal EEG features returned very poor emotion classification performance (even lower than the random guessing, c.f. **Figure 7**). The music modality under this circumstance provided complementary information and replaced a set of EEG features with less discrimination power with the musical characteristics of timber and loudness (c.f. **Figure 8**). The musical dynamics tended to dominate the multimodal feature composition in the arousal scale as compared to valence. This phenomenon might be attributed to the fact that the music modality met a great challenge in modeling emotional valence (Macdorman et al., 2007; Yang et al., 2008b). This might also explain why the improvement in the classification performance was much noticeable in the arousal classification. Thus, the music modality was assumed to boost the EEG-based emotion classification performance if the EEG dynamics were substantially limited in certain brain regions.

### **INFORMATIVE MUSICAL CHARACTERISTIC FOR EMOTION CLASSIFICATION**

By manipulating musical structures, conveying emotions in music is intuitively plausible (Peretz et al., 1998; Schmithorst, 2005; Gomez and Danuser, 2007; Zatorre et al., 2007). Several neurophysiological studies that devoted to the brain correlates in musical perception and emotion perception reported that some music-modulated brain activity were known to intervene in emotion processing (Blood et al., 1999; Tsang et al., 2001; Khalfa et al., 2005). It is reasonable to expect that there is a considerable amount of EEG rhythmicity that is not only engaged in emotion processing but also modulated by music perception. Thus, the acoustic characteristics of musical contents and EEG dynamics could somehow perform complementarily. As shown in **Table 4**, previous neurophysiological and music signal processing studies supported our findings. Several neurophysiological studies found that mode and consonance were relevant to the distinction of emotion valence (Tsang et al., 2001; Sammler et al., 2007), whereas the harmonics processing was very closely associated with emotional affect and intensity (Schmithorst, 2005). From musical signal processing aspect, Yang et al. (2008b) reported that the valence scale was better characterized by the dissonance and pitch-related features, whereas the arousal scale was better modeled by timber features. This was in line to the findings of spectral dissonance and mode for valence scale and a timber element (8th MFCC) for arousal scale. Aljanaki et al. (2013) recently also documented that the most important feature in the distinction of the arousal scale was the loudness, which supported our findings in arousal scale. It is encouraging that the consistent findings of the musical structures were conducted with different musical datasets.

# **COMPARING THE EMOTION CLASSIFICATION RESULTS WITH PREVIOUS WORKS**

Recent works that adopted the EEG-based multimodal approach are described here. Koelstra et al. (2012) proposed to use a decision-level fusion scheme to construct a multimodal pipeline (EEG, peripheral biosignals, music) for emotional valence, arousal and liking classification while watching music videos. The classification performance using the EEG signals were marginally worse than that using musical features for valence (EEG: 58%, biosignals: 63%, music: 62%, majority: 59%) and arousal (EEG: 62%, biosignals: 57%, music: 65%, majority: 64%) classification. The fusion of EEG and musical features resulted in an optimal classification accuracy around 63% marginally outperformed EEG modality only for arousal classification. In the same year, Soleymani et al. (2012a) also adopted the decision-level approach and explored an optimal fusion pair among EEG signals, peripheral biosignals and eye gaze for affective recognition during video appreciation. The classification performance using the EEG-Gaze fusion was better than single modality results for valence (biosignals: 46%, EEG: 57%, gaze: 69%, fusion: 76%, random: 34%) and arousal (biosignals: 46%, EEG: 52%, gaze: 64%, fusion: 68%, random: 36%) classification. The authors later performed a following-up study (Soleymani et al., 2012b) to compare the schemes for fusing EEG and gaze modalities at feature and decision levels. The authors reported that the decisionlevel fusion returned better classification results compared to single modalities for valence (EEG: 50%, eye: 67%, decision: 69%, random: 33%) and arousal (EEG: 62%, eye: 71%, decision: 76%, random: 33%) classification, where the feature-level fusion (valence: 58%, arousal: 66%) just outperformed the EEG modality. A year later, Koelstra and Patras (2013) similarly assessed the feasibility of using the feature- and decision-based multimodality (EEG dynamics and facial expression characteristics). The authors documented that the feature-level fusion in general marginally improved the performance against single modalities for valence (EEG: 72%, face: 65%, fusion: 73%, majority: 62%) and arousal (EEG: 68%, face: 68%, fusion: 69%, majority: 62%) classification, whereas the fusion-level approach using an optimal weighting scheme led to more convincing improvement (valence: 74%, arousal: 72%). In the present study, the feature-level multimodal approach (EEG and musical features) was adopted to validate its feasibility of emotion classification in music listening. The empirical result showed that the subject-dependent multimodal approach marginally outperformed the single modalities for valence (EEG: 76%, music: 65%, fusion: 77%, majority: 63%) and arousal (EEG: 74%, music: 66%, fusion: 76%, majority: 61%) classification, whereas the subject-independent multimodal approach provided more convincing improvement for valence (EEG: 61%, fusion: 67%) and arousal (EEG: 58%, fusion: 67%) classifications.

It is worth mentioning that the comparison only based on classification accuracy might not be fair as a variety of factors might affect the classification results, such as but not limited to experimental conditions, stimulus types, multimodal sources, and signal processing steps. Thus, for a fair comparison in multimodality, this study summarized the differences between this and another related work (Koelstra et al., 2012) that also focused on the multimodality of EEG and musical dynamics. The obtained subject-dependent results of this study were evidently higher than theirs by at least 10%, whereas the subjectindependent results of this study were marginally higher by around 3%, yet with fewer EEG features and electrodes. Despite the disparity in the selected musical features, the music modality results for emotional valence and arousal were comparable. Furthermore, compared to the studies using solely EEG modality (Koelstra et al., 2012; Soleymani et al., 2012a,b; Koelstra and Patras, 2013), the proposed subject-dependent EEG features (MESH) should be comparable to or even better than previous reports. Instead, the classification performance using the proposed subject-independent EEG set might only compare favorably to the study (Koelstra et al., 2012).

#### **OUTPERFORMED EEG PATTERNS FOR EMOTION CLASSIFICATION**

The MESH features in conjunction with the F-score feature selection produced a compact set of informative features and consequently optimized the classification performance, compared to others (DLAT, DCAU, and PSD) (c.f. **Figure 2**). The performance improvement might be attributed to the fact that emotion processing might accompany the EEG dynamics that varied distinctly within and between brain regions (Schmidt and Trainor, 2001; Aftanas et al., 2004; Sarlo et al., 2005; Sammler et al., 2007; Lin et al., 2010b). The MESH features that composed of two-directional power asymmetry (laterality and caudality) and individual spectra over the scalp allow seeking an optimal set for constructing a classification model for each individual. As referring to its feature composition (c.f. **Figure 4**), both DLAT and DCAU apparently dominated the EEG composition against the PSD. Specifically, the DLAT consistently appeared in multiple subjects (c.f. **Table 3**). These results suggested that the features depicting the directional spectral differences between brain regions might be of importance in the EEG-based emotion modeling.

#### **THE CHOICE OF EEG ELECTRODE REFERENCE**

The EEG signals analyzed in this study were recorded with the reference to the linked mastoids. The recorded potentials over the mastoids were conventionally believed to be neutral to the measured neural activities of interest, which were also adopted in previous music studies (Koelsch et al., 2007; Sammler et al., 2007). However, few reports demonstrated that the linked mastoids reference might introduce non-neutrality to the recorded EEG signals and distort the EEG spectra (Yao, 2001; Marzetti et al., 2007; Qin et al., 2010). Comparing the effects of different reference strategies on emotion classification is an important issue, but it is beyond the scope of this study. Interested readers can refer to the studies on reference techniques by Yao (2001), Marzetti et al. (2007), Qin et al. (2010).

#### **FUTURE DIRECTION**

Future efforts can be devoted to augment the multimodal classification performance as follows. First, data-driven approach, e.g., principal component analysis (Lin et al., 2009) and independent component analysis (Lin et al., 2010a), might be feasible to further elaborate the EEG spatio-spectral dynamics associated with implicit emotional responses. Second, advanced music signal processing techniques can be incorporated to extract other musical characteristics, e.g., rhythm. Lastly, the decision-level multimodal fusion has been reported to obtain convincing classification performance improvements over the feature-level fusion (Soleymani et al., 2012b; Koelstra and Patras, 2013). Following the explored EEG and musical features of this study, the fusion at the decision level can be further explored and compared.

# **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 31 January 2014; accepted: 12 April 2014; published online: 01 May 2014. Citation: Lin Y-P, Yang Y-H and Jung T-P (2014) Fusion of electroencephalographic dynamics and musical contents for estimating emotional responses in music listening. Front. Neurosci. 8:94. doi: 10.3389/fnins.2014.00094*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Lin, Yang and Jung. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Hybrid fNIRS-EEG based classification of auditory and visual perception processes

#### *Felix Putze1 \*, Sebastian Hesslinger 1, Chun-Yu Tse2,3, YunYing Huang4, Christian Herff 1, Cuntai Guan5 and Tanja Schultz <sup>1</sup>*

*<sup>1</sup> Cognitive Systems Lab, Institute of Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany*

*<sup>2</sup> Department of Psychology, Center for Cognition and Brain Studies, The Chinese University of Hong Kong, Hong Kong, China*

*<sup>3</sup> Temasek Laboratories, National University of Singapore, Singapore, Singapore*

*<sup>4</sup> Nuffield Department of Clinical Neurosciences, John Radcliffe Hospital, Oxford, UK*

*<sup>5</sup> Institute for Infocomm Research (I2R), A\*STAR, Singapore, Singapore*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Ricardo Chavarriaga, Ecole Polytechnique Fédérale de Lausanne, Switzerland Clemens Brunner, Graz University of Technology, Austria Jonas Brönstrup, Institute of Technology Berlin, Germany*

#### *\*Correspondence:*

*Felix Putze, Cognitive Systems Lab, Institute of Anthropomatics and Robotics, Karlsruhe Institute of Technology, Adenauerring 4, Karlsruhe 76131, Germany e-mail: felix.putze@kit.edu*

For multimodal Human-Computer Interaction (HCI), it is very useful to identify the modalities on which the user is currently processing information. This would enable a system to select complementary output modalities to reduce the user's workload. In this paper, we develop a hybrid Brain-Computer Interface (BCI) which uses Electroencephalography (EEG) and functional Near Infrared Spectroscopy (fNIRS) to discriminate and detect visual and auditory stimulus processing. We describe the experimental setup we used for collection of our data corpus with 12 subjects. On this data, we performed cross-validation evaluation, of which we report accuracy for different classification conditions. The results show that the subject-dependent systems achieved a classification accuracy of 97.8% for discriminating visual and auditory perception processes from each other and a classification accuracy of up to 94.8% for detecting modality-specific processes independently of other cognitive activity. The same classification conditions could also be discriminated in a subject-independent fashion with accuracy of up to 94.6 and 86.7%, respectively. We also look at the contributions of the two signal types and show that the fusion of classifiers using different features significantly increases accuracy.

**Keywords: brain-computer interface, EEG, fNIRS, visual and auditory perception**

# **1. INTRODUCTION**

For the last decade, multimodal user interfaces have become omnipresent in the field of human-computer interaction and in commercially available devices (Turk, 2014). Multimodality refers to the possibility to operate a system using multiple input modalities but also to the ability of a system to present information using different output modalities. For example, a system may present information on a screen using text, images and videos or it may present the same information acoustically by using speech synthesis and sounds. However, such a system has to select an output modality for each given situation. One important aspect it should consider when making this decision is the user's workload level which can negatively influence task performance and user satisfaction, if too high. The output modality of the system which imposes the smaller workload on the user does not only depend on the actions of the system itself, but also on concurrently executed cognitive tasks. Especially in dynamic and mobile application scenarios, users of a system are frequently exposed to external stimuli from other devices, people or their general environment.

According to the multiple resource theory of Wickens (2008), the impact of a dual task on the workload level depends on the type of cognitive resources which are required by both tasks. If the overlap is large, the limited resources have to be shared between both tasks and overall workload will increase compared to a pair of tasks with less overlap, even if the total individual task load is identical. For example, Yang et al. (2012) showed a study in which they combine a primary driving task with additional auditory and visual task of three difficulty levels. They showed that the difference in the performance level of the driving task depends on the modality of the secondary task: According to their results, secondary visual tasks had a stronger impact on the driving than secondary auditory tasks, even if individual workload of the auditory tasks was slightly higher than of the visual tasks. For Human-Computer Interaction (HCI), this implies that when the interaction strategy of the system must must select from different output channels by which it can transfer information to the user, its behavior should take into account the cognitive processes which are already ongoing. It is possible to model the resource demands of cognitive tasks induced by the system itself (see for example Cao et al., 2009). For example, we know that presenting information using speech synthesis requires auditory perceptual resources while presenting information using a graphical display will require visual perceptual resources. However, doing the same for independent parallel tasks is impossible in an open-world scenario where the number of potential distractions is virtually unlimited. Therefore, we have to employ sensors to infer which cognitive resources are occupied.

To some degree, perceptual load can be estimated from context information gathered using sensors like microphones or cameras. However, if, for example, the user wears earmuffs or head phones, acoustic sensors cannot reliably relate acoustic scene events to processes of auditory perception. Therefore, we need a more direct method to estimate those mental states. A Brain-Computer Interface (BCI) is a "system that measures central activity and converts it into artificial output that replaces, restores, enhances supplements, or improves natural central nervous system output" (Wolpaw and Wolpaw, 2012). BCIs can therefore help to detect or discriminate perceptual processes for different modalities directly from measures of brain activity and are therefore strong candidates to reliably discriminate and detect modalityspecific perceptual processes. As BCIs have many additional uses for active interface control or for passive user monitoring, they may be already in place for other tasks and would not require any additional equipment.

Our system combines two different signal types [Electroencephalography (EEG) and functional Near Infrared Spectroscopy (fNIRS)] to exploit their complementary nature and to investigate their individual potential for classifying modality-specific perceptual processes: EEG is the traditional signal for BCIs, recording electrical cortical activity using electrodes. fNIRS on the other hand captures the hemodynamic response by exploiting the fact that oxygenated and de-oxygenated blood absorb different proportions of light of different wavelengths in the near-infrared spectrum. fNIRS captures different correlates of brain activity than EEG: While EEG measures an electrical process, fNIRS measures metabolic response to cognitive activity. This fact makes it plausible that a fusion of both signal types can give a more robust estimation of a person's cognitive state.

BCIs based on EEG have been actively researched since the 1970s, for example in computer control for locked-in patients (e.g., Wolpaw et al., 1991; Sitaram et al., 2007). BCIs based on fNIRS have become increasingly popular since the middle of last decade (Sitaram et al., 2007). The term hybrid BCI generally describes a combination of several individual BCI systems (or the combination of a BCI with another interface) (Pfurtscheller et al., 2010). A sequential hybrid BCI employs two BCIs one after another. One application of a sequential BCI is to have the first system act as a "brain switch" to trigger the second system. A sequential hybrid BCI usually resorts to different types of brain activity measured by a single signal type (e.g., correcting mistakes of a P300 speller by detecting error potentials, Spüler et al., 2012). In contrast, a simultaneous hybrid BCI system usually combines entirely different types of brain signals to improve the robustness of the joint system. The first simultaneous hybrid BCI that is based on synchronous measures of fNIRS and EEG was proposed by Fazli et al. (2012)for classification of motor imagery and motor execution recordings. The authors reported an improvement in recognition accuracy by combining both signal types.

Zander and Kothe (2011) defined Passive BCI as follows: "a passive BCI is one that derives its outputs from arbitrary brain activity arising without the purpose of voluntary control, for enriching a humanmachine interaction with implicit information on the actual user state." A number of such systems exist to classify the user's workload level, for example presented by Heger et al. (2010) or Kothe and Makeig (2011). Those systems used different EEG feature extraction techniques that are usually related to the frequency power distribution to classify low and high workload conditions. Other researchers derived features from Event Related Potentials (ERPs) in time domain (Allison and Polich, 2008; Brouwer et al., 2012) or used Common Spatial Patterns (Dijksterhuis et al., 2013) to discriminate workload levels. Workload level is typically assessed from subjective questionnaires or task difficulty. Sassaroli et al. (2008) placed fNIRS optodes on the forehead to measure concentration changes of oxyhemoglobin and deoxyhemoglobin in the prefrontal cortex during memory tasks and discriminated between three different levels of workload in three subjects. Similarly, Bunce et al. (2011) discriminate different workload levels for a complex Warship Commander Task, for which task difficulty was manipulated to create different levels of workload. They recorded fNIRS from 16 optodes at the dorsolateral prefrontal cortex and saw significant differences in oxygenation between low and high workload conditions. They also observed a difference in signal response to different difficulty settings for expert and novice users, which was mirrored by the behavioral data. Herff et al. (2014) showed that it is possible to classify different levels of n-back difficulty corresponding to different levels of mental workload on a single trials for prefrontal fNIRS signals with an accuracy of up to 78%. Hirshfield et al. (2009) combined EEG and fNIRS data for workload estimation in a counting task and saw better results for fNIRS in comparison to frequency based EEG-features. The authors reported surprisingly low accuracy for their EEG-based classifier and suspected problems with coverage of relevant sites and montage-specific artifacts. In contrast, Coffey et al. (2012) presented results from a similar study but showed worse results for the fNIRS features. From the available literature, it is hard to judge the relative discriminative power of the different signal types. On the one hand, Coffey et al. (2012) and Hirshfield et al. (2009) cover only a small aspect of general passive BCI research as they both concentrate on the classification of workload and use similar fNIRS montages. On the other hand, the experiments are too different to expect identical results (different cognitive tasks, different features, etc.). Therefore, there is too little data available for a final call on the synergistic potential between both modalities and their applicability to specific classification tasks. This paper contributes to an answer of this question by investigating a very different fNIRS montage, by including different types of EEG features to ensure adequate classification accuracy and by looking at a more specific aspect of cognitive activity, namely processing of different input modalities.

All the systems mentioned above modeled workload as a monolithic construct and did not classify the resource types which contributed to a given overall workload level. While there exist user studies, e.g., Heger et al. (2011), which show that it is possible to improve human-computer interaction using this construct, many use cases—like the mentioned selection between auditory and visual output modalities—require a more fine grained model of mental workload, like the already mentioned multiple resource theory (Wickens, 2008). Neural evidence from a study by Keitel et al. (2012) of subjects switching between bimodal and unimodal processing also indicated that cognitive resources for visual and auditory processing should be modeled separately. Most basic visual processing takes place in the visual cortex of the human brain, located in the occipital lobe, while auditory stimuli are processed in the auditory cortex located in the temporal lobes. This clear localization of important modality-specific areas in the cortex accessible for non-invasive sensors hints at the feasibility of separating both types of processing modes.

In this paper, we investigate how reliably a hybrid BCI using synchronous EEG and functional fNIRS signals can perform such classification tasks. We describe an experimental setup in which natural visual and auditory stimuli are presented in isolation and in parallel to the subject of which both EEG and fNIRS data is recorded. On a corpus of 12 recorded sessions, we train BCIs using features from one or both signal types to differentiate and detect the different perceptual modalities. This paper contributes a number of substantial findings to the field of passive BCIs for HCI: We trained and evaluated classifiers which can either discriminate between predominantly visual and predominantly auditory perceptual activity or which were able to detect visual and auditory activity independently of each other. The latter is ecologically important as many real-life tasks demand both visual and auditory resources. We showed that both types of classifiers achieved a very high accuracy both in a subject-dependent and subject-independent setup. We investigated the potential of combining different feature types derived from different signals to achieve a more robust and accurate recognition result. Finally, we look at the evaluation of the system on continuous data.

### **2. MATERIALS AND METHODS**

#### **2.1. PARTICIPANTS**

Twelve healthy young adults (6 male, 6 female), age between 21 and 30 years (mean age 23.6, *SD* 2.6 years) without any known history of neurological disorders participated in this study. All of them have normal or corrected-to-normal visual acuity, normal auditory acuity, and were paid for their participation. The experimental protocol was approved by the local ethical committee of National University of Singapore, and performed in accordance with the policy of the Declaration of Helsinki. Written informed consent was obtained from all subjects and the nature of the study was fully explained prior to the start of the study. All subjects had previous experience with BCI operation or EEG/fNIRS recordings.

#### **2.2. EXPERIMENTAL PROCEDURE**

Subjects were seated in a sound-attenuated room with a distance of approximately one meter from a widescreen monitor (24--BenQ XL2420T LED Monitor, 120 Hz, 1920 × 1080), which was equipped with two loudspeakers on both sides (DELL AX210 Stereo Speaker). During the experiment, subjects were presented with movie and audio clips, i.e., silent movies (no sound; VIS), audiobooks (no video; AUD), and movies with both video and audio (MIX). We have chosen natural, complex stimuli in contrast to more controlled, artificially generated stimuli to keep subjects engaged with the materials and to achieve a realistic setup.

Besides any stimulus material, the screen always showed a fixation cross. Subjects were given the task to look at the cross at all times to avoid an accumulation of artifacts. When there was no video shown, e.g., during audio clips and during rest periods, the screen pictured the fixation cross on a dark gray background. In addition to the auditory, visual and audiovisual trials, there were IDLE trials. During IDLE, we showed a dark gray screen with a fixation cross in the same way as during the rest period between different stimuli. Therefore, subjects were not be able to distinguish this condition from the rest period. In contrast to the rest periods, IDLE trials did not follow immediately after a segment of stimulus processing and can therefore be assumed to be free of fading cognitive activity. IDLE trials were assumed to not contain any systematic processing of stimuli. While subjects received other visual or auditory stimulations from the environment during IDLE trials, those stimulations were not task relevant and of lesser intensity compared to the prepared stimuli. In contrast to AUD, VIS, and MIX trials, there was no additional resting period after IDLE trials.

The entire recording, which had a total duration of nearly 1 h, consisted of five blocks. **Figure 1** gives an overview of the block design. The first block consisted of three continuous clips (60 s audio, 60 s video, 60 s audio and video with a break of 20 s between each of them. This block had a fixed duration of 3 min 40 s. The remaining four blocks had random durations of approximately 13 min each. The blocks 2–5 followed a design with random stimulus durations of 12.5 ± 2.5 s (uniformly distributed) and rest periods of 20 ± 5 s (uniformly distributed). The stimulus order of different modalities was randomized within each block. However, there was no two consecutive stimuli of the same modality. **Figure 2** shows an example of four consecutive trials in the experiment. Counted over all blocks, there were 30 trials of each category AUD, VIS, MIX, and IDLE.

The stimuli of one modality in one block formed a coherent story. During the experiment, subjects were instructed to memorize as much of these stories (AUD/VIS/MIX story) as possible. In order to ensure that subjects paid attention to the task, they filled out a set of multiple choice questions (one for each story) after each block. This included questions on contents, e.g., "what happens after. . . ?", as well as general questions,

such as "how many different voices appeared?" or "what was the color of . . . ?". According to their answers, all subjects paid attention throughout the entire experiment. In the auditory condition, subjects achieved an averaged correct answer rate of 85%, whereas in the visual condition there is a correct answer rate of 82%.

#### **2.3. DATA ACQUISITION**

For fNIRS recording, a frequency-domain oximeter (Imagent, ISS, Inc., Champaign, IL, USA) was employed. Frequencymodulated near-infrared light from laser diodes (690 nm or 830 nm, 110 MHz) was conducted to the participants head with 64 optical source fibers (32 for each wavelength), pairwise co-localized in light source bundles. A rigid custom-made headmount system (montage) was used to hold the source and detector fibers to cover three different areas on the head: one for the visual cortex and one on each side of the temporal cortex. The multi-distance approach as described in Wolf et al. (2003); Joseph et al. (2006) was applied in order to create overlapping light channels. **Figure 3** shows the arrangement of sources and detectors in three probes (one at the occipital cortex and two at the temporal lobe). For each probe, two columns of detectors were placed between two rows of sources each to the left and the right, at source-detector distances of 1.7 –2.5 cm. See **Figure 3A** for the placement of the probes and **Figure 3B** for the arrangement of the sources and detectors. After separating source-detector pairs of different probes into three distinct areas, there were a total of 60 channels on the visual probe and 55 channels on each auditory probe. Thus, there was a total number of *nc* = 170 channels. The sampling frequency used was 19.5 Hz.

EEG was simultaneously recorded with an asalab ANT neuro amplifier and digitized with a sampling rate of 256 Hz. The custom-made head-mount system, used for the optical fibers, also enabled us to place the following 12 Ag/AgCl electrodes according to the standard 10–20 system: Fz, Cz, Pz, Oz, O1, O2, FT7, FT8, TP7, TP8, M1, M2. Both M1, and M2 were used as reference.

After the montage was positioned, the locations of fNIRS optrodes, EEG electrodes, as well as the nasion, pre-auricular points and 123 random scalp coordinates were digitized with Visor (ANT BV) and ASA 4.5 3D digitizer. Using each subject's structural MRI, these digitized points were then coregistered, following Whalen et al. (2008), in order to have all subjects' data in a common space.

schematic in **(B)** for the two auditory probes (top left and right) and the

#### **2.4. PREPROCESSING**

visual probe (bottom).

The preprocessing of both fNIRS and EEG data were performed offline. Optical data included an AC, a DC, and a phase component; however, only the AC intensities were used in this study. Data from each AC channel were normalized by dividing it by its mean, pulse-corrected following Gratton and Corballis (1995), median filtered with a filter length of 8 s, and downsampled from 19.5 to 1 Hz. The downsampled optical density changes *-*OD*<sup>c</sup>* were converted to changes in concentration of oxyhemoglobin (HbO) and deoxyhemoglobin (HbR) using the modified Beer-Lambert law (MBLL) (Sassaroli and Fantini, 2004).

The parameters for differential path-length factor and wavelength-dependent extinction coefficient within this study were based on standard parameters in the HOMER2 package, which was used for conversion process (Huppert et al., 2009). Values of molar extinction coefficients were taken from http://omlc.ogi.edu/spectra/hemoglobin/1. Finally, common average referencing (CAR) was applied to the converted data

<sup>1</sup>compiled by Scott Prahl using data from: W. B. Gratzer, Med. Res. Council Labs, Holly Hill, London, and N. Kollias, Wellman Laboratories, Harvard Medical School, Boston.

in order to reduce noise and artifacts that are common in all channels (Ang et al., 2012). Thereby, the mean of all channels is substracted from each individual channel *c*. It is performed on both *-*HbO and *-*HbR.

EEG data were preprocessed with EEGLAB 2013a (Delorme and Makeig, 2004). First the data was bandpass filtered in the range of 0.5–48 Hz using a FIR filter of standard filter order of 6 (<sup>=</sup> <sup>3</sup> low cutoff · sampling rate). Then, non-brain artifacts were rejected using Independent Component Analysis (ICA) as proposed by Jung et al. (2000). In this process, all 10 channels were converted to 10 independent components. One component of each subject was rejected based on prefrontal eye blink artifacts. Finally, the prestimulus mean of 100 ms was substracted from all stimulus-locked data epochs.

#### **2.5. GRAND AVERAGES**

In the following, we calculate Grand Averages of both fNIRS and EEG signals (in time domain and frequency domain) for the different types of stimuli. This is done to investigate the general sensitivity of the signals to differences in modality and to motivate the feasibility of different feature types which we define later for classification.

**Figure 4** shows the averaged haemodynamic response function (HRF) for selected channels of all 12 subjects for labels AUD (blue), VIS (red), and IDLE (black). The stimulus locked data trials (blocks 2–5) are epoched by extracting the first 10 s of each stimulus, and a 2 s prestimulus baseline was substracted from each channel. There was a clear peak in the HRF in response to a VIS stimulus on channels from the occipital cortex (channels 141 and 311 in the figure) and a return to baseline after the stimulus is over after 12.5 s. This effect is absent for an AUD stimulus. Conversely, the channels from the auditory cortex (channels 30 and 133 in the figure) react much stronger to a AUD than to a VIS stimulus.

**Figure 5** shows the first second of ERP waveforms of conditions AUD (blue), VIS (red), and IDLE (black), averaged across all 12 subjects. It shows distinctive pattern for auditory and visual stimuli when comparing electrodes at the visual cortex with electrodes at more frontal positions. It is also widely known that frequency responses can be used to identify cognitive processes. **Figure 6** shows power spectral density on a logarithmic scale at a frontal midline position (Fz), at the ocipital cortex (Oz) and the temporal lobe (FT7). The plots indicate that especially visual activity can be easily discriminated from auditory activity an no perceptual activity. This fact becomes especially evident at electrode site Oz. The alpha peak for the AUD condition is expected, but unusually pronounced. We attribute this to the fact that the VIS stimuli are richer compared to the AUD stimuli as they often contain multiple parallel points of interest and visual attractors at once. The difference between VIS and AUD trials does also not only involve perceptual processes but also other aspects of cognition, as they differ in content, processing codes and other parameters. On the one hand, this is a situation specific to the scenario we employed. On the other hand, we argue that this difference between visual and auditory information processing pertains for most natural conditions. We will investigate this issue by looking at the discriminability of AUD and IDLE conditions and also at the influence of alpha power on overall performance.

#### **2.6. CLASSIFICATION**

(red), and IDLE (black).

In this study, we first aimed to classify auditory against visual perception processes. Second, we wanted to detect auditory or visual processes, i.e., we classify modality-specific activity vs. no activity. Third, we wanted to detect a certain perception process in presence of other perception processes.

To demonstrate the expected benefits of combining the fNIRS and EEG signals, we first explored two individual classifiers for each signal domain, before we examined their combination by estimating a meta classifier. The two individual fNIRS classifiers were based on the evoked deflection from baseline HbO (HbO classifier) and HbR (HbR classifier). The EEG classifiers were

VIS (red), and IDLE (black).

based on induced band power changes (POW classifier) and the downsampled ERP waveform (ERP classifier).

#### *2.6.1. fNIRS features*

Assuming an idealized haemodynamic stimulus response, i.e., a rise in HbO (HbO features) and a decrease in HbR (HbR features), stimulus-locked fNIRS features were extracted by taking the mean of the first few samples (i.e., *topt* <sup>−</sup> *<sup>w</sup>* <sup>2</sup> *,..., topt*) substracted from the mean of the follwing samples (i.e., *topt,..., topt* <sup>+</sup> *<sup>w</sup>* <sup>2</sup> ) in all channels *c* of each trial, similar to Leamy et al. (2011). Equation 1 illustrates how the feature was calculated.

$$\begin{split} f\_{\boldsymbol{\epsilon}}^{\text{HbO}} &= \frac{2}{\boldsymbol{w}} \left( \sum\_{t\_{\text{opt}}}^{t\_{\text{opt}} + \frac{\boldsymbol{w}}{2}} \Delta \overline{\text{[HbO]}}\_{\boldsymbol{\epsilon}}(t) - \sum\_{t\_{\text{opt}} - \frac{\boldsymbol{w}}{2}}^{t\_{\text{opt}}} \overline{\Delta \overline{\text{[HbO]}}\_{\boldsymbol{\epsilon}}(t)} \right) \\\ f\_{\boldsymbol{\epsilon}}^{\text{HbR}} &= \frac{2}{\boldsymbol{w}} \left( \sum\_{t\_{\text{opt}}}^{t\_{\text{opt}} + \frac{\boldsymbol{w}}{2}} \overline{\Delta \overline{\text{[HbR]}}\_{\boldsymbol{\epsilon}}(t)} - \sum\_{t\_{\text{opt}}}^{t\_{\text{opt}}} \overline{\Delta \overline{\text{[HbR]}}\_{\boldsymbol{\epsilon}}(t)} \right) \end{split} \tag{1}$$

#### *2.6.2. EEG features*

For POW, the entire 10 s of all 10 channels were transformed to the spectral domain using Welch's method, and every other frequency component in the range of 3–40 Hz was concatenated to a 38 dimensional feature vector per channel. ERP features were always based on the first second (onset) of each trial. First, the ERP waveform underlied a median filter (*kmed* = 5 ≈ 0*.*02s), followed by a moving average filter (*kavg* = 13 ≈ 0*.*05s). A final downsampling of the resulting waveform (*kdown* = *kavg* ) produced a 20-dimensional feature vector for each channel.

In the end, all features, i.e., HbO, HbR, POW, and ERP, were standardized to zero mean and unit standard deviation (z-normalization).

Four individual classifiers were trained based upon these four different feature types. Each classifier yielded a probability distribution across (the two) classes. Using those individual class probability values, we further evaluated a META classifier, based on decision fusion: The META classifier was based on the weighted sum *<sup>p</sup>*meta <sup>=</sup> *<sup>m</sup> wm* · *pm* of the class probability values *pm* of each of the four individual classifiers (*m* = HbO, HbR, POW, and ERP) with weight *wm*. The class with higher *p*meta, i.e., the maximum likelihood class, was then selected as the result of the META classifier.

The weights *wm* were estimated based on the classification accuracy on evaluation data (i.e., labeled data which is not part of the training data but available when building the classifier). Specifically, those classification accuracies that were higher than baseline (pure chance, i.e., 0.5 for the balanced binary classification conditions) were linearly scaled to the interval [0*,* 1], while those that were below baseline were weighted with 0, and thus, not incorporated. Afterwards, the weight vector *w* = [*w*HbO*, w*HbR*, w*POW*, w*ERP] *<sup>T</sup>* was divided by its 1-norm in order to sum all of its elements to 1.

For the first three classifiers (HbO, HbR, and POW) a regularized linear discriminant analysis (LDA) classifier was employed (implemented following, Schlogl and Brunner, 2008 with a shrinkage factor of 0*.*5, as determined on evaluation data), while a soft-margin linear support vector machine (SVM) was used for the ERP classifier (using the LibSVM implementation by Chang and Lin, 2011 with default parameters). This was done because we expected the first three feature sets to be normally distributed (i.e., LDA is optimal), while we expected the more complex and variable temporal patterns of an ERP to require a more robust classification scheme. Note that this design choice was validated by evaluating both types of classifiers for all types of features on a representative subset of the data corpus. This ensured that in the reported results we used the classifier which leads to the optimal classification accuracy for every feature set.

For evaluation of the proposed hybrid BCI, we define a number of binary classification tasks. We call each different classification task a *condition*. Classification was performed for each modality and feature type separately as well as for the combined META classifier. In the subject-dependent case, we applied leave-one-trial-out cross-validation (resulting in 60 folds for 60 trials per subject). To estimate parameters of feature extraction and classification (*topt* and *w* from Equation 1 for each fold, fusion weights *wm*), we performed another nested 10 fold cross-validation (i.e., in each fold, we have 53 trials for training and 6 trials (5 trials in the last fold) for evaluation) for the train set of each fold. The averaged accuracy in the inner cross-validation is used for parameter selection in the outer cross-validation. This procedure avoided overfitting of the parameters to the training data. In the subject-independent case, we performed leave-one-subject-out cross-validation, resulting in a training set of 660 trials and a test set of 60 trials per fold.

To evaluate those classifiers for the discrimination and detection of modality-specific processing, we define a number of binary classification conditions. **Table 1** lists all defined classification conditions with the corresponding classes. All classification conditions are evaluated in a cross-validation scheme as described above. For each condition, we investigate both a subject-dependent classifier and a subject-independent classifier setup. As evaluation metric, we look at classification accuracy. Furthermore, we compare the performance of the individual classifiers (which only use one type of feature) with the META classifier and analyze the contribution of the two types of signals (EEG and fNIRS) to the different classification conditions. Additionally, we analyze the generalizability of the different detectors for modality-specific activity (lines 1–3 in **Table 1**) by evaluating the classifiers on trials with and without other independent perceptual and cognitive activity. Finally, we look at the classification performance on continuous data. For this purpose, we evaluate a subset of the classification conditions on windows extracted from continuous recordings without alignment to a stimulus onset.

# **3. RESULTS**

**Table 2** summarizes the recognition accuracy for all different conditions for the subject-dependent evaluation. The first entry is

**Table 1 | Binary classification conditions for evaluation.**


*For each condition, we list the class labels which define the corresponding classes.*

#### **Table 2 | Stimulus-locked classification accuracies (in %) for** *subject-dependent* **classification.**


*An asterisk in the* META *column indicates a significant improvement (α* = 0*.*05*) over the best corresponding individual feature type. Given in parantheses are standard errors of the mean. The last column indicates the p value of the statistical comparison of* META *and the best single-feature classifier. Highest classification accuracy for each condition is given in bold font.*

a discriminative task in which the classifier learns to separate visual and auditory perceptual activity. We see that for all four individual classifiers, a reliable classification is possible, albeit EEG-based features perform much better (HbO: 79.4% vs. POW: 93.6%). The fusion of all four classifiers, META, yields the best performance, significantly better (paired, one-sided *t*-test, *α* = 0*.*05 with Bonferroni-Holm correction for multiple comparisons) than the best individual classifier by a difference 4.2% absolute. This is in line with the results of the meta analysis by D'Mello and Kory (2012), who found modest, but consistent improvements by combining different modalities for the classification of inner states. **Figure 7** shows a detailed breakdown of recognition results across all subjects for the example of AUD vs. VIS. We see that for every subject, recognition performance for every feature type was above the trivial classification accuracy of 50% and the performance of META was above 80% for all subjects.

In the next step, we evaluated subject-independent classification on the same conditions. The results are presented in **Table 3**. Averaged across all conditions, classification accuracy degrades by 6.5% compared to the subject-dependent results, resulting from higher variance caused by individual differences. Still, we managed to achieve robust results for all conditions, i.e., subjectindependent discrimination visual and auditory processes is feasible. We therefore decided to report subsequent analyses for the subject-independent systems as those are much preferable from an HCI perspective.

The AUD vs. VIS condition denotes a discriminination task, i.e., it classifies a given stimulus as either auditory or visual. However, for an HCI application, those two processing modes are not mutually exclusive as auditory and visual perception can occur in parallel and can also be both absent in idle situations. We therefore need to define conditions which train a detector for specific perceptual activity, independently of the presence or absence of the other modality. Our first approach toward such a detector for auditory or visual perceptual activity is to define the AUD vs. IDLE and the VIS vs. IDLE conditions. A classifier trained on these conditions should be able to identify neural activity induced by the specific perceptual modality. In **Tables 2**, **3**, we see that those conditions can be classified with high accuracy of 95.6% and 96.4% (subject-dependent), respectively. To test whether this neural activity can still be detected in the presence of other perceptual processes, we evaluate the classifiers trained on those conditions also on MIX trials. We would expect a perfect classifier to classify each of those MIX trials as VIS for the visual detector and AUD for the auditory detector. The top two rows of **Table 4** summarize the results and show that the classifier still correctly detects the modality it is trained for in most cases.

A problem of those conditions is that it is not clear that a detector trained on them has actually detected specific visual or auditory activities. Instead, it may be the case that it has detected general cognitive activity which was present in both the AUD and VIS trials, but not in the IDLE trials. To analyze this possibility, we evaluated the classifier of the AUD vs. IDLE condition on VIS trials (and accordingly for VIS vs. IDLE evaluated on AUD). We present the results in the bottom two rows of **Table 4**. Both classifiers were very inconsistent in their results and

**Table 3 | Stimulus-locked classification accuracies (in %) for** *subject-independent* **classification.**


*An asterisk in the* META *column indicates a significant improvement (α* = 0*.*05*) over the best corresponding individual feature type. Given in parantheses are standard errors of the mean. The last column indicates the p value of the statistical comparison of* META *and the best single-feature classifier. Highest classification accuracy for each condition is given in bold font.*

**Table 4 | Subject-independent classification accuracy of classifiers (in %) for AUD vs. IDLE and VIS vs. IDLE, evaluated on different trials from outside the respective training set.**


"detected" modality-specific activity in nearly half of the trials, which actually did not contain such activity.

To train a classifier which is more sensitive for the modalityspecific neural characteristics, we needed to include non-IDLE trials in the training data as negative examples. For this purpose, we defined the condition allAUD vs. nonAUD, where the allAUD class was defined as allAUD = {AUD, MIX} and the nonAD was defined as nonAUD = {IDLE, VIS}. Now, allAUD **Table 5 | Subject independent correct classification rate (in %) and confusion matrix for the allAUD vs. nonAUD and the allVIS vs. nonVIS conditions, broken down by original labels.**


contains all data with auditory processing, while nonAUD contained all data without, but potentially with other perceptual activity. The condition allVIS vs. nonVIS was defined analogously. **Tables 2**, **3** document that a detector trained on these conditions was able to achieve a high classification accuracy. This result shows that the new detectors did not only learn to separate general activity from a resting state (as did the detectors defined earlier). If that would have been the case, we would have seen a classification accuracy of 75% or less: For example, if we make this assumption in the allVIS vs. nonVIS condition, we would expect 100% accuracy for the VIS, MIX and IDLE trials, and 0% accuracy for the AUD trials, which would be incorrectly classified as they contain general activity but none which is specific to visual processing. This baseline of 75% is outperformed by our classifiers for detection. This result indicates that we were indeed able to detect specific perceptual activity, even in the presence of other perceptual processes. For additional evidence, we look at how often the original labels (AUD, VIS, IDLE, MIX) were classified correctly in the two new detection setups by the META classifier. The results are summarized in **Table 5** as a confusion matrix. We see that all classes are correctly classified in more than 75% of all cases, indicating that we detected the modality-specific characteristics in contrast to general cognitive activity.

The results we presented in **Tables 2**, **3** indicate that fusion was useful to achieve a high recognition accuracy. Still, there was a remarkable difference between the results achieved by the

classifiers using fNIRS features and by classifiers using EEG features. This was true across all investigated conditions and for both subject dependent and subject independent classification. We suspect that the advantage of the META classifier was mostly due to the combination of the two EEG based classifiers. In **Figure 8**, we investigated this question by comparing two fusion classifiers EEG-META and fNIRS-META which combined only the two fNIRS features or the two EEG features, respectively. The results show that for the majority of the conditions, the EEG-META classifier performed as good as or even better than the overall META classifier. However, the fNIRS features contributed significantly to the classification accuracy for both conditions AUD vs. IDLE and VIS vs. IDLE (*p* = 0*.*003 and *p* = 0*.*01, respectively for the difference of EEG-META and META in the subject-dependent case).

To exclude that the difference was due to the specific fNIRS feature under-performing in this evaluation, we repeated the analysis with other established fNIRS features (average amplitude, value of largest amplitude increase or decrease). The analysis showed that we could not achieve improvements by exchanging fNIRS feature calculation compared to the original feature. We conclude that the difference in accuracy was not caused by decisions during feature extraction. Overall, we see that fNIRS-based features were outperformed by the combination of EEG based features for the most investigated conditions but that it could still contribute to a high classification accuracy in some of the cases.

There are however some caveats to the dominance of EEG features. First, the ERP classifier is the only one of the four feature types which is fundamentally dependent on temporal alignment to the stimulus onset and therefore not suited for many applications of continuous classification. While the employed fNIRS features also use information on the stimulus onset (as they essentially characterize the slope of the signal), only the ERP features rely on specific oscillatory properties in a range of milliseconds (compare **Figures 5** and **4**), which cannot be extracted reliably without a stimulus locking. Second, concerning the POW classifier, we see in **Figure 6** a large difference in alpha power between VIS and AUD. As both types of trials induce cognitive activity, we did not expect the AUD trials to exhibit alpha power (i.e., idling rhythm) nearly at an IDLE level. We cannot completely rule out that this effect is caused at least in parts by the experimental design (e.g., because visual stimuli and auditory stimuli differed in complexity) or subject selection (e.g., all subjects were familiar with similar recording setups and therefore easily relaxed). Therefore, we need to verify that the discrimination ability of the POW classifier does not solely depend on differences in alpha power. For that purpose, we repeated the evaluation of AUD vs. VIS with different sets of band pass filters, of which some excluded the alpha band completely. Results are summarized in **Figure 9**. We see that as expected, feature sets including the alpha band performed best. Accuracy dropped by a maximum of 9*.*4% relative when removing the alpha band (for the subject dependent evaluation from 1–40 Hz to 13–40 Hz). This indicates the upper frequency bands still contain useful discriminating information.

The previous analysis showed that different features contributed to different degrees to the classification result. Therefore, we were interested in studying which features were stable predictors of the ground truth labels on a single trial basis. The successful person-independent classification was already an indication that such stable, generalizable features exist. To investigate which features contributed to the detection of different modalities, we calculated the correlation of each feature with the ground truth labels for the conditions VIS vs. IDLE and AUD vs. IDLE.

For the POW features, we ranked the electrode by their highest absolute correlation across the whole frequency range for each subject. To see which features predicted the ground truth well across all subjects, we averaged those ranks. The resulting average rankings are presented in the first two columns of **Table 6**. We note that for the VIS vs. IDLE condition, electodes at the occipital cortex were most strongly correlated to the ground truth. In contrast, for the AUD vs. IDLE condition, those electrodes can be found at the bottom of the ranking. For this condition, the highest ranking electrodes were at the central-midline (it was expected that electrodes above the auditory cortex would not contribute strongly to the AUD vs. IDLE condition as activity in the auditory cortex cannot be captured well by EEG). The low *SD* also indicates that the derived rankings are stable across subjects. We can therefore conclude that the POW features were generalizable and neurologically plausible.

**Table 6 | Average rankings of electrode positions derived from correlation of POW and ERP features to ground truth labels.**


We then ranked the frequency band features by their highest absolute correlation across the whole electrode set for each subject and average those ranks across subjects. We observed the highest average ranks at 9.5 Hz and at 18.5 Hz. Especially for the first peak in the alpha band, we observed a low *SD* of 6.2, which indicates that those features were stable across subjects.

For the ERP features, we repeated this analysis (with time windows in place of frequency bands). The two rightmost columns of **Table 6** show a similar picture as for the POW features regarding the contribution of individual electrodes: Features from electrodes at the occipital cortex were highly discriminative in the VIS vs. IDLE condition, features from central-midline electrodes carried most information in the AUD vs. IDLE condition. Regarding time windows, we observe the best rank for the window starting at 312 ms, which corresponds well to the expected P300 component following a stimulus onset. With a *SD* of 2.9, this feature was also ranked highly across all subjects.

To investigate the reliability of the derived rankings, we conducted Friedman tests on the rankings of all participants. Those showed that all investigated rankings (with one exception) yielded a significant difference in average ranks of the items. The resulting *p*-values are given in **Table 7**. This indicates that the rankings actually represent a reliable, person-independent ordering of features.



The analysis for fNIRS features differed from the EEG feature analysis because of the signal characteristics. For example, the fNIRS channels were spatially very close to each other and highly correlated. Therefore, we did not look at features from single fNIRS channels. Instead, we differentiated between the different probes. For the VIS vs. IDLE condition, the channel which yielded the highest absolute correlation was located above the visual cortex for 75% of all subjects (averaged across both hBO and HbR). For the AUD vs. IDLE condition, the channel with the highest absolute correlation was located above the auditory cortex for 91.6% of all subjects. This indicates that the fNIRS signals also yielded neurologically plausible features which generalized well across subjects. When comparing HbO and HbR features, the HbO features were correlated slightly higher to the ground truth (19.6% higher maximum correlation) than the HbR features, which corresponds to their higher classification accuracy.

The classification setups which we investigated up to this point are all defined on trials which are locked at the onset of a stimulus. The detection of onsets of perceptual activity is an important use case for HCI applications: The onset of a perceptual activity often marks a natural transition point to react to a change of user state. On the other hand, there are use cases where the detection of ongoing perceptual activity is relevant. To investigate how the implemented classifiers perform on continuous stimulus presentation, we evaluated classification and detection on the three continuous segments (60 s of each AUD, VIS, MIX) which were recorded in the first block for each subject. As data is sparse for those segments, we only regard the subject-independent approach. To extract trials, the data was segmented into windows of a certain length (overlapping by 50%). We evaluated the impact of the window size on the classification accuracy: For window sizes of 1, 2, 4, 8, and 16 s, we end up with 120, 60, 30, 15, and 8 windows per subject and class, respectively. Those trials are not aligned to a stimulus onset. We used the same procedure to extract POW features as for the onsetlocked case. The ERP feature was the basis of the best non-fusion classifier but is limited to detecting stimulus onsets. Therefore, we excluded it from the analysis to investigate the performance of the remaining classifiers. For both feature types based on fNIRS, we modified the feature extraction to calculate the mean of the window, normalized by the mean of the already elapsed data. The other aspects of the classifier were left unchanged.

**Figures 10**, **11** summarize the results of continuous evaluation. The results are mostly consistent with our expectations and the previous results on stimulus-locked data. For all three regarded classification conditions, we achieve an accuracy of more than 75% for META, i.e., reliable classification does not solely depend on low-level bottom-up processes at the stimulus onset. Up to the threshold of 16 s, there was a benefit of using larger windows for feature calculation. Note that with growing window size, the number of trials for classification drops, which also has an impact on the confidence interval for the random baseline (Mueller-Putz et al., 2008). The upper limit of the 1% confidence interval is 52.4% for a window size of 1 s, 53.4% for 2 s, 54.9% for 4 s, 56.9% for 8 s, and 59.5% for 16 s. This should be kept in mind when interpreting the results, especially for larger window sizes. The EEG feature yields a better classification accuracy than the two fNIRS-based classifiers in two of the three cases. For the allAUD vs. nonAUD situation however, the POW classifier does not exceed the random baseline and only the two fNIRS based classifiers can achieve satisfactory results. Therefore, we see that when ERP features are missing in the continuous case, the fNIRS features can substantially contribute to classification accuracy in the case of allAUD vs. nonAUD.

# **4. DISCUSSION**

The results from the previous section indicate that both the discrimination and detection of modality-specific perceptual processes in the brain is feasible both in a subject-dependent as well as a subject-independent setup with high recognition accuracy. We see that the fusion of multiple features from different signal types led to improvement in recognition accuracy significantly. However, in general fNIRS-based features were outperformed by features based on the EEG signal. In the future, we will look closer into other reasons for this gap and potential remedies for it. One difference between fNIRS and EEG signals is the lack of advanced artifact removal techniques for fNIRS that have been applied with some success in other research on fNIRS BCIs (Molavi and Dumont, 2012). Another difference is that the coverage of fNIRS optodes was limited mainly to the sensory areas, but our

are in dependency of window size.

EEG measures may include robust effects generated from other brain regions, such as the frontal-parietal network. Activities in these regions may be reflecting higher cognitive processes triggered by the different modalities, other than purely perceptual ones. It may be worthwhile to extend the fNIRS setup to include those regions as well. Still, we already saw that fNIRS features can contribute significantly to certain classification tasks. While evaluation on stimulus-locked data allows a very controlled evaluation process and is supported by the very high accuracy we can achieve, this condition is not very realistic for most HCI applications. In many cases, stimuli will continue over longer periods of time. Features like the ERP feature explicitly model the onset of a perceptual process but will not provide useful information for ongoing processes. In future work, we will investigate such continuous classification on the longer, continuous data segments of the recorded corpus.

Following the general guidelines of Fairclough (2009), one limitation in validity of the present study is the fact that there may be other confounding variables that can explain the differences in the observed neurological responses to the stimuli of different modalities. Subjects were following the same task for all types of stimuli; still, factors like different memory load or increased need for attention management due to multiple parallel stimuli for visual trials may contribute to the separability of the classes. We address this partially by identifying the expected effects, for example in **Figure 4** comparing fNIRS signals from visual and auditory cortex. Also the fact that detection of both visual and auditory processing worked on MIX trials shows that the learned patterns were not only present in the dedicated data segments but were to some extend generalizable. Still, we require additional experiments with different tasks and other conditions to reveal whether it is possible to train a fully generalizable detector and discriminator for perceptual processes. Finally, we also have to look into a more granular model with a higher sensitivity than the presented dichotomic characterization of perceptual workload.

The evaluation was performed in a laboratory setting but with natural and complex stimulus material. The results indicate that such a system is robust enough to use it for the improvement an HCI system in a realistic scenario. We saw that both EEG and fNIRS contributed to a high classification accuracy; in most cases, the results for the EEG-based classifiers were more accurate than for the fNIRS based ones. Whether the additional effort which is required to apply and evaluate a hybrid BCI (compared to a BCI with only one signal type) depends on the specific application. When only one specific classification condition is relevant (e.g., to detect processing of visual stimuli), there is always a single optimal signal type which is sufficient to achieve robust classification. The benefit of a hybrid system is that it can potentially cover multiple different situations for which no generally superior signal type exists. Another aspect for the applicability of the presented system for BCI is the response latency, which also depends on the choice of employed features. The ERP features react very rapidly to but are limited to situations, in which a stimulus onset is present. Such short response latency (less than 1 s) may be useful when an HCI system needs to immediately switch communication channels or interrupt communication to avoid perceptual overload of the user (for example, when the user unexpectedly engages in a secondary task besides communicating with the HCI system). In such situations, the limitation to onsets is also not problematic. On the other hand, if the system needs to assume that the user is already engaged in a secondary task when it starts to observe him or her (i.e., to determine the initial communication channel at the beginning of a session), it is not sufficient anymore to only respond to stimulus onsets. For those cases, it may be worthwhile to accept the latency required by the fNIRS features and also the POW feature for a classification of continuous perceptual activity.

We conclude that we demonstrated the first passive hybrid BCI for the discrimination and detection of perceptual activity. We showed that robust classification is possible both in a subject-dependent and a subject-independent fashion. While the EEG features outperformed the fNIRS features for most parts of the evaluation, the fusion of multiple signals and features was beneficial and increased the versatility of the BCI.

# **ACKNOWLEDGMENTS**

This work was supported by the IGEL (Informatik-GrEnzenLos) scholarship of the KIT. We are also thankful to Prof. Trevor Penney of the National University of Singapore for providing access to the fNIRS system used in this project as well as the staff and students from Temasek Lab at NUS (Kian Wong, Tania Kong, Xiao Qin Cheng, and Xiaowei Zhou) for their intensive support throughout the entire data collection. We acknowledge support by Deutsche Forschungsgemeinschaft and Open Access Publishing Fund of Karlsruhe Institute of Technology.

# **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 31 January 2014; accepted: 29 October 2014; published online: 18 November 2014.*

*Citation: Putze F, Hesslinger S, Tse C-Y, Huang Y, Herff C, Guan C and Schultz T (2014) Hybrid fNIRS-EEG based classification of auditory and visual perception processes. Front. Neurosci. 8:373. doi: 10.3389/fnins.2014.00373*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Putze, Hesslinger, Tse, Huang, Herff, Guan and Schultz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# EEG sensorimotor correlates of translating sounds into actions

# *Jaime A. Pineda1,2\*, Mark Grichanik1, Vanessa Williams 1, Michelle Trieu1, Hailey Chang1 and Christian Keysers 3,4*

*<sup>1</sup> Department of Cognitive Science, University of California, San Diego, La Jolla, CA, USA*

*<sup>2</sup> Neurosciences Group, University of California, San Diego, La Jolla, CA, USA*

*<sup>3</sup> Netherlands Institute for Neuroscience, KNAW, Amsterdam, Netherlands*

*<sup>4</sup> Department of Neuroscience, University Medical Center Groningen, University of Groningen, Netherlands*

#### *Edited by:*

*Jan B. F. Van Erp, TNO, Netherlands*

*Reviewed by: Dennis J. McFarland, Wadsworth Center for Laboratories and Research, USA Reinhold Scherer, Graz University of Technology, Austria*

#### *\*Correspondence:*

*Jaime A. Pineda, Department of Cognitive Science, University of California, 9500 Gilman Drive, San Diego, La Jolla, CA 92093-0515, USA*

*e-mail: pineda@cogsci.ucsd.edu*

Understanding the actions of others is a necessary foundational cornerstone for effective and affective social interactions. Such understanding may result from a mapping of observed actions as well as heard sounds onto one's own motor representations of those events. To examine the electrophysiological basis of action-related sounds, EEG data were collected in two studies from adults who were exposed to auditory events in one of three categories: action (either hand- or mouth-based sounds), non-action (environmental sounds), and control sounds (scrambled versions of action sounds). In both studies, triplets of sounds of the same category were typically presented, although occasionally, to ensure an attentive state, trials containing a sound from a different category were presented within the triplet and participants were asked to respond to this oddball event either covertly in one study or overtly in another. Additionally, participants in both studies were asked to mimic hand- and mouth-based motor actions associated with the sounds (motor task). Action sounds elicited larger EEG mu rhythm (8–13 Hz) suppression, relative to control sounds, primarily over left hemisphere, while non-action sounds showed larger mu suppression primarily over right hemisphere. Furthermore, hand-based sounds elicited greater mu suppression over the hand area in sensorimotor cortex compared to mouth-based sounds. These patterns of mu suppression across cortical regions to different categories of sounds and to effector-specific sounds suggest differential engagement of a mirroring system in the human brain when processing sounds.

#### **Keywords: auditory mirror neuron system, action comprehension, mu rhythm, sensorimotor cortex, mu suppression**

# **INTRODUCTION**

The discovery of motor neurons in the primate premotor cortex that also exhibit visual "mirroring" properties has spurred a significant amount of research into how we understand the actions of others both within and across species having similar biological effectors (di Pellegrino et al., 1992; Rizzolatti and Craighero, 2004; Iacoboni and Dapretto, 2006). Such visuomotor neurons fire both when a monkey performs a motor action and when it observes another conspecific or human agent perform a similar goal-directed action (Rizzolatti and Craighero, 2004; Iacoboni and Dapretto, 2006; Rizzolatti and Sinigaglia, 2010). Evidence for the existence of a human mirror neuron system (MNS) has been obtained through a variety of indirect population-level measures (Iacoboni et al., 1999; Fadiga et al., 2005) including transcranial magnetic stimulation (TMS) (Fadiga et al., 1999; Maeda et al., 2002), positron emission tomography (PET) (Parsons et al., 1995), functional magnetic resonance imaging (fMRI) (Grezes et al., 2003; Buccino et al., 2004; Iacoboni and Dapretto, 2006), and electroencephalography (EEG) (Cochin et al., 1998, 1999; Pineda et al., 2000; Muthukumaraswamy and Johnson, 2004; Muthukumaraswamy and Singh, 2008; Oberman et al., 2008), and is thought to include premotor cortices (dorsal and ventral) and the inferior parietal cortex in which mirror neurons have also been measured in monkeys (Keysers and Gazzola, 2009; Caspers et al., 2010; Rizzolatti and Sinigaglia, 2010), with more recent evidence also pointing to an important role for the somatosensory cortex in the MNS (Keysers et al., 2010; Caspers et al., 2011). While no overall consensus exists as to the role of mirror neurons in social cognition (Hickok, 2009), one prominent hypothesis suggests that the observer's ability to embody the observed action as his or her own provides a neural scaffolding that facilitates behaviors and cognitive outcomes involved in social cognition, such as understanding actions, imitation, speech and language, theory of mind, social communication, and empathy (Rizzolatti and Craighero, 2004; Ferrari et al., 2009).

In the monkey premotor cortex, mirror neurons were also found to be sensitive to the acoustic correlates of actions, and the corresponding action sound by itself is sufficient to activate these premotor cells (Kohler et al., 2002; Keysers et al., 2003). Similarly, the premotor, posterior parietal and somatosensory cortices of humans show voxels that are active both while performing an action and listening to a similar action (Gazzola et al., 2006), and this activity is somatotopically organized, with more dorsal aspects of the premotor and parietal cortex more active during the execution and sound of hand actions, and more ventral aspects more active during the execution and sound of mouth actions. This somatotopical pattern allows classification of what action someone has performed by using the activity pattern while listening to actions (Etzel et al., 2008). Thus, by extension, the human MNS would appear to be multimodal, i.e., activated by motor, visual, auditory, as well as perhaps other sensory inputs associated with the action (Aglioti and Pazzaglia, 2010). Relatively few studies, however, have investigated the auditory properties of the human MNS, although it has been argued that nonaction (environmental) and action-related sounds (those that are reproducible by the body) are likely processed by separate neural systems (Pizzamiglio et al., 2005). Non-action related sounds appear to involve the temporal poles, while action-related sounds appear to involve the same neural machinery as the visual MNS. Because increasing the exposure and proficiency with a given physical action provides for greater activation of auditory mirroring circuits (Ricciardi et al., 2009) and a perceptual-motor link that is rapidly established (Lahav et al., 2007), it has been proposed that these associations result from the Hebbian association between motor programs and what they sound like while we perform an action. Re-afference, input that results from the agent's movement, ensures that premotor neurons that cause the action will have firing that is temporally correlated with that of the auditory neurons that represent the re-afferent sound of the action (Keysers and Perrett, 2004). Perhaps one main difference between auditory and visual mirroring is that while visual aspects of mirroring appears to involve bilateral activity in left and right hemispheres, auditory aspects of mirroring has been reported to be primarily left-lateralized, particularly in the parietal cortex, including the posterior parietal and somatosensory cortex (Gazzola et al., 2006; Lahav et al., 2007). This is not true for environmental sounds, that do trigger robust right hemispheric activations as well (Gazzola et al., 2006). The left lateralization may reflect semantic associations of the sounds that have been previously established.

Mirroring activity cannot be directly recorded in humans except under special circumstances, (see Mukamel et al., 2010), but a number of recent studies have suggested that mirroring may be indirectly measured in the mu frequency band of the EEG (alpha: 8–13 Hz and beta: 15–25 Hz recorded over sensorimotor cortex) (Hari et al., 2000; Muthukumaraswamy et al., 2004; Oberman et al., 2005; Pineda, 2005). Sensorimotor neurons fire synchronously at rest, leading to high-amplitude mu oscillations, and asynchronously during self-movement and the observation of movement, leading to reduced amplitude of the mu band (called mu suppression or event related desynchronization-ERD) (Pineda, 2005). Mu suppression or ERD during action observation, in the absence of self-performed action, has been hypothesized to reflect the downstream modulation of sensorimotor neurons by premotor mirror neurons (Muthukumaraswamy et al., 2004; Oberman et al., 2005; Pineda, 2005). A recent combined EEG-fMRI study has explored the relationship between mu suppression (as measured through EEG) and blood oxygen level dependent (BOLD) activity in regions normally associated with the MNS (as measured using fMRI) during both action observation and action execution (Arnstein et al., 2011). The study found that mu suppression in the alpha band during both action observation and action execution went hand-in-hand with increases in BOLD activity in the dorsal premotor cortex, the inferior parietal lobe and the posterior aspects of the somatosensory cortex (BA2). A weaker association was also found between activity in the ventral premotor cortex and mu suppression. All of these regions are associated with the MNS and have strong cortico-cortico connections in the human and non-human brain with the primary sensorimotor region lining the central sulcus (BA4 and BA3) where the mu rhythm is thought to be generated (Shimazu et al., 2004). Therefore, it is plausible that activity in these MNS regions, during action observation and execution, could desynchronize activity around the central sulcus, and thereby cause the mu suppression in the EEG signal. Furthermore, results of several human mu suppression studies parallel primate single-cell recordings in terms of the object-directedness and sensitivity of the electrophysiology to action observation (Muthukumaraswamy et al., 2004). Consistent with this hypothesis, Keuken et al (Keuken et al., 2011) recently showed that using TMS to disrupt activity in the inferior frontal gyrus directly impacts the modulation of mu rhythms over sensorimotor cortex.

To date no investigations have examined the relationship between auditory aspects of mirroring and EEG mu rhythm suppression in humans in order to inform models of connectivity between these domains. One goal of this study was to test the hypothesis that action-related sounds are processed differently compared to non-action related sounds, as reflected in mu rhythm oscillations. We will specifically assess whether mu rhythm suppression reflects action or non-action related activity. In Studies 1 and 2 participants listened to sounds as well as performed actions while blindfolded that corresponded to those sounds. During the active listening portion of Study 1, participants made overt physical responses to oddball sounds. To assess responses to sounds alone, participants made covert responses in Study 2. In both studies, we predicted that representation of action-based sounds would elicit greater mu-suppression reflecting greater engagement of mirroring processes compared to environmental sounds. That is, we expected that action-related sounds (those interpreted vis-a-vis the observer's own bodily representation) would cause greater mu suppression compared to non-action related sounds.

### **MATERIALS AND METHODS PARTICIPANTS**

Twenty-eight healthy undergraduate students, including one older student (12 males and 16 females of varied ethnicities; mean age = 20*.*3 ± 6*.*2 years; range = 17–47 years) attending the University of California, San Diego (UCSD) participated in Study 1. In Study 2, a different group of twenty eight undergraduate students (13 males and 15 females; mean age = 20*.*4 ± 1*.*1 years; range = 18–23 years) participated. All participants were assumed to have normal hearing if they were able to identify the stimuli during an initial auditory identification task. Those who described themselves as left-handed on a self-report questionnaire were excluded from the study. All gave written informed consent prior to taking part and were compensated with course credit for their voluntary participation. The experiment was reviewed and approved by the UCSD Internal Review Board.

### **AUDITORY STIMULI**

The auditory stimulus set was obtained from one of the coauthors (C.K.) who previously used it in an fMRI study (Gazzola et al., 2006) that investigated the human auditory MNS. The stimulus set consisted of three categories of sounds (see **Table 1**): Action (mouth- and hand-based sounds), Non-Action (environmental sounds), and Control or "Fuzzy" (phase-scrambled versions of both mouth and hand control sounds). There were five unique sounds per each category (25 total sounds with each sound presented for 4 s). The Action sounds (e.g., mouth: crunching candy and hand: ripping paper) were tangible sounds that could be easily reproduced by the listener. The Non-Action sounds were comprised of environmental sounds (e.g., howling wind or water dripping). Each control sound was based on one of the action sounds, and resulted from a reverse Fourier transform in which frequencies up to 125 Hz preserved their original phase and all frequencies above 125 Hz had their phase exchanged with that of another frequency. Accordingly, the control sounds were equivalent with respect to the bottom-up global frequency composition of the Action sounds, but were perceived as "fuzzy" because they were phase-scrambled and

**Table 1 | Auditory stimuli used for Studies 1 and 2.**


unrecognizable. For more information on how the sounds were developed see Gazzolla et al. (Gazzola et al., 2006).

#### **GENERAL EXPERIMENTAL PROCEDURE**

All participants took part in an auditory task first and a motor task second. The auditory task was always run before the motor task in order to avoid the possibility that the memory of executing the actions would bias perceptual brain activity. Participants initially completed a general screening questionnaire that enabled the experimenter to exclude individuals due to claustrophobia, prior experience, etc. Prior to EEG recording, participants took part in an Auditory Identification Task in which they were asked to listen attentively and identify the various auditory stimuli. Prior to the presentation of a sound, the experimenter announced the category to which the sound belonged (e.g., Hand Sound). If after three attempts, the participant incorrectly identified the sound, the experimenter would correctly identify the sound for the participant. The sound was played once more before moving on to the next sound to ensure participants now could identify the sound. Participants were not asked to identify nor discriminate control sounds. The experimenter played one of these control sounds and explained that they belonged in their own category of "fuzzy sounds" and did not need to be identified. Prior to the motor task, participants "practiced" performing the motor actions at least once following an experimenter's cue to make sure they followed the protocol.

#### **EEG RECORDING**

Participants were instructed not to consume caffeine nor use any hair products the day of the experiment. During EEG capping, they were seated in a comfortable recliner chair inside an acoustically- and electromagnetically-shielded chamber. After light abrasion of EEG electrode sites using NuPrep Gel, disk electrodes were applied using 10–20 conductive paste on the orbital bone below the left eye and the mastoid bone behind both ears. The eye electrode was used to monitor eye blinks and horizontal eye movements. The mastoid electrodes were computationally linked and used as reference electrodes. Seventeen electrodes embedded in a cap were positioned using the International 10–20 system at the following sites: F7, F8, F3, F4, FZ, C3, CZ, C4, P3, PZ, P4, T3, T4, T5, T6, O1, O2. Electrolytic gel was injected at each electrode site and the scalp lightly abraded with a thin wooden dowel to reduce the electrode-skin impedance to below 10 k*-*. The EEG data were recorded with a Neuroscan Synamps system (500 Hz sampling rate, 0.30–30 Hz bandpass filter). Participants were blindfolded and asked to keep their eyes closed during the various task conditions. A video monitoring system was used to ensure that participants remained still and that their hands were in resting position during EEG data collection.

# **AUDITORY TASK**

#### *Study 1*

A trial consisting of a sequence of three pseudo-randomly selected sounds from the same category (e.g., Hand sound— Hand sound—Hand sound) was presented at an inter-stimulus interval (ISI) of 1 s. No individual sound appeared twice within the same trial. Sixty trials (twelve from each sound category) were randomized and presented (inter-trial interval or ITI of 3 s) through circumaural headphones (Sony MDR XD 100) using Presentation software (NeuroBehavioral Systems). An additional five randomly interspersed sequences contained "oddball" events. These sequences contained a final sound that was out-of-category (e.g., Mouth sound—Mouth sound—Hand sound). Participants were instructed to click a computer mouse placed under their right hand when detecting such an oddball event. This task was necessary to ensure that participants were *actively* listening to all the stimuli. Control sounds were presented as triplets but were not used as oddball events. While analyzed for accuracy as a behavioral task, these oddball event trials were excluded from further analysis in order to avoid motor contamination of the EEG data. After the task, participants were given the opportunity to take a short break prior to the start of the motor task.

#### *Study 2 modifications to auditory task*

In the second study, the ISI was increased to five seconds and the ITI to seven seconds to increase the amount of time participants had to process the individual sound stimuli. The type of response to oddball events also differed, with covert responses in Study 2 to avoid the effects of motor actions. Like Study 1, sequences with "oddball" events were randomly interspersed in between the normal trials. However, instead of actively using a mouse to indicate the oddball events, participants were asked to covertly count the total number of oddballs presented throughout the auditory task. At the end of the block of trials, subjects were asked for the total number of oddball events counted. This task was necessary to ensure that participants were *actively* listening to all the stimuli. Control sounds were not used as oddball events. While analyzed for accuracy as a behavioral task, these oddball event trials were excluded from further analysis in order to avoid any type of motor contamination of the EEG data. After the task, participants were given the opportunity to take a short break prior to the start of the motor task.

#### **MOTOR TASK**

In both Studies 1 and 2, participants were also asked to execute four actions associated with the sounds they heard. One was "zipping a zipper" (hand action) in which participants were provided a zippered jacket to put on prior to the start of the motor task. When cued, they were to move their hands from resting position on the armchair, zip the jacket up and down four times and return their hands to resting position. A second action, "ripping paper" (hand action) involved cueing the participant to move both hands from resting position on the armchair to the center where an experimenter would hand them a paper towel to rip. Participants would rip the paper towel three times, drop the scraps on the floor, and move their hands back to resting position. A third action, "sipping from a straw" (mouth action) involved cueing the participant by placing the straw to their lips and asking them to pretend drinking through the straw three times. The final action was "kissing" (mouth action) in which subjects were cued to make three kissing motions (i.e., purse and release their lips, thus making a smooching sound). During these actions, participants were blindfolded to prevent them from seeing their own actions, and circumaural headphones delivered white noise loud enough to prevent the participants from hearing the sound effects of their own actions. Experimenters used haptic cues [e.g., touching of hand] to prompt the participant to begin each action (the cues and actions were explained to the participant and practiced prior to EEG recording). Each of the four actions was performed twice, for a total of eight motor actions within a single block of trials. The order of the actions was determined through a random number generator, and was unique to each participant. In addition, the block of trials was repeated eight times for a total of 64 motor actions per subject. One experimenter delivered the haptic cues while a second confederate used a keyboard to send pulses via keyboard to the EEG computer demarcating the onset and offset of each action (see cues in **Table 2**).

#### **DATA ANALYSIS**

Eye blinks and movement artifacts were digitally identified in the EOG recording and removed. Other types of EEG artifacts were also automatically and manually removed prior to analysis. Data were only analyzed if sufficiently "clean" EEG, with no movement or eye blink artifacts, were present. Between 10 and 30% of the data were removed for individual participants. For each cleaned segment the integrated power in the 8–13 Hz mu range was computed using a Fast Fourier Transform. Data were segmented into epochs of 2 s beginning at the start of the segment. Fast Fourier Transforms were performed on the epoched data (1024 points or 2046 ms). A cosine window was used to control for artifacts resulting from data splicing.

Suppression of the 8–13 Hz band in both the auditory and motor tasks was computed as the ratio of power in response to Action sounds, Non-Action sounds, and motor movements, relative to control sounds. A ratio was used to control for variability in absolute power as a result of individual differences in scalp thickness and electrode impedance, as opposed to absolute differences in electrical activity. Since ratio data are inherently non-normal as a result of lower bounding, a log transform was used for statistical analyses. A log ratio of less than zero indicates suppression whereas a value of zero indicates no suppression and values greater than zero indicate enhancement.

Accuracy on the Auditory Oddball Task was calculated as hits plus correct rejections divided by total number of blocks. For EEG suppression, an omnibus ANOVA was first run followed by distinct ANOVAs for midline (Fz, Cz, Pz), frontal (F3, F4, F7, F8), centro-parietal (C3, C4, P3, P4), temporal (T3, T4, T5, T6), and occipital (O1, O2) sites. Stimulus type (mouth sound, hand



sound, motor mouth, motor hand, environmental sounds) and electrodes were used as within-subject factors with Greenhouse-Geisser corrections applied to the degrees of freedom and only the corrected probability values reported. Partial Eta scores (h<sup>2</sup> *p*) are also reported. Pairwise comparisons were conducted using Tukey's Honest Significant Difference (HSD), while a Bonferroni correction was applied to correct for multiple comparisons.

#### **RESULTS**

Trial lengths were longer in Study 2 (7 s ITI) than in Study 1 (3 s ITI) to allow for more processing of the stimulus, and the type of response to oddball trials differed, with overt responses used in Study 1 and covert responses in Study 2. Nonetheless, no statistical differences were found between the studies and therefore only the combined results are reported.

#### **AUDITORY IDENTIFICATION TASK**

During the Auditory Identification Task, participants were able to identify sounds with a high degree of accuracy within three times of listening to a sound. The specific results for Hand, Mouth, and Environment sounds were 95, 98, and 93%, respectively. During the oddball trials, participants were able to correctly detect the oddball event with 97% accuracy.

#### **EEG 8–13 Hz SUPPRESSION**

There was a main effect of stimulus type across all electrodes, *<sup>F</sup>(*4*,* <sup>220</sup>*)* <sup>=</sup> <sup>63</sup>*.*2, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*01, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*535. Pairwise comparisons showed that all but audio hand vs. environmental sounds were highly significant (*p <* 0*.*01). There was also a main effect of electrodes, *F(*16*,* <sup>880</sup>*)* = 7*.*9, *p <* 0*.*01 and a stimulus type x electrode interaction, *F(*64*,* <sup>3520</sup>*)* = 5*.*99, *p <* 0*.*01.

Individual ANOVAs for subset of the electrodes (frontal, centro-parietal, midline, temporal, occipital) showed similar results as the omnibus ANOVA, as shown in **Figure 1** for centroparietal sites. Likewise, frontal electrodes showed a main effect of stimulus type, *<sup>F</sup>(*4*,* <sup>220</sup>*)* <sup>=</sup> <sup>48</sup>*.*8, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*01, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*470. Pairwise comparisons for this subset of electrodes indicated significant differences for all comparisons except the audio hand vs. environmental sounds. There was a significant stimulus type x electrodes interaction, *F(*12*,* <sup>660</sup>*)* = 3*.*29, *p <* 0*.*01, with greater suppression over right (F4) compared to left (F3) anterior frontal sites (see **Figure 2**). This difference occurred primarily for hand sounds.

Centro-parietal electrodes exhibited a main effect of stimulus type, *<sup>F</sup>(*4*,* <sup>220</sup>*)* <sup>=</sup> <sup>31</sup>*.*6, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*01, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*365. Like the frontal electrodes, pairwise comparisons indicated significant differences for all comparisons except the audio hand vs. environmental sounds. There was also a main effect of electrodes, *F(*3*,* <sup>165</sup>*)* = 3*.*08, *p <* 0*.*05 with significantly more suppression over left (C3, P3) compared to right (C4, P4) hemisphere sites. A stimulus type x electrode interaction, *F(*12*,* <sup>660</sup>*)* = 3*.*09, *p <* 0*.*05 showed that in the audio hand and the environmental sounds conditions there was greater suppression compared to other conditions and more over left hemisphere for the audio hand condition and over the right hemisphere for environmental sounds (see **Figure 3**).

Midline electrodes showed a statistically significant main effect of stimulus type, *<sup>F</sup>(*4*,* <sup>220</sup>*)* <sup>=</sup> <sup>55</sup>*.*0, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*01, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*500, with all pairwise comparisons showing significant differences. A main

**FIGURE 1 | Mu suppression over centro-parietal regions (C3, C4, P3, P4) in the action (hand, mouth) and non-action (environmental) auditory conditions as well as during the motor actions (hand, mouth).** Mu suppression is determined as the log of the ratio between experimental condition and baseline condition. Error bars are standard error of the mean.

**FIGURE 2 | Mu suppression across stimulus type at frontal regions (F3, F4, F7, F8) showing the stimulus type by electrode interaction in which greater mu suppression was recorded over right (F4) compared to left (F3) anterior frontal sites but not over more ventral anterior sites (F7, F8).**

effect of electrodes, *F(*2*,* <sup>110</sup>*)* = 7*.*5, *p <* 0*.*01, showed greater suppression occurring at the central (Cz) compared to frontal (Fz) and parietal (Pz) sites. There was also an interaction between stimulus type and electrodes, *F(*8*,* <sup>440</sup>*)* = 9*.*5, *p <* 0*.*01 with audio hand showing the greatest suppression.

Temporal electrodes exhibited a statistically significant main effect of stimulus type, *F(*4*,* <sup>220</sup>*)* = 52*.*0, *p <* 0*.*01, electrodes, *F(*3*,* <sup>165</sup>*)* = 25*.*5, *p <* 0*.*01, and an interaction between stimulus type and electrodes, *F(*12*,* <sup>660</sup>*)* = 4*.*28, *p <* 0*.*01.

Occipital electrodes showed a main effect of stimulus type, *F(*4*,* <sup>220</sup>*)* = 42*.*1, *p <* 0*.*01, with all audio sounds (hand, mouth, environmental) showing suppression compared to more positive responses during motor actions. There was also a main effect of electrodes, *F(*1*,* <sup>55</sup>*)* = 15*.*2 *p <* 0*.*01, such that greater suppression was seen over left (O1) compared to the right (O2) hemisphere.

**FIGURE 3 | Mu suppression over centro-parietal regions (C3, C4, P3, P4) showing that although no differences occurred between these sounds (audio hand vs. environmental), a stimulus type by electrode interaction showed that audio hand sounds exhibited greater suppression over left hemisphere sites (C3, P3) while environmental sounds exhibited greater suppression over right hemisphere sites (C4, P4).**

Finally, there was an interaction between stimulus type x electrodes, *F(*4*,* <sup>220</sup>*)* = 4*.*2, *p <* 0*.*01 such that the greatest amplitude differences between left and right hemisphere sites occurred for environmental sounds (see **Figure 4**).

#### **DISCUSSION**

Results from this study show that EEG 8–13 Hz mu rhythms exhibit amplitude modulation not only during the performance of an action (synchronization) but also during hearing of action related sounds as well as non-action related sounds (desynchronization). Synchronization during action execution has been previously reported and while differences in the direction of modulation may reflect motor vs. mirroring processes, they may also involve increases in sensitivity to motivationally meaningful events (Pineda and Oberman, 2006). Differences in the spatial distribution of the mu suppression triggered by the action vs. the environmental sounds may reflect different neural sources, with action related sounds displaying a locus over left hemisphere in the anterior-posterior axis while non-action environmental sounds display primarily a right hemisphere locus.

A concurrent fMRI-EEG experiment (Arnstein et al., 2011) has shown that mu suppression during action observation and execution, as measured in the EEG over central electrodes (C3), directly correlated with BOLD increases in the dorsal premotor, and parietal cortex (posterior SI and adjacent posterior parietal lobe). Activity in the ventral premotor cortex was less correlated with mu suppression over C3. Because the exact same sound stimuli used here have been used previously in an experiment using fMRI (Gazzola et al., 2006), we can use the results of the fMRI-EEG study to link the results from the current EEG study and the past fMRI experiment. In particular, Gazzola et al. (2006) found that in the left hemisphere, the hand action sounds recruited the dorsal premotor and somatosensory cortex more than the environmental sounds [Figure S1 in Gazzola et al. (2006)], while the reverse was true for the right hemisphere. Here, we find the same lateralization pattern, with hand action sounds producing more mu suppression than environmental sounds over C3 (left) and the environmental sounds producing more mu suppression than the hand-action sounds over C4 (right). In addition, (Gazzola et al., 2006) found a somatotopic activation pattern, with hand action sounds and hand-action execution recruiting the dorsal premotor and mid-parietal cortex, while mouth action sounds and mouth action execution activated the ventral premotor cortex and the ventral-most parietal cortex. In the fMRI-EEG study, the dorsal but not the ventral premotor cortex, and the mid- rather than ventral parietal cortex reliably predicted mu suppression over C3. Accordingly, one would expect more mu suppression over C3 for the hand than mouth-actions. For the sounds, this was exactly what was found in the present experiment. For the action execution, mu suppression was not found for either of the action types, but the mu-power over C3 was lower for the hand than the mouth action execution. Accordingly, the two experiments using fMRI and EEG, respectively, find compatible results in terms of lateralization and somatotopical arrangement of the activation triggered using the same auditory stimuli.

These observations are also congruent with previous studies that have used different sounds but indicated that action sounds (those that are reproducible by the body) and nonaction related sounds are processed by separate neural systems in the human brain (Pizzamiglio et al., 2005). More specifically Pizzamiglio et al. report that left posterior superior temporal and premotor areas appear to reflect action-related sounds, while bilateral areas in the temporal pole appear to respond to non-action related sounds. The present findings of a left hemisphere locus for action-related sounds underscore the fact that auditory aspects of mirroring, as reflected in the dynamics of the EEG mu rhythm, exhibits similar functional specialization inherent in processing auditory sounds with semantic meaning (Zahn et al., 2000). Furthermore, sounds associated with different effectors, e.g., hand compared to mouth action sounds, show distinct modulations of this rhythm. Hence, the data are consistent with the idea that biological sounds engage mirroring processes (both synchronization and desynchronization actions) in a manner similar to that which occurs during the observation and reproduction of motor actions (Pfurtscheller and Lopes da Silva, 1999; Neuper et al., 2006). The multi-sensory properties of the human MNS is thus assumed to help build a more accurate representation of sensorimotor activity from the visual, auditory, and other information embedded in the observation of other individuals. Whether these various aspects of the input are processed simultaneously or treated equally is left for future research.

Desynchronization or suppression of EEG rhythms has generally been interpreted as a correlate of an activated cortical area with increased excitability, while synchronization has been interpreted as a correlate of a deactivated cortical region (Pfurtscheller, 2001; Pineda, 2005). Mu rhythm oscillations in the present study were enhanced (meaning that the underlying neurons were less active and more synchronized) for mouth compared to hand stimuli, both while performing the movements in the Motor task and hearing the sounds in the Auditory task. In addition to fitting with the data of Gazzola et al. (Gazzola et al., 2006), using the same stimuli, the interpretation of these differences in mouthand hand-related processing is also compatible with the findings of Pfurtscheller and Neuper (Pfurtscheller and Neuper, 1994) who reported that excitation of one sensorimotor area is typically accompanied by inhibition of a neighboring sensorimotor area as a result of lateral inhibitory connectivity. Consistent with this idea, mu rhythms recorded at central sites (C3, C4), which are located closer to the hand than mouth area in the motor strip, showed suppression to hand-related sounds and enhancement to mouth-related sounds. These center-surround effects, however, were asymmetrical since enhancements to mouth-related sounds were typically *>*20%, while suppressions to hand-related sounds were typically *<*10%. The basis for such an asymmetry is unclear although it appears consistent with greater suppression occurring to hand-related sounds at the C3 and C4 electrode sites. This difference cannot be attributed to task performance because no difference in counting of oddball control trials occurred for mouth- and hand-related sounds.

Participants understood and actively listened to the sounds as indicated by the results of the Auditory Identification Task and responses to the oddball events during the Auditory Task condition. The EEG findings are therefore not likely due to differences in perceptual or attentional factors but more consistent with the assumption that mirroring is reflected in the dynamics of the mu rhythm. That is, meaningful action sounds trigger greater mirroring as reflected in more mu suppression because they are embodied by the listener in order to understand them. This is supported by the different patterns of mu suppression to action (Mouth and Hand-based sounds) and non-action (Environmental) sounds. That is, greater suppression was recorded over the left hemisphere (especially over parietal areas) in response to action compared to non-action sounds but this is reversed over the right hemisphere. Thus, it is congruent with the idea that the mu rhythms (and the premotor to sensorimotor cortex connection) indexes auditory activity related to mirroring and that such a system exhibits lateralized processing as a function of the semantic associations with the sounds.

Studies 1 and 2 of the present work were designed to address the influence of motor preparation to report oddballs on the mu suppression results. We required participants in one study to make overt responses by clicking a mouse and in a separate study make covert responses by mentally counting the oddball trials. Both overt and covert responding requires motor planning. However, in one case it specifically involves an effector, such as the hand, while in the other it avoids such effector-based preparation. Since no significant differences occurred between the two studies, it suggests that EEG mu rhythms are either unaffected by the type of motor preparation or similar motor preparation occurs for both overt and covert responding.

When we observe another person moving, we only see the external consequences of their actions. To reproduce this action, we need instead to produce motor programs that produce a similar action. Clearly, the visual signals entering the eye during action observation are fundamentally different from the motor commands that need to be generated to perform a similar action. For one to map observed actions onto similar states in one self to understand or imitate the actions of others poses what has been called the correspondence problem in mirroring (Brass and Heyes, 2005). Specifically, how do observed movements actually map onto the observer's own motor system to enable everything from simple motor imitation to visceral discomfort upon seeing a queasy face? That is, how do we actually translate what we see into what we do (Brass and Heyes, 2005; Pineda, 2005)? In the visual domain, this correspondence problem is constrained by the fact that the observer can witness what body part the agent has used to perform the action. In the auditory domain, such information is lacking. When you hear the crunching of the soda can, it is impossible to know whether the left hand or the right was used. Perhaps it was the left or right foot used. Either is impossible to know with the given piece of sound information. Nevertheless, hand-action sounds and mouth-action sounds generated different patterns of mu suppression, which mirrors the relative amount of mu suppression during action execution, and previous fMRI studies have shown the existence of somatotopic brain activity (Gazzola et al., 2006) that allows classification as to which effector was used from sounds alone (Etzel et al., 2008). It has been proposed, that this somatotopy is the result of Hebbian learning: while we crush a coca-cola can with our right hand, we simultaneously perform the motor program, and hear (through what is called re-afference) the sound of this action. Through Hebbian learning, neurons in high-level auditory cortex that respond to the sound of this action then would enhance their synaptic connections with motor neurons in the parietal and premotor cortex that caused the action and with neurons in SI that sense the tactile consequences of performing such hand actions (Keysers and Perrett, 2004). Thereafter, listening to the sound would trigger, through these Hebbian associations, the motor programs corresponding to that action and the somatosensory representations of what such actions feel like. Because such motor programs and somatosensory fields are located more dorsal in the premotor, somatosensory and posterior parietal cortices than for mouth motor programs (Gazzola et al., 2006), the sound of such actions that we normally perform with our hands will trigger activity preferentially in these more dorsal regions in fMRI (Gazzola et al., 2006) and causing maximum mu suppression under C3 in the present study. While performing mouth actions like gurgling, we can hear and feel ourselves perform the action, and neurons in the high-level auditory cortex responding to the sound of actions we normally do with the mouth will wire together with the more ventrally located mouth motor and somatosensory representations in the premotor and parietal lobe. Accordingly, hearing such sounds will later trigger more ventral activity in the premotor and parietal lobe (Gazzola et al., 2006) and less mu suppression over C3. In support of the notion that brain activity in premotor and posterior parietal cortex is triggered by sounds as a result of Hebbian associations rather than inborn processes, Lahav et al. (Lahav et al., 2007) has shown that no premotor activity occurs to the sound of piano music in piano naïve listeners. However, a few hours of piano lessons, during which participants repeatedly experience the temporal contingencies between pressing piano keys and musical notes suffices to train new neural connections: after the training, piano music suddenly did trigger premotor activation in regions used to perform hand actions while listening to the learned piano melodies.

Environmental sounds such as the sound of a train passing are at first glance not considered body action-related, and it may seem odd that they should trigger any mu suppression at all. However, it is not unthinkable that they could be variably embodied, via Hebbian learning processes, in different individuals. Does John associate the sound of a train passing with the vibrations of his morning commute on the locomotive? Does Suzie associate the sound of a train passing with using her arm to move her electric train along the play tracks? The fact that such environmental sounds in the present study exhibit significant levels of mu suppression is consistent with this embodied argument for non-action stimuli. Nonetheless, much more research is needed to clarify these explanations.

#### **CONCLUSION**

The findings of this study provide support for a mirroring system in the human brain that responds to specific sounds, as well as greater understanding of the distinctions and similarities between processing the auditory and visual aspects of mirroring. The patterns of mu suppression across cortical regions to different categories of sounds and to effector-specific sounds suggest differential engagement of this mirroring system in the human brain when processing different category of sounds and may offer a potential set of signals to explore for the development of a passive brain computer interface. Clearly, future studies are needed to specifically investigate that possibility, as well as the effects of auditory and motor tasks on special populations, such as autistic individuals, which in turn could provide insight on the degree of importance of the auditory MNS in social interactions and language.

#### **ACKNOWLEDGMENTS**

We would like to thank Drs. Marta Kutas, Andrea Chiba, Inna Fishman, and Sarah Creel for their help and feedback throughout this project. We would also like to thank all those from the Cognitive Neuroscience Laboratory at UCSD for their logistical and moral support: Alan Kiang, Adrienne Moore, Max Keuken, Heather Pelton, Jia-Min Bai, Conny Soest, Dan Lotz, Nick Pojman, Alexy Andrade, Albert Anaya, and Alicia Trigeiro. We would like to thank Valeria Gazzola and Lisa-Aziz Zadeh for helping to record the original auditory stimuli.

#### **FUNDING**

Christian Keysers was supported by VIDI grant 452-04-305 of the Netherlands Science Foundation (N.W.O.) and ERC grant 312511 of the European Research Council.

# **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 12 August 2013; accepted: 13 October 2013; published online: 11 December 2013.*

*Citation: Pineda JA, Grichanik M, Williams V, Trieu M, Chang H and Keysers C (2013) EEG sensorimotor correlates of translating sounds into actions. Front. Neurosci. 7:203. doi: 10.3389/fnins.2013.00203*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2013 Pineda, Grichanik, Williams, Trieu, Chang and Keysers. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Classifying visuomotor workload in a driving simulator using subject specific spatial brain patterns

*Chris Dijksterhuis <sup>1</sup> \*, Dick de Waard1, Karel A. Brookhuis 1,2, Ben L. J. M. Mulder <sup>1</sup> and Ritske de Jong1*

*<sup>1</sup> Department of Psychology, University of Groningen, Groningen, Netherlands*

*<sup>2</sup> Department of Infrastructure Systems and Services, Delft University of Technology, Delft, Netherlands*

#### *Edited by:*

*Anne-Marie Brouwer, Tilburg University, Netherlands*

#### *Reviewed by:*

*Tzyy-Ping Jung, University of California San Diego, USA Scott E. Kerick, US Army Research Laboratory, USA*

#### *\*Correspondence:*

*Chris Dijksterhuis, Department of Psychology, University of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, Netherlands e-mail: c.dijksterhuis@rug.nl*

A passive Brain Computer Interface (BCI) is a system that responds to the spontaneously produced brain activity of its user and could be used to develop interactive task support. A human-machine system that could benefit from brain-based task support is the driver-car interaction system. To investigate the feasibility of such a system to detect changes in visuomotor workload, 34 drivers were exposed to several levels of driving demand in a driving simulator. Driving demand was manipulated by varying driving speed and by asking the drivers to comply to individually set lane keeping performance targets. Differences in the individual driver's workload levels were classified by applying the Common Spatial Pattern (CSP) and Fisher's linear discriminant analysis to frequency filtered electroencephalogram (EEG) data during an off line classification study. Several frequency ranges, EEG cap configurations, and condition pairs were explored. It was found that classifications were most accurate when based on high frequencies, larger electrode sets, and the frontal electrodes. Depending on these factors, classification accuracies across participants reached about 95% on average. The association between high accuracies and high frequencies suggests that part of the underlying information did not originate directly from neuronal activity. Nonetheless, average classification accuracies up to 75–80% were obtained from the lower EEG ranges that are likely to reflect neuronal activity. For a system designer, this implies that a passive BCI system may use several frequency ranges for workload classifications.

**Keywords: passive brain computer interface, common spatial pattern, driving simulator, workload classification, adaptive automation, lateral control**

# **INTRODUCTION**

In contrast to an active Brain-Computer Interface (BCI) which allows users to engage in volitional thought control of a device, several BCI researchers have proposed to advance humancomputer interaction by triggering machine actions based on inferences of the user's current mental state, known as passive BCI (Kohlmorgen et al., 2007; Cutrell and Tan, 2008; Zander et al., 2010; Zander and Kothe, 2011). For example, Kohlmorgen et al. (2007) showed that it is possible to classify mental workload elicited by a secondary task mimicking cognitive processes in a real driving environment. Moreover, these classifications were then used to switch on and off a tertiary task that mimicked an interaction with the vehicle's electrical devices that in turn improved performance on the secondary task.

In the human factors and ergonomics literature, which traditionally focuses on overall system performance and safety critical tasks, the potentially detrimental effects of both mental underload and overload have been a major research topic for decades. Mental workload can be defined as a "reaction to demand" and "the proportion of capacity that is allocated for task performance" (de Waard, 1996). Mental underload and overload both represent compromised functional states during which a breakdown of primary task performance is more likely (e.g., Hockey, 1997, 2003; see also Brookhuis and de Waard, 2010). Preventing these hazardous functional states by maintaining mental workload or task demand within an acceptable range in real-time has been the central goal of adaptive automation since the seventies (Chu and Rouse, 1979; Hancock and Chignell, 1988; Rouse, 1988; Parasuraman et al., 1992; Kaber and Prinzel, 2006).

A large part of adaptive automation literature is devoted to determining the right moment of providing or withdrawing task support, and several types of triggers may be available to optimize performance of a human-machine system (e.g., critical events and human task performance; see Parasuraman et al., 1992). Therefore, the question arises as to what physiological measures could offer in terms of improving the overall system's performance. The most important argument for the inclusion of physiological measures in a control loop is their potential for detecting user states that would otherwise remain hidden. Human beings may exhaust themselves to protect primary task performance in demanding situations. While performance protection is important for dealing with short bursts of task demand, when exposed to longer periods of high workload, it may have affective costs such as increases in anxiety, but also compensatory performance costs, such as neglecting secondary tasks (Hockey, 1997, 2003). Since straining effort expenditure has a neurophysiological base, the ability to reliably classify workload using physiological measures could be

used to offload a person, before performance effects become apparent.

Traditional research approaches might not be well suited for uncovering the underlying neurophysiological mechanisms that could be used in a support system. As pointed out by Fairclough (2009), the fundamental problem of using physiological measures is the complex relationship between user states, such as mental overload, and their associated physiological variables. Specifically, four physiology-to-state mappings can be distinguished (Cacioppo et al., 2000). In the most straightforward case, there is a unique *one-to-one* mapping between a physiological variable and the psychological construct. Such a unique, one-toone mapping would be ideal for an interactive system. However, a one-to-one mapping that holds true in both the laboratory and the field has to date not yet been found. A *many-to-one* mapping is more complicated as several signals are needed to infer a mental state. For example, heart rate, heart rate variability and blood pressure have been combined to infer mental workload (e.g., Mulder et al., 2009). In a *one-to-many* mapping, one physiological signal responds to a range of user states. For instance, systolic blood pressure information was found to be sensitive to excitement, frustration, and stress (Cacioppo and Gardner, 1999). Lastly, the most common finding is a *many-tomany* mapping where many signals are in fact sensitive to many mental states. Ultimately, in an implicit human-machine control loop, a one-to-one or a many-to-one relation is required. As briefly mentioned, another factor complicating the relationship between physiological measures and user state is lack of generalizability outside the laboratory setting where a mapping was found. Simply put, a relation between a physiological measure and a user state found in the laboratory may not hold true in a real world setting where environmental conditions are less controlled.

Furthermore, due to large individual differences in physiological responsiveness, traditional statistical tests might not be suitable to uncover relationships that are valuable for implicit machine control. Even in a repeated measures analysis of variance, where the variations due to individual differences are partly taken out of the error term, the directions of effects within the individuals need some consistency across individuals to reach statistical significance. While significant effects on a group level are interesting from a fundamental point of view, individual patterns are more relevant, when physiology is applied in human-machine systems. In this respect, the feature extraction and classification algorithms used by BCI researchers offer a promising way of dealing with these limitations.

As shown by Kohlmorgen et al. (2007), driver support may be linked to electroencephalogram (EEG) signals. Given the fact that the driving task is increasingly demanding, due to increased complexity of the road network, increased traffic intensity, and the availability of potentially distracting in-vehicle information systems, such as phones, (e.g., Carsten and Brookhuis, 2005), accurate assessment of user state while driving might be used to benefit driving performance. From driving behavior literature, it is clear that besides mental workload, other, related psychological constructs might be investigated for use in a support system. At this point there is no consensus about the exact psychological processes underlying driving behavior. Depending on the theoretical framework, the level of (subjective) risk, workload, or a general feeling of comfort is either maintained or avoided (e.g., risk homeostasis theory, the zero-risk theory, risk allostasis theory, safety margin model (Näätänen and Summala, 1976; Wilde, 1982; Fuller, 2005; Summala, 2005; see also: Lewis-Evans et al., 2011). To make it even more complex, drivers alter the level of workload in practice through behavioral adaptations. For example, in demanding situations with high information density (e.g., complex variable message signs), narrow lanes or a winding road, a driver may reduce speed, which will reduce the reaction time requirements, and thereby avoids high workload levels (Hockey, 2003; Lewis-Evans and Charlton, 2006).

Ultimately, we would like to provide a proof of concept for a passive brain-car interface that changes driving speed in response to visuomotor workload, thereby keeping workload levels within an acceptable range, similar to a human driver. However, in preparation for this, we have first investigated the feasibility of using EEG signals to classify between levels of lane keeping demand in a driving simulator. For this, we applied subjectspecific Common Spatial Patterns (CSPs; e.g., Blankertz et al., 2008). The main advantage of using the CSP technique is that it maximizes the difference between two conditions by creating linear combinations of all included electrodes; spatial filters used to produce CSP components. In this way, some electrodes contribute more to the filtered signal(s) than others. These CSP components are determined per participant and therefore, individual differences are accounted for. The most discriminative components are then used to distinguish conditions.

Lane keeping demand was manipulated by changing driving speed, mimicking drivers' natural behavior. Driving speed was set relative to the participants' comfortable speed, since the effort that is required to keep the car safely on the road may vary between drivers for absolute driving speeds. A relative high driving speed is hypothesized to result in a relative high visuomotor workload. In addition, since the Standard Deviation of the car's Lateral Position (SDLP) reflects workload (e.g., Dijksterhuis et al., 2011), an individually set target SDLP was presented to the participants on the virtual windshield, urging drivers to show less swerving behavior in the driving lane. A relative low target SDLP is hypothesized to result in a relative high workload level.

### **MATERIALS AND METHODS PARTICIPANTS**

A total of 17 males and 17 females were recruited through social media and poster announcements throughout the University of Groningen and were paid 20 Euros for participation. A large part of the participants were either Dutch or German students at this university. Ages ranged from 21 to 34 years (*M* = 24*.*0; *SD* = 3*.*0) and the participants had held their driver's license for 3 to 15 years (*M* = 5*.*3; *SD* = 2*.*8). Self-reported total mileage ranged from 3000 to 350,000 km (*M* = 53*,* 000; *SD* = 76*,* 000), while the reported average annual mileage over the past 3 years ranged from 1000 to 50,000 km (*M* = 9000; *SD* = 11*,* 000). None of the participants reported on using prescribed drugs that might affect driving behavior. The Ethical Committee of the Psychology Department of the University of Groningen has approved the study.

# **SIMULATOR AND DRIVING ENVIRONMENT**

The study was conducted using a fixed-base vehicle mock up with functional steering wheel, indicators, and pedals. The simulator runs on ST Software©which is capable of simulating fully interactive traffic. The three computers dedicated to the simulator compute the road environment and traffic which are displayed on three 32-inch plasma screens and provide a total view of the driving environment of 210◦. In addition, three rear-view mirrors are projected on the screens. A detailed description of the driving simulator used can be found in Van Winsum and Van Wolffelaar (1993).

For the experiment a two-lane road (each 2.75 m wide) was prepared, without intersections and winding through rural scenery. The road consisted mainly of easy curves (about 80%) with a constant radius of 380 m and ranging in length from 120 to 800 m. The road surface was marked on the edges by a continuous line (0.20 m wide) and in the center by a discontinuous (dashed) line (0.15 m wide). Outside the edges a soft shoulder was present and there were no objects in the direct vicinity of the road. In the driving direction of the participants, no traffic was present. However, oncoming traffic, travelling between 76 and 84 km/h, was generated with a random interval gap between 1 and 2 s, resulting in 40 passing private vehicles per minute on average. The speed of the participant's vehicle was controlled by the simulator for all rides during the experimental session, except for the initial ride during which the participants drove the simulator car (width: 1.60 m) in automatic gear mode.

# **DESIGN AND PROCEDURE**

Upon arrival at the experimental site of the University of Groningen, a participant was informed in general terms with respect to the experimental design, was requested to sign an informed consent form, and asked to fill in a short questionnaire mainly related to their driving experience. Hereafter, the participant was given some time (ca. 7 min) to practice driving in the simulator, before the sensors were attached. Next, a three minute baseline recording was made while the participant sat in the simulator car chair and an aquatic movie played on the center screen of the simulator.

After this baseline recording the participant completed 16 short rides. After each ride, the participant was requested to park the vehicle on the side of the road and to provide an answer to two brief questions (on subjective mental effort and estimated driving speed). During the initial ride (140 s) the participant exerted both longitudinal and lateral control over the vehicle and was asked to find and drive at a speed that felt most natural and comfortable in this situation while the speedometer was turned off to prevent rule-based speed setting. The speedometer remained turned off for the entire experiment. The mean speed and standard deviation of the vehicle's lateral position (SDLP) on the road during the last 110 s of the initial ride represented the participant's personal, comfortable driving style. These parameters were saved and used to set driving speed and target SDLP during the 15 remaining rides.

During these 15 rides (130 s each), speed was set relative to the participant's comfortable speed (either −40, −20, 0, +20, or +40 km/h). In addition, while speed was set at the comfortable driving speed, the participant was requested to keep SDLP at either 0, −0.05, or −0.10 m relative to the initial SDLP, which represent a normal, hard, or very hard task. For the other driving speeds, the target SDLP was determined as follows. From a pilot study (*n* = 9), using a similar roadway environment, it was found that SDLP naturally increases as a function of speed. To compensate for this effect and thereby creating five roughly comparable steering challenges across speeds, another 0.03 m per speed level was either added to or subtracted from the target SDLP. For example, when driving 40 km/h slower than the comfortable speed while the target SDLP condition was set at "very hard," the numerical target SDLP was set 0*.*10 + 2 × 0*.*03 = 0*.*16 m lower than the comfortable SDLP as established during the initial ride. Current values of SDLP were derived from a 15 s moving window which was updated every second and these were projected onto the bottom of the windshield of the simulator while driving, adjacent to the target SDLP. In this way a driver could monitor real SDLP and compare it to the target. Accounting for the time window and for the time the simulator needed to get to the required speed, only the last 110 s of each ride was used in subsequent analyses. To be clear, the data used for this analysis were the raw, not averaged, vehicle parameters. In total, the experimental manipulations resulted in a within-subject design consisting of two repeated measures factors with several levels: speed (5) and target SDLP (3). The participants were exposed to these driving conditions according to a randomized schedule.

After finishing the last ride, the baseline measurement was repeated once more before all physiological sensors were removed. Finally, the participants were debriefed and were paid upon leaving.

# **DEALING WITH COLLISIONS**

Occasionally, the participants were challenged to the point that a collision with oncoming traffic could not be avoided. In total, six participants were involved in 10 collisions which is 1.8% of all experimental rides. Eight of these collisions occurred in a +40 km/h speed condition. When a collision occurred, that particular ride was repeated. Data acquired during the crash rides were not used for further analyses.

# **DATA ACQUISITION**

# *Vehicle parameters*

Driving speed and lateral position (LP) were sampled at 10 Hz. LP is defined as the difference in meters between the center of the participant's car and the middle of the (right hand) driving lane. Positive LP values correspond to deviations toward the right hand shoulder and negative values correspond to deviations toward the left hand shoulder. The sampled LP values were processed while driving and used to calculate mean LP and SDLP for each of the 16 rides. In addition, LP values were used to feed current values of SDLP back to the participant which were calculated by using moving, overlapping time windows (see Design and Procedure for more details), representing an indication of the participants lane keeping performance.

# *Subjective ratings*

After each ride, a rating on the one-dimensional Rating Scale Mental Effort (RSME) was requested (Zijlstra, 1993). The RSME ranges from 0 to 150 and several effort indications are visible alongside the scale which may guide the participant in marking the scale. Indications start with "absolutely no effort" (RSME score of 2) and end with "extreme effort" (RSME score of 112). The participants, who did not receive speed information from the speedometer, were also asked to write down an estimate of the driving speed they just experienced.

# *Physiological measures*

Physiological signals were sampled at 250 Hz. Firstly, the electrocardiogram (ECG) was registered using three Ag-AgCl electrodes, which were placed on the sternum (the ground electrode) and on the right and left side between the lower ribs. However, given the emphasis on brain activity in this paper, the ECG results are not reported here. Secondly, the electro-oculogram (EOG) was measured by Ag-AgCL electrodes attached next to the lateral canthus of each eye and above and below either the right or left eye. The EEG was measured using an electro-cap with 64 tin electrodes (at the following sites: FP1, FP2, Afz, F7, F5, Fz, F4, F8, T7, C5, C3, Cz, C4, T8, P7, P3, Pz, P4, P8, O1, and O2.) The amplifer was a REFA 8–72 (Twente Medical Systems International, Enschede, The Netherlands). Portilab 2 software was used to record all physiological signals. The ground electrode used for the ECG recording also served as the participant's ground for the EEG recording. EEG and EOG signals were amplified with a 1 s time constant (0.016 Hz high-pass). All EEG channels were referenced against the average activity of all channels during the registrations. In addition, a reference electrode was attached to each mastoid. Impedances were kept below 10 k for all electrodes.

# **EEG DATA PROCESSING**

Starting from the raw EEG signals, the sampled EEG and EOG data were first high-pass filtered (cutoff = 0*.*3 Hz, at 12 dB/Oct Butterworth filter) before the EEG data segments of the 15 experimental conditions (110 s each) were corrected for eye movements and blinks, using Brain Vision Analyzer (Gratton et al., 1983). The corrected data segments were then exported into binary files. No data epochs were removed before further processing.

The exported data files were processed using MATLAB R2010a (The MathWorks, Inc., USA, www*.*mathworks*.*com). After importing two data sets (two rides or conditions) of a particular participant, the EEG was band-passed filtered in the frequency domain (FFT filter) of interest, using an edge frequency of 1 Hz below and above the lower and upper frequency band limit respectively. The imported data (110 s for each condition) were then segmented into one second epochs and baselined relative to each mean activity. The first and last 10% of the epochs were omitted, leaving the 88 middle, non-overlapping, epochs per condition in the cross-validation design. This entailed a repeated (50 times) random portioning of two data classes (a condition pair) into a set of 66 training epochs (75%) and a set of 22 test epochs for each data class. The training sets were used to train the participant-specific classifier that was subsequently used to classify the testing epochs of each data class. This iteration process was carried out for each included participant, frequency band, EEG cap configuration, and data pair. The accuracies reported in the result section reflect the average accuracies across all 50 iterations and all included participants.

To improve discriminatory power of the data classifier, the contrast between two data classes was optimized by using the CSP technique. This technique determines CSP filters in such a way that they maximize the variances of spatially filtered signals for one training set while minimizing them for the other (Blankertz et al., 2008). A CSP filter is a coefficient vector by which the original channels can be transformed. This results in a new spatially filtered channel (a CSP component) which is a linear combination of all original channels, and the total number of filters and therefore, the number components, is equal to the number of original channels. The matrix of CSP filters is determined by solving a generalized eigen-value problem. The filter corresponding to the largest eigen-value yields a high variance signal in one condition, while producing a low variance signal in the other; and vice versa for the filters corresponding to the smallest eigen-value. The CSP filters are therefore ranked according to these eigen-values and the first and last filters in this sorted W matrix are usually used for further classification. To be more specific, in the current study, the two, four, or six filters (always an equal number from each side of the sorted W matrix) that resulted in the largest difference in variance between two training sets was used. Next, the total variance per training epoch and per CSP component was calculated and their logarithms were taken before entered into Fisher's linear discriminant analysis. This analysis again transforms the data by determining the linear weights of the discriminant function that combines data points of the two training sets in such a way that maximizes the distance between them. Finally, the CSP filters and classifier weights were used to classify the remaining testing epochs of the two conditions.

A wide range of EEG frequency bands were explored to investigate where useful discriminatory information might be present. Four frequency search strategies were deployed. The first frequency search strategy was characterized by both an increasing high pass cut-off point (increasing 1 Hz for each iteration) and an increasing frequency bandwidth (1.5 times the low frequency band limit). At the first iteration, frequencies between 3 and 4.5 Hz were passed. At the last iteration, frequencies between 72 and 108 Hz were passed. The second strategy entailed exploring all frequencies between 3 and 70 Hz using a fixed bandwidth of 1 Hz. For the third strategy, bandwidth was set to 4 Hz and iterations ran from 4 to 72 Hz. Lastly, data was filtered in broad bands to classify between conditions; 8–30, 32–54, 56–78, and 80–102 Hz.

In addition, several EEG cap configurations were explored. To start with, all 21 EEG channels were included. To explore whether classification accuracy may differ between scalp regions, several subsets were defined and tested. Firstly, a peripheral set was defined, consisting of 14 electrodes, (FP1, FP2, Afz, F7, F5, F4, F8, T7, C5, T8, P7, P8, O1, and O2). Secondly, a frontal set consisting of 7 electrodes (FP1, FP2, F7, F5, Fz, F4, and F8), which are associated with executive functions that are important in driving. Thirdly, a posterior set consisting of 7 electrodes (P7, P5, Pz, P4, P8, O1, and O2), containing electrodes associated to visuomotor processing. Lastly, the electrode set identified by Prinzel III et al. (2001) (P3, Pz, P4, Cz), which has often been used in adaptive automation research to get the "engagement index" [defined as the ratio; beta/(alpha + theta)].

Lastly, five condition pairs were selected from a total of 105 possible combinations (15!/2!(15–2)!). An experimental condition can be defined in terms of its driving speed level and target SDLP difficulty level. To improve comparability one factor was kept constant for each condition pair. In this way, four speed differences for the normal target level were classified: −40 vs. +40 km/h, −20 vs. +20 km/h, −20 vs. 0 km/h, and 0 vs. +20 km/h. The normal target level was chosen since this target resembles the individuals' natural driving behavior. Focusing on classifying between speed differences in this way was done because of the envisioned application. A brain-based adaptive cruise control would change speeds and therefore, the effect of speed interventions has to be assessed. In addition, as it turned out, the very hard target conditions required more subjective effort compared to the normal target level, and therefore, these two conditions were compared in the 0 km/h relative speed condition.

Due to data anomalies such as missing channels, eight participants were excluded from the offline classification phase of this study. Despite a smaller participant pool, the number of condition pair comparisons is very large: 161 frequency bands × 5 condition pairs × 5 EEG cap configurations × 26 participants × 3 numbers of components = 313,950. Given these large numbers, only a selection of aggregated classification accuracy values can be reported (**Figures 2**, **3**) next to examples of scalp topographies of CSP components (see **Figure 4** for an impression) reflecting how the information sources project to the scalp (retrieved from the inverse of W; see Blankertz et al., 2008).

#### **RESULTS**

#### **VEHICLE PARAMETERS AND SUBJECTIVE RATINGS**

Subjective ratings and vehicle parameters are shown in **Figure 1** and their test outcomes are listed in **Table 1**. To begin with, the participants' preferred speed during the initial ride ranged between 62 and 120 km/h, averaging at 90 km/h (see the black dot in **Figure 1A**). This is slightly faster than estimated for this ride (*M* = 74 km/h; **Figure 1B**). This pattern of underestimating driving speed is present for all speed levels (Pearson's product moment correlation =0.99 over all conditions).

The dimensions of the vehicle and driving lane allowed for 0.58 m of swerving margin on both sides of the vehicle. As can be seen in **Figure 1C**, the participants stayed well within their driving lane on average and positioned the vehicle slightly toward the right hand shoulder (0.07 m on average). As can be read in **Table 1**, there was a significant effect of speed on LP. The participants' mean position on the road curves toward the right-hand shoulder, both when driving slower and faster than the preferred speed (polynomial contrasts showed a quadratic trend; [*F(*1*,* <sup>33</sup>*)* = 15*.*35, *p <* 0*.*001, η<sup>2</sup> *<sup>p</sup>* = 0*.*317]. Next, as speed increased, so did the participants' mean SDLP (see **Figure 1D**), representing swerving behavior, from 0.18 m during the slowest speed to 0.30 m during the fastest speed. This is mainly a linear increase [*F(*1*,* <sup>33</sup>*)* = 182*.*81, *p <* 0*.*001, η<sup>2</sup> *<sup>p</sup>* = 0*.*847], although SDLP increases slightly more rapidly toward the higher speeds [quadratic trend; *F(*1*,*33*)* = 24*.*33, *p <* 0*.*001, η<sup>2</sup> *<sup>p</sup>* = 0*.*424]. Note that the factor; target SDLP, indicating the difficulty of keeping current SDLP values under the target SDLP while driving, had no effect on the actual SDLP.

**FIGURE 1 | Vehicle parameters and subjective ratings as a function of set driving speed condition. (A)** Real driving speed. **(B)** Estimated driving speed. **(C)** Lateral Position (LP). **(D)** Standard Deviation of the Lateral Position (SDLP). **(E)** Rating Scale Mental Effort (RSME). On the x-axes, values for the initial ride (black dots) are shown in addition to five driving speeds that were set, relative to the individual's preferred driving speed established during the

initial ride. Error bars represent the standard error. LP values represent the middle of the car (car width = 1.60 m) in relation to the middle of the right (driving) lane (width = 2.75 m). Normal, hard, and very hard indicate the difficulty of keeping current SDLP values under the target SDLP: see section Design and Procedure for details. Positive LP values indicate a position to the right hand of the lane mid. Maximum score for mental effort is 150. *n* = 34.


**Table 1 | Multivariate test results for vehicle parameters and subjective effort ratings (Figure 1).**

*LP, Lateral Position; SDLP, Standard Deviation Lateral Position; RSME, Rating Scale Mental Effort. Significant effects (p < 0.05) are shown in bold. Speed effect relates to speed condition, Target to SDLP target.*

**discriminant analyses after spatial filtering for several condition pairs. (A–F)** The accuracy values represent the average subject-specific classification accuracy over all participants that resulted from the cross-validation scheme. Classifications were based on applying the two most contrasting CSP components to the EEG channels. *N* = 26. For each row of subfigures, a different EEG cap configuration was used. For the left column **(A,D)**, the frequency bandwidth is 1.5 times the

43–64.5 Hz. For the middle column **(B,E)**, 4 Hz bands were used and a stepsize of 4. For the right column **(C,F)**, a broad band frequency search (22 Hz) was deployed. All electrodes: FP1, FP2, Afz, F7, F5, Fz, F4, F8, T7, C5, C3, Cz, C4, T8, P7, P3, Pz, P4, P8, O1, O2. Peripheral set: FP1, FP2, Afz, F7, F5, F4, F8, T7, C5, T8, P7, P8, O1, O2. Frontal set: FP1, FP2, F7, F5, Fz, F4, F8. Posterior set: P7, P5, Pz, P4, P8, O1, O2. Engagement index (EI) set: P3, Pz, P4, Cz.

In addition, interactions between speed and target SDLP are not present in the data.

**Figure 1E** shows that the mental effort ratings increased from between "a little effort" and "some effort" (a mean RSME score of 33) for the slowest speeds to between "rather much effort" and "considerable effort" (a mean RSME score of 34) for the fastest speeds [linear trend; *<sup>F</sup>(*1*,* <sup>33</sup>*)* <sup>=</sup> <sup>88</sup>*.*48, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*001, <sup>η</sup><sup>2</sup> *p* = 0*.*728]. Also, similar to SDLP, this increase is stronger toward the faster speeds [quadratic trend; *F(*1*,* <sup>33</sup>*)* = 86*.*04, *p <* 0*.*001, η2 *<sup>p</sup>* = 0*.*327]. In addition, even though target SDLP did not have an effect on vehicle parameters, there was a main effect on mental effort ratings. Bonferroni corrected pairwise comparisons revealed that the "very hard" level was perceived as more difficult than the other two, while "normal" and "hard" did not show a difference.

#### **CLASSIFICATION RESULTS**

#### *Averages classification accuracies*

In **Figure 2**, the classification accuracies for several condition pairs are shown. **Figure 2** (and **Figure 3**) only shows the average classification accuracies for two data pairs (−20 km/h vs. +20 km/h and normal performance target vs. very hard performance target). Although more extreme driving speed conditions could have been shown (e.g., 40 vs. +40 km/h), we feel that more similar speed conditions better reflect real driving circumstances and are therefore more relevant. Also, accuracy levels

across condition pairs tended to be similar, and therefore the number of shown condition pairs was limited.

and F8. "−20 km/h vs. +20 km/h" indicate set driving speeds relative to the

The graphs in **Figure 2** reveal several general trends. Firstly, accuracy tends to increase as frequency increases. This can be seen across electrode sets and condition pairs with accuracies reaching levels of 95% on average over all participants when a relative high number of electrodes is included (21 and 14). This increase is most pronounced in the frequencies from 5 to 20 Hz, after which it continues to rise more gradually indicating a ceiling effect (see for example **Figure 2A**). This ceiling is about 5–10% lower for the middle column of subplots in **Figure 2** (displaying the 4 Hz search strategy). Secondly, a broader frequency band tends to yield higher accuracies, which is most apparent when comparing the middle column (**Figures 2B,E**; 4 Hz frequency bands) to the right column (**Figures 2C**, **3F**; 22 Hz frequency bands). For example, when including all electrodes, the 4 Hz frequency bands in the 8–32 Hz range in **Figure 2B** range produced about 15% less accuracy when compared to the first broad band (8–30 Hz) in **Figure 2C**. Thirdly, there are distinct differences in accuracies as a result of using different channel sets. For example, the larger electrode sets (21 and 14 electrodes) yielded very comparable high accuracies, while the smallest (4 electrodes) consistently resulted in lower classification accuracies (about 15–25% less, depending on frequency band). Such differences can be understood in part from the fact that more channels provide a richer, higher-dimension database for the CSP technique to extract useful discriminatory power. Note however, that the seven frontal electrodes outperformed the seven posterior electrodes by about 5–15%, again depending on frequency band. The shape of the frontal curve in all subfigures (the red lines) reflect the upper two curves (all electrodes and 14 peripheral electrodes), while the posterior curves resemble the bottom EI curves. Finally, when focusing on the somewhat lower EEG frequency of **Figures 2A,D** ranges (e.g., 10–21 Hz), which are more likely to reflect neuronal

row, a broad band frequency (22 Hz) search was deployed.

activity, the mean accuracy in that range over both subfigures is 80% for the larger two electrode sets. The frontal set led to a classification of 76% on average, while the posterior and the engagement index set resulted in 62 and 55% respectively.

#### *Cumulative classification accuracies*

In **Figure 3**, cumulative classification accuracies are shown for a selection of classification results. This figure indicates the consistency of classification accuracies across all 26 included participants. For instance, **Figure 3A** shows that in the high frequency range (e.g., 43–64.5 Hz), test data from 10 participants were accurately classified 99% of the time or better. **Figure 4** confirms that higher frequencies usually yield better accuracies as the top frequencies in all subfigures display more red/yellow than the bottom frequencies. The green/yellow colors indicate that about half to two third of the participants were above the classification threshold indicated on the x-axes. When viewing these colors in **Figure 3** through the eyelashes, it can be seen that, especially for the larger electrode set (**Figures 3A,E,I** and **3C,G,K**), data from a

**FIGURE 4 | Examples of CSP analyses. (A–D)** The scalp topography of the components illustrate how the physiological sources project to the scalp. The components are determined such that projected signals are optimally discriminative. The filters and topographies correspond to the first and last vector of the sorted W matrix and its inverse respectively (see section Design and Procedure for more details). Absolute coloring is arbitrary, however, dense red or blue areas show where the greatest differences in the projected signals' amplitudes were found, between the −20 km/h and the +20 km/h set driving speed (at normal target level). These driving speeds were set relative to the participants' preferred speed as determined during the initial ride. Included electrodes: FP1, FP2, Afz, F7, F5, Fz, F4, F8, T7, C5, C3, Cz, C4, T8, P7, P3, Pz, P4, P8, O1, and O2. **(A)** Subject = 13, frequency band = 72–108 Hz, classification accuracy = 100%. **(B)** Subject = 27, frequency band = 8–12 Hz, classification accuracy = 98%. **(C)** Subject = 21, frequency band = 24–28 Hz, classification accuracy = 82%. **(D)** Subject = 25, frequency band = 8–30 Hz, classification accuracy = 74%. Please note that in the CSP literature, a scalp topography of a component is usually referred to as a spatial pattern.

substantial number of participants still yielded 85% + accuracy in the lower (alpha and beta) frequency ranges (e.g., 10–20/30 Hz). For instance, the classifier could accurately classify (85% or better) between −20 and +20 km/h in the 16–20 Hz frequency range for 16 out of 26 participants (**Figure 3G**). For the smaller, frontal electrode set (**Figures 3B,F,J** and **3D,H,L**) the number of participants yielding highly accurate classifications is somewhat less in the lower frequency range; as indicated by the larger presence of blue colors.

#### *Example common spatial pattern analysis*

**Figure 4** displays several CSP filter-topography pairs which are meant to illustrate the diversity of CSP scalp topographies. A common topography across participants, reflecting how the neurological sources project to the scalp, was not identified. However, we selected these topographies based on their resulting classification accuracies and/or the fact that the frequencies are within the normal EEG range. To start with, **Figure 4A** shows that for participant 13, the perfect classification accuracy in the broad 72–108 frequency range originates mainly from the frontal electrodes (Fp1 and Fp2) which were highly specific for the -20 km/h driving condition, and from C5 which was highly specific for the +20 km/h driving condition. This is illustrative for the general finding that the frontal electrodes were often the main contributors to very high classification accuracies. The other subfigures show topographies linked to frequencies below 30 Hz. In **Figure 4B**, topographies are shown that resulted in an unusually accurate classification for this relative low frequency band (98% in the 8–12 Hz, alpha, frequency band). In this case, the topographies are more distributed over the scalp, although the left temporal and frontal regions were important physiological sources for discriminating between the two data classes. The scalp topography for the +20 km/h condition in **Figure 4C** shows a central-parietal distribution, illustrating that the EI electrodes: P3, Pz, and P5 contributed to the 82% classification accuracy in the high beta range of participant 21. The −20 km/h topography suggests that C5 was by far the most distinctive electrode when maximizing the variance of the projected signals in this data class while minimizing it for the other. Finally, **Figure 4D** shows the CSP resulting in 74% classification accuracy for subject 25 in the broad 8–30 (alpha plus beta) Hz frequency band. These topographies suggest that discriminative power was distributed over the posterior electrodes in the −20 km/h condition and more evenly distributed over the scalp in the +20 km/h condition.

# **DISCUSSION**

The aim of the study was to investigate the feasibility of using EEG for monitoring the level of visuomotor workload in a driving environment, which can potentially be used by an user adaptive driver support system. To manipulate workload, we exposed drivers to five levels of driving speed that were set relative to their preferred driving speed. In addition, since increasing steering effort normally decreases swerving behavior within the driving lane given a particular speed, participants were presented with three explicit swerving performance targets represented as the standard deviation of the lateral position of the car with respect to the driving lane. To distinguish between workload levels, subjectspecific CSP and linear discriminant analysis based classification models were used.

To begin with, subjective mental effort data show that driving at a higher speed is indeed experienced as requiring more effort. Furthermore, estimated driving speed was slightly lower than the real driving speed. Previous research has shown that driving speed in a simulator, when driving on straight roads or easy curves, tends to be higher than it would be on real roads (e.g., Bella, 2008). This effect could be caused by a difference in speed perception between the real world and (fixed-base) driving simulators due to the absence of several speed cues, such as car movements and stereoscopic depth perception. During the current study these factors may also have contributed to misjudging driving speed, especially since the speedometer was hidden from view at all times. The standard deviation of the lateral position (SDLP), indicating lane keeping performance, increased as function of speed, which is normal (e.g., Peng et al., 2013). However, the performance target (target SDLP) did not have an effect on vehicle parameters, suggesting that this manipulation failed since a decrease of SDLP was expected if participants were exposed to more difficult target SDLPs. However, participants did rate the "very hard" target SDLP condition as the most difficult, perhaps demonstrating that participants were trying hard but could not manage. Also, EEG data from the very hard SDLP condition could be accurately discriminated from data acquired during the normal SDLP condition which is another indication that participants did not simply ignore the instructions. Since other task manipulations aimed at increasing steering difficulty, such as decreasing lane width, have proven to affect SDLP (e.g., Dijksterhuis et al., 2011), the absence of an effect on SDLP may be explained by this particular manipulation. In contrast to the automatic nature of the steering task during normal driving participants had to actively engage themselves in transferring numerical information about their lane keeping behavior, as presented on their windshield, to steering wheel movements.

EEG activities during the experimental conditions were classified, yielding several interesting results. Firstly, applying CSP to a variety of frequencies and frequency band widths revealed that, overall, broader bands and higher frequencies result in higher classification accuracies. This could be taken to suggest that neuronal gamma synchronization correlated with the task manipulations in which case these results are in line with other research suggesting that activity in the gamma frequencies reflects sensory-motor coordination (Schoffelen et al., 2005; see also Fries et al., 2007). However, this conclusion should be drawn with caution since muscle activity as represented in the EMG has power in the same frequencies, which is picked up by EEG electrodes as well (Whitham et al., 2007; Muthukumaraswamy and Singh, 2013). This view of muscular activities contributing to high classification accuracies in the gamma band is confirmed by graphs showing the projections of the CSP components. **Figure 4A** demonstrates just one case where the perfect classification for high frequencies can mostly be traced to the EEG electrodes close to the eyes. However, a relative high contribution of the peripheral electrodes for extremely high classification accuracies is an emerging pattern. Moreover, when performing a semi-real task, such as driving in a simulator, EMG activity can be expected to be more dominantly present compared to more controlled laboratory tasks, making classifications based on neuronal gamma activities less likely.

High accuracies were also found for a substantial number of participants in the lower frequency ranges, as shown in **Figures 2** and **3**, and these are unlikely to be confounded by EMG activity. As shown by Whitham et al. (2007), who recorded EEG during paralysis by neuromuscular blockade, EMG activity is largely absent from frequencies below 20 Hz. Therefore, we suggest that classifications in the lower frequency ranges were likely determined by underlying neuronal activity.

CSP component topographies showed no readily discernible degree of consistency across participants, as illustrated in **Figure 4**. This indicates that the effects of changes in psychological construct such as mental workload on electrical activities on the scalp is very subject dependent, which confirms that individually tuned classification approaches are required for accurate classifications. In case of high frequencies this implies that the, perhaps subconsciously produced muscular activities, show large inter-individual variations. In case of the lower frequencies, it is likely that also on a neurological level, there are large variations. Finding consistent topographies would have been promising for future applications. For example, it could lead to a theory-driven pre-selection of scalp locations, thereby excluding possible irrelevant information from the classification model. Yet, it may be expected to find a large inter-individual variability when classifying rather abstract mental states compared to, for example, classifying the difference between left and right hand motor imagery for which the neuroanatomical base is much clearer.

A limitation of the current study is that the experimental conditions (rides) could not be randomized within each participant. e.g., changing speed conditions every couple of seconds would have resulted in a highly unnatural driving experience. The drawback of the used approach is that there was an average of about 15 min between one condition and the other within each condition pair that was used for the classifications. Since neighboring epochs can be similar to each other, a difference in time may have led to an inflation of the classification accuracies. For future research, it is advised to repeat conditions within subjects to assess the potential effects of time dependencies. For example, by training the classifier on one condition pair and validating it on the other, identical condition pair. While it is important to realize that time dependencies cannot be ruled out, it should also be noted that it probably did not affect other effects, such as the accuracy difference between the parietal and frontal electrode set or the difference between low and high EEG frequencies.

Overall, these findings imply that the subject-specific CSP approach provides very good discriminatory power between visuomotor workload conditions over a large range of frequency bands. With respect to the high (gamma) frequency ranges it is important to realize that major contributions from muscular activities cannot be ruled out. Moreover, this will probably be true for most passive BCI applications as real life tasks, such as driving a car, usually require a lot of motor activity. A workload classification strategy based on EMG activity would therefore be worthwhile investigating in future research, which requires a relative low number of electrodes. However, high classification accuracies were also found for the lower EEG frequencies, implying a large contribution of neurological activities. These high accuracies are promising for future applications, however, several issues need to be addressed before a system is working from the user's point of view. Some of these issues will be further discussed below.

Even if classification accuracies of up to 80% may be considered quite high for 1-s epochs, it raises the issue of applicability; especially when performing a safety-critical task, this seems insufficient. However, depending on the temporal responsiveness requirements of an application, these accuracy levels might suffice. For example, using longer data epochs can be expected to result in more accurate classifications, since more information is available to the classifier (e.g., Brouwer et al., 2012). Although not further reported in the result section, increasing the epoch length from 1 to 2 s was found to increase accuracies with about 3% for the lower frequency ranges. Another option would be to combine several successive small data epochs. As an illustration, assuming that successive classifications are independent and applying a simple binomial chance distribution, then combining five successive epochs, each having a 80% chance of accurately being classified, would lead to a 94% accuracy when using a majority vote (i.e., three or more epochs are classified correctly). This would decrease the negative effects of small periods of noisy data which may be expected in real life tasks and which should improve a system's behavior from the users point of view.

Another important issue that needs to be solved before reliable applications can be build are the so-called non-stationarities in EEG signals, which refer to shifts in EEG signals between the initial calibration session during which a model is trained and online application. Non-stationarities negatively impact the transfer of classification accuracies between calibration and application of a model (e.g., Shenoy et al., 2006). One solution to this issue could be to update the classification model from time to time by adding additional calibration periods when the task at hand allows for it. Another solution are adaptive classifiers, which use data that are acquired while the user is interacting with the system in real-time (Shenoy et al., 2006). The drawback of using an adaptive classifier is that it requires immediate labeling of new, incoming data while the user is engaged in task performance. In some active BCI systems, for example, when controlling a game, it is plausible that the required information is available. In case of a passive BCI system however, this is most likely not the case. Again, using longer periods of time may offer a solution to this

### **REFERENCES**


problem. For example, assuming that mental workload does not vary every second, all EEG data measured over a somewhat longer period reflect one particular level of workload. If the classifier therefore classifies most epochs as data class A, then all epochs in that period could be labeled as such and subsequently used to update the classification model. Finding acceptable and robust methods of updating the classification model is likely to be a necessary development before (passive) BCI systems can be applied to task situations.

For the viability of future applications it is also important that the binary approach of discriminating between two data classes is expanded to the multiclass situation. For instance, workload levels during task performance may be either too high, too low or within an acceptable range. In an adaptive system, where support may be changed, activated, or deactivated based on workload classifications it is therefore of equal importance that the conditions for no change are defined. Thus, in terms a passive BCI application, a homeostatic system aimed at keeping workload at or around optimal levels, must also "know" when not to initiate changes. One way to accomplish multi-class analyses is to combine several pairwise classifications through voting procedures (Friedman, 1996; see also Dornhege et al., 2004; Grosse-Wentrup and Buss, 2008).

In conclusion, depending on temporal responsiveness requirements, a system's designer may have the option to either focus on high EEG frequencies and accept that muscular activities likely contribute to classification accuracies, or to focus on lower EEG frequencies that mainly reflect neurological activities but accept slightly lower accuracies. Although it is clear that the very high classification accuracies found in this offline study by themselves do not guarantee a well-functioning online system, it is a promising start in realizing a CSP based passive BCI system that can reliably be used to monitor visuomotor load in real-time.

# **ACKNOWLEDGMENTS**

Firstly, the authors would like to thank graduate students Lucas van Assen and Camilla de Jong for their contributions in recruiting participants, carrying out the experiment, and data processing. Secondly, the technical support given by Mark Span is greatly appreciated. Thirdly, the reviewers are thanked for their helpful suggestions during the publication process. Finally, this research was supported by the European Commission Project No. 215893 REFLECT, Responsive Flexible Collaborating Ambient, within the 7th framework programme. The REFLECT project investigated ways of realizing pervasive-adaptive environments.

and ERPs in the n-back task. *J. Neural Eng*. 9, 045008. doi: 10.1088/1741-2560/9/4/045008


and G. G. Berntson (Cambridge: Cambridge University Press), 3–26.


situations. *IEEE Trans. Syst. Man Cybern*. 9, 769–778. doi: 10.1109/TSMC.1979.4310128


workload: a cognitive-energetical framework. *Biol. Psychol*. 45, 73–93. doi: 10.1016/S0301-0511(96) 05223-4


*Traffic Accidents*. Amsterdam: North-Holland.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 16 May 2013; accepted: 01 August 2013; published online: 21 August 2013.*

*Citation: Dijksterhuis C, de Waard D, Brookhuis KA, Mulder BLJM and de Jong R (2013) Classifying visuomotor workload in a driving simulator using subject specific spatial brain patterns. Front. Neurosci. 7:149. doi: 10.3389/ fnins.2013.00149*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2013 Dijksterhuis, de Waard, Brookhuis, Mulder and de Jong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Combining and comparing EEG, peripheral physiology and eye-related measures for the assessment of mental workload

# *Maarten A. Hogervorst\*, Anne-Marie Brouwer and Jan B. F. van Erp*

*TNO Human Factors, Netherlands Organisation for Applied Scientific Research, Soesterberg, Netherlands*

#### *Edited by:*

*Cuntai Guan, Institute for Infocomm Research, Singapore*

#### *Reviewed by:*

*Reinhold Scherer, Graz University of Technology, Austria Minna Huotilainen, Finnish Institute of Occupational Health, Finland*

#### *\*Correspondence:*

*Maarten A. Hogervorst, TNO Human Factors, Netherlands Organisation for Applied Scientific Research, PO Box 23, 3769 ZG Soesterberg, Netherlands e-mail: maarten.hogervorst@tno.nl* While studies exist that compare different physiological variables with respect to their association with mental workload, it is still largely unclear which variables supply the best information about momentary workload of an individual and what is the benefit of combining them. We investigated workload using the n-back task, controlling for body movements and visual input. We recorded EEG, skin conductance, respiration, ECG, pupil size and eye blinks of 14 subjects. Various variables were extracted from these recordings and used as features in individually tuned classification models. Online classification was simulated by using the first part of the data as training set and the last part of the data for testing the models. The results indicate that EEG performs best, followed by eye related measures and peripheral physiology. Combining variables from different sensors did not significantly improve workload assessment over the best performing sensor alone. Best classification accuracy, a little over 90%, was reached for distinguishing between high and low workload on the basis of 2 min segments of EEG and eye related variables. A similar and not significantly different performance of 86% was reached using only EEG from single electrode location Pz.

**Keywords: EEG, physiology, eye, workload, classification, combination, ECG, skin conductance**

# **INTRODUCTION**

In the literature, mental workload has been associated with a range of physiological variables. These include heart rate (e.g., studies as reviewed by Vogt et al., 2006), different types of heart rate variability (reviewed by Hancock et al., 1985; Aasman et al., 1987), pupil size (reviewed by Beatty, 1982; May et al., 1990; Porter et al., 2007; Hampson et al., 2010), eye blink frequency and duration (Wilson and Fisher, 1991; Brookings et al., 1996; Veltman and Gaillard, 1996, 1998), electrodermal measures (Kohlisch and Schaefer, 1996; Reimer and Mehler, 2011), respiration frequency (Wientjes, 1992; Mehler et al., 2009; Karavidas et al., 2010) and various variables derived from EEG (most prominently power in the alpha and theta band—reviewed by Brouwer et al. (2012).

A question that arises when one aims to put this knowledge into practical use is which variable(s) one should measure in order to get the best workload assessment for a specific individual. It is not easy to answer this question based on the current literature because of several complications. Firstly, only a limited set of variables is recorded and analyzed in each study, precluding easy comparison of performance across variables. Secondly, variables are often analyzed and reported at a group level rather than used to assess workload in an individual. Associations between physiological variables and workload as found using a group level analysis may not generalize to the case of assessing workload in an individual since they may not be sufficiently strong to reliably assess workload at a certain moment in time for a single individual. On the other hand, physiological responses to workload may be consistent within and not between individuals, which would result in variables that are seemingly non-responsive to workload at a group level while they are actually valuable for assessing workload on an individual basis. Finally, many workload studies suffer from experimental flaws in which workload levels are confounded with for instance body movements (potentially affecting heart rate and related variables) or visual information processing (potentially affecting eye- and EEG based variables). We here aim to provide an overview of the workload assessment performance of a rather broad range of variables within the context of an experiment in which visual input and the amount of body movements are constant across workload levels. Classification analyses are used to get an impression of the quality of workload estimation within an individual. While analyses are performed offline, we simulate an online1situation, where our classification models are trained on data acquired at the start of the experiment and tested on data acquired at the end of the experiment, therewith avoiding inflation of classification accuracy due to time dependencies. The same data have been analyzed on a group level in Brouwer et al. (2014). That study gives an overview of the general magnitude and direction of effects of the different conditions on the studied variables.

<sup>1</sup>Note that here "online" and "real time" refer to using information as collected over the last half- or several minutes. Especially for certain non-EEG measures, it is not possible to retrieve reliable information from very short intervals.

Besides examining how well different variables can be used to estimate workload on their own, we examine to what extent combination of different variables improves performance. As discussed later, while some studies seem to suggest that assessment of mental state improves when combining physiological variables, reported improvements often are modest and not statistically significant or not statistically tested. We examine different ways of combining variables. Below we review the literature and formulate hypotheses as to what we expect to find.

#### **SENSITIVITY OF SINGLE VARIABLES TO WORKLOAD**

Studies on (neuro) physiological correlates of workload (or mental load) go back to at least the early sixties (Kalsbeek and Ettema, 1963). A range of variables has been examined over the years such as heart rate, different types of heart rate variability, pupil size, eye blink frequency and duration, saccade and fixation related measures, electrodermal measures, respiration, blood pressure, chemical measures, EMG and neurophysiological variables derived from EEG. To our knowledge, a substantial, recent review of physiological responses to workload is lacking. There does not seem to be an obvious "winning" variable that can effectively be used to determine workload. One review study (Hancock et al., 1985) suggested heart rate variability as the most reliable measure, whereas another (Vogt et al., 2006) reviewed 19 studies in which heart rate variability was not even recorded. In these studies, heart rate seemed to be relatively reliable. Most studies that examined EEG spectral variables next to physiological variables such as different eye and heart related measures, concluded or suggested EEG to be the most sensitive or promising indicator of workload (Brookings et al., 1996; Taylor et al., 2010; Christensen et al., 2012). The study by Christensen et al. (2012) showed that classification accuracy using only EEG data was only marginally lower compared to adding information about heart rate, blink rate, blink amplitude, blink duration and EOG. Berka et al. (2007) argue in their introduction that EEG is the only physiological signal that has been shown to accurately reflect subtle shifts in workload. However, in the three studies favoring EEG just mentioned, as well as in many other workload studies, workload was manipulated in the context of simulated realistic tasks involving potential confounds such as speech, body movement and visual information. In a recent study (Brouwer et al., 2014) we examined effects of workload and time using a task that controls for these kinds of confounds. Repeated measures ANOVA analyses did not mark EEG as the source of information that "best" indicated workload. Highly significant effects of workload were found for EEG in the alpha frequency band but also for mean and minimum skin conductance level, respiration frequency, heart rate, high frequency heart rate variability and pupil size. No significant effects were found for EEG in the theta frequency band, mid frequency heart rate variability, number of blinks and blink duration. Still, this study does not indicate which variables would be most useful for assessing workload based on a limited amount of physiological data of a single individual. This is especially the case since Brouwer et al. (2014) highlighted (strong) effects of time on most of the measured variables which could potentially complicate their use.

#### **COMBINING VARIABLES - PROCESSES UNDERLYING THE ASSOCIATION BETWEEN WORKLOAD AND PHYSIOLOGICAL VARIABLES**

Being interested in combining physiological variables in order to arrive at a better assessment of workload, it is of special importance to examine the background of the association between the variables and workload. This is because using a combination of variables reflecting workload is especially expected to improve workload assessment if these variables are not all associated with the same but rather with different aspects of workload. As described below, high workload likely goes hand in hand with increased cognitive processing, increased (emotional) arousal and increased energy demand; aspects of workload that are presumably reflected by different physiological variables that have all been associated with workload before.

#### **COGNITIVE PROCESSING—EEG**

EEG alpha activity (power in the 8–12 Hz band) has been linked to idling (Pfurtscheller et al., 1996), default mode brain activity (Laufs et al., 2003; Jann et al., 2009) and cortical inhibition (Foxe et al., 1998; van Dijk et al., 2008; Brouwer et al., 2009). This suggests that this measure would reflect different levels of workload, with high alpha for low levels of workload which indeed was reported in several workload studies (e.g., Fink et al., 2005; Brouwer et al., 2012). Another EEG frequency band that has been related to workload associated processes is theta (4–8 Hz). Evidence for an association between theta and working memory processes or mental effort has been summarized in several reviews by Klimesch (1996, 1997, 1999). Theta increases as task requirements increase (e.g., Miyata et al., 1990; Raghavachari et al., 2001; Jensen and Tesche, 2002; Esposito et al., 2009). A number of studies on workload reported both alpha and theta effects (e.g., Gundel and Wilson, 1992; Brookings et al., 1996; Gevins et al., 1998; Fournier et al., 1999).

Not only EEG spectral variables, also Event-Related-Potentials (ERPs) have been found to reflect different levels of workload. The P300 component of the ERP is a peak occurring 300 ms or somewhat later after an attended stimulus has been presented. It is thought to reflect attentional and working memory processes (Polich and Kok, 1995; Polich, 2007) and it is in particular this component that has been reported to decrease with increasing levels of memory or workload (Watter et al., 2001; Kida et al., 2004; Raabe et al., 2005; Allison and Polich, 2008; Evans et al., 2011; Pratt et al., 2011). Besides the P300, earlier ERP components like the N100 (Kramer et al., 1995; Ullsperger et al., 2001; Allison and Polich, 2008) the N200 (Kramer et al., 1995), the P1 (Pratt et al., 2011) and a positive-negative component between 140 and 280 ms (Missonnier et al., 2003, 2004) have been found to respond to task difficulty or workload. Finally, late positive or negative slow waves have been related to high memory load (Ruchkin et al., 1990) and amount of resource allocation (Rösler et al., 1997).

#### **AROUSAL AND ENERGY DEMAND—PERIPHERAL PHYSIOLOGY**

High mental workload is associated with high mental effort (Hockey, 1986; Gaillard and Wientjes, 1994). Mental workload or mental effort is associated with a decrease of the parasympathetic ("rest or digest") autonomous nervous system activity and an increase in sympathetic ("fight or flight") activity (Mulder and Mulder, 1987; Gawron et al., 1989). These changes in autonomous nervous system activity can be estimated through several peripheral physiological measures such as skin conductance (Roth, 1983), heart rate and heart rate variability (Berntson et al., 1997).

Electrical skin conductance varies with the moisture level of the skin. Since the sweat glands are controlled by the sympathetic part of the autonomous nervous system (Roth, 1983), electrodermal measures indicate the level of sympathetic activity or arousal. A large body of literature describes the positive effect of arousal on skin conductance (e.g., Winton et al., 1984; Greenwald et al., 1989; Boucsein, 1992, 1999; Brouwer et al., 2013). While increases in skin conductance may be viewed as reflecting sympathetic activity as a consequence of arousal due to mental effort, Reimer and Mehler (2011) and Kohlisch and Schaefer (1996) interpret their findings of heightened skin conductance with increased workload as reflecting emotional arousal.

Heart rate and its variability are affected by activation and suppression of both the sympathetic and parasympathetic nervous systems (Berntson et al., 1997). At normal breathing frequencies, fast changes in heart rate (0.15–0.50 Hz) reflect the adjustment of heart rate to breathing: breathing causes changes in blood pressure and by adapting heart rate, blood pressure is kept around a certain point (Mulder, 1980; Aasman et al., 1987). Also, the adaptation to breathing facilitates gas exchange between the lungs and the blood (Grossman and Taylor, 2007). High frequency heart rate variability reflects only the (fast) parasympathetic nervous system (Berntson et al., 1997). Mental effort has been reported to have the largest effect upon the mid-band (0.07–0.14 Hz; Mulder, 1980; Aasman et al., 1987). This band reflects not only parasympathetic but also sympathetic activity (Berntson et al., 1997; Veltman and Gaillard, 1998). For both bands, suppression of parasympathetic activity (associated with high workload) results in lower adaptation to changes in blood pressure and hence less heart rate variability.

Mental workload being associated with increased arousal and neural activity increases metabolic demand, which is probably the cause of observed increases in heart rate and respiration frequency with workload (Veltman and Gaillard, 1998).

#### **EYE-RELATED MEASURES**

Pupil dilation is not only caused by decreasing luminance but also by increasing workload (Beatty, 1982; May et al., 1990; Porter et al., 2007; Hampson et al., 2010). Consistent with this, the frontal cortex is involved in controlling pupil dilation (Hampson et al., 2010). The underlying function is unclear, but the fact that the effect has been observed in studies that varied task difficulty without varying the visual environment (Kahneman and Beatty, 1966; Kahneman et al., 1969) indicates that it does not primarily serve purposes related to visual perception. Reduction of blink frequency and duration with workload could be attributed to maximizing detection of visual information (Bauer et al., 1987; Fogarty and Stern, 1989). In this sense, the sensitivity of these parameters can often be explained by high workload being confounded by the presence of much visual information.

#### **THREE SENSOR GROUPS**

In sum, we can loosely divide physiological variables found to be associated with workload into three, what we call "sensor groups" that are assumed to reflect different aspects of workload. EEG measures are expected to mainly reflect cognitive processes. Peripheral physiological measures reflect arousal and energy demand. The third group of eye related measures have probably partly been found to covary with workload due to the often occurring confound of the amount of visual information, but for pupil dilation, the reason for its association with workload is unclear. Considering the idea that they reflect different aspects of workload, combination of these groups is expected to lead to better classification accuracy than either group alone, especially for the combination of EEG and peripheral physiology.

#### **COMBINING VARIABLES—FUSION TECHNIQUES**

In previous workload studies, EEG has been combined with other physiological signals for assessing workload. Coffey et al. (2012) found that classification of workload based on EEG was more accurate than when based on fNIRS (functional Near Infrared Spectroscopy), and that combining the two did not increase classification performance. Wilson and Russell (2003, 2007) combine respiration (Wilson and Russell, 2003), EEG, EOG and heart rate in their classification models to assess workload in simulated aviation-related tasks. However, they do not report on the relative contribution of these different signals to classification performance. Christensen et al. (2012) assessed workload in simulated remote piloting. Their classification models were based on EEG, EOG, heart rate, blink rate, blink amplitude, and blink duration. They did not extensively report on the relative contribution of these variables, but mention that when classification was performed on the basis of EEG only, classification accuracy hardly decreased (about 2%). Chanel et al. (2006) studied the relative contribution of EEG and peripheral physiological signals (skin conductance, heart rate, blood pressure, respiration and temperature) on classifying mental states as elicited by emotional pictures. They also did not find a strong advantage of fusion of EEG and physiology over EEG alone.

In all of these studies, combination of variables from different domains was achieved by simple concatenation of the input feature vectors. However, when combining EEG data with physiology, the large difference in length of the feature vectors forms a potential problem. While EEG spectral features are captured by power values in different frequency bands at different electrodes amounting to a large number of features, physiological or eye related features such as pupil size and heart rate are typically each represented by just one (average) value. This could lead to a priori small added value of these features. A possible solution is to use higher order combination of information by combing the assessments based on the various types of features. Such a method was used by Chanel et al. (2009) who studied classification of different emotions as elicited by emotional recall. Classification decisions were made by two different EEG based classifiers and one classifier based on physiology (skin conductance, heart rate, blood pressure and respiration) and these decisions were then combined. Adding the worst performing EEG set to the best performing one increased classification accuracy (that was generally between 70 and 80% for a two class problem) with about 2–4%, and adding physiology on top of that resulted in an additional increase of up to almost 3%. There was no direct comparison with the concatenation method (using all features as input to a single classifier) though the authors mention that this method did not lead to an increase in accuracy.

The improvements in classification accuracy as found by Chanel et al. (2009) are relatively small and are probably not statistically significant. However, the trend is positive and we think it is worthwhile to examine the case for workload where EEG and other types of signals are expected to complement each other. We will compare the discussed ways of combining information, i.e., fusion at the feature level or at the decision level. When combining decisions from different classifiers, a confidence measure of the decision is useful. Such a confidence measure is given by an elastic net model with logistic regression (Friedman et al., 2010). Therefore, besides using linear Support Vector Machine as the more standard classification model, we also use an elastic net model. This enables us to weigh information of the different sources before averaging (a similar method was used by Chanel et al., 2009). The potential advantage of fusion at a decision level is that smaller feature vectors reflecting physiology or eye related measures do not run the risk to be "flooded" by EEG—the disadvantage of fusion at the decision level is that interactions between different features or feature sets may be missed.

#### **CURRENT STUDY: OVERVIEW AND HYPOTHESES**

We study workload in an experiment in which we control for visual input and the amount of body movements by using an n-back task to vary workload. This task requires participants to indicate of each of successively presented letters whether it is a target or not. Workload is low when the target letter is an "x" (0-back), intermediate when the target letter is the same as the one before (1-back) and high when the target letter is the same as two letters before (2-back). In this task, visual input and number of button presses are the same across workload levels. This means that effects of workload can really be attributed to differences in mental processes and cannot be due to different amounts of hand or eye movements in the high workload condition compared to the low workload condition.

We determine the value of individual features and combinations for the assessment of individual workload level using individually trained classification models. We simulate an on-line situation in which a model is tuned to an individual using data from the first part of the experiment and in which the workload is predicted for the last part of the experiment. We record EEG, skin conductance, respiration, ECG, pupil size and eye blinks. Various variables are extracted from these measurements and used as features in classification models. Firstly, we examine how well classification models based on the various individual features perform. We expect EEG features to perform best given indications from earlier studies and given the fact that EEG is expected to reflect what can be considered to be the core of mental workload, namely cognitive processing. Next, we will look at combinations of features. We start by combining features originating from the same sensor (e.g., heart rate and heart rate variability that can both be determined from ECG). While these are not expected to strongly improve classification performance since they are probably largely reflecting the same underlying process, we think it is worth trying for the practical reason that these features are available without additional costs (i.e., without having to use an additional sensor). Subsequently, we combine features from different sensor groups "EEG," "Physiology," and "Eyes." Especially the combination of EEG with Physiology is expected to improve classification performance since these groups are assumed to reflect different general physiological processes associated with workload. For analyses at the sensor group level, we check whether taking time into account improves classification performance. An improvement of including information about time of measurement may be expected based on finding general effects of time on physiological variables (e.g., Fairclough et al., 2005; Brouwer et al., 2014). For analyses at the sensor group level, we also compare fusion at the feature level to fusion at the decision level. We use both SVM and elastic net classification models.

# **MATERIALS AND METHODS PARTICIPANTS**

Data of 14 participants are analyzed in this study. Participants were aged between 23 and 40 years (mean age 27.9), 8 female and 6 male. The experiment was performed in accordance with the local ethics guidelines and participants gave written informed consent2 .

### **MATERIALS**

Stimuli (letters), subjective workload scales and announcements about the type of the n-back task to follow were presented on a Tobii T60 Eye Tracker monitor, at a distance of about 50 cm from the participants' eyes. Feedback about task performance was presented through Labtec LCS-1050 speakers in the form of beeps. Participants used a keyboard to indicate whether presented letters were targets or non-targets. Which of the keys (1 or 2 on the numerical pad) indicated "target" and which "non-target" was counterbalanced between participants. Participants used the mouse to rate subjective workload on a scale (RSME) between the stimulus blocks.

EEG (electro encephalogram) was recorded through a g.tec USBamp and g.tec Au electrodes placed at Fz, FCz, Pz, C3, C4, F3, and F4, referenced to linked mastoid electrodes. A ground electrode was placed at FPz. Impedance was kept below 5 k*-*. EEG data were filtered by a 0.1 Hz high pass- and a 100 Hz low pass filter and sampled with a frequency of 256 Hz (USB Biosignal Amplifier, g.tecmedical engineering GmbH).

ECG (electro cardiogram) and skin conductance were recorded using a MindWare BioNex 8-slot chassis with a 3 channel Bio-Potential and GSR amplifier. A 4-channel transducer amplifier was used to measure respiration. For ECG measurement, self-adhesive 1 1/2-electrodes with 7% chloride wet gel

<sup>2</sup>A total of 35 participants took part in the original experiment (see also Brouwer et al., 2012, 2014). However, we here only considered participants with complete data sets. We also performed similar analyses for all participants for which a subset of data was available. The results from such partial analyses showed the same patterns as presented here.

were attached just below the right collarbone, just below the left lower rib and above the right hip. To record skin conductance, two self-adhesive 1 5/8- electrodes with 1% chloride wet gel were attached to the palm of the left hand that was not used for pressing the keys—one below the thumb and one below the little finger. Respiration was recorded using an elastic band around the waist at the height of the lower side of the sternum. MindWare's BioLab software was used to acquire ECG, skin conductance and respiration. These signals were sampled with a frequency of 300 Hz. They were acquired with a gain setting of 1000, 10, and 500 and filtered with a 0.5, 1, and 5 Hz high-pass filters, respectively.

Pupil size, blink rate and blink duration were measured using a Tobii T60 Eye Tracker that was integrated into an 17- monitor. Recording frequency was 60 Hz. All signals were synchronized using the TCAP signal from The Observer XT (Zimmerman et al., 2009).

We used the RSME scale (Rating Scale Mental Effort, Zijlstra, 1993) to measure subjectively experienced mental effort. This scale runs from 0 to 150 with higher values reflecting higher workload. It has nine descriptors along the axis, e.g., "not effortful" at value 2 and "rather effortful" at value 58. Verwey and Veltman (1996) concluded this simple one-dimensional scale to be more sensitive than the often-used NASA-TLX (Hart and Staveland, 1988).

# **TASK**

Participants viewed letters, successively presented on a screen. For each letter, they pressed a button to indicate whether the letter was a target or a non-target. In the 0-back condition, the letter x is the target. In the 1-back condition, a letter is a target when it is the same as the one before. In the 2-back condition, a letter is a target when it is the same as two letters before. With this version of the n-back task, the level of workload is varied without varying visual input or frequency and type of motor output (button presses). A 3-back condition was not used, due to evidence that many participants find it too difficult and tend to give up (Ayaz et al., 2007; Izzetoglu et al., 2007).

Participants were informed after every button press whether it was a correct decision by a high (correct) or a low (incorrect) pitched tone. This was intended to help the participant, who in our experiment switched rather often between n-back conditions, and to increase the likelihood that participants would decide to invest effort since the participant knew the experiment leader would hear the sounds as well.

#### **STIMULI**

The letters used in the n-back task were black (font style: Matlab standard, approximately 3 cm high) and were presented on a light gray background. The letters were presented for 500 ms followed by a 2000-ms inter-stimulus interval during which the letter was replaced by a fixation cross. In all conditions, 33% of letters were targets. Except for the letter x in the 0-back task, letters were randomly selected from English consonants. Vowels were excluded to reduce the likeliness of participants developing chunking strategies which reduce mental effort, as suggested in Grimes et al. (2008).

#### **DESIGN**

The three conditions (0-back, 1-back, 2-back) were presented in 2-min blocks divided across four sessions. Each session consisted of two repetitions of each of the three blocks. Thus, for each of the three conditions participants performed 4 sessions ∗2 repetitions = 8 blocks. In each block, 48 letters were presented, 16 of which were targets. The blocks were presented in pseudorandom order, such that each condition was presented once in the first half of the session and once in the second half of the session, and that blocks of the same condition never occurred directly after each other. Before each session was a baseline block of 2 min in which the participant quietly fixated a cross on the screen. With 4 sessions ∗2 repetitions ∗3 conditions, plus 4 sessions ∗1 baseline block, the total duration of the n-back task was 56 min.

#### **PROCEDURE**

After entering the lab, participants read and were explained about the experimental procedure. They then signed an informed consent form. The physiological sensors were attached and the Tobii eye tracker was calibrated. The three conditions were practiced up to the point that the participant was familiar with the task. Regardless of this, all participants completed at least one block of the 2-back task in order to also practice the RSME rating that appeared at the end of the block. It was stressed that the 2-back task could be difficult, but that even when the participant thought it was too difficult he or she should keep trying to do as well as possible. Participants were asked to avoid movement as much as possible while performing the task and to use the breaks in between the blocks to make necessary movements. Before the start of each block, the participant was informed about the nature of the block (rest, 0-back, 1-back, or 2-back) via the monitor. After each block, the RSME scale was presented and the participant rated subjective mental effort by clicking the appropriate location on the scale using the mouse. The next block started after the participant indicated to be ready by pressing a button. Between sessions, participants had longer breaks, chatting with the experiment leader or having a drink.

# **DEFINITION OF FEATURES**

EEG data were filtered by a 0.1 Hz high pass- and a 100 Hz low pass filter and sampled with a frequency of 256 Hz (USB Biosignal Amplifier, g.tecmedical engineering GmbH). Afterwards data was processed and analyzed using Matlab and the FieldTrip open source Matlab toolbox (Oostenveld et al., 2011). Epochs starting at 500 ms before stimulus onset and ending 2000 ms after were shifted such that the mean of the first 500 ms was zero. No eye blink artifacts were removed before classification which makes the implementation of online classification easier. Our previous analysis (Brouwer et al., 2012) showed that with EOG performance was not better or contribute to EEG-based workload classification, indicating that performance is only expected to get better when removing them. Over each block and each of the 7 EEGchannels (C3, F3, C4, F4, Fz, Pz, FCz) we calculated the average ERP over all trials after resampling the data to 100 Hz. The (*N* = 101) samples between 0 to 1 s as ERP-features. Similarly, for each of the trials and channels the spectral power over complete trials (from −0.5 to +2.0 s) was calculated in (*N* = 37) bands ranging from 2 to 20 Hz (in steps of 0.5 Hz) following an FFT approach using a single Hanning taper. Next, the average spectral power was determined for each block and channel (by averaging over all trials within a block). Trials with extreme variance in the signal as defined by a standard deviation above 100µV were discarded before calculating the average ERP and spectral power features (1% of the data). Apart from using the "raw" ERP and power spectra of the various EEG-channels we also used alpha power and theta power as feature input for classification. As a measure of alpha power we used the average over the natural log transformed power within the frequency band ranging from 8 to 13 Hz. As a measure of theta power we used the average over the natural log transformed power of frequencies between 4 and 8 Hz. Models that included alpha power and/or theta power as features did not also include the raw power values. Additionally, we examined alpha power of EEG as only recorded at Pz, theta power as only recorded at Fz and ERP as only recorded at Pz since it would be practical to attach only one electrode to the scalp, and those are the location-feature combinations that we a priori expect to produce the clearest results. Effects of workload on the alpha band are particularly expected around Pz (for effortful and attentive processing alpha reduction is observed at parietal regions—(Klimesch et al., 2000; Keil et al., 2006). Effects of workload on the theta band are particularly expected around frontal electrode locations such as Fz (e.g., Miyata et al., 1990; Raghavachari et al., 2001; Jensen and Tesche, 2002; Esposito et al., 2009). The P300 is expected to be most clearly visible at Pz (e.g., Ravden and Polich, 1999; Srinivasan, 2007). Since a priori Pz seems to be the most informative electrode, we also looked at EEG data in general coming only from this electrode.

Skin conductance level was determined by averaging skin conductance over each block. Inspection of the raw data showed that frequently, skin conductance peaks around the onset of a block (i.e., after rating subjective workload of the previous block) after which skin conductance rapidly decreases and remains around the same level. This led us to also use minimum skin conductance of each block as a feature.

As a measure of heart rate, we determined the mean RRI for each block. RRI is the interval between successive heart beats or more precisely, the interval between subsequent R-peaks in the ECG. Three measures of heart rate variability were computed. The root mean squared successive difference (RMSSD: Goedhart et al., 2007) between the RRIs reflects high frequency heart rate variability. High-frequency heart rate variability was also computed as the power in the high frequency range (0.15–0.5 Hz) of the RRI over time using Welch's method applied after spline interpolation; similarly, for mid-frequency heart rate variability the power in the frequency range of 0.07–0.15 Hz was used.

The respiration signal was filtered using a running Gaussian blurring window (with a kernel width of 0.39 s). Subsequently peaks and throughs were detected using the derivative of the signal. Breathing frequency was defined as the mean time interval between the peaks. Modulation depth was defined as the average difference between peak and through.

Pupil size as determined by the Tobii Eyetracker and the ClearView algorithms was averaged for each block. When the eyetracker did not detect the pupil for both eyes for minimally two successive frames (i.e., 33 ms) and maximally 25 successive frames (416 ms), this was considered to be a blink. For each blink, blink duration was determined. Blink rate is the average number of blinks per minute.

The feature "time" was operationalized as the mid-time of the corresponding data segment in seconds from the start of the experiment, discarding breaks and periods in between blocks in which the RSME was registered. For instance, the mid-time of the first 2-min workload block is 60 s and that of the second is 180 s.

Physiological features that were considered with respect to their capacity to estimate workload in this study are summarized in **Table 1**. This table also indicates the length of the corresponding feature vector ("Dimension"), as well as the single sensors and the sensor groups that the features belong to. For examining the usability of different variables for assessing workload, we follow the list of features as summed up in **Table 1**. Only for EEG, features can consist of multiple values (dimension larger than 1). The rationale behind this is that the mentioned features are the smallest possible pieces of information that are expected to reflect workload. For examining EEG sensors "all electrodes," "Pz," and "Fz," we only include features reflecting both ERP and spectral properties of the EEG signal as printed in italics. The EEG sensor group only includes ERP and spectral power features of "All electrodes."

#### **CLASSIFICATION ANALYSIS**

The first three sessions, each containing two blocks of each n-back condition, were used to train the model parameters to individual participants. The last session was used to evaluate the model's classification accuracy. This simulates estimating workload online, using model parameters that are adjusted to the individual participant in a training phase. As a default, the classification models were trained and applied to distinguish between 0- and 2-back blocks, each containing 2 min of data or 48 trials (letters). Average classification performance (fraction correct in the last session) over all participants was used as measure of model performance.

Feature vectors were constructed for each of the data segments. For instance, the feature vectors used for the model that includes all spectral power values over 120 s blocks of data contains 259 features (power at 37 frequencies × 7 channels, see **Table 1** second row) × 16 blocks (4 sessions × 4 blocks). The data from the first 3 sessions was used to train a classifier model for each individual participant. The features were standardized to have mean 0 and standard deviation 1 on the basis of data from the training set. The same standardization transformation was applied to the test data (the data of the 4th session). After training the model using the training data (12 blocks of 259 features in the example above), the classification was applied to the test data and the performance score of each of the individual models was determined. Finally, overall performance is calculated by taking the average score over all individual models.

Classification accuracy was determined for a range of models differing in the (types of) features that were included in the model, differing in the type of classifier and differing in the fusion rule that was used. Classification was performed using the Donders machine learning toolbox (DMLT) developed by

#### **Table 1 | Examined (neuro)physiological features, sensors and sensor groups.**


*For EEG, the sensors "All electrodes," "Pz" and "Fz" are examined using the features as defined by ERP and spectral power (printed in italics). The EEG sensor group only includes ERP and spectral power features of "All electrodes."*

van Gerven et al. (2013). Two types of classifiers were used. We used a linear Support Vector Machine as representing a more standard model and, in order to obtain confidence measures that can be used to fuse information, we used an elastic net model with logistic regression (Friedman et al., 2010).

For combining information across sensor groups, both fusion at feature level and fusion at the decision level were investigated. In the first (default) case the concatenated feature vector containing all features was used as the input to a single model. In the latter case, the final decision was based on the average of the probability estimates supplied by the logistic regression from the different elastic net models, each based on the individual features (one model output for each feature). For instance, if estimated probabilities on high workload would be based on mean heart rate, mean skin conductance and blink rate with model output probabilities of *p*1 = 0.2, *p*2 = 0.6, *p*3 = 0.5, the average probability of the combination model is 0.43. Thus, these data would be assessed to reflect low workload.

#### **STATISTICAL ANALYSIS**

We used one-tailed binomial tests to determine whether classification accuracy was significantly higher than chance, which works as follows. In the default situation of classifying the 2 min high and low workload blocks (2- vs. 0-back), classification accuracy per participant could only take values of 0, 0.25, 0.50, 0.75, and 1 (correct classification of 0–4 blocks in the last session). To test whether on the whole, classification performance is above chance, we compute the averaged score over all 14 individual models as a measure of performance. This average score can take on values between 0/56, 1/56, 2/56,. . . 1 (resolution of 1/56 where 56 is 4 possible scores higher than zero∗14 participants). To determine whether this average score is significantly higher than chance we calculate the chance that a this score or higher is obtained when using a random classification model (i.e., with a probability of classifying a block as one or the other with a probability of 0.5). In this way, one can determine that the chance of obtaining a value of 0.61 or higher given a probability of 50% is equal to 0.05 (the level corresponding to *p* = 0*.*05 in e.g., **Figure 1**). We also calculated Bonferroni corrected levels per comparison/figure, and found that when the *p* = 0*.*05 significance level is corrected for multiple testing (using Bonferroni) this level goes up to the same level as the uncorrected *p* = 0*.*01 level. This means that the conditions that reach an uncorrected level of *p* = 0*.*01 maintain significance after Bonferroni correction (at *p* = 0*.*05).

Pairwise comparison tests were used to determine whether two accuracies were significantly different from each other. To indicate the level of significance chance and alpha levels are shown in the various figures. The figures also include estimates of the standard error in the fractions correct based on a binomial distribution. We did not correct for multiple testing which means that estimates of significance levels are on the low side.

Since the results suggested that EEG models might perform at ceiling level and that we could get a higher benefit of combining variables for a more difficult case where workload assessment did not reach ceiling, we also analyzed performance for classifying smaller workload differences (2- vs. 1-back and 1- vs. 0-back) and for classifying 30 rather than 120 s segments of data. In the latter case two of the participants' data were incomplete due to segments without blinks resulting in undefined blink duration and were discarded (leaving 12 participants). Using parts of blocks rather than complete blocks resulted in having 16 data sets (each 30 s long) per participant available to test the trained classification model rather than 4 (each 2 min long).

#### **RESULTS**

Task difficulty and subjective effort (workload) were successfully manipulated as indicated by the expected effects of n-back level on performance and subjective ratings (Brouwer et al., 2012). The different n-back levels resulted in the expected differences in performance for the 14 subjects with decreasing fraction correct (0.96, 0.94, and 0.90 for the 0-back, 1-back, and 2-back conditions) and increasing response times (560, 616, and 730 ms respectively). Perceived mental effort as measured by RSME increased with n-back level (31, 39, and 55 for the 0-back, 1-back, and 2-back conditions respectively).

#### **SINGLE VARIABLES**

**Figure 1** shows the performance of models that include a single variable or feature (as defined in the second column of **Table 1**), separately for the SVM and elastic net classification approaches (see below "Classification Approach"). The horizontal lines indicate chance level, and levels corresponding to a significant difference from chance, for *p* = 0*.*05 and *p* = 0*.*01. In general, performance of models based on EEG variables is much better than models based on the other (single) variables.

ERP and spectral power (when using SVM) lead to approximately the same high classification performance of over 0.85 as using all EEG features. Moreover, when reducing information from using EEG or ERP as recorded at all electrodes to only Pz classification performance remains at the same level. Also, when instead of using all frequency bands only alpha is used, performance does not deteriorate. Using only Pz for alpha alone does reduce performance relative to using all electrodes (*p <* 0*.*05). A model based on the theta band alone does not perform as well as using all frequencies (*p <* 0*.*05), indicating that the theta band is less informative in our case (in correspondence with our previous findings, see Brouwer et al., 2012).

Models based on the physiological variables perform relatively poorly with respiration frequency being the only feature that reaches the 0.01 significance level with an accuracy of 0.68. High frequency HRV (as defined by RMSSD and spectrally defined) is the only other physiological feature that, depending on the classification approach, just reaches significance (*p <* 0*.*05).

In comparison, models based on eye measures show relatively good performance with a classification accuracy of 0.75 for pupil size. Blink rate significantly performs above chance as well but blink duration does not.

#### **SINGLE SENSORS**

**Figure 2** shows the performance of the "single sensor" models, i.e., the models that include all features belonging to a certain sensor. Also shown is the performance of the best performing single variable model for each sensor type (in which ERP and spectral power are regarded as the corresponding single features for EEG and EEG\_Pz. Skin conductance reaches the significance level when features are combined using the elastic net model, while it remains below significance level for each individual feature. However, and as hypothesized, performance of models using combinations of features from a single sensor do not perform significantly better compared to using only the best performing single feature for any of the sensors. Again, the EEG model shows the best performance (with an accuracy of 0.86 for both classification approaches). Second best is the performance of the models based on eye measures (accuracy of 0.75 for SVM). Also the model based on respiration reaches a relatively high performance level (accuracy of 0.70 for both classification approaches). Models based on skin conductance (accuracy of 0.63 for elastic net) and ECG (accuracy of 0.61 for elastic net) show relatively poor performance, just reaching a level that is significantly higher than chance (*p <* 0*.*05). Pairwise comparison tests (using the SVM-data) show

that the EEG models are significantly different from the other models (*p <* 0*.*01), and that performance of the eye model is significantly better than that of the skin conductance and ECG models (*p <* 0*.*05).

#### **COMBINATIONS ACROSS SENSOR GROUPS**

**Figure 3** shows the results for different (combinations of) sensor groups of SVM and elastic net, as well as the outcome of combining the outputs of different elastic net models ("decision level"). Shown are the results for the default case of classifying 2 vs. 0-back over 2 min. data segments (a) as well as for more difficult cases: using 30 s data segments (b), or classifying 2 vs. 1-back (c), or 1 vs. 0-back (d). Comparing performance in the default case (**Figure 3A**) with that of separate sensors (**Figure 2**) shows that for Physiology, combining the three sensors leads to a (non-significant) increase in performance (accuracy of 0.75 for SVM, compared to 0.70 for the best performing single physiological sensor respiration). Models that include EEG perform significantly better than physiology and eye models (SVM, pairwise comparisons, *p <* 0*.*05). Adding sensor groups to the already well performing EEG improves classification accuracy by 3–5% (for adding Physiology or Eye variables with elastic net). For SVM, the combination of physiology and eye measures tends to improve performance relative to either one alone by 7% as well. However, all of these improvements do not reach statistical significance. Using the assumption of a binomial distribution, significance (*p <* 0*.*05) is reached for differences of around 10%.

We may not observe a larger, statistically significant improvement of combining EEG with physiology (as we had hypothesized) because EEG alone is already performing very well. Therefore, we performed the same analyses on more challenging classification tasks, namely classification of shorter time segments (30 s rather than 120 s) and classification of more similar workload levels (2- vs. 1-back and 1-back vs. 0-back). In addition, this gives us an impression of how much lower classification performance is under these circumstances. Shorter time segments are expected to be more difficult to classify because the extraction information will be less reliable. This is especially obvious for some of the non-EEG measures (e.g., for high frequency heart rate variability minimum durations of 1 min are advised: Task Force of the European Society of Cardiology the North American Society of Pacing Electrophysiology, 1996; Berntson et al., 1997).

**Figure 3B** shows the performance of the models for classifying data segments of 30 s instead of 120 s. Note that this includes data from 12 instead of 14 participants (see Materials and Methods). The thresholds for significance decrease in this case since the test set contains 16 samples instead of 4. As expected, performance for classifying 30 s segments is lower than for classifying 120 s segments, with a performance that is (on average) 8% (SVM) and 5% (elastic net) lower. Still, performance for models that include EEG is around 0.8 or higher for the elastic net model. The pattern of results is highly similar to that for classifying 120 s segments thus, there does not seem to be a larger benefit of sensor group combination than in the previous case, suggesting that the lack of improvement of adding non-EEG variables to the EEG model is not due to a ceiling effect.

**Figure 3C** shows classification performance for classifying 2 back vs. 1-back using 2-min blocks. **Figure 3D** shows 1-back vs. 0-back classification performance. As expected, performance for discriminating smaller differences in workload is lower than for discriminating 2-back vs. 0-back. Performance is on average 10% (SVM) and 6% (elastic net) lower for 2 vs. 1-back, with performance around 0.85 for elastic net models including EEG. It is 22% (SVM) and 18% (elastic net) lower for 1 vs. 0-back, with performance just below 0.70 for elastic net models including EEG. This suggests a larger increase in workload from 1-back to 2-back than from 0-back to 1-back in accordance with earlier findings (Brouwer et al., 2012, 2014). Again, models that include EEG variables show the best performance. Performance of classification models based on Physiology is not significantly different from chance for both small workload differences, and the Eye based model drops to chance level when distinguishing 1- from 0-back.

Also included in **Figure 3** is the performance of the "decision level" model that combines the output of different elastic net models (based on single features). We did not find any significant differences between performance of the elastic net models that use the two types of fusion; trends indicate an advantage of fusion at the feature level compared to fusion at the decision level.

**Figure 4** shows the effect of adding the feature time (i.e., the time of measurement, since the start of the experiment) to the model input for the default case (2- vs. 1-back, 120 s of data). Adding time leads to an increase in performance of 9% (significant at *p <* 0*.*05) and 5% (not significant) for respectively physiological and eye sensor group models (SVM) when classifying 2- vs. 0-back using 120 s of data. For EEG and "All" the inclusion of time information does not improve performance. Further analysis of the data shows that when classification is more difficult due to shorter time intervals or smaller workload differences (see **Figures 3B–D**), the potentially beneficial effect of including time decreases.

# **CONCLUSION AND DISCUSSION**

#### **SUMMARY OF FINDINGS**

In this study, we compared how well different physiological variables can be used to assess workload in (simulated) real time for a single individual, when the amount of body movement and visual information are controlled for. We also examined to what extent (different ways of) combining information leads to better classification performance, as well as whether taking time into account improves performance.

**from different single feature models ("decision level"). (A)** performance in

Classification models based on data from each of the three sensor groups perform above chance (distinguishing high from low workload) at a 0.01 significance level, where EEG reached around 86% classification accuracy and classification models based on peripheral physiology and eye-related variables reached between 70 and 75% accuracy. As hypothesized, the difference in classification accuracy between models based on EEG variables on the one hand, and models based on peripheral physiology and eye-related variables on the other hand was statistically significant. The best performing single variable was ERP at Pz with 88% accuracy (elastic net). All EEG variables (except power in the theta band measured at Fz) performed well above chance (*p <* 0*.*01). The only non-EEG variables exceeding the 0.01 chance level were respiration frequency (69% accuracy) and pupil size (75% accuracy).

comparing 1- vs. 0-back (120 s of data).

As hypothesized, combining variables recorded using a single sensor (i.e., only the EEG electrodes, electrode Pz, skin conductance electrodes, respiration belt, ECG electrodes or eye camera) does not significantly improve performance over the best

performing single variable from that sensor. Variables from the same sensor are likely highly correlated and, in our experiment, combining them has no added value. For four out of the six sensors the trend was even that performance worsened when combining data.

Combining variables of the three physiological sensors (skin conductance electrodes, respiration belt and ECG electrodes) resulted in a modest, non-significant improvement to around 75% classification accuracy with respect to the best performing single physiological sensor (respiration—around 70%). In contrast to what we expected, combining EEG with another sensor group (physiology and eyes) does not lead to a significant improvement in classification accuracy over EEG alone. Using elastic net, we only found non-significant trends of better performance for EEG combined with eye data (91% accuracy) and EEG combined with physiology (89%) than EEG alone (86% accuracy). Adding physiology to fused EEG and eye data does not further improve, or tend to improve, performance. Analysis of data from shorter time segments or smaller workload differences indicates that the fact that we did not get a stronger, significant improvement of adding physiological or eye data to EEG is not caused by a ceiling effect.

Fusion of variables by concatenating feature vectors could not be improved by fusing variables at the decision level. The latter approach results in a more balanced weighting of the different indicators compensating for the low number of features from physiological and eye-based variables relative to EEG, and could therefore have improved classification. As it turned out, the variables that may have profited of the decision level approach, i.e., physiological and eye data, performed lower than EEG variables. Giving more weight to variables with a lower performance is not necessarily beneficial. In addition, classification models based on concatenation may be making more optimal use of interactions between variables that are missed when fusion occurs at the decision level.

Including the time of measurement relative to the start of the experiment as a parameter leads to statistically significant improvements of up to 9% for physiological variables. Adding time did not significantly improve performance of classifiers based on EEG, eye-related variables or combinations of variables. The latter cannot be attributed to a ceiling effect, as indicated by analysis of data from shorter time segments or smaller workload differences.

# **COMBINATION OF INFORMATION**

The notion that combining physiological variables that reflect a certain mental state will result in a more accurate assessment of this mental state compared to using these variables on their own seems very sensible and has frequently been suggested in the literature as a potential way to improve mental state assessment. However, we are not aware of studies that tried this and showed a statistically reliable and strong improvement. A few studies explicitly mention that combination of physiological information did not result in reliable improvement (e.g., Christensen et al., 2012; Coffey et al., 2012; Severens et al., 2013) or only to a modest degree in one of multiple conditions (Brouwer et al., 2012). Other studies report that classification performance of models combining information increases classification accuracy (by a small amount) but do not provide statistical evidence to show that the effect is reliable (e.g., Chanel et al., 2009). We had anticipated that our study could provide clear evidence for the benefit of combination given the nature of workload, which is a mental state that involves multiple processes that are presumably reflected by different types of physiological variables (e.g., cognitive processes by EEG and arousal by peripheral physiological measures). Also, while most studies combine information by fusion at the feature level, we thought that fusion of information at the decision level could have contributed to finding a strong reliable advantage of combining information. However, we did not find significant differences between the two methods, with the overall trend indicating worse rather than better performance for fusion at the decision. The fact that with this study, there is still no evidence of large benefits when combining physiological variables reflecting workload suggests that they are too highly related to gain a benefit of combination. It could be the case that meta-analyses or a study such as ours including more participants would turn non-significant trends of fusion benefits into statistically significant effects. However, if present at all, the effects of data fusion are at least small. Also, it may be the case that for other tasks (perhaps involving more strong emotional processing besides cognitive processes) benefits of feature fusion can be found more easily.

#### **WORKLOAD ASSESSMENT PERFORMANCE**

Classification accuracy for distinguishing 2-min segments of high vs. low workload for a single individual is relatively high (with an average over participants up to 91%), especially when considering the fact that the amount of movements and visual information was the same across workload levels, and that classification simulated a real time situation.

When the duration of to be classified data segments was decreased to 30 s, this resulted in a relatively small decrease in performance, of on average 5% (elastic net) to 8% (SVM). The results further show that discrimination between more subtle differences in workload is possible as well. These results are good news for potential use in applications. Another finding that is useful for practical applications is that performance based on a single Pz-channel was found to be comparable to that of a model using all EEG-channels. This means that in our case a single channel suffices to characterize the EEG.

As reported, we found EEG variables to be most informative when assessing workload. However, we also found pupil size and blink rate to reflect workload. This is interesting given the fact that lighting conditions and visual input were strictly controlled in our experiment. These results thus indicate that not only pupil size but also blink rate is affected by mental workload level apart from visual demands.

#### **GROUP vs. INDIVIDUAL LEVEL**

In the present study, we examine how well different variables can be used to assess workload in real time for a single individual by training classification models on different types of information and comparing performance. Sensitivity of physiological variables is often examined using group level analyses (e.g., using repeated measures ANOVAs). However, for various reasons (see Introduction) one cannot draw straightforward conclusions about assessing workload on an individual level from the results on a group level. For instance, in the current study we found that classification models based on heart rate performed badly, while, for basically the same set of data, this variable was found to be among the ones most strongly associated with workload in a repeated measures ANOVA (Brouwer et al., 2014). Brouwer et al. (2014) also found that heart rate strongly decreased over the time course of the experiment. Since we presented workload conditions in 2-min segments equally dispersed over time, even strong time effects are averaged out in repeated measure ANOVAs whereas they could overrule the comparatively small workload effects in classification type of analyses, especially when classification models are trained on data acquired at the start of the experiment and tested on data at the end. Thus, caution should be taken when generalizing results from studies using group level analysis to situations where momentary data of individuals is used.

#### **NOISE AND CONFOUNDS IN REAL LIFE**

In real-life, out-of-the-lab situations, the presence of factors like body movement and varying light conditions may act as noise, therewith diminishing the value of certain variables (e.g., pupil size). Also, levels of workload can be confounded with different levels of stimulus processing or motor actions. For example, workload in Air Traffic Control may be confounded by speech, where controllers talking more during high than during low workload situations. Such confounds may affect physiology (e.g., speech affects respiration), resulting in improved classification accuracy. One of the reasons for the fact that we only found a minor improvement in performance by fusing different workload measures may have been that the experiment controlled for many of such confounding factors, thus decreasing the additional benefit of recording various physiological measures. In this way we were better able to determine which factors directly reflect mental workload. However, in practical situations, physiological measures may supply information about the context and contribute to workload assessment in a more indirect way, i.e., via the confounds. In such a case one should determine whether physiological measures are the most convenient measures to supply workload information, or whether task or behavioral measures (such as the detection of speech through audio sensors with respect to our previous ATC example) are more suitable.

# **ACKNOWLEDGMENTS**

We would like to thank Pjotr van Amerongen, Rob van de Pijpekamp, Tobias Heffelaar and Patrick Zimmerman (Noldus) for technical assistance, Marcel van Gerven and Jason Farquar for contributions to the multivariate analysis tools and the SVM method in the FieldTrip toolbox, Boris Reuderink and Robert Oostenveld for fruitful discussions. This research has been supported by the GATE project, funded by the Netherlands Organization for Scientific Research (NWO) and the Netherlands ICT Research and Innovation Authority (ICT Regie). Furthermore, the authors gratefully acknowledge the support of the BrainGain Smart Mix Programme of the Netherlands Ministry of Economic Affairs and the Netherlands Ministry of Education, Culture and Science.

# **REFERENCES**


Berntson, G. G., Bigger, J. T., Eckberg, D. L., Grossman, P., Kaufmann, P. G., Malik, M., et al. (1997). Heart rate variability: origins, methods, and interpretive caveats. *Psychophysiology* 34, 623–648. doi: 10.1111/j.1469-8986.1997.tb02140.x Boucsein, W. (1992). *Electrodermal Activity*. New York, NY: Plenum Press.


modes in spontaneous brain activity fluctuations at rest. *Proc. Natl. Acad. Sci. U.S.A.* 100, 11053–11058. doi: 10.1073/pnas.1831638100


measurement, physiological interpretation and clinical use. *Circulation* 93, 1043–1065. doi: 10.1161/01.CIR.93.5.1043


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 29 January 2014; accepted: 25 September 2014; published online: 14 October 2014.*

*Citation: Hogervorst MA, Brouwer A-M and van Erp JBF (2014) Combining and comparing EEG, peripheral physiology and eye-related measures for the assessment of mental workload. Front. Neurosci. 8:322. doi: 10.3389/fnins.2014.00322*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Hogervorst, Brouwer and van Erp. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# EEG-based workload estimation across affective contexts

#### *Christian Mühl <sup>1</sup> \*, Camille Jeunet 1,2 and Fabien Lotte1,3*

*<sup>1</sup> Institut National de Recherche en Informatique et en Automatique, Bordeaux Sud-Ouest, Talence, France*

*<sup>2</sup> Laboratoire Handicap et Système Nerveux, University of Bordeaux, Bordeaux, France*

*<sup>3</sup> Laboratoire Bordelais de Recherche en Informatique (LaBRI), Talence, France*

#### *Edited by:*

*Jan B. F. Van Erp, TNO - Netherlands Organisation for Applied Scientific Research, Netherlands*

#### *Reviewed by:*

*Stephen Fairclough, Liverpool John Moores University, UK Maarten Andreas Hogervorst, TNO - Netherlands Organisation for Applied Scientific Research, Netherlands*

#### *\*Correspondence:*

*Christian Mühl, Institut National de Recherche en Informatique et en Automatique, Bordeaux Sud-Ouest, 200, Rue de la Vieille Tour, 33405 Talence, France e-mail: c.muehl@gmail.com*

Workload estimation from electroencephalographic signals (EEG) offers a highly sensitive tool to adapt the human–computer interaction to the user state. To create systems that reliably work in the complexity of the real world, a robustness against contextual changes (e.g., mood), has to be achieved. To study the resilience of state-of-the-art EEG-based workload classification against stress we devise a novel experimental protocol, in which we manipulated the affective context (stressful/non-stressful) while the participant solved a task with two workload levels. We recorded self-ratings, behavior, and physiology from 24 participants to validate the protocol. We test the capability of different, subject-specific workload classifiers using either frequency-domain, time-domain, or both feature varieties to generalize across contexts. We show that the classifiers are able to transfer between affective contexts, though performance suffers independent of the used feature domain. However, cross-context training is a simple and powerful remedy allowing the extraction of features in all studied feature varieties that are more resilient to task-unrelated variations in signal characteristics. Especially for frequency-domain features, across-context training is leading to a performance comparable to within-context training and testing. We discuss the significance of the result for neurophysiology-based workload detection in particular and for the construction of reliable passive brain–computer interfaces in general.

**Keywords: workload, stress, brain–computer interface, classification, electroencephalography, passive brain computer interface**

# **INTRODUCTION**

The increasing complexity and autonomy of information systems rapidly approaches the limits of human capability. To avoid overload of the users in highly demanding situations, a dynamic and automatic adaptation of the system to the user state is necessary. Reliable knowledge about the user state, especially his workload, is a key requirement for a timely and adequate system adaptation (Erp et al., 2010). Examples are systems supporting air traffic control, pilots, as well as medical and emergency applications.

Conventional means of workload assessment, such as selfassessment and behavior, are intrusive or limited in their sensitivity, respectively (Erp et al., 2010). Physiological sensors, assessing for example the galvanic skin response (GSR) or elecrocardiographic activity (ECG), offer an unobtrusive and continuous measure that has been found sensitive to workload (Verwey and Veltman, 1984; Boucsein, 1992). In the last two decades, neurophysiological activity became popular as a modality for the measurement of mental states in general and of workload in specific. So-called "passive brain-computer interfaces" (pBCI, Zander and Kothe, 2011) are able to measure neuronal activity in terms of the electrophysiological activity of neuron populations as in the case of EEG or the oxygination of the cerebral blood flow as for functional near-infrared spectroscopy (fNIRS). Both approaches have been found informative regarding the detection of cognitive load (Brouwer et al., 2012; Solovey et al., 2012), and there is evidence for a partially superior sensitivity of neural measurements compared to other physiological sensors (Mathan et al., 2007) or self-report (Peck et al., 2013).

Most experiments on passive BCI use a very controlled approach, which naturally limits the range of real-world conditions they reflect. While this control is necessary to ensure the psychophysiological validity of the mental state detection, their results lack a certain ecological validity, they can not be generalized to other contexts. This might be one of the most impeding problems for the creations of passive BCI systems that work in the real world, since daily life is characterized by the variability of the conditions we function under. A prominent example are changes of affect while working, for example working under the pressure of an impending evaluation vs. work without pressure. A system that is supposed to work in such contexts needs to be calibrated and tested in them. Previous research in the domain of pBCI largely ignored the problem. To shed light on the interaction of mental state classification and change of affective context, we devised a protocol that recreates conditions of work, requiring different effort, during relaxed conditions and under psychosocial stress in a controlled environment. To study the resilience of a state-of-the art workload detection system to changes in affective context, we train subjectspecific classifiers in either stressed or non-stressed context and test their performance within the same and in the other context.

In summary, the contributions of this paper for the study of the effect of affective context on workload classification are:


Below, we will give the reader some background to neurophysiology-based detection of workload under varying (affective) user states and its potential interactions with stress responses. Then, we will introduce the employed approaches to manipulate the user's mental state, the used devices, and the applied signal processing and classification algorithms. We will then report the nature of the found effects, discuss their relevance, and conclude with the general consequences and limitations of the presented findings.

# **RELATED WORK**

# **DETECTION OF WORKLOAD FROM NEUROPHYSIOLOGY**

Mental workload can be defined as (perceived) relationship between the amount of mental processing capability and the amount required by the task (Hart and Staveland, 1988). The closer the requirements are to the actual capabilities, the higher is the (perceived) workload. Therefore, a general strategy for workload manipulation is the manipulation of task demand or difficulty (Gevins et al., 1998; Grimes et al., 2008; Brouwer et al., 2012), though alternative strategies, such as the manipulation of feedback or participant motivation (Fairclough and Roberts, 2011), exist.

Already in 1998, Gevins et al. (1998) showed that EEG is a viable source of information regarding the workload of a person, enabling 95% accuracy when using about 30 s of signal. However, there are many factors that can affect the performance of classification algorithms, such as the number of training data available, their distribution, their separability between classes, the data signal-to-noise ratio, the similarity (in terms of data distribution) of the training data and testing data, etc. (Duda et al., 2001). The estimation of these performances also depends on the number of testing data available (Müller-Putz et al., 2008), and the way they are estimated (cross-validation, independent test set). Finally, more BCI-specific factors affect the performances, such as whether the classification is subject-specific or subjectindependent (see, e.g., Lotte et al., 2009), which subjects are used (there is a huge between-subject variability), whether the training and testing data are from the same session (e.g., same day) or not, etc. (Lotte et al., 2007). In this regard, Grimes et al. (2008) showed that a number of factors, such as the numbers of channels, amount of training data, or length of trials, have a strong influence on classification performance of workload classifiers. For example, reducing the length of the signal from 30 to 2 s reduces the classification performance on two workload levels from almost 92% to about 75%. Similar tradeoffs between optimal and practical signal processing settings are reported for channel number and training time. Another work, by Brouwer et al. (2012), studied in a similar setup the feasibility of different types of features (i.e., from the time- and frequency-domain, and combined) to differentiate workload levels, finding that the different feature types work comparably well with accuracies of about 85% after 30 s. Reducing the signal length to 2 s reduced the accuracy to about 65%. Zarjam et al. (2013) showed that workload manipulated by an arithmetic task can be classified with a performance of 83% for seven workload levels. Walter et al. (2013) tested the generalization of workload classifier from simple tasks, such as go/no-go, reading span, n-back tasks, to complex tasks involving diagram and algebra problems. While they were able to train well-performing classifiers for the simple tasks, reaching performances of about 96% for two classes on signals of a few seconds length, a cross-testing of a workload classifier trained on a simple task to a complex task did not succeed. However, since in both studies the order of workload levels was not randomized, a temporal trend present in the features could have biased the results toward a higher accuracy. Overall, these studies show that the workload level can be classified from neurophysiological activity. Indeed, it has also been suggested that neurophysiological information is more sensitive than information from other physiological signals (Mathan et al., 2007). Most importantly, these studies show that different factors, mainly methodological differences in workload induction, signal acquisition and processing, can have significant influences on the classification results.

However, to date there have only been few studies regarding the influence of the mental state changes during training and testing on the classifier performance. For active BCI, Reuderink and colleagues studied the influence of frustration on left and right hand movement classification during a computer game, using freezing screens and button malfunctions as induction tools (see Reuderink et al., 2013). The resulting loss of control (LOC) during "frustrating" episodes, surprisingly led to higher classification performance than during normal, relaxed game play (Reuderink et al., 2011). Zander and Jatzev (2012) induced LOC in a similar way during a simple behavioral task, the RLR paradigm, which resulted in lower classification performance. For passive BCI and specifically for workload level detection, only Roy et al. (2012) tested the impact of fatigue on EEG signal characteristics and workload classification performance. With increasing fatigue, the differentiating signal characteristics diminished and, consequently, the classification performance declined. This lack of research on interactions between passive BCI and changes in user state is problematic, since BCIs in general have been found susceptible to changes in task-unrelated mental states during classification, such as attention, fatigue or mood. Specifically, it is believed that variations in task-unrelated mental states are partially responsible for what is called non-stationarities of the signal, the change of its statistical properties over time, and thereby the source of one of the most notorious problems for BCI (Krusienski et al., 2011; van Erp et al., 2012).

In the next section, we will briefly introduce the concept of stress, which is another possible contextual factor influencing workload estimation that is occurring during daily life and work, and thus might be a relevant source of variance for workload detection devices.

<sup>1</sup>The validation of the administered stress-induction protocol was presented at the PhyCS 2014 conference (see Jeunet et al., 2014 for more details).

# **STRESS RESPONSES AND WORKLOAD**

The psychophysiological concept of "stress," was introduced in 1936 by Selye (1936) to describe "the non-specific response of the body to any demand for change." In that sense, it is an organism's response to an environmental situation or stimulus perceived negatively—called a "stressor"—which can be real or imagined, that taxes the capacities of the subject, and thus has an impact on the body's homeostasis (that is to say that the constants of the internal environment are modified). To face the demand (i.e., to restore homeostasis), two brain circuitries can be activated during a "stress response cascade" (Sinha et al., 2003; Dickerson and Kemeny, 2004; Taniguchi et al., 2009): the sympatho-adrenomedullary axis (SAMa, also called the noradenergic circuitry) and the hypothalamus-pituitary gland-adrenal cortex axis (HPAa). On the one hand, the SAMa induces the release of noradrenaline which allows immediate physical reactions (such as increased heart rate and skin conductance, or auditory and visual exclusion phenomena) associated with a preparation for violent muscular action (Dickerson and Kemeny, 2004). On the other hand, the HPAa activation (which is lower) results in the releasing of cortisol the purpose of which is to redistribute energy in order to face the threat. Thus, more energy is allocated to the organs that need it most (brain and heart), while non-necessary organs for immediate survival (reproductive, immune and digestive systems) are inhibited. This stress response cascade ends when homeostasis is restored.

However, stress can be of different types, such as physical, psychological or psychosocial (Dickerson and Kemeny, 2004), each kind of stress being associated with a specific response. Indeed, physical stress, induced by extreme temperatures or physical pain for example, is associated with an increase of heart rate (Loggia et al., 2011), skin conductance (Boucsein, 1992; Buchanan et al., 2006) and subjective stress ratings but with only a weak cortisol response (Dickerson and Kemeny, 2004). These results suggest that this kind of stress induces an activation of the SAMa but only a weak activation of the HPAa. Psychological or mental stress, associated with difficult cognitive tasks, uncontrollability or negative emotions is associated with a weak release of cortisol (weak HPAa activation), but strong effects on heart rate and skin conductance (strong SAMa activation) (Boucsein, 1992; Reinhardt et al., 2012). Finally, psychosocial stress, triggered by a social evaluation threat (that is to say a situation in which the person's own estimated social value is likely to be degraded), and added to by a feeling of uncontrollability (in particular during the Trier Social Stress Task (TSST) Kirschbaum et al., 1993), has been shown to induce a strong activation of both the SAMa (Hellhammer and Schubert, 2012) and the HPAa (Dickerson and Kemeny, 2004).

Psychosocial stress and workload potentially can interact on physiological, neurophysiological and behavioral levels. Since workload can also be understood as the response to a particular type of psychological stressor, such as increased task demand, both concepts are associated with the activation of the sympathetic nervous system (see SAMa above). Furthermore, psychosocial stress and workload share also neurophysiological responses. From research in the neurosciences, and consistent with the notion of neural response systems, we know that stress has strong correlates in the EEG as well. One of the most prominent correlates of anxiety, as induced by psychosocial stress, is found in the alpha band, and specifically in brain asymmetry. Tops et al. (2006) proposed that cortisol administration (which simulates a stress situation) leads to a global decrease of cortical activity (except for the left frontal cortex in which activity is increased). However, other studies (Lewis et al., 2007; Hewig et al., 2008) showed that stress was associated with a higher activity in the right hemisphere, and that the right hemisphere activation was correlated with negative affect. For Crost et al. (2008), the explanation of these conflicting results would be that an association between EEG-asymmetry and personality characteristics, such as anxiousness, may only be observed in relevant situations to the personality dimensions of interest. For workload, on the other hand, we know that the alpha band plays a role in terms of increased sensory processing leading to decreased occipitoparietal alpha power (Gevins et al., 1998; Brouwer et al., 2012), as well as for frontal alpha asymmetry covarying with changes in engagement (Fairclough and Roberts, 2011). From a theoretical point of view, Eysenck and Derakshan (2011) suggested that increasing anxiety, for example due to psychosocial stress, has effects on different cognitive processes, leading to impaired processing efficiency and performance effectiveness. Specifically for workload-related processes, their "attentional control theory" suggests that anxiety impairs efficient function of inhibition and shifting mechanisms of the central executive, subsequently decreasing attentional control and increasing distraction effects of irrelevant stimuli. However, these deficits might not necessarily lead to decreases of performance if they are compensated by alternative strategies, such as enhanced effort.

Summarizing, increases in workload, as induced by higher task demand, can be subsumed under the concept of psychological stress and have been found to lead to increasing physiological and neurophysiological activity that has also been found responsive to anxiety as induced by psychosocial stress. Furthermore, cognitive theories propose links between anxiety and pre-attentional and attentional cognitive processes, which are expressed in behavior and physiology. Due to these possible interactions of workload and stress, it seems relevant to experimentally study the effect of stress on workload detection.

# **RESEARCH QUESTIONS**

The work on the effects of potential contextual factors, such as moods or fatigue, on the stability of BCI performance, and the physiological and psychological links between stress and cognitive processes suggests that stress can be a relevant factor influencing the classification of workload levels. In more general, the findings of context-dependency of BCI performance make it seem imperative to explore the effect of factors, such as mood, on brain signals and classifier performance, to gain insight into the relevance of task-unrelated mental states on classifier performance, and to find ways to render classifiers robust against such changes. Specifically, for the development of reliable passive BCIs in the wild, those functioning robustly in private or work environments, the influence of contextual changes of mental states that are predominant in the context of application have to be explored. That is why we test the robustness of three workload classifiers, using features from either frequency-, time-, or both domains, to the influence of (psychosocial) stress. We let participants work under different levels of workload, while either under the impression of being observed and validated, or while being relaxed and free from this kind of pressure. We are interested in the effect of the contextual manipulation of stress on the classifier performance and in testing cross-context training as a simple and straightforward remedy to the problem. Thus, we address the following questions:

*Q1: Can we induce stress and workload in a controlled manner?* We validated stress and workload manipulation of our experimental protocol using participants' self-assessments, behavioral performance, and physiological indicators of sympathetic nervous system (SNS) activation (i.e., GSR, ECG). Stress is expected to increase perceived anxiety and SNS activity, while workload increase should be reflected in increased perceived arousal and mental effort, decreased performance, as well as increased SNS activity (Verwey and Veltman, 1984; Boucsein, 1992).

*Q2: Can we train a workload classifier based on the data collected via this protocol?* To ensure that we are using a state-of-the-art workload classifier, we trained the classifier on all data, irrespective of context, as done in conventional studies. We expect a performance of about 70% as shown by Grimes et al. (2008) and Brouwer et al. (2012) under similar conditions.

*Q3: Does the classifier generalize across affective contexts, and if so, how well?* To study the effect of different affective contexts on the classification performance, we compared the results from classifiers trained in either stressful or non-stressful context and applied it then to test data from the same ("within") or the other context ("across"). We expect a higher "within" compared to "across" performance to indicate the difficulty of the classifier to generalize.

*Q4: Does training based on multiple context render the classifier resistant against changes in affective context,and if so, how resistant?* To test if the training with combined data from both affective contexts is effecting the classifier's capability to generalize, we compare the performance depending on the training context ("single," that is training on only stress or non-stress context, or "combined," that is training over contexts) and expect higher performance for a classifier trained on data from the combined contexts.

# **MATERIALS AND METHODS**

As mentioned before, we designed a protocol in which subjects had to do cognitive tasks involving two levels of mental workload, manipulated via task difficulty, while being exposed to two levels of psychosocial stress. We used the EEG signals collected with this protocol to design and assess a workload classifier across different stress conditions. This section describes in details the subjects involved, the protocol and the method to validate it, the EEGbased workload classifier used and the evaluations performed with it.

#### **PARTICIPANTS**

Twelve female and twelve male participants were recruited for our experiment. The participants were between 18 and 54 years old, with a mean age of 24*.*7 ± 7*.*9, and except four all were righthanded. Educations varied between high school degree and Ph.D., with a mean education of 3*.*1 ± 2*.*4 years after high school. To be admitted, people had to be at least 18 years, to speak the local language and to sign an informed consent. Furthermore, non-inclusion criteria were applied: bad vision, heart condition, neurological or psychological diseases, and affective troubles. Moreover, people were asked to select a time for the experiment in which they would feel alert. Finally, we asked them not to drink coffee and tea less than 2 h before the experiment.

#### **MATERIAL**

For our recordings, we used the following sensors: ElectroEncephaloGram (EEG, 28 active electrodes in a 10/20 system without T7, T8, Fp1, and Fp2), ElectroCardioGram (ECG, two active electrodes), facial ElectroMyoGram (EMG, two active electrodes), ElectroOculoGram (EOG, four active electrodes), breath belt (SleepSense), pulse (g.PULSEsensor), and a galvanic skin response sensor (g.GSRsensor). All sensors were connected and amplified with three synchronized g.USBAmp amplifiers (g.tec, Austria). The workload task was designed in the Presentation software (Neurobehavioral Systems, www*.*neurobs*.* com/presentation) and EEG signals were recorded and visually inspected with Open ViBE (Renard et al., 2010). **Figure 1** shows a participant sitting fully-wired in the experimental environment.

Subjects were first asked to sign an informed consent and to fill out three questionnaires: one assessing personal characteristics (such as gender, age and education) and form Y-A (anxiety state) and Y-B (anxiety trait) of the State-Trait Anxiety Inventory (STAI) (Spielberger et al., 1970) (see below for details). Then, all the sensors were installed and a 3 min baseline recorded. To avoid order effects, we counterbalanced the order of stress and relax condition (affective context) and 0-back and 2-back task (workload blocks), resulting in four scenarios (see **Figure 2A**). Each scenario was composed of 12 workload blocks in the stressful and 12 workload blocks in the relaxed context. The scenarios therefore begin with either relaxation or stress induction, and the workload blocks either start with the low workload (0-back) or high workload (2 back) condition. In each affective context, the subject performs, in alternating order, six times each workload condition (low/high) (6 × 2 × 2 = 24 min per block), with a short break after six tasks (i.e., after about 12 min). After each context was absolved, that is

**FIGURE 1 | A fully wired participant in the experimental environment during the relaxation induction period.**

after the induction phase and the 12 workload blocks, the STAI form Y-A questionnaire was administered again to assess the anxiety state. Finally, the sensors were removed and the participant was debriefed about the aim of the experiment.

#### *Stress and relaxation inductions*

In order to manipulate stress, we used a stress-induction protocol based on the Trier Social Stress Task (TSST) (Kirschbaum et al., 1993) and a relaxation condition using a resting phase, music and/or videos. The stress-induction protocol is composed of three parts lasting together about 15 min and it requires the participation of three people, "the committee," who are presented as being body language experts. In the first part, a member of the committee asks the subject to prepare, during 5 min, a fake job interview for a position fitting the professional profile of the subject. During the second part, the committee asks the person to do this job interview and to speak about himself for 5 min. They tell the subject that he is filmed for a future behavioral analysis and take notes during the whole interview. The committee acts as being serious and neutral/unresponsive toward the subject. The third part is a 3 min long arithmetic task (the subject has to count from 2083 to 0 by steps of 13) and to begin again at any mistake or hesitation. At the end of this protocol, in order to keep the stress level high, the committee tells the subject he will be filmed during the workload tasks and that he will have to do another interview, which will be longer, and a self-evaluation based on the recorded film material after it. Furthermore, during the experiment, participants are receiving visual feedback about their performance in the workload tasks. During the stress condition, these feedbacks have been modified to display a performance 5–10% below their actual performance. Thereby, this protocol includes psychosocial stress and uncontrollability in order to maximize the chance to trigger a stress response for all the participants (Dickerson and Kemeny, 2004). On the other hand, the goal of the relaxation induction was to create a condition (referred to as "relax" condition) in which participants would be able to relax and thus execute the workload task without the influence of additional psychosocial and psychological stressors. To allow for an effective relaxation, participants were allowed to choose between resting in silence or select music/videos that would help them to feel calm (Krout, 2007). In order to measure the level of anxiety of the subjects and thereby to validate the stress/relax manipulation, the "State Trait Anxiety Inventory" (Spielberger et al., 1970) is used. It is composed of two scales of 20 propositions each: STAI form Y-A and STAI form Y-B. STAI form Y-A score measures anxiety state and is increased when the person currently experiences psychological stress. A college student (female/male) has a mean state anxiety index of 35/36, while values higher than 39/40 have been suggested to detect clinically significant symptoms (see Julian, 2011).

#### *Workload tasks*

We used the n-back task (Kirchner, 1958) as workload task (see **Figure 2B**), as it is easy to modify workload while keeping visual stimulation and behavioral motor requirements the same. Similar to Grimes et al. (2008) and Brouwer et al. (2012), we decided for a manipulation of task-difficulty to manipulate workload. Specifically, we used 0-back (low workload) and 2-back (high workload) varieties of the n-back task, which were presented in blocks of 2 min each. In both tasks, a stream of 60 white letters appears on a black background on the screen. Each letter is presented for 500 ms, followed by an inter-stimulus interval of 1500 ms. Among these letters, 25% are targets. In both tasks, when a letter appears, the subject is asked to perform a left mouse click if this is a target letter, and a right mouse click otherwise. For the 0 back task, the low workload condition, the target is the letter "X": each time an "X" appears, the subject has to do a left click, and in all the other cases he has to do a right click. For the 2-back task, the high workload condition, the subject has to do a left click if the letter that appears is the same as the one preceding the last letter. For example, if the sequence "C A C" appeared, the second "C" would be a target. At the end of each 2-min block, the subject has to report his level of arousal (on a scale from 1 to 9) (Bradley and Lang, 1994) and the perceived effort necessary to perform the task (Rating Scale of Mental Effort—RSME, Zijlstra, 1993). Finally, a screen with his performance during the block (see section 4.3.2) appears. As mentioned before, during the stressful condition, this displayed performance is lower than the actual performance to induce additional uncertainty.

#### **PROTOCOL VALIDATION METHODS**

#### *Self-assessment data*

To investigate the effect of the psychosocial stress induction on the STAI score, we computed an ANOVA with this score in the three factor-levels "baseline," "after relaxation," and "after stress induction." To assess the effect of both stress and workload manipulation, we conducted 2 (stress) × 2 (workload) ANOVAs for the averaged-over-blocks ratings on the arousal scale of the SAM and on the RMSE.

# *Behavioral data*

To investigate the effects of the experimental manipulations on behavior, we calculated the performance per block based on the number of true positive (*TP*), true negative (*TN*), false negative (*FN*), and false positive (*FP*) responses resulting from the button presses within the n-back task (left click for targets, right click for non-targets) using the following equation: *Per f* <sup>=</sup> (*TP*+*TN*) (*TP*+*TN*+*FP*+*FN*). As for ratings, we analyzed the data in a 2 (stress) × 2 (workload) ANOVA.

# *Physiological data*

Physiological responses were analyzed with respect to heart rate (HR) and galvanic skin response (GSR). Before applying statistical methods, the GSR data was pre-processed by extracting the mean GSR value (*μ*S) for each block and then averaging these values over blocks as described above. The ECG signal was band-pass filtered between 5 and 200 Hz, applying a notch-filter 48–52 Hz to reduce power line noise, before mean HR for each of the blocks was extracted. As for the former analyses, we analyzed the data with a 2 (stress) × 2 (workload) ANOVA. We are reporting data as significant if *p <* 0*.*05 and as trend if *p <* 0*.*1. For all ANOVAs partial eta squared values (*η<sup>p</sup>* 2) are calculated as a measure of effect size.

# **EEG SIGNAL PROCESSING**

Our system aims at estimating the level of mental workload of the user from its EEG signals. To do so, we employed a machine learning approach based on state-of-the-art algorithms developed for Brain-Computer Interfaces (BCI) technologies (Lotte et al., 2007; Blankertz et al., 2008; Ang et al., 2012). This section describes the way EEG signals were preprocessed and segmented into trials, the machine learning algorithms used as well as the approach followed for the evaluating our method (see **Figure 3** for a schematic overview of these procedures).

# *EEG preprocessing and segmentation*

We first cleaned signals from eye movements (EOG) contamination using the automatic method proposed in Schlögl et al. (2007). The EEG signals from each 2 min n-back task were

**from EEG signals. Top:** training set, aiming at identifying the relevant frequency bands (i.e., spectral filters) and channels (i.e., spatial filters), using the Filter Bank CSP and REFSF approach. **Bottom:** testing set, using the

optimized spectral and spatial filter to estimate the workload level from an unknown EEG trial. (CSP, Common Spatial Patterns; REFSF, regularized Fisher spatial filter; mRMR, maximum Relevance Minimum Redundancy; LDA, Linear Discriminant Analysis).

then divided into 60 EEG trials, i.e., one EEG trial per letter appearance. More precisely, each EEG trial was defined as starting at a letter appearance onset and ending 2 s later, i.e., just before the next letter appearance. This resulted in 60 EEG trials per task, i.e., 720 trials per workload level (360 trials in the stressful condition, 360 in the non-stressful condition). Among them, trials corresponding to target letters were discarded in order to avoid confounding and interfering effects that may result from Event Related Potentials (ERP—notably a P300) likely to be triggered by target identification. This left 540 trials per workload levels (270 trials per psychosocial stress condition).

### *Machine Learning algorithms*

In order to estimate workload levels from EEG signals, we investigated two different types of neurophysiological information: (1) oscillatory activity and (2) Event Related Potentials (ERP), both of which having been shown to be useful for such a task (Brouwer et al., 2012). We set up state-of-the-art signal processing pipelines in order to estimate workload using these two types of information, both individually and in combination (see **Figure 3**). They are described below:

*Oscillatory activity.* To classify low mental workload vs. high mental workload in EEG signals based on oscillatory activity, we used a variant of the Filter Bank Common Spatial Patterns (FBCSP) algorithm (Ang et al., 2012) in order to learn optimal spatial and spectral features, i.e., EEG frequency bands and channels. The FBCSP is one of the most efficient algorithms to extract spatio-spectral features from EEG signals. It was indeed the algorithm used by the winners of the last BCI competition on all EEG data sets (Ang et al., 2012; Tangermann et al., 2012), showing the superiority of this method over other approaches. The FBCSPbased approach we employed works as follows. The first step—the training step—consists in identifying the most relevant frequency bands (i.e., spectral filters) and EEG channels (i.e., spatial filters), using examples of EEG signals from the high and low workload conditions (see below for details on the definition of the training sets). To do so, we first filter each training EEG trial into multiple frequency bands using a bank of band-pass filters. Here we used band-pass filters in the following frequency bands, which correspond to classical EEG rhythms: *δ* (1–4 Hz), *θ* (4–8 Hz), *α* (8–12 Hz), *β* (12–30 Hz), *γ* (30–47 Hz), and high *γ* (53–90 Hz). Then for each of these bands, the band-pass filtered EEG trials are used to optimize spatial filters, i.e., linear combinations of the original EEG channels. These spatial filters are optimized using the Common Spatial Pattern (CSP) algorithm (Blankertz et al., 2008), which finds the optimal channel combination such that the power of the resulting spatially filtered signals is maximally discriminant between the two conditions (here, low and high workload). We optimize 12 (6 pairs) such CSP filters for each frequency band. Then, the power of the spectrally and spatially filtered EEG signals is used as features, resulting in each EEG trial being described by 72 features (12 CSP filters × 6 frequency bands). From these 72 features, the 18 most relevant ones are selected using the maximum Relevance Minimum Redundancy (mRMR) feature selection algorithm (Peng et al., 2005). This amounts to selecting the 18 most relevant pairs of spectral and spatial filters. Finally, the 18 selected power features are used to train a shrinkage Linear Discriminant Analysis (LDA) classifier (Blankertz et al., 2010; Lotte and Guan, 2010) to discriminate low workload EEG trials from high workload ones. This concludes the training step. For testing, i.e., to predict the workload level of a given EEG trial, the EEG signals are first filtered using the 18 selected pairs of spectral and spatial filters, then the power of the resulting signals is computing and given as input to the previously trained LDA classifier whose output indicates the workload level (high or low).

*Event related potentials.* To classifiy low mental workload vs. high mental workload in EEG signals based on ERP, we first bandpass filtered the signals between 0.5 and 16 Hz, and downsampled them to 36 Hz, to reduce the signal dimensionality. We only used the first second of EEG signals from each trial (i.e., the first second after letter presentation in the N-back task) to analyse ERP, i.e., 36 samples per channels. Then, based on these 1-second of EEG signals from the training set, we learned optimal spatial filters for the discrimination of ERP based on EEG samples, by using the Fisher Spatial Filters (FSF) proposed by Hoffmann et al. (2006). We extracted 6 such spatial filters, which resulted in 216 features (6 filters × 36 EEG samples per filter), using a regularization parameter *λ* = 0*.*4 for optimizing the FSF for all subjects. We finally selected 18 features (i.e., 18 EEG samples) out of these 216 initial ones, using mRMR feature selection. These 18 selected features were used to train a shrinkage LDA. For testing, the EEG signals were preprocessed in the same way (i.e., band-pass filtered in 0.5–16 Hz and downsampled to 36 Hz), spatially filtered using the 6 Fisher Spatial Filters optimized during training, and the 18 resulting selected features were used as input to the previously trained LDA classifier whose output indicates the workload level (high or low).

*Combination of oscillatory activity and ERP.* In order to combine both oscillatory activity and ERP information, we extracted 18 FBCSP features as described above and 18 ERP features, as described above as well, from each trial. These 36 features were concatenated into a single feature vector, which was used as input to a shrinkage LDA classifier.

# *Evaluation scheme*

The performance of our workload-level estimator was assessed using sixfold stratified Cross-Validation (CV), separately for each subject. This means the data from each subject was divided into six parts, each part containing the same number of trials from each class (high/low workload). Five of these parts were used for training, i.e., to identify the relevant spectral and spatial filters, as well as to train the LDA classifier. The 6th part was used for testing the resulting workload-level estimator for that subject. This process was repeated six times, with each part used exactly once as the testing set. For three subjects we used only three- and fourfold CV due to missing blocks in the end of the recording. The performance, here the classification accuracy (i.e., rate of trials with correctly estimated workload-level), hence obtained on each testing part are then averaged to give a final performance of the workload-level estimator for that subject.

The goal of our work is to design a generic workload-level estimator, usable in practice, i.e., that can work across different affective contexts (here, different psychosocial stress levels). To do so, we performed different evaluations to estimate (1) the general performance of our system, independently of the affective context; (2) how it behaves *within* a given affective context; (3) how it behaves *across* different affective contexts, i.e., can a workloadlevel estimator calibrated on data from a given affective context (e.g., a relaxed condition) be used to estimate workload in another affective context (e.g., a stressful condition), (4) if effects of time can explain across-context classification performance loss, and (5) whether calibrating our system with data from different affective contexts makes the system better or worse, even if used in a single affective context. Different sub-parts of the data were thus used for training and testing within our CV scheme, in particular:


affective contexts, we devised an analysis similar to the above two analyses, but with first and second half of each context instead of relax and stress context. Therefore, we trained our classifiers on the data of 4 blocks and tested them on 2 blocks from either the same or the other half of the context. This was done in a threefold cross-validation scheme and resulted in two within and two across classification performance values (one from 1st half to second half, and one backwards) for each affective context. These were averaged over the affective contexts and yielded one value for the workload classification accuracy for within- and across-context (i.e., "half") per participant per half2 . For a genuine effect of affective context instead of an effect of simply the time passing between both contexts, the "within vs. across halfs" performance loss for a classifier that was only trained on one half should be smaller compared to the loss between "within vs. across affective context" performance loss for a classifier that was only trained on one affective context.

5. **Calibration across affective context performance estimation:** When considering different affective contexts, an interesting question is whether using data from different contexts to calibrate the workload-level estimator will make it better or worse, notably as compared to the within affective context evaluation. Indeed, on the one hand, using data from different contexts can force the machine learning approach to identify workload indices that are invariant to the affective context, thus improving the system, but on the other hand it adds more noise and variability to the data, which can impede the machine learning process. Therefore, with this evaluation, in each fold of the cross-validation, 20 blocks were available for training, coming from both the relaxed and stressful condition, and 2 blocks were available for testing, coming from either the stressful or the relaxed condition (but not both). To ensure that the comparison of this approach with the withincontext approach is fair, we had to use the same number of training trials for each approach. Indeed, using all the trials available in the 20 training blocks would mean using more training trials than in the within-context evaluation, which could result in higher performance simply due to a larger number of training trials. Therefore, for this last evaluation, we randomly selected 6 blocks from each context for training, from 4 of which all trials were used, while we selected every other trial from the remaining 2 blocks to keep the workload classes balanced within context. Further two blocks were selected from each context for testing. This procedure was repeated six times for a cross-validation comparable to the within-/across context evaluation.

# **RESULTS**

In this section, we first present the validation analysis, suggesting that our protocol indeed induced different levels of workload and stress (Q1). Then the results of the EEG-based workload classification over, within, and across affective contexts are presented, showing that a state-of-the-art subject-specific workload

<sup>2</sup>For three subjects, the averaging only contained data from the stress context due to missing blocks in the 2nd half of the relax context.

classifier (Q2) has difficulty generalizing over affective contexts (Q3), but can be rendered less context-sensitive by calibration across affective contexts (Q4).

#### **VALIDATION OF THE PROTOCOL**

#### *Subjective indicators*

Each subject filled in three "STAI form Y-A" (state) questionnaires: one at the beginning (*STAIBL*) of the experiment and one in the end of each affective context, that is after performing the n-back tasks under stress or relax condition (stress: *STAIS*; relax: *STAIR*) (see **Figure 4A**). Three data sets were excluded due to incompleteness. A repeated-measures ANOVA (*N* = 21) with the factor levels "baseline," "stress," and "relax" showed a significant difference of perceived anxiety between the conditions [*F*(2*,* 20) = 3*.*6225, *p <* 0*.*05, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*108]. We conducted a *post hoc* analyses using paired *t*-tests with the hypothesis that subjectively perceived anxiety increases due to the stress induction procedure relative to baseline and relaxation condition. The results suggest that the stress-induction protocol indeed increases anxiety compared to baseline and relaxation condition, and keeps it significantly higher until measured in the end of the affective context (see **Figure 4A**): *STAIS* scores (mean = 37*.*5 ± 12*.*6) are significantly higher [*t*(20) = 2*.*87, *p* = 0*.*01] than *STAIBL* scores (mean = 30*.*1 ± 4*.*6) and they are significantly higher [*t*(20) = 2*.*37, *p* = 0*.*028] than *STAIR* scores (mean = 32*.*2 ± 8*.*6). This increased anxiety seems mainly due to the interview and the apprehension of a final evaluation, rather than due to the n-back task as such: we found no difference between *STAIR* and *STAIBL* [*t*(20) = 1*.*27, *p* = 0*.*22], that is when they performed the n-back tasks knowing that there would be no evaluation.

We furthermore asked the subjects after each block to rate their arousal on the respective scale of the Self-Assessment Maneken (see **Figure 4B**) and to rate the mental effort on the Rating Scale Mental Effort (see **Figure 4C**). Two data sets were excluded due to incompleteness. We submitted the data of each scale to a 2 (stress) × 2 (workload) repeated-measures ANOVA. Regarding the subjectively perceived arousal, we only found a main effect of the workload manipulation [*F*(1*,* 21) = 4*.*444, *p* = 0*.*047, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> 0*.*175] with higher perceived arousal for the 2-back task (mean = 4*.*7 ± 1*.*4) compared to the 0-back task (mean = 4*.*3 ± 1*.*7). Regarding the subjectively perceived workload, we only found a main effect of the workload manipulation [*F*(1*,* 21) = 63*.*216, *p <* 0*.*0001, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*751] with higher perceived effort for the 2 back task (mean = 48*.*1 ± 11*.*5) compared to the 0-back task (mean = 28*.*6 ± 12*.*9).

#### *Objective indicators*

For the analysis of the objective indicator of behavioral performance, we logged all responses and computed the task accuracy for each task block (see **Figure 5A**). Two data sets were excluded due to incompleteness. We submitted the accuracy to a 2 (stress) × 2 (workload) repeated-measures ANOVA. As for the subjective indicators of perceived arousal and effort, we found a main effect of the workload manipulation [*F*(1*,* 21) = 65*.*251, *p <* 0*.*0001, *ηp* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*757] with higher accuracy for the simple 0-back task (mean = 97*.*3 ± 2*.*0) compared to the hard 2-back task (mean = 91*.*1 ± 4*.*8).

As a further objective indicator, we computed skin conductance level and heart rate. Four data sets were excluded due to incompleteness. For heart rate analysis a further data set was excluded due to malfunctioning sensors. We submitted the data of the physiological signals to a 2 (stress) × 2 (workload) repeated-measures ANOVA. For GSR (see **Figure 5B**), we found an increase of the skin conductance level [*F*(1*,* 19) = 4*.*4806, *p* = 0*.*048, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*191], indicating higher sympathetic arousal during the stress condition (mean = 3*.*83 ± 2*.*05) compared to the relax condition (mean = 3*.*52 ± 2*.*07). Skin conductance level increased for high compared to low workload condition as well, however, not significantly. For HR (see **Figure 5C**), we found a trend toward an increase of the heart rate [*F*(1*,* 18) = 3*.*2123, *p* = 0*.*089, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*151], indicating higher sympathetic arousal during the stress condition (mean = 79*.*41 ± 10*.*23) compared to the relax condition (mean = 78*.*30 ± 10*.*08). More importantly, we found a highly significant effect of the workload manipulation on

HR [*F*(1*,* 18) = 36*.*1431, *p <* 0*.*0001, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*667], with a higher HR for the more challenging 2-back task (mean = 80*.*4 ± 9*.*89) compared with the relatively easy 0-back task (mean = 77*.*27 ± 10*.*19).

In summary, we found evidence for the validity of the stress and workload induction (Q1) in both, the subjective (questionnaires) and objective (performance and physiological sensors) measures. This ensures that calibrating and evaluating a workload classifier on the EEG recorded with this protocol is meaningful.

#### **CLASSIFICATION OF EEG**

#### *General performance estimation*

In this section we report the general classification performance for a training on the whole data set, showing that our setup is stateof-the-art compared to similar studies hence positively answering question Q2. Specifically, we obtained performances similar to the best performances that were presented more recently with the n-back task paradigm and with 2 s short trials by Grimes et al. (2008) and Brouwer et al. (2012). The data of two participants was excluded due to incompleteness and of another one due to malfunctioning EEG sensors.

For the training and testing on the basis of all available data, those trials recorded during stress *and* relax context, we achieved an average classification accuracy of 76.1% when using only frequency-domain features, with performances between 58.7% and 95.4% (see **Figure 6**). According to Müller-Putz et al. (2008), we determined the above chance-level performance via a binomial test. For a two-class problem and given the number of 1080 trials used in our sixfold cross-validation scheme, the chance-level is at 53.1% for *p* = 0*.*05. Consequently, the classification performance was above chance for each subject, with a highly significant better-than-random performance for the average result over all subjects (*p* -0*.*0001).

Subsequently, we tested the previously observed increase of performance for increasing decision intervals, that is when more data is available for testing (Grimes et al., 2008; Brouwer et al., 2012). A majority vote over the classifier decisions for all 45 relevant trials of a given block, using only frequency-domain features, leads to an accuracy of 96%, well over the 71% chance-level resulting from a binomial test on the basis of 24 decisions (one per block). For time-domain features, we observed an average accuracy of 74% for 2 s trials (of which only the first was used), and 96% for the judgement after 45 trials. For both feature varieties in combination, the 2-second accuracy was the highest with 80.4%, though the block-wise accuracy was only 94.4%. Since all accuracies are well over chance level the used classification schemes enable for a solid classification performance for all feature varieties with the combined frequency- and time-domain features performing best for short estimation intervals and separate feature varieties performing best for the long decision intervals.

From a scientific point of view it is necessary to know about the source of the classification performance: is the information of neural origin or is it derived from muscular activity that is known to contaminate higher frequency bands of the EEG (Goncharova et al., 2003)? Although this question is often eluded in previous works (Grimes et al., 2008), we tried to answer it by first computing the percentage of the features selected from each frequency band in the FBCSP algorithm. As **Figure 6** indicates, the majority (about 65%) of features selected with the mRMR feature selection algorithm employed came from lower frequency bands (i.e., delta, theta, alpha). However, the remaining 35% originated in high frequency bands, those over 12 Hz (beta, gamma, gamma2). To ensure that the classifier performance does rely on neuronal sources and not on muscle activity, we repeated the workload classifier evaluation excluding these potentially contaminated high frequency bands, both for training and testing. We achieved a somewhat lower, but again much better-than-random (*p* - 0*.*0001) classifier performance of 74.2%, with accuracies between 53.9% and 88.2%. This suggests that our workload classifier does rely mostly on neural information from low frequency bands.

#### *Within- vs. across-context estimation*

In this section we tested the generalization of the classifier to a different affective context (question Q3). To evaluate the effects of testing in dependence of training context, we conducted a 2 (training context: relax, stress) × 2 (testing context: same-astraining, different-from-training) repeated-measures ANOVA for each feature type. **Figures 7**, **8** depict the average classifier performance when tested within and across affective context and the average loss of performance for the three used feature varieties (and the loss for the specific frequency bands), respectively.

The main effect found for the testing context when using frequency-domain features alone [*F*(1*,* 20) = 5*.*610, *p* = 0*.*028, *ηp* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*219] shows that the transfer from one context to another is problematic and results in a decrease of classifier performance (mean = 69*.*4 ± 9*.*7%) compared to testing on the same context as for the training (mean = 72*.*4 ± 9*.*4%). An exploratory analysis of the effect of context change on classifiers using only specific frequency bands revealed a significant contribution of the low frequency bands to the performance decline, while the less relevant high frequency bands were not or only minimally contributing (see **Figure 8**).

For time-domain features alone, the decrease of classifier performance for across context is as well significant, though stronger [*F*(1*,* 20) = 21*.*002, *p <* 0*.*001, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*512], with a lower across-context classification performance (mean = 69*.*1 ± 5*.*5%) compared to within-context classification performance (mean = 73*.*3 ± 5*.*1%).

For frequency- and time-domain features combined, the decrease of classifier performance across-context (mean = 73*.*2 ± 8*.*8%) compared within-context (mean = 77*.*3 ± 7*.*9%) is as well marked [*F*(1*,* 20) = 12*.*104, *p* = 0*.*002, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*377].

To rule out that the differences between within-context and across-context training were caused by the time passing between affective contexts, we divided each context into two parts (1st half,

**performance of a classifier trained in different training contexts (relax, stress, combined) and tested on data from relax and stress context.** The differences between the testing performance for stress and relax context show generalize to another context. The higher performance for the combined training set relative to the training on data from a single context indicates a gain of the classifier in invariance and hence a protection against over-fitting.

2nd half) and trained and tested the classifiers in the same manner as done for the within (e.g., training and test on 1st half) and across affective context (e.g., training on 1st half and test on 2nd half) tests. With the data averaged over affective contexts, we conducted a 2 (training context: 1st half, 2nd half) × 2 (testing context: same-as-training, different-from-training) repeated-measures ANOVA for each feature type. We did not find the pattern of performance loss that we observed for within vs. across affective context testing. Surprisingly, the only effect we found was a increase of performance for across vs. within context (half) testing for the frequency-domain only feature variety [*F*(1*,* 20) = 5*.*142, *p <* 0*.*04, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*204] from 61.1% to 63.7%.

Summarizing, all feature varieties have been found susceptible to changes in affective context. For the frequency-domain features, only classifiers using the low frequency bands of delta, theta and alpha are significantly declining in performance when tested in an affective context different from the training context (see **Figure 8**). However, as we showed, these frequency bands are the most informative regarding the workload level. An additional test of the within vs. across effects between the 1st and 2nd half of the affective contexts on classifier performance showed that the time effect alone does not lead to a consistent decrease of performance.

#### *Across-context calibration*

To evaluate the use of a combined training context to increase the capability of the classifier to generalize over affective contexts (question Q4), we conducted a 2 (training context: average single, combined) × 2 (testing context: stress, relax) repeatedmeasures ANOVA for each feature type. The specific effects of across-context calibration in comparison to single context (stress and relax) calibration are depicted in **Figure 7**.

The main effect of the training context for frequency-domain features alone [*F*(1*,* 20) = 6*.*816, *p* = 0*.*017, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*254] indicates a higher performance for training with combined (mean =

72*.*4 ± 9*.*5%) vs. with single affective context (mean = 70*.*9 ± 9*.*3%). There is no significant difference between testing on the (optimal) same context vs. combined testing.

For time-domain features the increase of classifier performance between single (mean = 71*.*2 ± 5*.*2%) and combined context (mean = 72*.*1 ± 4*.*9%) training is as well significant [*F*(1*,* 20) = 6*.*703, *p* = 0*.*017, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*251]. Despite the observed increase due to training with combined data from both contexts, there is still a significant decrease of performance of about 1.2% relative to training and testing on the same context [*t*(20) = −3*.*526, *p <* 0*.*01].

For frequency- and time-domain features combined, we observed an increase of classifier performance between single (mean = 76*.*7 ± 7*.*6%) and combined context training (mean = 75*.*2 ± 8*.*1%) with [*F*(1*,* 20) = 6*.*306, *p* = 0*.*021, *η<sup>p</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*240]. There is no difference between testing on the (optimal) same context vs. combined testing.

Summarizing, for those classifiers trained with frequencydomain and combined frequency- and time-domain features, training on combined contexts leads to an increase of performance comparable with (optimal) same context training and testing. For classifiers trained with time-domain features only, we observe a significant increase of classification performance when training on combined context, but there is still a loss of performance compared to the (optimal) same context training and testing. Since the number of trials for both conditions are kept equal, this is evidence for a gain in resilience of the workload classifier against contextual changes, especially for classifiers based on frequency-domain features.

#### **DISCUSSION**

If we want to create passive brain-computer interfaces that work in the wild, we need to take the variability of such environments into account. To test how well a workload classifier would be able to cope with variability due to changes in affective context, we trained it on the data from a subject performing a task under the evaluative pressure of an impending interview, the same subject in a non-stressful setting, and from both contexts.

We validated the experimental protocol using subjective and objective indicators of the psychophysiological activation expected due to stress/relaxation induction and different workload levels. Though we did not see a significant difference in the perceived arousal measure (SAM), higher values for the STAI and increased sympathetic nervous system activity (as indicated by significant differences for GSR and a trend for HR) support a successful induction of anxiety in the stressful compared to the non-stressful condition. Higher perceived arousal and mental demand, higher sympathetic nervous system activity (as indexed by HR) as well as lower behavioral performance for high compared to low workload levels support the efficacy of the workload induction paradigm.

We showed that workload can be classified on the basis of 2 s of neurophysiological signals with an accuracy of 76.1%. This is comparable to previously reported results for such short intervals of data (Grimes et al., 2008; Brouwer et al., 2012). It was shown that the accuracy can be increased using decision-level fusion over the results of several trials (Brouwer et al., 2012) or simply by using longer signal epochs (Grimes et al., 2008), however, with the tradeoff of a less fine-grained, more discrete, and lagging measure of workload. We observed a similar increase of classifier performance to between 94.4% and 96% using a majority vote based on the classifier outcome of the relevant 45 trials of a given block.

While the source of information measured via EEG, neuronal or myographical, might seem of no immediate significance for an application on able-bodied users, it seems relevant to us to ensure that we indeed measure the neural activity implied by pBCI. In this regard, it is noteworthy that the distribution of relevant frequencies vary between subjects. While in general the majority of features (65%) is selected from low frequency bands (delta, theta, alpha), some subjects have a strong contribution of high frequencies (beta, gamma, gamma2) up to 50%. Since these higher frequency bands are notorious for their response to muscle activity in addition to neuronal information (Goncharova et al., 2003), we tested if the workload classification would suffer considerably when excluding them from the feature pool. The average performance did indeed decrease slightly to 74.2%. However, the highly significant above-chance performance over all subjects indicates an only marginal role of muscular activity in workload estimation3 . This is in line with other studies that suggest a relevance of low frequency bands for workload (Jensen et al., 2002; Jensen and Tesche, 2002) and its estimation (Zarjam et al., 2013). Consequently, we showed that the trained classifier uses the neural correlates of workload to discern two workload levels with a performance equaling that reported in similar studies.

Regarding the classifier generalization to different affective contexts, we show that a classifier created in a non-stressful context can generalize to a stressful context and vice versa. However, the training context has a significant influence on the classification performance, with decreasing performance for cross-context classification (i.e., from 72.4% to 69.4% for frequency-domain features, from 73.3% to 69.1% for timedomain features, and from 77.3% to 73.2% for features from both domains). Interestingly, we found that a training which takes several relevant contexts into account enables the generalization of the classifier to a certain degree. Classifiers based on frequencydomain and on combined frequency- and time-domain features perform comparably well after training with data from both affective context (72.4% and 76.7%, respectively) as after being trained and tested within a specific context. Classifiers based on timedomain features profit as well from a training with data from both affective contexts (72.1%), but still show a declined performance relative to optimal, within-context training and testing.

The current study is limited in its generality by the use of a stress induction paradigm which manipulates affective context only once. We chose the TSST because it is a recognized standard of social stress induction and a powerful elicitor that allows to keep stimuli and task comparable during the workload session of stressful and non-stressful condition. However, since we have only two stress conditions and not several interleaved stress conditions, the stress manipulation is synonymous with a change in time, though with a counter-balanced order. Both affective contexts are separated by at least 10 min and we can not exclude that signal changes with time played a role for classifier performance. The analysis of effects of time within the affective contexts, however, did not reveal general performance decreases due to time passing and thus adds to the evidence of context-related performance loss. Similarly, the spread of training blocks over a larger time in combined compared to single testing contexts limits comparability of both performance measures. To ensure that our results hold for stress in specific, interleaved stress induction methods can be used, though a viable experiment length, reliability of stress induction, and comparability of stimuli and task need to be guaranteed.

Another limitation of the paradigm can result from a potential interaction of (psychosocial) stress and workload. For example, impaired cognitive processes or increased engagement in the face of evaluative pressure, could lead to differences in participant performance between affective contexts (Eysenck and Derakshan, 2011). Despite the lack of such interaction effects in our analysis, the possibility of participant's performance-related differences being reflected in brain activity is a general issue that needs to be considered, since such changes in brain activity would be only indirectly related to stress. Therefore, future research needs to identify the processes that are responsible for the signal variability in the face of psychosocial stress. On a related note, other stressors could be manipulated to identify the source of the performance decrease, for example in terms of impaired cognitive processes.

The result of our study suggests that classification performance for passive BCIs can be increased using not only a larger quantity of training data, but by introducing qualitative variations. Here, we varied the stress level of our participants during the task performance. This manipulation is comparable to the variation of the affective context of a task in real-world scenarios, for example task performance under pressure vs. normal task performance. Consequently, to create more reliable BCIs for

<sup>3</sup>Alternatively, the decrease might be due to the removal of relevant neural information represented in beta or gamma bands.

workload detection, robust against alterations in contextual conditions, such as affective factors (emotions, moods), the training data should include data collected under the relevant contextual conditions.

Zander and Jatzev (2012) found that certain metrics might enable the identification of phases of changed contexts and therefore identify phases were additional calibration might be necessary. One could then use transfer learning (Pan and Yang, 2010) or other re-calibration strategies to enable an adaptation of the transfer algorithm to the new context. However, the suggested metric specifically enables the detection of LOC, which is useful for the detection of perceived LOC and subsequent reliability decrease of active BCIs when environmental and internal factors of the user change. Passive BCIs are not directly related to a feeling of control since they do not enable nor aim at the intentional control of machines. Therefore, for passive BCI one needs other indicators of reliability.

Currently, several groups are investigating the cognitive, affective, and demographic factors that influence active BCI performance (see Lotte et al., 2013). We argue that a similar research program would allow to build more robust passive BCIs by (1) taking into account changes in relevant contextual factors (e.g., stress), (2) by exploring indicators of such changes or the subsequent loss of reliability, and (3) by the exploration of strategies to update the classifier in face of the loss of reliability due to contextual changes.

#### **CONCLUSION**

The current work has relevance for the development of passive brain-computer interfaces that are able to specifically classify one psychophysiological construct (e.g., workload), while being invariant to others (e.g., stress). We devised and validated a protocol to test the effect of stress on pBCI approaches. We showed that a classifier has trouble transfering from stressful training data to non-stressful test data and vice versa, indicating an influence of affective task context on the performance of a workload classifier. Moreover, we found that the classification profits from the training on a mix of the varied affective task contexts. Such classifiers perform comparably well to those trained and tested on the same affective context. More generally spoken, the results suggest that the classification performance is not only dependent on quantitative factors, such as the numbers of channels, amount of training data, or length of trials, but also on qualitative factors, such as the affective context. This underlines the need for studies that identify such contextual factors and that elucidate ways to deal with detrimental effects related to their influence. Future research and development of workload classification systems using physiological sensors needs to take the contextual factors into account to increase the generality and ecological validity of the system.

#### **REFERENCES**


Kirchner, W. K. (1958). Age differences in short-term retention of rapidly changing information. *J. Exp. Psychol.* 55, 352–358. doi: 10.1037/h0043688


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 31 January 2014; accepted: 30 April 2014; published online: 12 June 2014. Citation: Mühl C, Jeunet C and Lotte F (2014) EEG-based workload estimation across affective contexts. Front. Neurosci. 8:114. doi: 10.3389/fnins.2014.00114*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Mühl, Jeunet and Lotte. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Modeling temporal sequences of cognitive state changes based on a combination of EEG-engagement, EEG-workload, and heart rate metrics

*Maja Stikic <sup>1</sup> \*, Chris Berka1, Daniel J. Levendowski 1, Roberto F. Rubio1, Veasna Tan1, Stephanie Korszen1, Douglas Barba2 and David Wurzer <sup>2</sup>*

*<sup>1</sup> Advanced Brain Monitoring Inc., Carlsbad, CA, USA*

*<sup>2</sup> Center for Performance Psychology, National University, Carlsbad, CA, USA*

#### *Edited by:*

*Anne-Marie Brouwer, Netherlands Organisation for Applied Scientific Research, Netherlands*

#### *Reviewed by:*

*Lauren E. Reinerman-Jones, University of Central Florida, USA Ali Bahramisharif, Radboud University Nijmegen, Netherlands*

#### *\*Correspondence:*

*Maja Stikic, Advanced Brain Monitoring Inc., 2237 Faraday Avenue, Suite 100, Carlsbad, CA 92008, USA e-mail: maja@b-alert.com*

The objective of this study was to investigate the feasibility of physiological metrics such as ECG-derived heart rate and EEG-derived cognitive workload and engagement as potential predictors of performance on different training tasks. An unsupervised approach based on self-organizing neural network (NN) was utilized to model cognitive state changes over time. The feature vector comprised EEG-engagement, EEG-workload, and heart rate metrics, all self-normalized to account for individual differences. During the competitive training process, a linear topology was developed where the feature vectors similar to each other activated the same NN nodes. The NN model was trained and auto-validated on combat marksmanship training data from 51 participants that were required to make "deadly force decisions" in challenging combat scenarios. The trained NN model was cross validated using 10-fold cross-validation. It was also validated on a golf study in which additional 22 participants were asked to complete 10 sessions of 10 putts each. Temporal sequences of the activated nodes for both studies followed the same pattern of changes, demonstrating the generalization capabilities of the approach. Most node transition changes were local, but important events typically caused significant changes in the physiological metrics, as evidenced by larger state changes. This was investigated by calculating a transition score as the sum of subsequent state transitions between the activated NN nodes. Correlation analysis demonstrated statistically significant correlations between the transition scores and subjects' performances in both studies. This paper explored the hypothesis that temporal sequences of physiological changes comprise the discriminative patterns for performance prediction. These physiological markers could be utilized in future training improvement systems (e.g., through neurofeedback), and applied across a variety of training environments.

**Keywords: unsupervised learning, self-organizing map, cognitive state, electroencephalography (EEG), electrocardiography (ECG)**

#### **INTRODUCTION**

Passive Brain Computer Interface (BCI) utilizes spontaneously occurring brain signals to implicitly infer information about the cognitive or affective state of the user. Zander and Kothe (2011) presented a wide range of scenarios for the general population, i.e., healthy users that do not have difficulties to move or speak. This opens a number of new challenges, as it is difficult to compete with the standard communication channels such as speech and interaction devices (e.g., a mouse or keyboard) available to healthy users (Brouwer et al., 2013). However, passive BCIs provide an implicit input modality that carries information about the actual user cognitive or affective state that is not intentionally sent by the user. This offers numerous opportunities for adaptive human-computer interaction techniques that would adjust to the ongoing user's mental state (Cutrell and Tan, 2008; Zander et al., 2008; George and Lecuyer, 2010). Novel applications for passive BCI technology in domains such as entertainment, gaming, marketing, and industry continue to emerge (van Erp et al., 2012), e.g., implicit tagging of multimedia content, different avatar's appearances reflecting the gamers' mental state, adaptive automation in driving environments depending on the driver's workload, etc. Insight into the cognition and internal mental state of the user provides valuable information for detection of emergency due to the lack of concentration, drowsiness, or too high mental workload in safety-critical applications, such as driving or security surveillance (Zander and Kothe, 2011). Furthermore, training and education could also benefit from a more in-depth understanding of cognitive and affective states during learning with technology designed for cognitive enhancement and neurofeedback (Berka et al., 2010; van Erp et al., 2012).

A large number of previous studies in the context of training and education focused on error detection and performance prediction based on brain activity (Gehring et al., 1993; Falkenstein et al., 2000; Ferrez and Millan, 2008; Lehne et al., 2009; Stikic et al., 2011). This line of research has shown great promise in providing the means to increase the pace of skill learning, which is typically characterized as a three stage process (Fitts and Posner, 1967). In the initial cognitive stage, new knowledge is assembled to understand the skill to be learned (i.e., what to do) by effortful processing of perceptual cues. In the next, the associative stage, sensory processing of relevant stimuli is refined through practicing (i.e., how to do it) until reaching the third, the autonomous stage, which is characterized by automated execution of the task with minimal conscious mental effort. A growing body of research has been investigating physiological patterns associated with skill acquisition, mostly in sport environments that require complex visuomotor coordination, ranging from archery (Salazar et al., 1990; Landers et al., 1994), over marksmanship (Haufler et al., 2000; Hillman et al., 2000; Kerick et al., 2001; Deeny et al., 2003), to golf (Crews and Lander, 1993; Babiloni et al., 2008). These tasks are also characterized by a motionless preparatory period ("pre shot routine") before the skilled movement occurs. Since there are minimal movement artifacts during this preparatory period, EEG recordings in such settings are feasible. Gevins et al. (1999) suggested that EEG is more effective than other neuroimaging techniques in providing dynamic measures of skill acquisition due to the high temporal resolution, its online monitoring ability, and the potential for real-time feedback based on cognition, attention, and arousal.

All above mentioned skill acquisition studies focused on the analysis of low-level EEG features such as Power Spectral Densities (PSDs) grouped into the standard bandwidths, or EEG coherence analysis. Salazar et al. (1990) found alpha power in the left hemisphere to be dominant during the aiming period in archery. Haufler et al. (2000) compared the EEG of novices and expert shooters, finding that the experts exhibited less activation than the novice shooters during the aiming period, with a pronounced difference in the left central-temporal-parietal area. Crews and Lander (1993) suggested a decrease in left hemisphere, motor cortex activity as golf players prepared to putt. Babiloni et al. (2008) found that stronger event-related desynchronization (ERD) in high frequency alpha power (10–12 Hz) over the frontal midline (Fz and Cz) and right primary sensorimotor cortex (C4) is correlated with successful golf putts. The results of a study conducted by Deeny et al. (2003) revealed that, compared to lesser skilled shooters, experts engaged in less cortico-cortical communication, particularly between left temporal association and motor control regions, which implies decreased involvement of cognition with motor processes.

Changes in heart rate (i.e., the number of heartbeats per unit of time) have also been analyzed in relation to expert performance, and they are believed to reflect the focusing of attention and the skill related aspects of sensory-motor preparation for performance. The pre-shot period is typically characterized by heart rate deceleration (Landers et al., 1994; Kontinnen et al., 1998; Tremayne and Barry, 2001).

Building upon these findings, Berka et al. (2010) went one step further by developing a device called the Adaptive Peak Performance Trainer (APPT) that provides continuous psychophysiological monitoring and feedback (i.e., visual, auditory, and haptic) to the trainee in real-time. The device was based on a unique identified Pre-Shot Peak Performance (PSPP) expert profile in three different tasks (i.e., marksmanship, archery, and golf), which was characterized by an increase in midline theta and left temporal-parietal alpha EEG frequency bands and a local deceleration of heart rate. Skill learning was accelerated significantly when novices were trained with the APPT compared to controls, suggesting its feasibility as a training aid.

None of these past studies incorporated higher level cognitive state metrics such as EEG-engagement and EEG-workload that could potentially also relate to an individual's skill development. These metrics could contribute to better generalizability across different tasks compared to the low level EEG based features. Engagement reflects the allocation of processing resources (Anderson, 2004) including the cognitive processes associated with decision making, such as information gathering, visual scanning, audio processing, and selectively sustaining concentrated attention on one aspect of the environment while ignoring other distractions. Workload is typically defined as the amount of mental or physical resources required to perform a particular task (Son and Park, 2011). Seidler et al. (2012) have shown that individual differences in spatial working memory capacity predict the rate of motor skill learning. Bainbridge (1989) argued that workload is reduced with the development of a certain skill. However, Brouwer et al. (2014) did not find physiological variables to indicate decreasing effort associated with learning. EEG-based neural indices of cognition that reflect levels of alertness, selective attention, and working memory have been identified in previous studies (Hillyard et al., 1973; Gevins et al., 1998; Mikulka et al., 2002). However, these indices have only started to be explored in skill acquisition and performance prediction studies. Moreover, the neurophysiological data are typically analyzed during the pre-shooting period only, neglecting the changes over the entire process that include, for example, observation of the dynamic environment, evaluating possible threats, and eliminating enemies in the marksmanship scenarios. Lastly, the previous studies did not account for the individual variability in the EEG data, and many of them lacked evaluation of the generalizability of the proposed approaches across different users and tasks.

Performance prediction algorithms are based on either supervised or unsupervised machine learning methods. The main difference between these two types of methods is whether they utilize labeled data (supervised learning) or unlabeled data (unsupervised learning). Many of the supervised approaches are not based on EEG data, but on kinematical models, e.g., using computer vision. Couceiro et al. (2013) compared different classifiers (linear and quadratic discriminant analysis, Support Vector Machines— SVM, and naive Bayes) in their golf study. The experimental results showed the superiority of the SVM classifier, however, a large individual variability in the data was present due to the different game styles of each player. Jensen et al. (2012) evaluated AdaBoost, Fisher LDA, k-Nearest Neighbor (kNN), naive Bayes, and SVM classifiers in a study of kinematic golf putt data classification based on the inertial sensor. Nagashima et al. (2009) assessed marksmanship performance by sensor-based breath and trigger control measures that were fitted with a logistic regression model. Muangjaroen and Wongsawat (2012) developed an EEGbased real-time index for predicting putt success based on high alpha power in the C4 sensor site, theta power in Fz, and both high alpha and theta power in Pz. An impeding bottleneck of the supervised learning methods is the substantial amounts of labeled training data required to train a classifier. Labeling of data is typically a labor-intensive, time-consuming, and error-prone process that scales poorly to the large number of users and skill learning sessions necessary to achieve the required level of expertise. This limits the applicability of the supervised learning approaches in challenging skill acquisition settings where fine grained labels of all relevant session's periods would need to be precisely obtained. Another line of research avoids the labeling efforts by unsupervised discovery of structure in the data. However, there are only a few studies where unsupervised approaches were applied for performance prediction, such as data fusion of EEG and ECG signals through Self-Organizing Maps (SOM) (Bandeira et al., 1999). In this approach, shooting data were classified by majority voting of the labeled neurons of the trained SOM. However, this study focused on the data fusion and analyzed only low-level EEG features. The study included a single constrained shooting scenario with only seven subjects. Labeling of the neurons consisted of finding the "winning" neuron in a SOM for each feature vector in the dataset, and assigning the feature vector's label to the neuron's label. Thus, this approach also required labeled training data. The activated NN nodes represented the single subject's data, and did not include similar performance data. Data fusion slightly improved performance of the classifier, but they did not conduct cross-validation of the algorithm on an independent dataset or made any effort to determine whether the model would generalize across tasks.

In this paper, we present an alternative approach toward performance estimation in dynamic environments requiring rapid decisions. We focus on the two tasks that were often analyzed in the past: combat marksmanship (e.g., Haufler et al., 2000; Deeny et al., 2003) and golf training (e.g., Crews and Lander, 1993; Babiloni et al., 2008). These two tasks both require complex visuomotor coordination, mental and sensory-motor preparation, focused attention, and concentration. None of the previous studies in the literature explored the potential similarities of the neurophysiological performance predictors on these tasks. This novel approach is based on a combination of the two types of higher level EEG-based cognitive state measures associated with decision-making, alertness, attention, and working memory: engagement (Johnson et al., 2011) and workload (Berka et al., 2007). Furthermore, heart rate is also included in the analysis, as it adds the dimensions of stress, anxiety, frustration, and arousal experienced by a shooter (Berka et al., 2010). EEG-engagement reflects information-gathering, visual processing, and allocation of attention, while EEG-workload is associated with increased working memory load during problem solving, integration of information, and analytical reasoning (Berka et al., 2007). The EEG-based metrics are used as an indicator of how well a shooter is processing information, as well as an indicator of the focused and relaxed mental state that is a prerequisite for good shooting performance (Raphael et al., 2009). We tested the following hypotheses: (1) temporal sequences of EEG-based engagement and workload and ECG-based heart rate comprise discriminative patterns related to performance on the combat marksmanship task, (2) the discovered patterns are good indicators of performance on the golf training task, and (3) individual differences in the neurophysiological metrics could be overcome by data normalization. In order to overcome the main drawback of the supervised approaches that require large amounts of labeled training data, we explored the potential of the unlabeled data by applying an unsupervised approach. The approach is based on the one dimensional self-organizing NN inspired by work of Stevens et al. (2009, 2013). However, we extended their algorithm by combining all three metrics together (i.e., EEGengagement, EEG-workload, and heart rate) into a single NN. Multiple neurophysiological metrics could facilitate developing a deeper understanding of skill acquisition, as they have different functional properties reflecting various aspects of the learning process. Furthermore, we incorporated the temporal modeling of these metrics on a fine-grained scale by including 3 consecutive seconds into a feature vector. All metrics were self-normalized to account for the individual differences inherent in the physiological data. The NN model was trained and auto-validated on the combat marksmanship training data. The trained model was then cross-validated in two different ways. First, we utilized 10-fold cross-validation (Kohavi, 1995) to evaluate generalization capabilities of the approach across different subjects. Second, we also evaluated the generalization capabilities of the model by crossvalidating it on the golf-training data. Correlation analysis of the temporal cognitive state changes was performed to investigate if they comprise the discriminative patterns for performance prediction that could ultimately allow for faster learning of different skills through neurofeedback. Lastly, the model was compared to the linear regression model on the same set of variables to investigate whether the utilized variables or their temporal modeling with the NN approach contributed to the results.

# **MATERIALS AND METHODS**

Two studies were conducted: combat marksmanship training on the military simulator, and golf putting training at the indoor golf facility. In this section, we describe the protocols of both studies, and we detail data recording setup, signal processing, and algorithm development procedures.

#### **PARTICIPANTS**

For the combat marksmanship study, a total of 51 participants (22 females and 29 males, mean age 27.4 ± 6.7 years) were recruited from local colleges and newspaper/online advertisements. All participants had normal or corrected-to-normal vision. No participants that had undergone formal marksmanship training were admitted to the study. Informed consent was obtained from all participants in accordance with the guidelines and approval of the Biomedical Research of America IRB.

A total of 22 subjects were recruited for the golf training study using the National University Golf Academy's (NUGA) IRB approved protocol. The participants were split into two groups: experienced players and novices, based on their previous experience. Each group had 11 participants and in each group 45% of the population was male. Average overall age was 33.9 ± 11.1.

No study procedures were conducted until the consent form was fully explained and the consent form was signed. There was no overlap between the participants in these two studies. All participants in both studies were right-handed.

#### **STUDY PROTOCOL**

The study protocols were comprised of screening, baseline, and the experimental session.

#### *Screening procedure*

The eligibility screening criteria aimed to cover anything known to effectively alter the EEG signals, so those subjects with general health problems such as psychiatric, neurological, behavioral, attention, or sleep disorders, as well as any pulmonary or eating disorder, diabetes, high blood pressure, or history of stroke were excluded from the studies. Those who reported using pain medications regularly, stimulants such as amphetamine, illicit drugs, or those who consumed excessive alcohol or tobacco on a daily basis were also excluded. Exclusion criteria also included head injuries within the past 5 years and pregnancy. Eight subjects failed this initial screener. Participants who passed the screener were scheduled for the baseline session. Participants were asked to ensure they got a full night of sleep (between 7.5 and 9 h) in the night leading up to their study appointments. Participants were also requested to refrain from drinking alcohol at least 24 h before each study visit, and abstain from caffeine on the day of the study and nicotine 1 h before the start of the study visit.

#### *Baseline session*

In both studies, the participants completed a 15 min baseline (BL) using the Alertness Memory Profiler (AMP). The AMP was developed by Advanced Brain Monitoring Inc. (ABM) to integrate EEG, ECG, and performance measures in an easy-to-administer platform designed for quantitative assessment of neurocognitive functions, including alertness, attention, learning, and memory. The AMP uses a multivariate approach that allows simultaneous acquisition and analysis of data.

The baseline session comprised a Visual Passive Vigilance Task (VPVT), an Auditory Passive Vigilance Task (APVT), and a 3-Choice Active Vigilance Task (3CVT). During the 5 min of VPVT and APVT, subjects were required to press the space bar on the keyboard every 2 s. Subjects were prompted to maintain the 2 s time intervals by a 10 cm diameter red circle that appeared in the center of the monitor during the VPVT, or by an audio tone during the APVT. The 3CVT (**Figure 1**) is a sustained attention task that requires subjects to discriminate one primary target (presented 70% of the time) from two secondary non-target geometric shapes that were presented for 0.2 s and randomly interspersed over the test period. It challenges the ability to sustain attention by increasing the inter-stimulus interval (ISI) from 1.5 s up to 10 s at the end of the task. Participants were instructed to respond as quickly and as accurately as possible to each stimulus presented by pressing the left arrow to indicate target stimuli, and the right arrow to indicate nontarget stimuli. A brief training period was provided prior to the

**FIGURE 1 | A subject wearing the wireless EEG sensor headset while performing the 3CVT task.**

start of the testing period to minimize practice effects for the 3CVT. The training period lasted until a certain number of correct responses to both targets and non-targets was reached (2 targets and 2 non-targets). The applied criterion was the same for all participants in both studies. On average, training period lasted 30 s.

## *Combat marksmanship training experiment*

The military training platform Virtual Battle Space 2 (VBS2), Tactical Warfare Simulation running on the Real Virtuality 2 simulation engine and developed by Bohemia Interactive, was used to create the combat marksmanship training scenarios. The VBS2 platform was designed in close cooperation with the United States Marine Corps and Australian Defence Force, and it has been used as a training battlefield simulation system by federal, state, and local government agencies worldwide for tactical training and mission rehearsal in military organizations. It is an interactive, 3D training system that allows users to construct specific missions based on their individual needs. In collaboration with Laser Shot Inc. and a subject-matter expert (SME) from Washington State University, we developed five custom combat scenarios using VBS2. The scenarios (**Figure 2**) had realistic settings and contexts in which participants (acting as soldiers) were required to make deadly force decisions. In order to mimic reality as closely as possible, the room was equipped with life-size projection of threats, stereo sound delivered via earbuds, and other paraphernalia found in the battlefield environment. The participant used a demilitarized "airsoft" replica of an M4 rifle that interacted with the platform using a wireless laser-based training system from Laser Shot Inc. The M4 was mounted with a holographic weapon sight, or red dot scope, commonly used in a combat environment for quick target acquisition. Sandbags were provided to support the weight of the weapon, to both simulate combat firing procedures and investigate the effect of muscle fatigue on performance. The weapon also had a CO2 gas recoil system that simulated the kickback of the weapon.

**FIGURE 2 | Combat marksmanship scenario scenes.**

Participants were initially given marksmanship instructions and requested to undergo a set of training tasks. Training addressed the fundamentals of marksmanship (aiming, breath control, trigger control, etc.) and the rules of engagement applicable in the testing scenarios. During the three training scenarios, the participants practiced firing the M4: (1) at close-range static targets, (2) at targets of various distances, and (3) at targets that were randomly displayed among non-targets. Five testing scenarios were administered in a randomized order for each subject. Each scenario was set in a unique environment that replicated typical fire-fighting situations for soldiers (e.g., Afghanistan compound, checkpoint, and market; Improvised Explosive Device (IED) compound, or urban alley). In order to avoid excessive fatigue in participants, the scenarios were designed from a fixed point of view and each scenario lasted only 3–4 min, on average. Throughout the scenarios, a mixture of enemy and friendly units (both stationary and moving) appeared at varying distances. Two scenarios were enemy-heavy in which 90 and 76% of the units were enemies, two were friendly-heavy in which 76 and 60% of the units were friendly, and one had about equal mix of enemy and friendly units (47% enemy units and 53% friendly units). Participants were instructed to evaluate threats and eliminate all enemy units as quickly as possible.

To enable detailed event tracking, an event logging and synchronization platform was utilized as an essential piece for the study testbed. An External Sync Unit (ESU) was used to synchronize the physiological signals with the relevant events in the scenarios (such as friend/enemy became visible, enemy fired, etc.), as well as user responses (e.g., rifle shots, hit friend/enemy, etc.). Synchronization in the Windows environment is dependant on the windows task scheduler and cannot guarantee an upper bound for user level tasks, while the ESU is a general purpose data integration platform that can synchronize multi-source digital data (serial and/or parallel port protocols) with physiological signals to millisecond-level precision.

#### *Golf training experiment*

The golf training experiment was designed in close cooperation with SMEs from NUGA. The participants were asked to come to the indoor golf facility at the NUGA's premises. Upon arrival, they were asked to complete 10 sessions of 10 putts each wearing the wireless EEG headset (**Figure 3**). The average session duration was 5.5 min. It was decided that 100 putts from each subject would be required for the EEG data analysis. To avoid possible fatigue, those 100 putts were broken down into 10

**FIGURE 3 | NUGA's golf facility.**

sessions. All sessions were administered within the same day, with 3 min breaks between the sessions. The putting distance was far enough (10 feet) to be challenging for both experienced players and novices. Each putt comprised a series of steps including: (1) preparation period—the participants were asked to stand a few feet behind the ball before stepping up to the ball; (2) step up to putt—when the subjects were ready, they would walk up to the ball and get into their putting stance; (3) start putt—when the putter made contact with the ball; and (4) end putt—motion of the ball was stopped or off the green. All relevant events were manually marked and logged during the real-time EEG acquisition and for each putt the three signal sequences of interest were analyzed: preparation period, pre-putt period, and post-putt period (shown in **Figure 4**).

#### **DATA RECORDING AND SIGNAL PROCESSING**

EEG and ECG data were collected using the wireless EEG sensor headset (**Figure 1**) developed by ABM (Berka et al., 2004). For the combat marksmanship study, nine Ag/AgCl EEG electrodes were located at F3, Fz, F4, C3, Cz, C4, P3, POz, P4, according to the international 10–20 system. For the golf study, 20 EEG channels were utilized: Fp1, Fp2, Fz, F3, F4, F7, F8, T3, T4, T5, T6, Cz, C3, C4, Pz, P3, P4, O1, O2, and POz sites. The EEG channels were referenced to linked reference electrodes located behind each ear on the mastoid bone. Even though different montage was used in these two studies, only the data from the EEG channels that are necessary for calculating EEG-based engagement and workload metrics were analyzed: Fz, F3, Cz, C3, C4, and POz. The most discriminative variables for both models (i.e., EEG-engagement and EEG-workload) were selected in the previous study (Berka et al., 2007) using stepwise regression. ECG was recorded with electrodes placed on the clavicle and opposite lower rib. All data was sampled at 256 Hz and transferred in real-time via Bluetooth link to a nearby host computer where the proprietary data acquisition software stored the data onto the disk.

The EEG signals were filtered with a band-pass filter (0.5–65 Hz) before the analog-to-digital conversion. To remove environmental artifacts from the power network, notch filters at 50, 60, 100, and 120 Hz were applied. The decontamination algorithm (Berka et al., 2004) detected and removed any sudden changes in amplitude, i.e., artifacts in the time-domain EEG signal, including spikes caused by tapping or bumping of the sensors, amplifier saturation, and excursions. Eye blinks were identified and decontaminated by wavelet transform based upon an algorithm presented in Berka et al. (2007). The wavelet decomposition coefficients for differential Fz-POz and Cz-POz EEG channel derivations in the exponential 0–2, 2–4, 4–8, 8–16, 16– 32, 32–64, and 64–128 Hz bands were calculated by applying the Coiflet order-1 wavelet filter (Wei, 1998). These coefficients were further utilized in the discriminant function analysis to detect the eye blink regions. Decontamination of eye blinks was accomplished by computing mean wavelet coefficients from the nearby non-contaminated regions and replacing the contaminated data points. The EEG signal was then reconstructed by wavelet composition. From the filtered and decontaminated EEG signal, the log-transformed absolute power spectral densities (PSD) were calculated for each 1 s epoch of data by applying Fast Fourier Transformation (FFT). Relative PSD values were derived by subtracting the PSD for each 1 Hz bin from the total PSD power in the range of 1–40 Hz. Lastly, excessive muscle activity (EMG) was detected by identifying epochs in which (a) PSD bins from 35 to 40 Hz were above a certain threshold and (b) the square root of the PSD bins' sum from 70 to 128 Hz was above a defined cut-off value. The threshold values were customized empirically by visual inspection of the large EEG data cohort contaminated with EMG during the EEG sensor development. Epochs with detected EMG artifacts were discarded from further analysis. On average, 21.9 and 23.4% of the epochs per session were contaminated with artifacts in the combat marksmanship and golf study, respectively. There were no significant differences in the number of detected artifacts for different combat marksmanship scenarios (19.9, 20.6, 21.2, 23.4, and 24.2%).

The ECG signal was filtered with the band-pass 4th order IIR filter (15–35 Hz). This improved the contrast between the QRS complex and the T wave, and minimized double peak detection, leading to the more robust detection of peaks. The filtered signal was processed by a real-time algorithm that computed sample-by-sample running mean and standard deviation to derive a threshold for peak detection. The inter-beat (R-R) interval was determined as the number of seconds between the current and the previous peak. Heart rate was estimated as a number of beats per minute, i.e., 60/(R-R interval), and logged in real-time. The algorithm assessed the quality of detected beats by monitoring the standard deviation of up to 6 consecutive beats.

#### **EEG-BASED ENGAGEMENT AND WORKLOAD METRICS**

In order to explore the applicability of alertness quantification in performance estimation, we incorporated the B-Alert® model (Johnson et al., 2011) into the analysis. It is an individualized model that classifies a subject's cognitive state into different levels of alertness: distraction/relaxed wakefulness, low engagement, or high engagement. It utilizes the absolute and relative PSD values from the midline Fz-POz and Cz-POz derivations during VPVT, APVT, and 3CVT BL data to derive coefficients for a discriminant function that generates classification probabilities for each 1 s epoch. During the model training procedure, APVT, VPVT, and 3CVT represent distraction/relaxed wakefulness, low engagement, and high engagement, respectively. The quality of the individual models were assessed by auto-validation on these three tasks (APVT, VPVT, and 3CVT), and for all subjects the majority of the epochs were classifed into the expected class. This model has been validated in a number of previous studies across a range of applications, such as sleep deprivation (Westbrook et al., 2004), team collaboration (Stevens et al., 2009), and emergent leadership (Waldman et al., 2013). The high engagement output posterior probabilities of the model were added to the feature vector for further analysis.

The EEG-based workload model (Berka et al., 2007) was utilized to extract the subject's cognitive workload levels on a second-by-second basis. This is a general model that was trained on the EEG data from a large population performing the Forward Digit Span (FDS) and Backward Digit Span (BDS) tasks. During these two tasks, the subjects sit still in front of a computer screen and memorize sequences of 2 up to 9 digits that are shown on the computer screen as a series of single digits of increasing lengths, followed by an empty box prompting the participant to reproduce the sequence by typing in the memorized digits in the presented order (FDS) or the reverse order (BDS). For both FDS and BDS, the task difficulty was manipulated by increasing the number of digits at each level. The model utilizes the absolute and relative PSDs from the differential EEG derivations (C3-C4, Cz-POz, F3-Cz, Fz-C3, and Fz-POz) into a 2-class linear Discriminant Function Analysis classifier (i.e., low workload and high workload). The present study employed the posterior probability of the high workload class in order to identify a continuous measure of EEG-workload and analyze its temporal changes during skill training.

It has been shown in the previous studies (Stevens et al., 2013) that EEG-engagement and EEG-workload metrics have different functional properties in response to different tasks and they are poorly correlated with one another. Correlation between EEGengagement and EEG-workload was *R* = −0*.*24 ± 0*.*21 with an *<sup>R</sup>*<sup>2</sup> of 0.11 <sup>±</sup> 0.11 across the marksmanship training sessions. Correlation between EEG-engagement and EEG-workload across the golf training sessions was *<sup>R</sup>* = −0*.*<sup>11</sup> <sup>±</sup> <sup>0</sup>*.*22 with an *<sup>R</sup>*<sup>2</sup> of 0.06 ± 0.11. Correlation between these two metrics in Stevens et al. (2013) was in the same range *<sup>R</sup>* = −0*.*<sup>19</sup> <sup>±</sup> <sup>0</sup>*.*24 (*R*<sup>2</sup> <sup>=</sup> 0*.*09 ± 0*.*05).

#### **PERFORMANCE METRICS**

In both the combat marksmanship and golf training studies, the subjects' performance scores were calculated. For the combat marksmanship data, the following metrics were included in the analysis: percent of enemy hits, enemy deaths, and misses. The enemy was hit if they were shot in a non-vital area such as hand, foot, or shoulder, while they were killed when shot in a vital area like head or chest. For the golf study, we calculated the percent of putts that ended up in the hole as the main performance metric.

#### **ALGORITHM DEVELOPMENT**

EEG-derived engagement and workload, and ECG-derived heart rate metrics were combined together in order to analyze their common changes over time. To normalize the data for individual variability, all three metrics were self-normalized (i.e., z-scored) with respect to the training task in question. This accommodated for individual differences inherent in the physiological data. The feature vector was derived by applying a 3 s sliding window to each second of data (i.e., the window was shifted in 1 s increments). Thus, the feature vector comprised the three analyzed metrics (i.e., EEG-engagement, EEG-workload, and heart rate) over three consecutive seconds, summing up to 9 features.

These vectors were used for training an unsupervised selforganizing artificial NN (Kohonen, 1990) consisting of 20 nodes placed on a 1D grid (20 × 1), which is a discrete representation of the continuous input space with the preserved spatial properties of that space (i.e., topological relationships within the training set are maintained). In the NN, the nodes are represented with a weight vector (i.e., code vector) whose dimension is equal to the dimension of the input vectors and a neighborhood function that decays with distance and dictates the topology of the map. Training consists of the competitive learning process which develops the topology in the following manner. The feature vectors most similar to each other (based on the Euclidean distance) activate the same nodes, i.e., in the end similar nodes are closer to each other on the grid than the more dissimilar ones. For each input pattern, the node with the closest weight vector is declared the Best Matching Unit (BMU). The weights of the BMU and its neighbors are then adjusted toward the input vector. The neighborhood size is at first large during the ordering phase, and afterwards it is reduced in each iteration until only the BMU's weight starts to be updated during convergence phase when fine tuning of the NN weights is performed. After training, mapping is carried out by assigning each new feature vector to the NN node it activated. In our experiments, both training and mapping were performed in MATLAB® using Neural Nework Toolbox. The batch version of the learning algorithm was applied: instead of presenting only a single feature vector per iteration, all vectors are presented to the NN in each iteration and the weights change so that each node's new weight vector is the weighted average of the input vectors that activated that node. The weight vectors were initialized with the most significant principal components of the input space to start with a reasonable ordering. By distributing the weights in this manner and using the batch learning algorithm, the number of iterations needed to reach convergence was reduced, and we ran 200 iterations in our experiments.

The goal of the analysis was to find relevant patterns in the trained NN nodes' activations related to the skill training process. In our experiments, the NN model was trained on the combat marksmanship data that were also used for auto-validation as a proof of concept. The NN model was validated by utilizing 10 fold cross-validation. In the first nine cross-validation iterations, the data from five subjects were used for training and the rest of the data were used for testing. In the last round, the data from six subjects were employed for training and the rest of the data was employed for testing. By choosing the training folds in this manner, we evaluated person-independent modeling (i.e., the test data did not comprise the data from the subjects that were used for training) and every subject in the study was used only once in the training phase. The trained NN model was also tested on the golf study data to evaluate generalization capabilities of the model across different tasks. An extensive set of analyses was performed. First, we characterized the NN nodes by aggregating the feature vectors that activated the same node and analyzed the size of each node, i.e., how many times each node was activated. Second, the distributions of activated NN nodes during different relevant events and periods in both datasets were analyzed. The distributions were aggregated by counting all node activations whenever the event of interest appeared in the combat marksmanship dataset or during a certain period of interest in the golf dataset. The middle point of the 3 s long feature vector window was taken as a reference point, i.e., we analyzed EEG-engagement, EEG-workload, and heart rate metrics of the 1 s epoch when the event occurred, and their changes 1 s before and 1 s after the event. One-Way Multivariate Analysis of Variance (MANOVA) was applied to the estimated distributions to investigate if they are statistically different, i.e., if different neurophysiological patterns are associated with these events and periods of interest. Third, time series analysis of the adjacent NN node activations was performed by deriving a transition matrix that shows the overall number of the subsequent NN node activations for all possible pairs of the NN nodes. Fourth, based on the observed patterns in the transition matrix, we introduce a transition score that measures the size of temporal state changes as a sum of transitions between the subsequent activations of the NN nodes in the following manner. Let *t* be an overall number of activated nodes during the subject's skill training session (i.e., an overall number of feature vectors presented to the NN model), and let *i(j)* be the NN node activated during epoch *j* (*i* ∈ [1*, N*]*, N* = 20*, j* ∈ [1*, t*]). We define transition score as:

$$Transition\, score = \sum\_{j=2}^{t} abs(i\left(j\right) - i(j-1))$$

Correlation analysis was performed to investigate if these transition scores were related to the subject's performance scores in both studies. Lastly, these performance prediction results were compared to the linear regression model on the same set of variables used for training the NN model.

#### **RESULTS**

In this section, the following results are presented: the frequency of activated NN nodes and their patterns during the important events and periods of interest in both the combat marksmanship and golf training datasets, time series analysis of activated NN nodes throughout the entire training sessions obtained by calculating transition matrices, analysis of performance scores, correlations between transition scores and performance scores, and comparison with the linear regression model.

#### **ANALYSIS OF THE NN NODES**

After training on the combat marksmanship dataset was completed, we grouped together and analyzed the feature vectors that activated the same NN nodes. Average EEG-engagement, EEGworkload, and heart rate values of the grouped feature vectors for each node and node sizes (i.e., a number of feature vectors that activated the nodes) are shown in **Figure 5**. The nodes 8, 6, and 11 were activated the most often (2071, 1955, and 1827, i.e., 7.9, 7.5, and 6.9% of the time), and they represent three distinct neurophysiological states in which there are no large changes in any of the three analyzed metrics. Node 8 represents a state of low EEG-engagement, as its z-scored value is negative (i.e., below the average value over the subject's entire training session) that is increasing over the observed 3 s. Heart rate is also slightly increasing and it is above the subject's average value, while EEG-workload is slightly decreased during the third second, but still above the average subject's value. Node 6 represents EEG-engagement and heart rate below the subject's average value, while EEG-workload is above the average value. Lastly, EEG-engagement and EEG-workload of node 11 are positive, while heart rate is around the subject's average value with a slight increase tendency. The least activated nodes were 2 and 13 — they were activated 518 and 589 times, i.e., 1.9 and 2.2%

**nodes (i.e., the number of feature vectors that activated the corresponding node).**

of the time, respectively. The main characteristic of these nodes is that they have a large decrease (node 2) or increase (node 13) in EEG-engagement and, at the same time, relatively large opposite change of EEG-workload. These two nodes were the least often activated nodes during the validation of the trained NN on the golf training data as well (0.8% node 2, and 0.9% node 13). The most often activated nodes for the golf dataset were also nodes 11 (10.4%) and 6 (7%), in addition to node 5 (7.5%) that has around average EEG-engagement, above average EEG-workload, and below average heart rate values.

For the combat marksmanship dataset, we analyzed the changes in the subjects' EEG-engagement, EEG-workload, and heart rate when an enemy appeared and compared them to the typical subject's reactions to the friendly unit's appearance by examining distributions of activated NN nodes for these two events. In **Figure 6** are shown distributions of the activated NN nodes for the two analyzed events (i.e., "Became visible friend" and "Became visible enemy"). MANOVA analysis showed statistically significant difference between these two events' NN node activations [*F*(20*,* 427) <sup>=</sup> <sup>1</sup>*.*716, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*028, effect size <sup>η</sup><sup>2</sup> partial = 0*.*07, observed power = 0.966]. When looking at differences in the individual node's activations, the largest difference was for nodes 15 and 11. Node 15 was activated more often as a response to the appearance of friendly units and node 11 was a frequent response to the enemy's appearance. Node 15 represents the relatively relaxed state when all three metrics (i.e., EEG-engagement, EEGworkload, and heart rate) are below their average values, while node 11 is a representative of above average EEG-engagement, EEG-workload, and heart rate values. Furthermore, there is a slight decrease of EEG-engagement and heart rate in node 15, during the third second, when friendly units appeared, while in node 11 an increase of EEG-engagement and heart rate is encountered during the third second, i.e., after enemies appeared.

We also characterized the three periods of interest in the golf study: preparation period, pre-putt period, and post-putt period by their distribution of activated NN nodes. These results are shown in **Figure 7**. From the plots one can observe that the most frequent node during the pre-putt period was node 11, while node 5 was the most often activated during the post-putt period. Comparison of those 2 nodes' EEG-engagement, EEGworkload, and heart rate values (shown in **Figure 5**) indicated above average EEG-engagement, EEG-workload, and heart rate levels during the pre-putt period, while during the post-putt period EEG-engagement was back to its average value, heart rate decreased below average value, and EEG-workload was still above average but slightly decreased compared to the pre-putt period. Furthermore, we compared pre-putt and post-putt periods' node activation distributions for the hits and the misses, and MANOVA showed statistically significant differences between pre-putt and post-putt periods for these two types of putts [*F*(20*,* 2178) = 4*.*018, *p <* 0*.*001, effect size η<sup>2</sup> partial = 0*.*036, observed power

= 1 for pre-put period and *F*(20*,* 2128) = 1*.*743, *p* = 0*.*02, effect size η<sup>2</sup> partial = 0*.*016, observed power = 0.974 for post-putt period].

#### **TRANSITION MATRICES**

Time series analysis of the activated NN nodes was performed and transition matrices on the second-by-second basis were constructed. Transition matrices for both combat marksmanship and golf study are shown in **Figures 8**, **9**, respectively. From the plots it can be observed that both transition matrices followed the same pattern of changes, i.e., most of the transitions were local, which was reflected in the transition matrix as movement around a diagonal. This is confirmed in **Figure 10**, which shows a number of transitions aggregated across distance between the nodes (based on their sole position on the NN grid, and not on Euclidean distances between them). From the plots it can be clearly seen that the number of transitions decreased almost exponentially with distance between the nodes in both datasets. This indicates that the analyzed physiological metrics were not changing dramatically over time. On the other hand, larger state changes were less dominant, however, after investigating their timestamps it turned out that they typically represented some significant events in the datasets that caused large increase or decrease of EEG-engagement, EEG-workload, or heart rate metrics. Consequently, more distant nodes of the

**FIGURE 8 | Transition matrix during auto-validation on the combat marksmanship dataset: a number of subsequent transitions between each pair on the NN nodes.**

trained NN model were activated, and in such cases an average nodes' distance on the NN grid was 4. In order to quantify this phenomenon, we calculated transition scores for each subject's skill training session and correlation analysis was performed to investigate if they were associated with the subject's performance.

#### **ANALYSIS OF PERFORMANCE SCORES**

In the combat marksmanship dataset, even though the scenarios had varying number of enemies, there were no statistically significant differences in performance across different scenarios. The order of the scenarios was randomized across subjects, and there was a slight trend of decrease (about 5%) in performance in the later scenarios as shown in **Figure 11**, but it was not statistically significant (*p* = 0*.*45).

In the golf dataset, experienced players performed better on average (by 16%), but after ranking all subjects by the achieved performance, the set of the best performers comprised an equal split of novices and experienced players. Averaged subjects' performance for each session is shown in **Figure 11**, and there was an increase of 12% in performance during the first half of the sessions. Afterwards, performance started to decrease, but when comparing the first and the last session, there was an overall slight improvement in performance (by 4.8%).

#### **CORRELATION ANALYSIS**

Correlations during auto-validation between transition scores and performance scores for combat marksmanship are shown in **Table 1**. **Table 2** shows correlations during 10-fold crossvalidation between transition scores and performance scores for combat marksmanship. In **Table 3** are shown correlations between transition scores and performance scores for the golf study. In all three cases, the correlations were statistically significant based on the two-tailed *t*-test. Statistically significant

**marksmanship and golf dataset.**

**Table 1 | The Pearson's correlation coefficients during auto-validation between transition scores and performance scores for the combat marksmanship dataset.**


**Table 2 | The Pearson's correlation coefficients during 10-fold cross-validation between transition scores and performance scores for the combat marksmanship dataset.**


positive correlations between transition scores and percent of enemy hits/enemy deaths for the combat marksmanship dataset, as well as between transition scores and percent of "in" putts for the golf study, indicate that the larger the transition score the better the performance. Negative correlations between transition scores and percent misses in the combat marksmanship dataset during both auto-validation (**Table 1**) and 10-fold crossvalidation (**Table 2**) show a trend of lower transition scores for the worst performing subjects.

#### **COMPARISON WITH THE LINEAR REGRESSION MODEL**

We built the linear regression model using the same set of variables as in the NN model to predict the performance scores in the combat marksmanship study. We tested the variables for **Table 3 | The Pearson's correlation coefficients between transition scores and performance scores for the golf dataset.**


#### **Table 4 | Multicollinearity test: VIF and tolerance of the EEG-engagement, EEG-workload, and heart rate variables.**


**Table 5 | The linear regression model performance prediction results: coefficient of determination (***r***2),** *p***-value for the** *F***-test of the overall linear regression model, and the** *p***-value for the** *t***-test of the predictor variables.**


multicollinearity by calculating tolerance and the corresponding variance inflation factor (VIF). The results are shown in **Table 4**. As tolerance *>* 0.2 (i.e., VIF *<* 5), one can conclude that multicollinearity was not present. Thus, the linear regression model was built without regularization. The regression results are shown in **Table 5**. One can observe that only the model that predicted the "percent enemy deaths" performance score was statistically significant (*F*-Test, *p*-value = 0.018). In that particular model, only 3 predictor variables were statistically significant: EEG-Engagement—sec 2 (*t*-test, *p*-value = 0.004), EEG-Workload—sec 1 (*t*-test, *p*-value = 0.05), and EEG-Workload—sec 3 (*t*-test, *p*-value = 0.05). When comparing the *r*<sup>2</sup> scores in **Table 1** (i.e., based on the NN model) and **Table 5** (i.e., based on the linear model), one can conclude that the NN model outperformed the linear regression model, as it was able to predict the performance scores more accurately.

# **DISCUSSION**

In this paper, we explored a novel way of combining EEG-based metrics of engagement and workload, and an ECG-based heart rate metric to capture temporal cognitive state changes related to performance on two different skill training tasks, namely combat marksmanship and golf training. The main benefit of the algorithm is that it does not require prior labeling of the data as it employs the unsupervised self-organizing NN approach. The supervised learning approaches require large amounts of labeled data to train a classifier, but obtaining that large amount of labeled training data is a time-consuming and error-prone process that has limited the real-world applicability of many previous algorithms. Instead of utilizing low-level EEG features, we investigated the higher level EEG-based metrics of engagement and workload that contributed to the better generalizability of the approach across different tasks. The richness of the combat marksmanship dataset in the different EEG-engagement and EEG-workload levels allowed for successful training of the NN model. The model was able to cover all important patterns in cognitive state changes that were relevant for the golf task as well. In order to overcome individual differences inherent in the physiologically induced metrics, we normalized the data. Our algorithm was cross-validated in two different ways. First, person-independent training was performed by utilizing 10-fold cross-validation. Second, the NN model trained on the combat marksmanship data was tested on the golf dataset to evaluate generalization capabilities of the approach across different tasks. The temporal sequences of physiologically estimated cognitive state changes detected in the combat marksmanship dataset occurred in the golf dataset as well. The model was able to capture differences in the subject's psychophysiological states in response to different events such as appearance of the friendly and enemy units in the combat marksmanship scenarios or during different stages of golf putting. This could be also attributed to a difference in motor activity. We aimed to minimize that effect by excluding the epochs with detected EMG artifacts from the analysis. This demonstrated the richness of a model combining EEG-engagement, EEG-workload, and heart rate metrics that successfully provide a window on the internal mental state of the user. Analysis of the dynamic changes over time of these three metrics combined together contributes to a more in-depth understanding of different aspects and stages of skill training. For example, EEG-engagement plays an important role during gathering information necessary for performing a task, EEGworkload is essential in processing the gathered information and comparing it to internal mental models, while heart rate is an indicator of the subject's preparation to receive sensory inputs and reaction to the processed information.

Statistically significant correlations between the introduced NN nodes' transition scores and performance were found in both datasets, which supports our first two hypotheses that temporal sequences of EEG-based engagement and workload and ECG-based heart rate comprise patterns related to performance on the combat marksmanship dataset, and that the discovered patterns are good indicators of performance on the golf training task as well. This is in line with the findings of Stevens et al. (2013) that showed relationship between the fluctuations in the entropy of the activated NN nodes and the relevant events in the task that require significant cognitive re-organization. These large cognitive state transformations might reflect adaptation to changes in the task. In their study, decreases in entropy were also associated with periods of poorer task performance. Though preliminary, our results also suggest physiological signatures may distinguish elements of perception indicative of good performance. Unlike previous studies that analyzed the relationship between the performance scores on the combat marksmanship and golf training tasks (e.g., Crews and Lander, 1993; Haufler et al., 2000; Deeny et al., 2003; Babiloni et al., 2008), and low-level EEG features such as PSD bandwidths over different scalp regions, we focused on the higher level metrics such as EEG-engagement, EEG-workload, and heart rate. The discriminative patterns in these three metrics were general across the two analyzed tasks and could be included in future neurofeedback-based or other intelligent training systems to accelerate learning across different tasks. Beyond performance estimation and skill acquisition scenarios, the proposed unsupervised approach based on time series analysis holds potential to also enrich information systems with additional data on cognitive state changes of the user that could be employed, for example, for testing the effectiveness of different marketing advertisements, user's reactions to different movie storyline endings, or even to political speeches. Due to the general nature of the developed algorithm, different NN nodes would be more active under diverse application scenarios, and the speed of the feature vector changes specific to different applications could be captured.

Even though the average ages of the subjects in these two datasets are different (27.4 ± 6.7 in the combat marksmanship dataset vs. 33.9 ± 11.1 in the golf dataset) and physiological signals are age dependent, the transfer of learning over these two datasets was successful, presumably due to the normalization of data. Furthermore, the EEG-based engagement metric is derived from the model individualized for each subject based on the EEG data during the baseline VPVT, APVT, and 3CVT tasks. Utilization of individualized metrics such as EEG-based engagement further facilitates knowledge transfer between different tasks. Although the algorithm proposed in our work is built upon the approach of Stevens et al. (2009, 2013) we extended it in a number of ways. First, we did not analyze the three metrics separately, but we combined EEG-engagement, EEG-workload, and heart rate into a single NN to capture synchronous changes of the metrics. Second, we did not apply a quantization into the upper, middle, and lower quartile, rather, we analyzed the metrics on a fine-grained continuous scale to be able to capture subtle cognitive changes during the tasks. Third, we incorporated temporal modeling in the NN itself by including 3 consecutive seconds of data into a feature vector. Thus, already the NN nodes were, to some extent, able to capture relevant temporal patterns in the data. Further investigations are needed to estimate the optimal window size, but increasing it from 1 to 3 s allowed for better temporal modeling and interpretation of the NN nodes. Fourth, we reduced the number of the NN nodes from 25 [that was used in Stevens et al. (2009, 2013)] to 20, which proved to be sufficient for our model and at the same time, reduced the time required to train the model, resulting in a more efficient algorithm. Fifth, we introduced a simple, yet effective NN state transition score that was correlated with performance scores in both analyzed datasets. Lastly, Stevens et al. (2009, 2013) focused on the analysis of team settings, which present a number of confounding factors, but the goal of this study was to analyze the individual changes in cognitive states and to eventually reveal the patterns relevant for characterizing performance on different tasks. We will also investigate the potential of better feature interactions coverage in the two-dimensional NN grid.

The current study sought to develop a method for capturing cognitive state changes that could be integrated in a number of applications, but this is only the first step in developing a performance estimation system that could be implemented in real-time and utilized in real-world settings. In order to apply the proposed approach in real-time, after the NN model is trained, one would need to collect the appropriate baseline data for new users to allow for individual data normalization (i.e., z-scoring utilized in our algorithm). Another option would be to compute mean and standard deviation of the subject's EEG-engagement, EEG-workload, and heart rate metrics (that are necessary for z-scoring) adaptively online during the task itself. Thus, the algorithm could be relatively easily implemented in real-time. The main limitation is the 1 s delay required to capture the third second of data in our feature vector. In most of the envisioned potential applications of the proposed approach, this delay would be acceptable.

Furthermore, we analyzed different performance scores, such as percent of enemy hits and enemy deaths in the combat marksmanship study, to evaluate if we can estimate subtle differences in performance by capturing neurophysiological differences in the subjects' reactions to relatively similar events in this very rich dataset. However, larger studies are necessary in order to further refine the NN model to be able to also capture neurophysiological correlates of performance improvement. In the combat marksmanship study there was no clear overall improvement in performance throughout the scenarios, while in the golf study, performance was slightly improved at the end of the study compared to the initial performance, but there was a drop in performance seen in the second half of the golf sessions. As the golf study sessions lasted longer than the combat marksmanship sessions, the decrease in performance in the golf study was presumably due to a loss in concentration as the study progressed. Thus, a larger number of skill training sessions across different days is needed. We utilized simulated training scenarios in these initial studies to minimize the noise that could be introduced by environmental variables in the real-world settings, however, the training tasks were made as realistic as possible by working closely together with established military and golf training domain experts. We plan to extend the combat marksmanship study and include professional soldiers in order to obtain more realistic data, and compare their physiological profiles with the novices' profiles. Further validation on the additional tasks, such as driving, is also planned. Nonetheless, the statistically significant correlations between the NN-based transition scores and performance are a promising first step toward the final goal of performance prediction that could be further used across a wide range of real-world application scenarios to improve, accelerate, and increase efficiency of the skill learning process by identifying an adaptive, intelligent, and multimodal neurofeedback based on the relevant neurophysiological patterns discovered in our datasets.

# **ACKNOWLEDGMENTS**

This work was supported by the Defense Advanced Research Projects Agency contract NBCH090054. The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency.

# **REFERENCES**


marksmanship training. *Found. Augmented Cogn.* 5638, 630–639. doi: 10.1007/ 978-3-642-02812-0\_72


**Conflict of Interest Statement:** Authors Stikic, Berka, Levendowski, Rubio, Tan, and Korszen are paid salaries and/or shareholders of Advanced Brain Monitoring, Inc. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 15 February 2014; accepted: 08 October 2014; published online: 05 November 2014.*

*Citation: Stikic M, Berka C, Levendowski DJ, Rubio RF, Tan V, Korszen S, Barba D and Wurzer D (2014) Modeling temporal sequences of cognitive state changes based on a combination of EEG-engagement, EEG-workload, and heart rate metrics. Front. Neurosci. 8:342. doi: 10.3389/fnins.2014.00342*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Stikic, Berka, Levendowski, Rubio, Tan, Korszen, Barba and Wurzer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Cardiovascular state changes in simulated work environments

#### *Arjan Stuiver <sup>1</sup> \* and Ben Mulder <sup>2</sup>*

*<sup>1</sup> Neuropsychology, Behavioural and Social Sciences, University of Groningen, Groningen, Netherlands*

*<sup>2</sup> Experimental Psychology, Behavioural and Social Sciences, University of Groningen, Groningen, Netherlands*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Michal Lavidor, Bar Ilan University, Israel Johanna Wagner, Graz University of Technology, Austria*

#### *\*Correspondence:*

*Arjan Stuiver, Neuropsychology, University of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, Netherlands e-mail: a.stuiver@rug.nl*

The usefulness of cardiovascular measures as indicators of changes in cognitive workload has been addressed in several studies. In this paper the question is explored whether cardiovascular patterns in heart rate, blood pressure, baroreflex sensitivity and HRV that are found are consistent within and between two simulated working environments. Two studies, were performed, both with 21 participants: one in an ambulance dispatch simulation and one in a driving simulator. In the ambulance dispatcher task an initial strong increase in blood pressure is followed by a moderate on-going increase in blood pressure during the next hour of task performance. This pattern is accompanied by a strong increase in baroreflex sensitivity while heart rate decreases. In the driving simulator study, blood pressure initially increases but decreases almost to baseline level in the next hour. This pattern is accompanied by a decrease in baroreflex sensitivity, while heart rate decreases. Results of both studies are interpreted in terms of autonomic control (related to both sympathetic and para-sympathetic effects), using a simplified simulation of a baroreflex regulation model. Interpretation of the results leads to the conclusion that the cardiovascular response patterns in both tasks are a combination of an initial defensive reaction, in combination with compensatory blood pressure control. The level of compensatory blood pressure control, however, is quite different for the two tasks. This helps to understand the differences in response patterns between the two studies in this paper and may be helpful as well for understanding differences in cardiovascular response patterns in general. A substantial part of the effects observed during task performance are regulatory effects and are not always directly related to workload manipulations. Making this distinction may also contribute to the understanding of differences in cardiovascular response patterns during cognitive workload.

**Keywords: cardiovascular reactivity, mental workload, state assessment, simulated work, baroreflex**

# **INTRODUCTION**

Cardiovascular measures are extensively studied in both laboratory and applied environments to gain insight in either responsiveness or in operator state changes during continuing mental work. In recent years, research on operator state assessment has frequently focused on developing applications for adaptive automation. Cardiovascular measures are mentioned by different authors as good candidates for the assessment of operator state because they can be measured relatively easily and continuously (Hockey et al., 2003; Mulder et al., 2003, 2004).

In several studies it has been shown how psychophysiological measures can be used in adaptive automation (Pope et al., 1995; Prinzel et al., 2000; Fairclough and Venables, 2006; Ting et al., 2010). Knowledge about a person's current state, in combination with information about task load and task performance, can be applied to adapt the working environment to fit the user's demands or needs (Haas and Hettinger, 2001). In this context, it is important to take into account the demand a task is placing on the user. The basic idea of adaptive automation or adaptive support is to design a system that will give valuable help, being a companion for the operator during periods of expected over- or underload (Hoogeboom and Mulder, 2004), for example by controlling task demands.

Physiological information that can be used in adaptive automation may consist of response patterns to momentary changes in task load or could consist of state changes related to continuous work or both. In this context, a reaction to changes in workload that last only a short period of time, e.g., 30 s to 1 min, is (part of) a response pattern to momentary changes. State changes indicate the long-term effects up to a couple of hours. One of the prerequisites for using either short-term response patterns or state changes is that consistent cardiovascular patterns have to be available at an individual level for the task under focus. Another requirement is that a distinction between periods of low and high workload can be made on the basis of changes found in these patterns. Knowing that this is a multidimensional problem, the focus of the present paper will be on the consistency of cardiovascular state changes in two task environments: an ambulance dispatching task and a driving simulator. Based on the results found suggestions will be made on how to apply such measures and which issues have to be addressed for applications in the world of adaptive automation.

Effects of mental effort on heart rate (HR) and heart rate variability (HRV) have been extensively studied during laboratory tasks (Mulder and Mulder, 1987; Backs and Seljos, 1994), simulated work (Brookings et al., 1996; Veltman and Gaillard, 1998; De Rivecourt et al., 2008; Dijksterhuis et al., 2011) and during real work (Roscoe, 1992; Wilson, 1993; De Waard et al., 1995; Hankins and Wilson, 1998). In general, during effortful working periods in these tasks, a pattern is found of increased HR in combination with decreased HRV, compared to resting baselines or compared to conditions of lower workload.

However, the sensitivity of these effects depends strongly on the task load differences between conditions, the type of task and time on task (Mulder and Mulder, 1987; Althaus et al., 1998; Backs and Boucsein, 2000). Effects of increased task demand correspond with elevated (finger) blood pressure (BP), lowered baroreflex sensitivity (Mulder and Mulder, 1981; Reyes del Paso et al., 2004) and higher respiration rate (Wientjes, 1992). Some authors have characterized this response as a defense reaction or a preparatory fight-or-flight response (Mulder, 1980; Jordan, 1990; Berntson et al., 1991). Generally speaking, the effects in laboratory tasks and simulated work are more transparent than those in real life work environments, due to varying task demands, unknown task characteristics, time on task effects and adaptivity of the short term blood pressure control system (baroreflex, e.g., Mulder, 1992) to these changing task demands. In laboratory conditions HRV clearly showed more consistent effort effects than HR and is therefore proposed as a good indicator of invested mental effort in cognitively demanding tasks (Mulder and Mulder, 1981).

A well-known fatigue and monotony effect on HR, i.e., a gradual decrease during the working day, is reported in many studies (Myrtek et al., 1994; Raggatt and Morrissey, 1997). A similar phenomenon is found during much shorter sessions, e.g., an hour of continuous task load. Mulder et al. (1993) showed that initial HR(V) effects resembling a defense reaction disappeared after 10–20 min in a memory search task that lasted 45 min, while BP and baroreflex sensitivity (BRS) remained the same after the initial effects, i.e., BP remained high and BRS remained low. The authors concluded that these effects were directly related to short-term blood pressure control (baroreflex, Van Roon et al., 2004; Mulder et al., 2009), a control system in the body that tries to maintain stable blood pressure. Baroreceptors monitor changes in blood pressure and influence sympathetic and parasympathetic activation. A continuing elevation of blood pressure increases sensitivity of this system and increases parasympathetic and decreases sympathetic activation, lowering heart rate and indirectly blood pressure. The question arises how generalizable and how consistent the patterns described above (an initial defense response followed and possibly overshadowed by effects of the short-term blood pressure regulation mechanism) actually are and how to interpret such changes in terms of autonomic control (Berntson et al., 1991; Backs, 1998).

The effects of the typical adaptive baroreflex response described above may obscure the direct effects of task demand manipulations, because these adaptive effects might overshadow or at least diminish the effects of changes in task demand. It has to be kept in mind that both types of effects occur simultaneously and are in principle present at any moment, which makes interpretation of effects of task demand more difficult. This may be a reason to split the research regarding these effects in two directions: the first perspective is to look at specific cardiovascular state changes over longer time periods. The second is to look at smaller time segments connected to specific short lasting changes in task demand. In this paper we will focus on the first approach.

One of the two environments was an ambulance dispatchers' simulation, in which the participants' main task was to send the ambulances to accidents or schedule them for non-emergency rides (Blandford and Wong, 2004; Mulder et al., 2009). Most of the workload was related to the cognitive aspects of the task, such as perceiving the environment and keeping an up to date model of what is happening at every moment. Handling the telephone calls, planning scheduled rides and deciding which ambulances to use also placed strain on the participant. Because it was a computerized task the work required little physical effort. Together with the time pressure in some periods, working memory aspects were the main contributing factors to the workload the participant experienced. In conclusion, the planning of non-urgent rides and keeping coverage of the region may be seen as a continuous loading task that was interrupted every now and then (increasing task difficulty) by incoming emergency calls.

The other environment was a fixed-base high fidelity driving simulator. In this study participants had to drive eight road sections on which they had to pass crossroads with heavy traffic. Driving in a simulator and more specifically passing intersections is predominantly a cognitive task. Monitoring traffic and deciding when to cross imposes more mental than physical load. There was, however, a higher degree of physical activity and less cognitive activity involved in the driving task compared to the dispatcher task. Participants had to steer and use the pedals, which may have evoked some minor physical load effects as well.

The studies reported here can be compared to previous research. Mulder et al. (2009), using the same ambulance dispatcher environment, found a cardiovascular pattern as a function of time of an ongoing increase in blood pressure in combination with decreased HR and baroreflex sensitivity, while HRV in the mid frequency band (0.07–0.14 Hz) increased as well as function of time. No (continuous) blood pressure data from former driving simulator experiments are available. HR and HRV data are available however. The main findings are that HR effects as a function of task load are in general stronger (larger increases) than those of HRV (expected decreases). The same pattern of effects was found in simulated flight (Veltman and Gaillard, 1998; De Rivecourt et al., 2008). Since flying also requires physical activity to a certain degree, this may be linked to the motor aspects of the task.

As described above, based on previous research, an increase in blood pressure over time and an accompanying lowering of heart rate and increase in heart rate variability can be expected in the dispatcher task. Since there is little explicit information about changes in blood pressure during driving, the same time course of changes in blood pressure, heart rate, baroreflex sensitivity and HRV may be expected to occur during driving as found in the ambulance dispatcher task.

To conclude this introduction the following research questions can be formulated:


# **METHODS**

The main goal of the present paper was to compare the patterns of cardiovascular state changes in two different simulated working environments: an ambulance dispatcher task and a driving task. In each of these environments an experiment has been conducted. In this section, the experimental settings and procedures are described for each of the experiments separately, while the common (data analysis) parts are described together at the end of this section.

# **AMBULANCE DISPATCHER TASK**

#### *Participants*

A total number of 22 participants (between 19 and 27 years of age, 12 female, all students) took part in the experiment. The data of 21 were used for analysis. The results of one participant were excluded due to difficulties with blood pressure measurement. All participants were students of the University of Groningen. They received a financial reward for their participation and signed an informed consent at the start of the first training session. The study was approved by the ethical committee of the faculty of Behavioural and Social Sciences at the University of Groningen.

#### *Dispatcher task*

Participants had to perform three main tasks. Firstly, they had to activate emergency rides as a response to emergency calls. They had to select the optimal or most economic ambulance, which might have been an ambulance driving in the vicinity of where the emergency occurred or an ambulance from a post nearby. To make this choice, the participant needed a good overview of the current location of ambulances and the current situation in the region. The second task was scheduling ambulances that carried out non-urgent transport rides, transporting patients to and from hospitals as scheduled. In more detail: non-urgent rides were either in the "transport-list" at the beginning of each scenario or would come in during task performance as a telephone call (nonurgent). This part of the task required a lot of scheduling and planning. The last part of the task was to make sure that every place in the entire area could be reached by an ambulance within 15 min. To preserve this coverage the participant had to choose optimal ambulances for emergency rides and non-urgent transport. This also required good insight in the location and activities of the ambulances.

In this experiment the dispatch center was simulated on a computer with two screens. On one screen the communication and planning interface was shown. Participants used a regular mouse and keyboard as interface. On the other screen a map of the region was shown on which ambulances (moving) and hospitals, ambulance stations and emergency locations (stationary) were represented. Communication was achieved through screen messages to keep the system uniform and suitable for experiments.

### *Training*

Participants completed an extensive training period of 6 h in three sessions. One of the goals of the training was learning the topography of the province of Groningen used in the task. Participants were trained in knowing the location of the most important places and the time it would take ambulances to drive between these places. However, if during the experiment a participant could not remember where a city was located or how long it would take an ambulance to reach a destination, he or she could use an interactive map included in the task interface to find that information.

A second goal of the training was to learn how to perform the task, i.e., how to use the application, find ambulances on the map, find cities and driving times between places, how to dispatch ambulances, plan non-emergency rides, and keep coverage. Participants were trained gradually and their progress was evaluated at several moments. Toward the end of the training period the scenarios became more realistic and were more similar to and at the same complexity-level as the scenarios used in the experiment. Since the simulated task consisted mainly of the planning aspect of the real dispatcher task, planning was the focus point of training and testing. Therefore, after training all participants were tested on their topographical knowledge and their skill in estimating how much time it would take to drive from one place to another. If they did not attain an acceptable level (60% correct), they were asked to spend extra time on training and were re-tested before the experiment. Of course, participants reveiced a less elaborate training compared to real dispatchers, and their training was only sufficient due to the simplification of the task.

# *Experimental procedure*

Participants attended two experimental sessions on two different days; each took about 3 h, including a break of 15 min in the middle. The experimental session started with a 5-min baseline (Rest), four scenarios (each lasting 15 min), a break of 15 min, 5-min Rest measurement, and again four scenarios of 15 min each.

In half of the scenarios all the participants had to do was to respond to emergency calls and keep coverage. The scheduled rides were planned and activated "automatically" in such a way that they did not need to be activated during the scenarios and could effectively be ignored. In the other four scenarios they had to plan and activate the scheduled rides and to respond to emergency calls while guarding coverage. The order of presentation of scenarios was balanced to prevent order effects, as they will average out in the results. Participants were randomly assigned to conditions.

### **SIMULATED DRIVING TASK**

#### *Participants*

From an initial number of 23 participants, 20 finished the experimental sessions (age between 19 and 25 years, 9 female). Participants were required to have held their license for at least a year and had driven at least 5000 km. They received a financial reward for their participation. At the start of the experiment they had to fill in a questionnaire about their age and driving experience and signed an informed consent. The study was approved by the ethical committee of the faculty of Behavioural and Social Sciences of the University of Groningen.

# *Virtual driving environment and task*

The study was conducted using a ST Software© driving simulator, consisting of a fixed-base vehicle mock up with functional steering wheel, indicators, and pedals. The simulator was surrounded by three 32- diagonal plasma screens. Each screen provided a 70◦ view, leading to a total 210◦ view. A detailed description of the driving simulator functionality can be found in Van Winsum and Van Wolffelaar (1993). Participants steered with only the right hand to allow taking finger blood pressure measurements on the other hand. For the same reason the simulator car had automatic transmission.

Participants completed a route with a total of four rural and urban areas with either traffic from both sides on crossings or from one side. Participants drove the same track twice, thus driving a total of eight sections that took about 10 min to complete when driving at an average speed of 80 km/h. An urban segment with traffic coming from both sides on the crossing was followed by a rural section also with traffic from both sides, thereafter an urban section with traffic coming only from the right side on the crossings and finally a rural section with traffic also only from the right. Participants were selected randomly to start at one of the four segments. The starting locations were balanced to prevent order effects.

In each section there were six crossings, without road-priority, while normal European traffic rules had to be followed. Other road characteristics such as lane width, curvature, number of lanes (one for each direction) were not different for the sections.

# *Experimental procedure*

Participants completed one experimental session, which took about 2.5 h; in contrast to the dispatcher task there was only a 5-min rest measurement and no further break in the middle. The experimental session started with a short training ride of about 10 min to get acquainted with driving in the simulator. After that, there was a 5-min baseline (Rest), four road sections (lasting for about 10 min each), 5-min Rest, and another four road sections. After each road section participants were asked by a computerized voice to stop at a lay-by to report their invested mental effort during the preceding road section on the Rating Scale Mental effort (RSME, Zijlstra, 1993). After four sections, participants had completed one round, after which they drove the same round again, meaning that they drove a total of eight sections. These eight sections were the driving task equivalent of the eight scenarios created for the dispatcher task.

# *Cardiovascular measures*

Cardiovascular measurement and analysis procedures were similar for the two experiments and will be described together in this section.

The electrocardiogram (ECG) was recorded with three Ag-AgCl electrodes. The common electrode was placed at the sternum and the other two electrodes at the right and left side between the two lowest ribs. Blood pressure was measured with a FIN.A.PRES device (Finometer®). Both ECG and blood pressure were sampled at 250 Hz. R-peaks were detected online from the ECG by using a hardware ECG-trigger as part of a TMSi Porti measuring system (Twente Medical Systems International). Interbeat intervals (IBI's) were derived from R-peak time points and automatically corrected using the CARSPAN program (Mulder, 1992), followed by visual inspection and manual correction where necessary.

Spectral analysis of all cardiovascular data was also done with CARSPAN, as well as calculation of mean values of heart rate and systolic blood pressure. Spectral HRV values, on the basis of heart rate changes, were only derived for the mid frequency band (0.07–0.14 Hz) based on previous findings in the dispatcher task environment (Mulder et al., 2009). HRV estimates were calculated as values normalized to the mean, i.e., modulation index (Mulder, 1992; Veldman et al., 1998). The variables were logarithmically transformed to obtain normally distributed variables (Van Roon et al., 2004). An index of baroreflex sensitivity (BRS) was created by calculating the transfer gain (modulus) from systolic blood pressure changes to interbeat interval changes in the mid frequency range (Robbe et al., 1987).

# *Applying a baroreflex simulation model*

For estimation of (para-)sympathetic effects of workload in the simulated working environments we did a simplified simulation study using a baroreflex model. This model was initially developed by Wesseling and colleagues (Wesseling and Settels, 1985) and further extended, tested and configured for mental workload studies by van Roon (1998). In this model the basic mechanisms of short-term blood pressure control are implemented, such as baroreflex function, and effector systems like heart function (rate and contraction force), peripheral resistance changes and venous filling of the heart. The main assumptions in this model-approach are that after tuning model-parameters to a baseline measurement, all subsequent adaptations during task performance are related to sympathetic and vagal control. Van Roon et al. (2004) extensively described this model, including some working examples. A complete application study was performed by Althaus et al. (2004).

It would be beyond the scope of this paper to perform and describe complete simulations for the current data sets. Therefore, a simplified procedure is applied, making use of the main characteristics of the model. One of these characteristics is that mean blood pressure is nearly independent of vagal activation and therewith almost completely determined by sympathetic control (Van Roon et al., 2004). With this knowledge, sympathetic gain changes in respect to baseline can be determined. Having this information, subsequently, changes in vagal gain can be resolved from HR data. Finally, the obtained vagal and sympathetic gain estimates can be checked against the measured BRS and HRV values.

#### *Variables, experimental design and statistical analysis*

In the ambulance dispatcher task, analysis periods of 5 min were selected; for each of the scenarios three such periods were averaged to get one value per scenario. Taking averages of spectral values over 5 min periods instead of one spectral value for the total 15 min scenario, helps overcome the problems that may arise from non-stationarities in the signal (Weber et al., 1992). A total of 10 values were obtained per variable, per session (two resting periods, eight scenarios). Data of the two sessions were averaged. For the driving task, analysis periods of 10 min were used for the driving segments and periods of 5 min for the resting phase. This resulted for the driving task also in 10 values: two resting periods, eight road segments.

Analyses on the following variables will be reported: mean heart rate (in beats/minute), mean systolic finger blood pressure (in mmHg), HRV values from the mid-frequency band (in natural log-transformed squared modulation index values), BRS in the mid-frequency band (in ms/mmHg) and a rating of subjective mental effort on the RSME (Zijlstra, 1993).

The data were analyzed using the General Linear Model Repeated Measures test in SPSS. The same design was applied for both experiments. Repeated Measures MANOVAs were run on all the four variables (heart rate, systolic blood pressure, baroreflex sensitivity and heart rate variability) simultaneously. Values for the first rest period vs. the second rest period were compared. Rest vs. task effects (two levels; first rest vs. task and second rest vs. task) were tested using the data from the first task segment following the rest period. It has to be noted that in the dispatcher task the first 5-min segment of the first scenario after a rest measurement was used, while for the driving experiment the first road segment (10 min) after a rest measurement was applied. Level differences as well as trend lines were tested for the first and the second task part (first four scenarios vs. the last four, first four road segments vs. the last four). This resulted in 6 different tests for each variable, for the resulting familywise error rate a corrected alpha, originally set to 5%, was calculated with the Holm-Bonferroni method (Holm, 1979). In the result section it is indicated if the *p*-value was larger than the corrected alpha and the result therefore not statistically significant.

#### **RESULTS**

#### **AMBULANCE DISPATCHER TASK**

**Figure 1** shows the response patterns for HR, SBP, HRV, and BRS for the dispatcher task. Large changes only occur in the first part of the session, while values remain at the same (extreme) level after the first part. HR strongly decreases over the session, from about 80 beats per minute during the first rest, to about 70 beats/min during the last scenarios. Statistical analysis shows that there are no rest-task differences. There is a decrease in HR during the first half of the session [*F*(1*,* 20) = 39*.*6, *p <* 0*.*001], while the level does not change in the second half of the session. Overall, the HR level of the second part of the session is lower than the first part [*F*(1*,* 20) = 160*.*2, *p <* 0*.*05]. HR during the first rest period is lower than during the second one [*F*(1*,* 20) = 76*.*7, *p <* 0*.*001].

Systolic blood pressure increases strongly during the first scenario, compared to the preceding rest (about 7 mmHg). This increase continues during the first scenarios and SBP reaches very high values in the last scenario before the pause (about 15 mmHg higher compared to the first rest). After the pause, SBP stays at a high level and increases only slightly further. Results from statistical tests confirm an on-going blood pressure increase from rest to the first task [*F*(1*,* 20) = 15*.*6, *p <* 0*.*05], an ongoing increase in the first part [*F*(1*,* 20) = 13*.*1, *p <* 0*.*05], and a SBP difference between the first and the second rest measurement [*F*(1*,* 20) = 16*.*4, *p <* 0*.*001]. Finally there is a small but significant ongoing increase in the second part of the task [*F*(1*,* 20) = 8*.*01, *p <* 0*.*05].

Baroreflex sensitivity shows more or less the opposite effect from heart rate. BRS increases strongly during the first part of the session (about 3 ms/mmHg) and stays at an even higher level after the pause, during both the rest and the subsequent scenarios (a very large (significant) difference of about 7 ms/mmHg with the rest at the start of the session). A large difference in BRS between the first and the second rest measurement [*F*(1*,* 20) = 76*.*7, *p <* 0*.*001] was found, in combination with a clear increase [*F*(1*,* 20) = 39*.*6, *p <* 0*.*001] of BRS during the first part of the session and no further increase in the second part. There was no initial resttask difference, while the BRS level was distinctly higher during task performance after the pause than before [*F*(1*,* 20) = 61*.*6, *p <* 0*.*05].

HRV shows a gradual increase during the session: a linear increase in HRV can be seen during the first [*F*(1*,* 20) = 29*.*4, *p <* 0*.*001] and the second [*F*(1*,* 20) = 9*.*1, *p <* 0*.*05] part of the session. The HRV level was lower in the first than in the second part of the session [*F*(1*,* 20) = 13*.*7, *p <* 0*.*05].

Participants reported a higher rating of subjective effort in the first session compared to the second [*F*(1*,* 20) = 18*.*46, *p <* 0*.*001] with an average for both sessions of 45.8. A difference of 16.6 between easier and more difficult scenarios was also found, with 37.5 for the easier scenarios and 54.1 for the other [*F*(1*,*20) = 49*.*50, *p <* 0*.*001]. The results are summarized in **Table 1**.

#### **SIMULATED DRIVING TASK**

**Figure 2** shows the response patterns for HR, SBP, HRV, and BRS for the simulated driving task. The overall time course of the cardiovascular variables in the driving task is quite different from that in the dispatcher task. This holds in particular for SBP and BRS.

The heart rate pattern can be characterized by an initial increase during driving compared to the preceding rest in both the first and the second part of the task. The initial increase is followed by a gradual decrease, which is strongest in the first part of the session. Heart rate is lower during the second part of the session, both during rest and driving. Statistical analysis confirms the initial HR increase from rest to task [first part: *F*(1*,* 19) = 8*.*19, *p <* 0*.*05] and the gradual decrease during driving [*F*(1*,* 19) = 14*.*7, *p <* 0*.*05] in the first part. HR is indeed lower during the second rest [*F*(1*,* 19) = 13*.*9, *p <* 0*.*05] and the second part of the task [*F*(1*,* 19) = 21*.*1, *p <* 0*.*001] compared to the corresponding periods in the first part of the session.

The blood pressure pattern resembles the heart rate pattern to a large extent. There is a very strong initial increase of SBP

**in the ambulance dispatchers' task.** Heart rate in beats per minute, systolic blood pressure in millimeters Mercury, heart rate variability in squared

Mercury. R1, rest period 1; R2, rest period 2 and S1–S8 correspond with scenario 1–8.


(about 17 mmHg) in the first part of the driving task compared to the preceding rest. SBP level shows a gradual and strong decrease during the first part of the driving task (more than 10 mmHg). In the second part of the task the pattern is quite different: a smaller initial increase after rest is found, as well as a small gradual increase during driving. Statistical analysis confirms the initial rest—task difference [first part: *F*(1*,* 19) = 63*.*4, *p <* 0*.*001; second part: *F*(1*,* 19) = 19*.*6, *p <* 0*.*001] and the SBP decrease in the first part [*F*(1*,* 19) = 34*.*6, *p <* 0*.*001].

BRS results only shows trends for rest task differences that are not statistically significant (after Holm-Bonferroni corrections), for both the first part [*F*(1*,* 19) = 5*.*16, *p <* 0*.*035] and the second part [*F*(1*,* 19) = 4*.*16, *p* = 0*.*056]. There were no other effects on BRS.

#### **Table 2 | Results of the driving simulator study.**

systolic blood pressure in millimeters Mercury, heart rate variability in


correspond with scenario 1–8.

HRV shows a gradual increase during the first part of the session [*F*(1*,* 19) = 19*.*6, *p <* 0*.*001]. HRV does not seem significantly higher in the second part, when compared with the first part of the session nor does it seem to increase within the session. [*F*(1*,* 19) = 4*.*8, *p <* 0*.*05]. The results of the driving simulator study are summarized in **Table 2**.

#### *Results of the baroreflex model simulations*

Subjective mental effort measured by RSME shows no effects over time. The average level of rated effort was 40.8, with a difference of 7.3 between easier and more difficult sections

The results of the baroreflex model simulation are depicted in **Figure 3**. For the ambulance dispatcher task sympathetic gain decreases 20% in the first 15 min compared to the baseline. This

can be derived from the 20% increase in blood pressure from the first rest to the first task (**Figure 1**). It is important to note that although it sounds illogical, both in the model and in real life a decrease of sympathetic gain corresponds with increased sympathetic activity. This inverse relationship does not occur for the vagal control loop. After the initial decrease in sympathetic gain, a further decrease toward 30% occurs during the remaining first hour of task performance. In the second half of the task sympathetic gain stays 20% decreased compared to baseline. From the relationship between vagal gain, sympathetic gain and heart rate, given by the model, it can be derived that vagal gain does not change in the first 15 min. The decrease in heart rate from rest to task is therefore in this case mainly due to sympathetic gain changes. The same relationship given by the model suggests that the lower heart rate during task performance is due to an increase in vagal gain, partly compensated by a decrease in sympathetic gain. More specifically, vagal gain increases strongly and gradually with 40% in the remaining part of the first hour. In the second hour it remains constant at an even higher level of about 60% compared to baseline.

For the simulated driving task, the initial decrease of sympathetic gain is strong, between 35 and 40% as derived from the increase in blood pressure. The large decrease diminishes gradually in the remaining part of the first hour toward 20%. In the second rest measurement sympathetic gain returns to baseline (even 5% higher), while during the second half of the driving task this level is decreased with about 20%. After the first 15 min, vagal gain shows quite a different pattern compared to the ambulance planning task. During the first 15 min vagal gain does not change compared to baseline (changes in heart rate are again mainly due to sympathetic changes). Toward the end of the first driving hour it gradually decreases with about 15%, when the effects of vagal gain are almost totally compensated by sympathetic changes which are indicated by heart rate returning to its original level. During the subsequent resting period, vagal gain returns to baseline level (even 5% higher) when blood pressure is at or below baseline and heart rate as well. It decreases when driving starts again to stay at a constant level of 20% gain reduction indicated by a rise in both blood pressure and heart rate.

# **DISCUSSION AND CONCLUSION**

The main research questions were: can we find a consistent pattern of state changes, will these patterns be similar in different test environments, can these patterns be explained in terms of cardiovascular regulation and what can be concluded with respect to the usability of such patterns for operator state assessment?

Two characteristic, but different cardiovascular patterns were found as a function of time-on-task for the two different tasks. The pattern of the ambulance dispatcher task can be characterized by a small but distinct initial increase in blood pressure, followed by an on-going increase during the first hour of task performance. The break did not reduce this level and blood pressure remained high during the second hour. This pattern was accompanied by strongly reduced heart rate and extremely increased baroreflex sensitivity. The whole pattern may be summarized as a relatively small initial task effect (defense/fight or flight response) followed by strong regulatory effects of baroreflex short-term blood pressure control.

The patterns found in the driving task (**Figure 4**) have other characteristics than those found in the dispatcher task. The initial effects are stronger, while no ongoing increase in blood pressure is seen, which is reflected in a quite different baroreflex pattern. A strong initial increase of systolic blood pressure in the first 10 min of task performance is followed by an on-going decrease. The initial increase is reduced with more than 50% in the subsequent hour. The heart rate pattern resembles the blood pressure pattern, with the exception that the decrease in heart rate was stronger. Remarkable is the baroreflex pattern that shows no significant initial rest–task differences and that remains at the same level during each of the two driving hours. BRS seems to have much lower values during driving than in the ambulance dispatcher's task, though this has not been tested. The magnitude of change in the ambulance dispatcher's task is quite large with 5.5 ms/mmHg, in the driving task it is not significant with 1.2 ms/mmHg.

The question arises what the reason is for these different timeon-task cardiovascular patterns for these two task environments and whether similar patterns have been found in other, comparable studies. First of all, reported mental effort was higher in the first session of the dispatcher task and decreased over time. This decrease was not found in the driving task. The difference between easier and more difficult scenarios was larger in the

dispatcher task and on average higher, suggesting that workload was higher in the dispatcher task and even more so in the more difficult scenarios. This might partly explain the differences in baroreflex patterns between the two tasks.

The ambulance dispatcher task has been applied in a series of experiments in our laboratory in recent years. Mulder et al. (2009) reported two of these studies. The first showed the same pattern of results in an ambulance dispatcher task with alternating easy and difficult task periods, lasting for about 2 h. Heart rate and baroreflex changes were of the same magnitude as in the present experiment, while blood pressure changes were even somewhat larger. Results of the second study reported in Mulder et al. (2009) also showed the same pattern in a 1 h lasting task period, but the magnitudes of the responses were slightly lower for all variables, including systolic blood pressure, heart rate and baroreflex sensitivity. In a different planning task, carried out by Laumann (2004), studying restorative effects of a walk in nature after a preceding heavy cognitive workload session with planning work, the pattern of cardiovascular results also coincided with the present data with respect to blood pressure, heart rate, baroreflex sensitivity and HRV.

The most important aspect of the studies mentioned above is the cognitive demand these place on the participants. For comparison, in a study on visual fatigue, Veldman et al. (1998) found partly comparable and partly different results. They studied cardiovascular changes during a 2.5 h lasting editing and errorcorrection task (visual work). During the first part of this task exactly the same pattern of results was found as in the present task. The pattern in the second part however, showed the opposite: blood pressure returned to original levels, while heart rate increased and baroreflex sensitivity decreased to starting levels.

Unfortunately, there are only very few studies in which blood pressure is measured continuously during driving or comparable tasks. The reasons are clear: both hands are needed for steering and controlling the car. For the current study we used a driving simulator with automatic transmission, keeping one hand available for finger blood pressure measurements. After a short training period of 10 min participants had no trouble with controlling the vehicle and were used to the blood pressure measurement. The data of the present car driving study completely resemble the results of two laboratory studies reported in Mulder et al. (1993), including the initial strong rise of blood pressure followed by a gradual decrease and a (continuing) relative small decrease of baroreflex sensitivity. In both studies a fast-paced memory search task was included (without counting) lasting for 45 min. Several other studies, where no blood pressure was measured, show the heart rate and HRV pattern of the present study (an initial increase of HR and a decrease of HRV, followed by a gradual decrease of HR and an increase of HRV). In a 1.5 h lasting car-driving study on the road, testing vigilance, De Waard and Brookhuis (1991) found a gradual decreasing heart rate (of about 8–10%) and a corresponding increase in HRV. These results suggest that the experienced workload and related invested effort in the current driving simulator might have been very low as well. The similarity of effects in the current study seems to be confirmed by the effects on reported mental effort.

In an overview, Backs and Boucsein (2000)list studies in which heart rate decreases as a function of time-on-task, sometimes in combination with an increasing HRV. Such studies include, for instance, monotony in train drivers (Myrtek et al., 1994), prolonged city bus driving (Milosovic, 1997) and long-haul bus driving (Raggatt and Morrissey, 1997). It must be noted that in most of these studies this pattern is considered to be connected to vigilance, diminished arousal or fatigue. Moreover, it has to be mentioned that this HR(V) pattern does not distinguish between the cardiovascular patterns of the present driving task and the ambulance planning task.

In conclusion, the above shows that results similar to those found in the dispatcher task are found in studies that are very comparable in nature to this planning task, while similar results to the driving task are found in other visual demanding tasks and other driving studies. As may be expected, this suggests that the nature of the tasks determines the cardiovascular response patterns to a great extent.

#### **BAROREFLEX MODEL SIMULATIONS**

The next question to be answered is in which way the results of the present two studies can be characterized in terms of autonomic activation and short-term blood pressure (baroreflex) control. We performed a simplified simulation study using a baroreflex model for this purpose. This simplified procedure worked very well for the present data set, although it has of course restrictions with respect to the accuracy of the estimates. Complete simulations would have obtained better results but the present approach is good enough to describe patterns of autonomic activation in simulated task environments.

In terms of autonomic activation, the main differences between the two task situations can be summarized as follows: in the driving task the initial sympathetic activation is higher than in the ambulance planning task, while this activation as a function of time on task reduces during driving and increases during ambulance planning. This pattern completely corresponds with the blood pressure pattern. The main difference, however, is seen in vagal activation, which is decreased during driving and strongly increased during ambulance planning. The differences are clearly reflected in baroreflex sensitivity, being at a high level during the planning task and at a low level during driving.

#### **CONCLUSION**

Looking at the overall results, one might conclude that there are distinctly different response patterns. For the ambulance planning task this pattern is very consistent over a series of studies, while this still has to be confirmed in future experiments for the driving task. The basic differences between the tasks are at the level of working memory and planning in the dispatcher task and active control (steering) in the driving environment. The difference in response patterns may perhaps be explained by looking at the level of control in the two tasks. Rasmussen (1987) differentiated between different levels of control: skill-based, rule-based and knowledge based. According to him, skill-based task performance requires fast and almost effortless processing, in contrast with knowledge-based performance, which requires slow, serial and effortful processing. Within the context of driving, Michon (1985) made a distinction between decision making at the tactical and operational level, which would correspond to knowledge-based and skill-based, respectively. An important difference between the two current studies is that the demands between intersections in the driving simulator task are mainly on an operational or skill-based level whereas those in the dispatcher task are most of the time on the knowledge-based or tactical level. There are, however, no conclusive arguments why this aspect would completely explain these large differences in response patterns. One possible additional explanation is that the dispatcher task gives such a high, continuing workload, resulting in an ongoing increase in blood pressure, that baroreflex is activated as strongly as possible in an effort to reduce blood pressure (Julius, 1988).

The question is what this means for applications in adaptive automation. With respect to the time duration of the working period, one could imagine that health consequences may build up in case an operator continuously goes on with working hard, having corresponding increases in blood pressure in combination with a maximal functioning baroreflex, for many hours as may occur in the ambulance planning task. For operators, doing this kind of work every day, it might be helpful or even necessary to reduce workload, for instance by having a "digital companion" that helps to reduce working memory activity. Although, in the short term this may not really help to reduce blood pressure, in the long run it might become important for keeping blood pressure within healthy boundaries.

The present (state) data are not really sensitive to changing short-term task demands; it is advised to use more short-term variables, such as HRV, RSA and respiratory rate for overload estimation. In this way overload might be detected at a short timescale and task load can be adapted on the basis on such variables in combination with task performance measures.

What is clear from these two studies is that the effects of timeon-task, or more specifically on the blood pressure regulation mechanisms that cause these effects, have a large influence on the studied measures. When the measures studied in this paper, have to be applied for workload assessment, it has to be recognized that a substantial part of the effects observed may be related to short-term blood pressure control and not always directly related to workload manipulations. To find useful indices of mental workload (and invested effort) for applications in adaptive automation, it is necessary to take into account that the effects of time-on-task may actually overshadow the effects of workload manipulations. Success in adaptive automation based on psychophysiological measures may depend on the development of measures that are more sensitive to workload manipulations.

Furthermore, although different tasks elicit different effects, consistency within tasks is very high. Therefore, it can be concluded that these measures give insight in the cardiovascular effects of complex task performance but must be studied within the environment they are to be used in. Measuring and analyzing blood pressure next to heart rate is very helpful if not necessary, in understanding these cardiovascular patterns.

#### **ACKNOWLEDGMENTS**

Parts of this work were completed within the REFLECT project funded by the European Commission within the 7th Framework programme. REFLECT investigates ways of realizing pervasiveadaptive environments. The contributions and comments of Karel of Brookhuis, Chris Dijksterhuis, Arie van Roon and Dick de Waard to this work have been very valuable and are greatly acknowledged.

#### **REFERENCES**


G. R. J. Hockey, A. W. K. Gaillard, and O. Burov (Amsterdam: IOS Press), 356–362.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 13 December 2013; accepted: 18 November 2014; published online: 05 December 2014.*

*Citation: Stuiver A and Mulder B (2014) Cardiovascular state changes in simulated work environments. Front. Neurosci. 8:399. doi: 10.3389/fnins.2014.00399*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Stuiver and Mulder. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Estimating endogenous changes in task performance from EEG

# *Jon Touryan\*, Gregory Apker , Brent J. Lance , Scott E. Kerick , Anthony J. Ries and Kaleb McDowell*

*U.S. Army Research Laboratory, Human Research and Engineering Directorate, Aberdeen Proving Ground, MD, USA*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Ricardo Chavarriaga, Ecole Polytechnique Fédérale de Lausanne, Switzerland Jose Luis Contreras-Vidal, University of Houston, USA*

#### *\*Correspondence:*

*Jon Touryan, U.S. Army Research Laboratory, Human Research and Engineering Directorate, RDRL-HRS-C, Aberdeen Proving Ground, MD 21005, USA e-mail: jonathan.o.touryan.ctr@ mail.mil*

Brain wave activity is known to correlate with decrements in behavioral performance as individuals enter states of fatigue, boredom, or low alertness. Many BCI technologies are adversely affected by these changes in user state, limiting their application and constraining their use to relatively short temporal epochs where behavioral performance is likely to be stable. Incorporating a passive BCI that detects when the user is performing poorly at a primary task, and adapts accordingly may prove to increase overall user performance. Here, we explore the potential for extending an established method to generate continuous estimates of behavioral performance from ongoing neural activity; evaluating the extended method by applying it to the original task domain, simulated driving; and generalizing the method by applying it to a BCI-relevant perceptual discrimination task. Specifically, we used EEG log power spectra and sequential forward floating selection (SFFS) to estimate endogenous changes in behavior in both a simulated driving task and a perceptual discrimination task. For the driving task the average correlation coefficient between the actual and estimated lane deviation was 0.37 ± 0.22 (*μ* ± *σ*). For the perceptual discrimination task we generated estimates of accuracy, reaction time, and button press duration for each participant. The correlation coefficients between the actual and estimated behavior were similar for these three metrics (accuracy = 0*.*25 ± 0*.*37, reaction time = 0*.*33 ± 0*.*23, button press duration = 0*.*36 ± 0*.*30). These findings illustrate the potential for modeling time-on-task decrements in performance from concurrent measures of neural activity.

**Keywords: EEG, performance estimation, BCI, fatigue, driving, rapid serial visual presentation (RSVP)**

# **INTRODUCTION**

Brain-Computer Interaction (BCI) technologies that enable computer systems to adapt to the current cognitive or affective state of the user provide a promising avenue for developing systems that will improve human interaction with computers, the environment, and even each other (Zander and Kothe, 2011; Lance et al., 2012). Among the broad range of BCI technologies, the majority of systems and approaches have been within the active and reactive BCI paradigms (see Zander and Kothe, 2011 for review). These two classes of BCIs seek to decode volitionally induced or externally elicited patterns of neural activity over a relatively short timescale, on the order of milliseconds to seconds. In contrast, passive BCIs utilize implicit or ongoing neural responses for the purpose of detection an operator's current cognitive or affective state. Typically, passive BCIs assess change in neural activity over relatively longer timescales, on the order of seconds to minutes. To date, active and reactive BCI technologies have shown limited success outside of specific patient populations. One reason for this is a lack of robustness, partly due to the non-stationarity of neural signals (Von Bünau et al., 2009; Liyanage et al., 2013). In addition to large inter-session variability, fatigue and other sources of time-on-task decrements in performance can be reflected as a non-stationarity in the neural activity. These effects can be particularly pronounced for tasks that require sustained levels of attention (Tonin et al., 2013). In fact, many active and reactive BCI paradigms are specifically designed to minimize task-induced fatigue and often operate on relatively short timescales (Gao et al., 2014). In this study, we investigate the possibility of addressing this form of non-stationarity through the incorporation of an algorithm designed to identify fatigue-based decrements in performance.

The ability to detect changes in performance directly from biological markers has been an area of growing interest over recent decades. One particularly relevant application is the detection of fatigue, drowsiness, or reduced alertness during driving. Because fatigue is a major cause of accidents and injury when operating motor vehicles (Connor, 2002), robust identification of fatigue before it impairs behavior would be of significant value. To this end, numerous studies have identified indicators of fatigue-induced changes in driver performance from both physiological observables (Vural et al., 2009; Sommer and Golz, 2010; Vogel et al., 2010) and neural signals (Borghini et al., 2012), pre-dominately via electroencephalography (EEG). Furthermore, research groups have recently developed systems for real-time detection of attentional lapses, due to drowsiness or fatigue, from concurrent measures of brain activity (Davidson et al., 2007; Lin et al., 2010a). These approaches fall within the broader passive BCI framework, and are ideal for capturing slow fluctuations in behavioral performance.

Unlike active and reactive BCI paradigms, sustained, and monotonous tasks such as highway driving are used when investigating time-on-task decrements in performance. With such tasks, performance begins to degrade as a function of time, presumably induced by fatigue or inattentiveness due to boredom. Features of the EEG signal, such as fluctuations in power along certain frequencies or changes in evoked amplitudes, can then be correlated with this degradation in performance. Many studies exploring the neural correlates of fatigue use changes in the EEG log power spectrum as principal features in their analysis (Jung et al., 1997; Lal and Craig, 2005; Jap et al., 2009; Balasubramanian et al., 2011). This idea is based on a large body of literature that has linked EEG frequency bands, such as theta (4 to 8 Hz) and alpha (8 to 13 Hz) to changes in task-relevant behavior. In contrast, a more general but potentially powerful approach was originally proposed by Lin et al. (2005a). This approach takes an agnostic view as to the *a priori* selection of frequency bands but rather uses principal component analysis to identify the sets of frequencies that explain the most variance in the EEG power spectrum. The power distribution along these frequencies is then linearly integrated, via a data-driven model, to produce a time-varying estimate of behavior.

While identification of fatigue-induced decrements in driver performance is of obvious importance, other perceptually demanding tasks suffer similar time-on-task decrements, including air-traffic control (Grandjean et al., 1971), and x-ray screening (Basner et al., 2008). Importantly, it is for these types of tasks that the next generation of reactive BCI technology is being developed. However, less is known about the neural correlates of behavior for these more complex tasks. As BCI technologies transition into wider application domains and extended use scenarios, they must be able to adapt to the inevitable fluctuations in human performance. Accordingly, the first step in this process is to understand the link between the neural state and corresponding behavior in the context of time-on-task induced decrements in performance.

Here, we sought to address this important issue by using a data-driven approach to link endogenous changes in behavioral performance to concurrent measures of neural activity. Our goal was to use slow fluctuation in the EEG log power spectrum to estimate time-on-task decrements in performance, based on an extension of a similar BCI paradigm used for drowsiness detection (Lin et al., 2010a), and apply this method to an image triage paradigm increasingly common in BCI technologies (Gerson et al., 2006; Sajda et al., 2010; Touryan et al., 2011, 2013b; Yu et al., 2012). To accomplish this, we designed a study in which participants engaged in both a monotonous driving task and a prolonged rapid serial visual presentation (RSVP) task. Importantly, to quantify the nature and degree of the time-on-task decrements in performance, we acquired subjective, behavioral, and neurophysiologic measures throughout the experiment. We extended the behavior-estimation method found in Lin et al. (2005a), evaluated the method using the data from the simulated driving task, and applied it to the RSVP image triage task. The results of this study suggest that opportunistic identification of time-on-task performance decrements within an event-based BCI is both feasible and advantageous.

# **METHODS**

Twenty-five participants were recruited from the general population. They ranged in age from 21 to 57 (*μ* = 34*.*6) and included ten males. Twenty-one of the participants were right handed, two were left handed, and two were ambidextrous. All individuals participated in a single multi-hour session containing three phases and received compensation of \$20 per hour. The voluntary, fully informed consent of the persons used in this research was obtained as required by Title 32, Part 219 of the Code of Federal Regulations and Army Regulation 70-25. The investigator adhered to the policies for the protection of human subjects as prescribed in AR 70-25. None of the participants were excluded from the analysis due to noise, movement artifacts, or low behavioral performance. The study design involved 3 tasks (**Figure 1**): calibration, driving, and rapid serial visual presentation (RSVP). The calibration session was always performed first but the order of the driving and RSVP alternated for each participant.

# **CALIBRATION**

This task consisted of a standard driving simulator, developed with SimCreator® (Real Time Technologies; Dearborn, MI), that utilized steering wheel and foot pedal controls. In this task the vehicle was moving down a straight highway at a constant speed (computer controlled) in the rightmost lane. Participants were asked to maintain the vehicle position within the current cruising lane by correcting for any perturbation or drift. At random intervals a lateral perturbation to the right or left was applied to the vehicle, causing it to begin to veer off course. The strength of the perturbation increased until a corrective steering adjustment (greater than 4◦) was made at which point the perturbation ceased, allowing the participant to return the vehicle to the center of the rightmost lane. The perturbations would only resume once the vehicle was back in the cruising lane for at least 8 s. If the vehicle drifted far beyond the edge of the simulated roadway, participants would receive audible feedback (i.e., rumble strip noise). The simulated environment was minimal and included no traffic or scenery in order to induce boredom and task fatigue. The calibration task consisted of a single 15 min block and was designed for the acquisition of EEG baseline activity.

# **DRIVING**

This task was similar to the calibration task except that participants were now given control over the vehicle speed via accelerator and brake pedals. Current vehicle speed was indicated by a digital speedometer at the bottom of the screen. Participants were asked to maintain both the vehicle position and speed. Speed limit signs were posted at regular intervals with values of either 25 or 45 miles per hour. Again, the simulated environment was minimal and included no traffic or scenery. The driving task consisted of 6 blocks of 10 min each with breaks of approximately one minute between blocks.

### **RSVP**

This task consisted of a rapid presentation of color photographs (512 × 662 pixels) of indoor and outdoor scenes. The images were presented at 5 Hz (200 ms per image) and subtended a visual angle of approximately 9◦. Every 10 s a blank screen with the word "blink" was presented to give participants a chance to blink without missing stimuli. The RSVP task consisted of 6 blocks of 10 min each (to mirror the driving task). All scenes contained only inanimate objects and were manually scaled and cropped. Some scenes contained target objects and others did not. Before each block participants were instructed as to the class of target objects for that block. The target classes for this experiment were: stair, container, poster, chair, and door. Before the task began, participants were familiarized with exemplars from each target class. During the RSVP, participants were instructed to press a button only when they saw an object from the current target class. The order of the target classes was randomly chosen for each participant (blocks 1–5); however, the last block (block six) always had the same target class as the first block. In addition to target class, target probability varied across each block. Six target probability values (0.01, 0.03, 0.05, 0.07, 0.09, and 0.11), one for each block, were randomly assigned at the beginning of the task.

#### *Subjective measures*

In addition to biographical information, various cognitive and personality metrics were obtained, via standard questionnaires or timed assessments at the beginning of the experiment. The data from these cognitive and personality assessments was not included in the present study. Self-reports of fatigue were obtained using three different questionnaires: (i) the Visual Analog Scale for Fatigue (VAS-F; Monk, 1989), (ii) the Task-Induced Fatigue Scale (TIFS; Matthews and Desmond, 1998), and (iii) the Karolinska Sleepiness Scale (KSS; Akerstedt and Gillberg, 1990). The VAS-F was administered once after each task (calibration, driving and RSVP). The TIFS and KSS were administered once after the calibration task, after each 10 min block in the driving and RSVP tasks, and once at the end of the experiment. In order to account for individual differences in basal fatigue level, scores were normalized by the mean value over the experiment for each participant.

#### *Behavioral measures*

During the driving simulator task, various vehicle state measures were acquired at 100 Hz. Since the task objective was to maintain vehicle position within the rightmost lane, lane deviation (the difference between the vehicle's lateral position and the center of the lane) was the metric used to assess driver performance. During the RSVP task, participants pressed a button only when they saw a target object. Accuracy, reaction time (RT) and press duration were determined from this button response. Because the image duration (200 ms) was much less than the average RT, button responses were assigned to images in the following manner. For each button press, images within the time window of 300 to 1000 ms preceding the response were identified. If one or more of these preceding images was a target, the button press was assigned to the first (oldest) target image. RT was then calculated from the onset of that target image. If no targets occurred within the preceding time window, the button press was assigned to the nontarget image that preceded the button press by 600 ms (a standard RT value). However, due to the ambiguity of assigning a button press to a non-target image, RT statistics were not calculated for non-target images and this assignment process was only used to determine the false alarm rate.

For the RSVP task, behavioral responses were strongly influenced by perceptual difficulty; some targets were obvious and identified in all instances while other targets were subtle and only identified in some instances. The effect of perceptual difficulty was even evident at the aggregate level of target class (see **Table 3**), where the average accuracy was greater for some classes of target objects (e.g., chair) relative to others (e.g., poster). Therefore, to mitigate the influence of perceptual difficulty, we calculated a normalized behavioral metric. Specifically, average accuracy, RT, and duration were calculated for each target image across all participants (grand average). Then, for each instance of that target within the RSVP stream this grand average was subtracted from the behavioral response. This difference was then added to the nominal value for each measure. For accuracy, the nominal value was one: accuracy values greater than one indicated the participant was more accurate than average for that target image. For RT, the nominal value was 600 ms: RT values greater than 600 ms indicated the participant was slower than average for that target image. Lastly, the nominal value for duration was 300 ms: duration values greater than 300 ms indicated the participant depressed the button longer than average for that target image. These nominal values were chosen as round numbers that reflect the average behavioral response across participants. In addition to perceptual difficulty, target probability had a pronounced effect on RT and button press duration. Across participants, both RT and duration decreased as target probability increased (RT: slope = −262 ms, *p <* 0*.*05; duration: slope = −277 ms, *p <* 0*.*05). Using the inverse of these slopes, we adjusted both RT and duration as a function of target probability, on a block-by-block basis.

To capture temporal fluctuations in performance during both driving and RSVP tasks, we averaged the behavioral metrics (driving: absolute lane deviation, RSVP: accuracy, RT, and duration) via a centered, 90 s mean filter (Jung et al., 1997; Lin et al., 2005b; Chuang et al., 2012). This window size provided a robust, timevarying estimate of accuracy, RT and duration (average number of trials per window = 445) even when the target probability was low. The filtered data were center-aligned such that each time point included an average of data over the preceding and following 45 s. The edges of the filtered data were padded with the first and last valid value after smoothing (i.e., 45 s after the beginning of the first block and 45 s before the end of the last block).

#### *Electroencephalography measures*

Electrophysiological recordings were digitally sampled at 1024 Hz from 256 scalp electrodes over the entire cortex using a BioSemi Active Two system (Amsterdam, Netherlands). External leads were placed on the outer canthus, and above and below the orbital fossa of the right eye to record electrooculography (EOG). For power spectrum analysis, EEG was referenced to the average mastoids, down-sampled to 256 Hz, and digitally band-pass filtered between 0.5 and 50 Hz using the EEGLAB toolbox (Delorme and Makeig, 2004). A down-selected montage was created by (i) identifying the subset of channels that matched the BioSemi 32 configuration and, (ii) averaging the electrodes directly adjacent to those channels (between 5 and 8 depending on location). The purpose of the neighborhood averaging was merely to mitigate the influence of noise or high impedance from a single channel. In our approach, spatial filtering was accomplished through principal component analysis (PCA, see below) rather than the application of spherical Laplacian or identification of the independent component.

Moving-average power spectra were based on an approach described by Lin et al. (2005a). Briefly, the power spectral density (PSD) estimates were calculated in sliding 750-point epochs (∼3 s) with a 500-point step size (∼2 s). Each epoch was subdivided into 125-point Hanning windows with a 25-point step size. A 256-point FFT was then used to calculate the power spectrum for each window and a 5th order median filter was applied across windows for artifact mitigation. The windowed spectra were then averaged and converted into a logarithmic scale to produce the time-varying PSD estimate for each channel. Frequencies between 1 and 40 Hz were kept for subsequent analysis. Finally, the power estimates at these frequencies were smoothed with a 90 s mean filter in the identical fashion as the behavioral metrics described above. **Figure 2** outlines the sequence of steps in the EEG preprocessing, behavior integration, and model building components of the analysis.

#### *Regression models*

Two modeling schemes were used for each participant and task, a standard modeling scheme and an adaptive modeling scheme. In the standard scheme, regression models were built with the PSD estimates from four midline electrodes: Fz, Cz, Pz, and Oz. PSD estimates from these channels were combined to form a high-dimensional vector of the EEG log power spectrum.

$$X\_4 = \begin{pmatrix} (\text{Fz}\_1 \dots \text{Fz}\_{40})\_1 & \cdots & (\text{Oz}\_1 \dots \text{Oz}\_{40})\_1 \\ \vdots & \ddots & \vdots \\ (\text{Fz}\_1 \dots \text{Fz}\_{40})\_n & \cdots & (\text{Oz}\_1 \dots \text{Oz}\_{40})\_n \end{pmatrix} \tag{1}$$

Here, *X*<sup>4</sup> is the matrix of combined PSD estimates from the 4 channels and *n* overlapping time epochs. PCA was then applied to the combined PSD estimates (2). The set of eigenvectors *V* that explained at least 1% of the variance were then selected to represent the subspace of EEG log power (3),

$$C\_X = V \begin{pmatrix} \lambda\_1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \lambda\_{160} \end{pmatrix} V^{-1} \tag{2}$$

$$V = \left\{ \nu\_i \, \middle| \, \frac{\lambda\_i}{\sum \lambda} \ge 0.01 \right\} \tag{3}$$

where *CX* is the covariance matrix of the combined PSD estimates over the experiment (*X*4) and *vi* and *λ<sup>i</sup>* correspond to the *i*th eigenvector and eigenvalue respectively. A linear regression model, with a least-squares-error cost function, was fit to the behavioral data using the PSD projections onto these eigenvectors. No explicit temporal offset or lag is included in the regression model as the eigenvectors represent only a single PSD epoch.

For the adaptive modeling scheme, regression models were built using a subset of electrodes selected from the entire 32 channel montage. A different number and subset of channels was used for each participant and task to maximize the model's performance. Specifically, sequential forward floating selection

(SFFS) was utilized to rank channels in order of significance (Pudil et al., 1994). An iterative process added and removed channels from the rank-ordering by maximizing the criterion function *J*(*Xk*) at each step (**Table 1**). The criterion function, one over the root-mean-squared error, was calculated as follows:

$$J(\mathbf{X}\_k) = \frac{1}{\left(\frac{1}{n}\sum\_{i=1}^n \left(y - y\_{\text{est}}(\mathbf{X}\_k)\right)^2\right)^{1/2}}\tag{4}$$

where *Xk* is the combined PSD from *k* channels, *y* is the actual behavior, and *yest (Xk)* is the estimated behavior using these channels. During each iteration, PSD estimates from *k* channels were combined to form a high-dimensional vector of EEG log power. PCA was then applied to the combined PSD estimates. As in the standard scheme, eigenvectors that explained at least 1% of the variance were then selected to represent the subspace of EEG log power. A linear regression, with a least-squares-error cost function, was fit to the behavioral data using the PSD projections onto these eigenvectors.

By iteratively including and excluding channels, the SFFS algorithm avoids local maxima and can therefore be used to find the globally optimal feature set. For this dataset, the criterion function tended to peak well before all channels were included in the rank-ordering. Therefore, to reduce computational time we included a maximum-iteration number of 500 for our SFFS implementation. With this value, the criterion function for each participant achieved its peak value and an increase in iteration number did not improve performance (data not shown). The final behavioral estimate was generated using the set of channels with the largest *J (Xk)*. For the current study, *k* ranged between 1 and 12 channels.

For both the standard and adaptive modeling schemes, the EEG and behavioral data were split into leave-one-out cross-validation sets corresponding to the experimental blocks. Specifically, models were built with data from five blocks and tested on data from the remaining block. This cross-validation procedure insured a temporal separation between the training and testing sets roughly the size of the 90 s smoothing window (see Supplementary Material). The model performance was quantified using Pearson's correlation coefficient between the actual (*y*) and estimated (*yest*) behavior.

$$R = \frac{\sum (\mathbf{y} - \overline{\mathbf{y}}) \* \left(\mathbf{y}\_{est} - \overline{\mathbf{y}\_{est}}\right)}{\sqrt{\sum (\mathbf{y} - \overline{\mathbf{y}})^2 \* \sum \left(\mathbf{y}\_{est} - \overline{\mathbf{y}\_{est}}\right)^2}}\tag{5}$$

Here, significance was established using a bootstrap reshuffling technique. Specifically, values of the estimated behavior vector (*yest*) were randomly permuted and then smoothed by a 90 s mean filter. The correlation coefficient between the random estimate and the actual behavior was then calculated. The correlation coefficients from 1000 permutations were used to estimate the mean and variance of the random distribution for each behavior vector and establish a significance threshold (*p <* 0*.*05).

To determine the spectral characteristics of the regression model, we calculated the relative contribution of each frequency component (1 to 40 Hz) to the overall behavioral estimate. Specifically, the relative weight for each frequency would be calculated as follows:

**Table 1 | Sequential forward floating selection (SFFS) algorithm.**


*<sup>W</sup>*(*f*) <sup>=</sup> <sup>1</sup> *β <sup>M</sup> i* = 1 *X*ˆ(*f*)*vi*(*f*)*β<sup>i</sup>* (6)

Here, *X*ˆ(*f*) and *vi*(*f*) are the value of the average PSD and *i*th eigenvector at frequency *f*, respectively. The average PSD and eigenvector are weighted by the corresponding linear model coefficient *βi*. The resulting sum is normalized by the magnitude of the model coefficients to compare across participants. Relative weights were calculated for all *k* channels and averaged across training sets. We used the Benjamini and Hochberg (1995) false discovery rate (FDR) algorithm to determine which frequencies had relative weights significantly different from zero (*p <* 0*.*05) across all channels and participants.

## **RESULTS**

#### **SUBJECTIVE MEASURES**

Over the population, self-reported fatigue increased during both the driving and RSVP tasks (**Figure 3**). To assess the significance of this trend we performed repeated-measures ANOVA for each task type and survey with block or interval as the main factor (see **Table 2**). The Karolinska Sleepiness Scale (KSS) showed significant time-on-task effects in the driving and RSVP portion of the experiment (*p <* 0*.*001). Similarly, the Task-Induced Fatigue Scale (TIFS) showed time-on-task effects along 3 of the 4 the dimensions (*p <* 0*.*001); this included boredom, visual fatigue, and muscle fatigue. Malaise showed a significant time-on-task effect for only the driving portion of the experiment (*p <* 0*.*01). The TIFS also revealed a significant task type effect for boredom and visual fatigue (*p <* 0*.*01), where the driving task was perceived as more boring, and the RSVP task was perceived as inducing more visual fatigue.


Not surprisingly, many of the subjective measures seemed to plateau or decrease prior to the last block of the task. In the beginning of the experiment, participants were informed how many blocks would be included in each task. Thus, individuals seemed to experience an increased alertness as they neared the end of the task, a phenomenon shown in previous studies (Lorist et al., 2009). In contrast to the KSS and TIFS, the Visual Analog Scale for Fatigue did not show a significant time-on-task effect. However, this survey was only administered three times during the entire experiment (at the beginning, once after the driving task, and once after the RSVP task). Overall, the participant reports of fatigue indicated that both the driving and RSVP task induced fatigue and boredom.

#### **BEHAVIORAL MEASURES**

A number of previous studies have sought to quantify driver performance with a range of metrics including lane position, reaction time to perturbation onset, and corrective steering wheel deflections, among others. For simplicity we used absolute lane deviation as a general proxy for driver performance and level of alertness (Sandberg et al., 2011). Across participants, there was no significant increase in either the mean or standard deviation of absolute lane deviation across blocks. However, we did observe that lane deviation had a tendency to increase throughout each block, returning back to a lower value at the beginning of the subsequent block (**Figure 6A**). To quantify this effect, we fit a linear function to the absolute lane deviation within each block. We found that this slope exhibited a time-on-task effect [*F*(5*,* 24) = 2*.*42, *p <* 0*.*05], with sharper increases in lane deviation during later experimental blocks.

**Figure 4** shows the temporal dynamics of the three RSVP behavioral metrics (accuracy, RT and button press duration) for a typical participant. Notably, there were large fluctuations both within and across blocks. Across blocks, these fluctuations could be due to changes in either task parameters (target class or target probability) or alertness level. Within block fluctuations, however, could only be precipitated from endogenous changes such as perceptual learning, fatigue, or boredom. To further quantify these fluctuations we calculated both the average behavioral performance across each block and identified significant linear trends within each block.

First, to assess the influence of the task parameters we performed separate repeated-measures ANOVAs on accuracy, RT and button press duration with block, target class, and target frequency as factors (see **Table 3**). Across participants, one clear modulator of performance in the RSVP task was target class. This was true for both accuracy and RT (*p <* 0*.*001 for both), and to a lesser extent button press duration (*p <* 0*.*05). Likewise, target probability had a significant effect on RT and duration (*p <* 0*.*001) but not accuracy. Although none of the raw behavioral metrics showed a significant time-on-task effect across blocks, most participants had at least one block with a significant decrease in accuracy or increase in RT, reflecting a within block time-on-task performance decrement. To quantify this, for each participant we identified all blocks in which the accuracy or RT had a significant linear trend (*p <* 0*.*05). Time-on-task performance decrements were defined as blocks with either significantly negative trends in accuracy or significantly positive trends in RT. On average, participants exhibited this type of performance decrement in multiple blocks (*μ* = 2*.*12 blocks per participant). Similarly, performance improvements were defined as blocks with either significantly positive trends in accuracy or significantly negative trends in RT. In contrast to decrements, participants exhibited these performance improvements far less

**Table 3 | ANOVA for behavioral measures in the RSVP task.**


often (*μ* = 0*.*52 blocks per participant). This difference was significant across participants (*p <* 0*.*001; paired *t*-test), indicating that fatigue or boredom had a more pronounced influence on within block performance as compared with perceptual learning.

Since the goal of this study was to identify endogenous changes in performance, particularly task-induced fatigue or boredom, we wanted to mitigate the influence of task parameters on behavior. To accomplish this we normalized the behavior (accuracy, RT, and duration) for perceptual difficulty and target probability. **Figure 5** shows the results of this normalization process for all three behavioral metrics. While a subtle time-on-task trend is evident in the raw accuracy, it is masked by the effect of target class. In contrast, with normalized accuracy the time-on-task trend becomes highly significant (see **Table 3**). Interestingly, normalized RT and duration did not exhibit a similar time-on-task effect across blocks. Using these normalized metrics, we then developed EEG-based models of the RSVP behavior for each participant.

#### **ESTIMATING PERFORMANCE FROM EEG**

Previous studies have shown a clear relationship between the EEG power spectrum and time-on-task decrements in performance, especially in monotonous driving (Ting et al., 2008) or vigilance tasks (Stikic et al., 2011). Less is known about the link between the EEG power spectrum and behavior in perceptual tasks, such as the RSVP paradigm described here. To explore this relationship further, we constructed linear regression models to estimate each participant's behavior from their EEG power spectral density (PSD). A separate set of linear models were created from the PSD for both the driving and RSVP tasks using an adaptive modeling scheme. **Figure 6A** shows the actual and estimated behavior (absolute lane deviation) for one participant in the driving task. Notably, there was substantial variability in model performance across blocks. This indicated that the underlying relationship between the PSD and driving performance was variable between training and testing sets (Apker et al., 2013). **Figure 6B** shows the actual and estimated behavior (normalized accuracy) for one participant in the RSVP task. As with the regression models of

**FIGURE 5 | Raw and normalized RSVP behavioral measures. (A)** Grand average target detection accuracy over blocks. **(B)** Grand average RT (ms). **(C)** Grand average button press duration (ms). **(D–F)** Normalized accuracy, RT, and duration over the same blocks.

driving behavior, these models show some degree of variability across blocks.

To quantify the accuracy of these estimates we calculated Pearson's correlation coefficient, between the actual and estimated behavior, over the entire task. In line with our previous studies (Touryan et al., 2013a), we found that we were able to produce a behavioral estimate with a significant correlation coefficient for the majority of participants (see **Table 4**). This was true for both the driving (21 of 25 participants) and RSVP tasks (14 of 25 participants). Across participants, there was no significant difference in the accuracy of the behavioral estimate in the two tasks (*R* = *μ* ± *σ*; driving *R* = 0*.*374 ± 0*.*224, RSVP *R* = 0*.*248 ± 0*.*368, *p* = 0*.*14; Wilcoxon signed-rank test). In contrast, these results were substantially better than could be achieved using a standard, fixed-montage approach. Specifically, the standard modeling scheme only yielded significant estimates for 6 participants in the driving task and 4 participants in the RSVP task. The average accuracy of the behavioral estimate was also significantly lower than the adaptive approach for both the driving (*R* = 0*.*079 ± 0*.*224, *p <* 0*.*001) and RSVP (*R* = −0*.*137 ± 0*.*332, *p <* 0*.*001) tasks.

In addition to target detection accuracy, we wanted to quantify the relationship between the PSD and the two other behavioral measures within the RSVP task. To accomplish this, we used the same adaptive modeling scheme described above to fit regression models and construct estimates for both normalized RT and button press duration. **Figure 7B** shows the actual and estimated RT for one participant in the RSVP task, while **Figure 8B** shows the actual and estimated button press duration for another participant. Here, estimates of lane deviation from their corresponding driving tasks are included for comparison (**Figures 7A**, **8A**). Interestingly, our adaptive approach was able to produce significant behavioral estimates for both the RT and duration metrics in the majority of participants (RT *n* = 14, duration *n* = 17). The average correlation coefficients from these behavioral estimates were similar (RT *R* = 0*.*332 ± 0*.*225, duration *R* = 0*.*360 ± 0*.*302) to the normalized accuracy metric.

We observed no significant difference in the average correlation coefficient from the three RSVP metrics. Likewise, estimation accuracy from the three metrics was not significantly correlated across participants. This suggests that substantial individual differences exist in the link between the PSD and behavior in the RSVP task. For example, we were able to generate highly significant behavioral estimates along all three RSVP metrics for some participants (**Table 4**, participant 6). In other cases, only one of the three RSVP metrics produced a significant estimate (**Table 4**, participant 2). Across the population, the adaptive modeling scheme failed to produce a significant behavioral estimate for only 2 of the 25 participants in the RSVP task. Together, all participants had a least one significant estimate in either their driving or RSVP tasks.

#### **TOPOLOGICAL AND SPECTRAL FEATURES**

**Figure 9** shows the average topological distribution of included channels in the adaptive modeling scheme for both the driving and RSVP tasks. To quantify the gross features of this topology, we used the following approach. First, for each participant we normalized the channel distribution by the total number of channels included in their optimal model (between 1 and 12). We then separated the normalized distribution by hemisphere



*aRoot-mean-squared error (RMSE) values have been normalized by participant standard deviation for that task and metric.*

*\*Denotes significance (\*p < 0.05, \*\*p < 0.01, \*\*\*p < 0.001).*

in two ways: anterior-posterior and left-right. For the first comparison we utilized the driving and RSVP-accuracy distributions. We performed an ANOVA with two factors (task × location) but did not identify any significant topological effects between tasks. We then performed an additional two factor ANOVA (metric × location) for the distributions within the RSVP task. While there was no significant clustering of channels across all metrics, there was a significant interaction between metric and left-right distribution [*F*(2*,* 24) = 3*.*497, *p <* 0*.*05]. Here, the accuracy and duration models tended to select more channels from the right hemisphere.

In addition to the topological distributions, we wanted to quantify the spectral characteristics of the regression model. To accomplish this we calculated the relative weight for each frequency component within the linear model (1 to 40 Hz). **Figure 9** shows the average spectrum of relative weights for both tasks. For the driving task, we found that models tended to include an inverse but balanced weighting of theta (4 to 7 Hz) and alpha (8 to 12 Hz) bands, very similar to previous reports (Lin et al., 2012). For the RSVP task, the accuracy and RT metrics exhibited a near-opposite spectral weighting, including components of both the alpha and beta (13 to 30 Hz) bands. This likely reflected their complementary relationship to time-on-task changes in performance (i.e., decreases in accuracy and increases in RT over time). Interestingly, the spectral weights for the duration metric included both positive and negative values within the beta band.

The topological distributions show some level of commonality in the electrodes selected by the adaptive modeling scheme for both the driving and RSVP tasks. The most commonly selected electrode was Oz for models estimating RT in the RSVP task (**Figure 9C**). We wanted to determine the relative influence of the most commonly selected channels in each task. To do this we built fixed-montage models using the four most commonly selected electrodes in each task and for all three RSVP metrics (i.e., a different fixed-montage for each condition). We found that the performance of these models was very similar to the standard modeling scheme (utilizing Fz, Cz, Pz, and Oz). For the driving task, the models using the four most commonly selected electrodes produced an average correlation coefficient of 0.081 ± 0.279, similar to the standard scheme (*R* = 0*.*079 ± 0*.*224). For the RSVP task, the average correlation coefficients were similar

for accuracy (common electrodes: *R* = −0*.*051 ± 0*.*347, standard scheme: *R* = −0*.*137 ± 0*.*332), RT (common electrodes: *R* = −0*.*004 ± 0*.*292, standard scheme: *R* = 0*.*031 ± 0*.*266), and duration (common electrodes: *R* = −0*.*034 ± 0*.*388, standard scheme: *R* = −0*.*073 ± 0*.*388). These results suggest that the power of the adaptive modeling scheme is in the ability to capture the large, inter-subject variability in both electrode number and location.

#### **MODEL GENERALIZATION**

The regression models described above are optimized for each participant and task. However, some elements of these models may have the ability to generalize across tasks. In particular, one of the three behavioral metrics in the RSVP task may produce more generalizable models than the others. To explore this, we used the regression models from the driving task to estimate RSVP behavior and models from the RSVP task to estimate driving behavior. For each participant, we used the six models from the crossvalidation process and applied them to the entire PSD data from the alternate task, resulting in six complete estimates of the behavior for each participant and task. This process was repeated for each of the behavioral metrics in the RSVP task: normalized target detection accuracy, RT, and button press duration. Importantly, we needed to scale the estimate to the new behavioral metric. For driving, the metric (absolute lane deviation) typically increases with time-on-task fatigue or boredom; in contrast, the RSVP metric (target detection accuracy) typically decreases under the same conditions. Thus, we added an additional linear transform, slope, and offset, to match the novel behavioral metric. While this additional transform corrects for the sign and scale of the linear relationship it does not affect the magnitude of the correlation coefficient (i.e., the measure of estimation accuracy). **Figure 10** shows the distribution of these cross-task correlation coefficients (mean and standard error) for all participants.

Across participants, regression models constructed from the driving data were able to estimate some degree of the behavioral variation in the RSVP task, and vice versa. Specifically, for the

regression models built on the driving task the average correlation coefficient between the actual and estimated RSVP accuracy was 0.195 ± 0.150. Only 7 of the 25 participants had significant correlation coefficients. For the RSVP task models based on accuracy, the average correlation coefficient between the actual and estimated lane deviation was 0.176 ± 0.140. In this instance a different subset of participants (7 of the 25) had significant correlation coefficients. For RT and duration, the results were similar. The average correlation coefficient between the actual estimated RSVP behavior was 0.183 ± 0.152 (5/25 significant) for RT and 0.259 ± 0.148 for duration (12/25 significant). Estimates of lane deviation yielded average correlation coefficient of 0.218 ± 0.151 (11/25 significant) for RT-based models and 0.177 ± 0.123 (8/25 significant) for duration-based models. As the scatter plots suggest (**Figure 10**), we did not observe a significant correlation between the accuracy of the behavioral estimates within these two tasks.

# **DISCUSSION**

In this paper, we have extended an approach for modeling elements of instantaneous driver performance based on changes in the EEG log power spectra, and we have evaluated this approach by estimating continuous performance in a simulated driving task. We were able to generalize this approach to estimate fluctuations in RSVP behavior (target detection accuracy, RT, and button press duration) to a similar degree. Furthermore, when regression models fit under the driving paradigm were applied to the RSVP task, explained variance remained significant for some participants, despite its reduction overall. Together, our results show the potential for estimating time-on-task performance decrements in current and future BCI-relevant paradigms. While average accuracy of the estimated driver performance was lower than previous reports (Lin et al., 2005a,b), our results are from a larger cohort of participants that were not particularly fatigued at the time of the experiment. In addition, our driving simulator incorporated more complex vehicle dynamics, including operator control over vehicle speed, which further extended the realism of this study.

It is important to note that the major source of behavioral variance in our RSVP paradigm was not time-on-task. Despite strong indicators of increasing subjective fatigue and boredom, target class was a primary modulator of performance across blocks (**Figures 4**, **5**). This phenomenon was due to the difference in the average perceptual difficulty in the identification of objects from each target class. While target images from each class were roughly matched along low-level visual dimensions such as luminance, object size, and eccentricity, we observed a significant effect of target class in all three behavioral metrics (target detection accuracy, RT and button press duration). Likewise, we observed a similarly strong effect of target probability on both RT and duration. Once we normalized the behavioral metrics for these factors, a clear time-on-task effect was evident in target detection accuracy. While RT and duration did not show the same effect at the block level, significant linear trends were observed within experimental blocks. Interestingly, all three behavioral metrics were able to produce models with similar explanatory power.

The adaptive modeling scheme produced topological distributions (**Figure 9**) that reflect elements of the underlying neural processes. The driving and RSVP tasks likely engaged a range of brain networks with some degree of overlap between tasks. The majority of work exploring the link between neural activity and driver performance has focused on the central to occipital regions (Borghini et al., 2012; Lin et al., 2012). In contrast, the majority of RSVP target classification studies have implicated frontal-parietal networks (Gerson et al., 2005; Luo and Sajda, 2009). Our results indicated that a broad range of channels were selected in both the driving and RSVP tasks. However, since the adaptive modeling scheme was a primarily data-driving approach to behavior estimation, inferences regarding the foci of underlying network activity are limited. In general, the number of channels included in each participant's optimal model was small (driving = 3.72, RSVP accuracy = 4.32, RSVP RT = 4.24, RSVP duration = 4.08). SFFS ranks the channels based on cumulative predictive power, thereby minimizing redundancy in the feature selection process. Hence, the lowest ranking (most significant) channels tend not to be spatially adjacent, obscuring the underlying scalp distribution. In addition, there was substantial variability in both the number and location of selected channels, making average topological distributions difficult to interpret, even with a relatively large sample size (*N* = 25).

In contrast to the topology, the spectral features showed some commonality across participants. Every linear model contained a set of weights associated with each eigenvector, which in turn represents a combination of spectral features. By calculating the relative weight for each frequency across all participants, we were able to assess the spectral features associated with the behavior in each task. Not surprisingly, the models for the driving task incorporated the theta and alpha bands in their estimation of lane deviation. This relative combination of theta and alpha was very similar to previous reports linking neural activity to changes in driver performance (Lin et al., 2010b, 2012; Chuang et al., 2012). For the RSVP task, the spectral features were dependent on the behavior of interest. Target detection accuracy was found to be positively associated with frequencies in the alpha band. This result is in contrast to previous research which has shown that visual discrimination performance is negatively associated with alpha power during pre-stimulus periods, primarily in the bilateral parieto-occipital regions (Thut et al., 2006; Hanslmayr et al., 2007; Romei et al., 2008; Van Dijk et al., 2008; Mathewson et al., 2009, 2011). However, these studies support the notion that pre-stimulus alpha activity can function as an attentional gating mechanism by which task-irrelevant information is inhibited (Jensen and Mazaheri, 2010; Foxe and Snyder, 2011). Thus, the observed positive relation between alpha power and accuracy may be a reflection of inhibitory responses to the relatively more frequent non-target stimuli.

The spectral profile of the RT models exhibited a near-opposite weighting compared with models of accuracy. This is consistent with the observed behavior, where RT and accuracy showed a complementary relationship to time-on-task changes in performance. Again, our results are in contrast to previous research relating pre-stimulus alpha to decreased accuracy and increased RT. However, there is a clear difference between pre-stimulus measures of spectral power and PSD estimates calculated during the RSVP (Macdonald et al., 2011). The steady state visually evoked potential (SSVEP) has a strong influence on the spectral profile and produces peaks at the presentation frequency (in this case 5 Hz) and corresponding harmonics. In addition, the SSVEP itself is modulated by the attentional state of the participant (Kim et al., 2007). Thus, the spectral features contained within the models of RSVP behavior likely reflect an interaction between the SSVEP and endogenous changes in oscillatory activity.

For button duration, a negative relationship was observed in the alpha and low beta bands while a positive relationship was observed in the high beta band. This finding appears consistent with the event-related alpha desynchronization/synchronization (ERD/ERS) and post-movement beta synchronization literature. Previous research has consistently shown alpha (10 to 12 Hz mu rhythm) and a harmonic beta (20 to 24 Hz) desynchronization preceding finger movements, as well as post-movement beta synchronization in the 12 to 16 Hz and 26 to 30 Hz bands (Pfurtscheller and Lopes da Silva, 1999). However, as with accuracy and RT, these results may reflect a complex interaction between stimulus- and response-related activities. Further experimentation is required in order to differentiate the neural and functional sources of the spectral features contained within our behavior estimation models.

Given the nature of our adaptive modeling scheme, exploring the link between the topological and spectral distributions would be a substantial challenge. The relative spectral weights are extracted from the eigenvector loadings. The spatially distributed channels that constitute each model are therefore spectrally linked through the PCA process. Thus, determining the independent influence of PSD spectra at each spatial location would be difficult. In contrast, alternative approaches often have an initial, separate feature selection process. For example, other groups have used independent component analysis (ICA) to first identify a small number of components with unique scalp topologies that are maximally correlated with the behavior of interest (Lin et al., 2012). Similarly, other groups first calculate the average power within established frequency bands: most notably delta, theta, and alpha (Balasubramanian et al., 2011). While substantially limiting the dimensionality of the feature space, these initial selection processes allow for a more direct association between spatial or spectral features and changes in behavior.

#### **LINEAR MODEL CONSIDERATIONS**

Interestingly, there was substantial variability in model performance across blocks (**Figure 6**). This may have resulted from several factors. First, the relationship between the neural signature of fatigue and the resulting behavior may have itself changed over time. During the course of the task, participants could have engaged in different compensatory strategies for perceived timeon-task fatigue (Hockey, 1997) and these strategies could have differentially affected the observed behavior. This was especially true for the RSVP task in which there was a tradeoff between the speed and accuracy of the response. Second, the observed increase in perceived boredom over each task could have negatively influenced the motivation of participants to mitigate fatigue induced performance decrements. For example, the time it takes to return the vehicle to the cruising lane after the onset of a perturbation (i.e., response time) could have been negatively affected by motivation. In turn, this would have had a large impact on the smoothed lane deviation values with only an indirect link to time-on-task fatigue. Finally, while our PSD estimation process utilized band-pass filters and the power spectra were smoothed over a 90 s integration window, external noise and muscle artifacts could have degraded the model fitting process. Indeed, ICA has been used in similar approaches to mitigate artifacts and improve the signal-to-noise ratio (Lin et al., 2005b, 2006, 2012).

While the current study employed a data-driven approach designed to estimate changes in observed behavior, we sought to constrain the dimensionality of the problem and leverage previous work within this area. Several studies have identified the general time course of behavioral fluctuation from endogenous sources such as fatigue or alertness level (Jung et al., 1997; Lin et al., 2005b; Chuang et al., 2012). Our 90 s mean filter was directly borrowed from this previous work. However, this parameter imposes a constraint on the distribution of random correlation coefficients. Low-pass filtered random signals can achieve spurious correlations of relatively high magnitude. While a broader temporal integration window would likely increase the average accuracy, it would equivalently increase the significance threshold. Thus, given that our results predominately captured slow changes in performance (e.g., across blocks), our integration window was well suited to the nature of the behavioral fluctuations we attempted to estimate.

The analytical approach described here employs a relatively simple model, combined with feature selection, to generate continuous estimates of behavior. Similar approaches have been extended to incorporate ICA (Lin et al., 2005b) and fuzzy neuralnetworks (Lin et al., 2006, 2012). However, the linear approach still represents a solid and interpretable framework for exploring the relationship between the EEG power spectra and behavior in a variety of tasks. In addition, this method is computationally simple and utilizes universal signal processing components such as continuous PSD estimation from channel data. Thus, as we have demonstrated here, it remains a practical approach for an embedded application in current BCI systems (Lin et al., 2010a).

A number of other approaches for predicting changes in performance, due to fatigue or workload, have targeted specific frequency bands within the PSD (e.g., alpha and theta). As described above, the benefit of this more directed approach is the substantial reduction in the dimensionality of the feature space. In turn, this allows for the creation of more robust models with less data and a tractable exploration of frequency band interactions (e.g., theta-alpha ratio). However, *a priori* selection of features can limit a model's explanatory power by averaging over or ignoring PSD features that potentially contain information. The method utilized here, in contrast, takes a data-driven approach to feature selection, both in the channel montage and PSD components. While this more flexible approach is ideal for catching variations across individuals, it may be constrained in its ability to estimate behavior in novel tasks.

#### **INCREASING BCI ROBUSTNESS**

As BCI technologies improve, one potential application of this approach is as an opportunistic tool (Lance et al., 2012) within current and future applications, such as an image triage BCI system (Gerson et al., 2006; Sajda et al., 2010). While driver performance estimation systems that seek to use EEG must significantly outperform other approaches and modalities in order to justify the imposition of EEG signal acquisition, active and reactive BCI systems already include real-time neural signal processing. Thus, the addition of time-on-task behavior estimation algorithms would opportunistically take advantage of that data at little or no additional cost. The results of this study provide an initial proof-of-concept showing that existing fatigue-based performance estimation algorithms can be repurposed for additional BCI-specific tasks. This BCI-within-a-BCI framework potentially opens several future possibilities, most notably by using the performance estimation algorithm to increase the robustness of extended use active or reactive BCI technologies through a variety of mitigation methods (Zander and Kothe, 2011; Zander and Jatzev, 2012).

There are many possible approaches to developing these mitigation methods. One is to have different target-detection classifiers optimized to match individual fatigue levels. Another is to use the performance estimation algorithm to adapt the RSVP task to the user's current performance level; for example by representing images seen during periods of low performance or simply slowing the presentation rate. It should also be possible to affect the interaction of an RSVP BCI with a computer vision (CV) system, such as the system described in Gerson et al. (2006) or Touryan et al. (2013b); for example by weighting the target labels provided by the CV system more highly when the user is in a period of low performance. A related hybrid approach, combining measures of covert spatial attention with a motor imagery BCI, was recently proposed by Tonin et al. (2013).

# **CONCLUSION AND FUTURE WORK**

Typically, EEG-based performance estimation algorithms, like the one described here, are designed for tracking slow fluctuations in behavior. These approaches are intended to provide an objective measure of changes in task-induced fatigue (Lin et al., 2012) or mental workload (Kohlmorgen et al., 2007; Brouwer et al., 2012) that directly affect behavior over longer periods of time. In this study, we adapted one such approach and extended its application into a novel task paradigm. Our results show the potential for using performance estimation algorithms to inform current and future BCI applications by addressing some of the non-stationarities inherent within neural signals. By identifying time-on-task performance decrements, this approach could ultimately lead to more robust BCI systems.

Unfortunately, while our adaptive approach produced significant behavioral estimates in each task, a challenge remains in developing a universal, task-independent model of performance decrements. The adaptive modeling scheme described here is optimized for a particular individual, task, and behavior. However, brain activity clearly varies between individuals, across tasks and over time. Thus, behavioral models based on a particular ensemble of EEG data tend to degrade in their ability to extrapolate across these factors. However, there are analysis methods such as transfer learning (Pan and Yang, 2010), that may improve this extrapolation process. These techniques have recently been applied to BCIs (Lu et al., 2009; Jin et al., 2013; Samek et al., 2013; Wu et al., 2013), and provide a potential next step for extending this work.

# **ACKNOWLEDGMENTS**

Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-12-2-0019. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. The authors would like to thank T. Johnson, M. Cannon, M. Jaswa, C. Manteuffel, and J. Sidman for their help developing the driving simulator. We would also like to thank P. Weber for developing the RSVP display software; L. Gibson and K. Turner for creating the RSVP stimuli; C. Argys, K. Corby and T. Chiappone for running the experiments.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fnins.2014. 00155/abstract

# **REFERENCES**


Heidelberg: Springer), 521–530. Available online at: http://link.springer.com/ chapter/10.1007/978-3-642-39454-6\_56


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 04 February 2014; accepted: 25 May 2014; published online: 13 June 2014. Citation: Touryan J, Apker G, Lance BJ, Kerick SE, Ries AJ and McDowell K (2014) Estimating endogenous changes in task performance from EEG. Front. Neurosci. 8:155. doi: 10.3389/fnins.2014.00155*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Touryan, Apker, Lance, Kerick, Ries and McDowell. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Electrode replacement does not affect classification accuracy in dual-session use of a passive brain-computer interface for assessing cognitive workload

# *Justin R. Estepp\* and James C. Christensen*

*Applied Neuroscience Branch, Human Effectiveness Directorate, 711th Human Performance Wing, Air Force Research Laboratory, Wright-Patterson AFB, OH, USA*

#### *Edited by:*

*Anne-Marie Brouwer, TNO Human Factors, Netherlands*

#### *Reviewed by:*

*Santosh Mathan, Honeywell, USA Jean Vettel, Army Research Laboratory, USA*

#### *\*Correspondence:*

*Justin R. Estepp, Air Force Research Laboratory, 711th Human Performance Wing, 711 HPW/RHCP, BLDG 840, W200, 2510 Fifth Street, Wright-Patterson AFB, 45433–7951 OH, USA e-mail: justin.estepp@us.af.mil*

The passive brain-computer interface (pBCI) framework has been shown to be a very promising construct for assessing cognitive and affective state in both individuals and teams. There is a growing body of work that focuses on solving the challenges of transitioning pBCI systems from the research laboratory environment to practical, everyday use. An interesting issue is what impact methodological variability may have on the ability to reliably identify (neuro)physiological patterns that are useful for state assessment. This work aimed at quantifying the effects of methodological variability in a pBCI design for detecting changes in cognitive workload. Specific focus was directed toward the effects of replacing electrodes over dual sessions (thus inducing changes in placement, electromechanical properties, and/or impedance between the electrode and skin surface) on the accuracy of several machine learning approaches in a binary classification problem. In investigating these methodological variables, it was determined that the removal and replacement of the electrode suite between sessions does not impact the accuracy of a number of learning approaches when trained on one session and tested on a second. This finding was confirmed by comparing to a control group for which the electrode suite was not replaced between sessions. This result suggests that sensors (both neurological and peripheral) may be removed and replaced over the course of many interactions with a pBCI system without affecting its performance. Future work on multi-session and multi-day pBCI system use should seek to replicate this (lack of) effect between sessions in other tasks, temporal time courses, and data analytic approaches while also focusing on non-stationarity and variable classification performance due to intrinsic factors.

**Keywords: passive brain computer interface, cognitive state, electroencephalography, machine learning, non-stationarity**

# **INTRODUCTION**

Practical applications of brain-computer interface (BCI) systems, whether used for direct control or passive monitoring (pBCI; Zander et al., 2010), require stable performance over sustained usage. pBCI performance may be unstable for many reasons, such as changes in the physical properties of the sensors used, the location of sensors, variance in other cognitive states of a participant (e.g., fatigue), and drift or non-stationarity in the signals collected. Non-stationarity in physiological signals can severely hamper routine use of pBCI (Christensen et al., 2012), regardless of the true underlying cause for their non-stationarity. This issue has been addressed in previous BCI work via recalibration of the learning algorithm (e.g., Pfurtscheller and Neuper, 2001) or the use of adaptive algorithms that continually update the mapping between signals and output class (e.g., Vidaurre et al., 2006). Nevertheless, an improved understanding of the source and nature of non-stationarity would support continued improvement in the long-term stability of BCI and pBCI systems. In order to properly explore non-stationarity in the context of pBCI system performance, an essential first step is to rule out sensor and data collection system-related variance. Methodological variability due to electrode replacement could arise from a number of factors including, but not limited to, changes in transducer and sensor properties such as electrode impedance (Ferree et al., 2000), degradation from use and wear (Geddes et al., 1969), technician technique (self or third party), and uncontrolled ambient and environmental conditions. While these issues are unique in their own right, a practical approach is to collapse across all possible factors and consider the act of removing and replacing the electrode array to be encompassing of these nuances and others that may have not been detailed here, as well.

In BCI applications, sustained usage will be most dependent on the ability to demonstrate longitudinal usability over satisfactory periods of time. The time course of declines in pBCI performance suggests that non-stationarity in physiological signals is significant after, at most, a few hours. pBCI system accuracy has previously been observed to decline significantly when training and test data were separated by minutes or hours, but the additional decline when separated by days rather than hours was relatively negligible (Christensen et al., 2012). This time course also suggests that inter-day effects such as consolidation or sleep quality are not likely to be comparatively significant contributors to signal non-stationarity. While this may be a floor effect, the fact that accuracy was still significantly above chance may serve as evidence that cognitive and affective states of interest can be sufficiently mapped by using feature spaces that are observed and aggregated in the learning set over long temporal periods (cross-session learning).

Separating between-session effects with a cognitive or physiological origin from those with a methodological source remains a practically difficult challenge for multi-day studies. However, given the similar performance in pBCI system accuracy between time courses of hours and days, methodological sources of variability may be instantiated as a factor in dual- (or multi-) session, intra-day experimental designs in order to observe any subsequent effects. As in Christensen et al. (2012), a decrease in cross-session BCI system performance as compared to withinsession has been observed in other studies with explicit design considerations for multi-session use, most notably in the area of a reactive BCI framework known as rapid visual serial presentation (RSVP; Bigdely-Shamlo et al., 2008; Meng et al., 2012). Thus, the intra-day, dual-session task design is appropriate to investigate methodological variability given similar decreases in system performance at the multi-day time scale. While not the focus of this work, it is noteworthy that cross-session learning paradigms have been successful in mitigating cross-session performance decrements in the RSVP paradigm (Huang et al., 2011) as well as the pBCI paradigm (Christensen et al., 2012).

Electrophysiological methods, both neural and peripheral in origin, have some drawbacks for BCI applications. Electrodes placed on the skin may move relative to the underlying sources; if removed and replaced, the placement may not be identical, resulting in the spatial sampling of a slightly different distribution of electrical potentials. Systems that use gel to provide a conductive, coupling medium at the electrode interface are generally less susceptible to motion-related problems as compared to dry systems since the gel interface allows some electro-mechanical stability (Estepp et al., 2009, 2010). Electrophysiological signals are also dependent on impedance, and impedance at each electrode will inevitably drift due to changes in the skin interface, sweat, and drying of the gel or other electrolyte used. Dry electrode systems are not without similar problems as well, such as physical shifting of the electrode resulting in a decoupling of the hybrid electrical interface with the skin (Estepp et al., 2010) and stabilization of the electrochemical balance between the electrode and skin over time (Geddes and Valentinuzzi, 1973).

Regarding specific electrophysiological methods, electroencephalography (EEG) has been used in many BCI applications for a variety of theoretical and practical reasons (Donchin et al., 2000; Cheng et al., 2002; Wolpaw and McFarland, 2004). EEG is also a relatively practical technology, as it can be portable, inexpensive, noninvasive, and user-acceptable, particularly with systems requiring little or no skin preparation (e.g., Estepp et al., 2009; Grozea et al., 2011; Chi et al., 2012). EEG is also commonly used in pBCI applications for assessing cognitive (Wilson and Fisher, 1995; Gevins et al., 1997; Jung et al., 1997; Lin et al., 2005) and affective (Harmon-Jones and Allen, 1998; Davidson, 2004; Lin et al., 2010) states. Peripheral physiological measures, such as heart period and blink rate (e.g., Veltman and Gaillard, 1998; Wilson and Russell, 2003b) have also been used as sensitive indicators of cognitive workload. Combining both neural and peripheral physiological sources as features in a pBCI context may lead to overall improved system performance when compared to using neural features alone (e.g., Chanel et al., 2009; Christensen et al., 2012); however, the use of fused physiological sources in the context of pBCI systems is relatively underserved compared to those using neural sources only. An emerging trend in BCI research called hybrid BCI (Millán et al., 2010; Pfurtscheller et al., 2010) may be well-suited to exploring beneficial roles for passive cognitive and affective state assessment that incorporates peripheral physiological sources in combination with active and reactive schemas.

While the effects of physiological non-stationarity can be investigated at the individual signal or feature level, another reasonable approach is to study the system behavior at the learning algorithm decision level, or pattern classifier output stage, as it relates to the design of the protocol. A common practice in pBCI system analysis of this type is to reduce the likelihood of results that are unique to any single learning method (e.g., Garrett et al., 2003; Christensen et al., 2012) by investigating a number of varying approaches for the underlying paradigm being studied; this, of course, necessitates an open-loop system design whereby the learning algorithm segment of the system can be substituted *posthoc* using data collected a priori. While many variants of common learning methods exist in both the BCI and pBCI literature, popular choices include Linear Discriminant Analysis (LDA; for use in cognitive task classification, see Wilson and Fisher, 1995; Berka et al., 2004; Thatcher et al., 2005; for use in traditional active BCI, see Pfurtscheller et al., 1998; Guger et al., 2001; Blankertz et al., 2002; Parra et al., 2005), Support Vector Machines (SVM; e.g., Kaper et al., 2004; Lal et al., 2004; Schlögl et al., 2005; Thulasidas et al., 2006; Sitaram et al., 2007), and Artificial Neural Networks (ANN; for use in cognitive task classification, see Wilson and Russell, 2003a,b, 2007; Christensen and Estepp, 2013; for use in traditional, active BCI applications, see Pfurtscheller et al., 1996; Piccione et al., 2006).

The present work sought to examine the contribution of neural and peripheral physiological sensor (electrode) removal and replacement between sessions in a dual-session task paradigm to learning algorithm performance (the decision-level of the pBCI system) decrement over time. Based on previous work in open-loop (Wilson and Russell, 2003a,b; Estepp et al., 2010; Christensen et al., 2012) and closed-loop (Wilson and Russell, 2007; Christensen and Estepp, 2013) systems analysis, cognitive workload monitoring in a complex, multitask environment was chosen as the state paradigm. Following thorough task training, one set of participants completed two pBCI sessions in a single day without change to their electrode montages while an independent set of participants had their electrode montages removed and replaced with a new set of electrodes between the first and second session. Electrocardiography (ECG) and electrooculography (EOG) data were collected simultaneously with the EEG. Additional subjective state (workload) assessment and task performance data were also collected. Electrode impedances were measured before and after each session. Using a common feature set, k-folded learning trials were performed using four unique learning approaches, thus mitigating the likelihood of spurious results due to any single, *ad-hoc* method or test result. This design enabled direct comparison of between-session classifier accuracy with and without montage replacement, thus quantifying the impact of between-session methodological variability on pBCI performance.

# **MATERIALS AND METHODS PARTICIPANTS**

Twenty participants (13 male, age range of 18–28 years, mean age of 21.45 years) were recruited to participate in this study. This protocol was reviewed and approved by the Air Force Research Laboratory Institutional Review Board and performed in accordance with all relevant institutional and national guidelines and regulations. All prospective participants received a study briefing and completed comprehensive written informed consent prior to their voluntary participation in this study. Participants were compensated for their time unless otherwise employed by the Department of Defense at the time of their participation.

# **BETWEEN-SESSION ELECTRODE PREPARATION: THE BETWEEN-SUBJECTS FACTOR**

To investigate the effect of methodological variability due to electrode replacement, a between-subjects group factor was introduced between two sessions (S1 and S2) in a dual-session study design. Half of the available participants (10 of 20) were randomly selected to keep their electrode montage in place (referred to as the "Remained" group), while the other half had all electrodes removed and replaced between sessions (referred to as the "Replaced" group). The Replaced group washed and dried their hair (all using the same baby shampoo without conditioner) after having the first set of electrodes removed. Any markings that may have been used to ensure appropriate electrode cap placement prior to S1 were also removed. Prior to the beginning of S2, the electrode montage was reapplied for the "Replaced" group using a different set of electrodes than was used in S1. This procedure was designed to introduce methodological variability due to electrode replacement, if existing, on a shorter time scale than between days such that its potential effects could be reasonably isolated from previously observed between-day effects in learning algorithm performance (Christensen et al., 2012).

#### **AF-MATB SIMULATION ENVIRONMENT**

The Air Force Multi-Attribute Task Battery (AF-MATB; Miller, 2010) was used as a realistic, ecologically-valid multitask environment in which a participant's workload could be varied. The AF-MATB task interface is shown in **Figure 1**. The task is broadly representative of aircraft operation (particularly remote piloting), and can include compensatory manual tracking, visual and auditory monitoring, and a dynamic resource allocation task. Both AF-MATB and its original instantiation, MATB (Comstock and Arnegard, 1992), have been utilized in numerous studies concerning the use of pBCI architectures and the assessment of cognitive workload in individuals (e.g., Wilson and Russell, 2003b; Christensen et al., 2012) and, when coupled with adaptive automation rule sets, in closed-loop studies (e.g., Freeman et al., 1999; Prinzel et al., 2000, 2003; Wilson and Russell, 2003b). For this study, the visual (System Monitoring) and auditory (Communications) monitoring, compensatory manual tracking (Tracking), and Resource Management tasks were presented simultaneously during all task conditions. The remaining two panels (Scheduling and Pump Status) are informational panels only. Scheduling, although disabled for this study, can be used to convey information about future task state of the Tracking (T) and Communications (C) subtasks. Pump Status displays the current flow rate of the pumps in the Resource Management subtask. For additional details about the AF-MATB simulation environment and its properties, please refer to the online Supplementary Materials for this manuscript.

The demands of each task were varied so that, overall, two levels of individualized difficulty were presented. Participants were trained for a minimum of 2 h per day over 5 different days on AF-MATB until their performance parameters attained asymptote with minimal errors. This procedure helped to reduce learning effects and allowed participants to reach a desired level of familiarity and comfort with the laboratory setting. Task difficulty was increased over the training sessions in order to find a high difficulty level for each individual that met minimum task performance criteria. Participants were not instructed to prioritize any one task over the others. For additional details of the task and training procedure, please see the online Supplementary Materials for this manuscript.

#### **AF-MATB TESTING AND DATA COLLECTION**

On the testing day, participants completed four AF-MATB trials. These trials, each 15 min in length, were divided between two sessions S1 and S2. Trial type within each session was balanced to one each of low and high task difficulty. The order of trials in each session was randomized for all participants. The end of S1 and start of S2 were chronologically separated by 45 min.

**Figure 2** depicts an approximate timeline for the data collection. Data collection began with initial electrode preparation and placement and an initial measurement of impedance (Z1). A 5 min practice trial (P) was given to participants to re-familiarize themselves with the task interface before beginning data collection. Session 1 (S1) consisted of two, 15-min AF-MATB trials (one at each task difficulty level and paired with a NASA-TLX assessment administered at the end of the trial) followed by a second impedance measurement (Z2). The between-subjects factor of electrode replacement was introduced between the two sessions. S2 also consisted of two, 15-min AF-MATB trials bookended by two additional impedance measurements (Z3 and Z4).

#### **ELECTROPHYSIOLOGICAL RECORDING**

Prior to completing the practice trial on testing day (the sixth and final day of the protocol), participants were outfitted with a standard elastic fabric EEG electrode cap (Electro-Cap International, Inc., Eaton, OH, USA) containing 9 mm, tin cup electrodes positioned according to the International 10–20 System (Jasper, 1958) and its 10-10 (Chatrian et al., 1985) and 10-5

**FIGURE 1 | User interface for the AF-MATB task environment.** The four subtasks (System Monitoring, Tracking, Communications, and Resource Management) are shown in the left and center columns on the interface. The right column shows Scheduling and Pump Status

information windows. The Scheduling information window was disabled for this study. More information on the AF-MATB task can be found in AF-MATB User's Guide (Miller, 2010) and in the Supplementary Material for this manuscript.

**FIGURE 2 | Experimental timeline for data collection.** Data collection began with initial electrode preparation and placement and an initial measurement of impedance (Z1). A 5-min practice trial (P) was given to participants to re-familiarize themselves with the task interface before beginning data collection. Session 1 (S1) consisted of two, 15-min AF-MATB trials (one at each task difficulty level and paired with a NASA-TLX assessment administered at the end of the trial) followed by a second impedance measurement (Z2). The between-subjects factor of electrode replacement was introduced between the first session and the second session (S2). S2 also consisted of two, 15-min AF-MATB trials bookended by two additional impedance measurements (Z3 and Z4). Order of the AF-MATB trials (with respect to task difficulty) was randomized independently within each session. (Oostenveld and Praamstra, 2001) extensions. The EEG cap was sized and fitted according to measured head circumference above the nasion. After measuring the nasion-to-inion distance, frontal poles (Fp1 and Fp2) were placed at the first 10% distance marker above the nasion, and alignment of Fz was verified to be consistent with the 50% distance markers (nasion-inion and intra-preauricular). Five EEG channels on the electrode cap (Fz, F7, Pz, P7, and O2) were used during data acquisition. Matching, singlelead tin cup electrodes were also placed on the outer canthus of each eye (forming a bipolar channel for horizontal EOG, or HEOG), inferior to and superior to the left eye on the orbital bone (forming a bipolar channel for vertical EOG, or VEOG), and on the left (common reference) and right (amplifier ground) mastoid processes. Disposable Ag/AgCl pediatric/neonatal electrodes (Huggables, CONMED Corp., Utica, NY, USA) were positioned on the left clavicle and sternum, forming a bipolar channel for ECG. All peripheral channels were prepared by cleaning the skin with 70% isopropyl alcohol prep pads and gently scrubbing the cleaned surface with NuPrep (Weaver and Company, Aurora, CO, USA). EEG scalp sites were prepared via syringe with a blunted needle and then filled with Electro-Gel (Electro-Cap International, Inc., Eaton, OH, USA). The full electrode montage is displayed in **Figure 3** (electrodes below the horizon of the axial view are shown with a flattened projection perspective). All electrophysiological channels were chosen based on a previous saliency analysis and sensor downselect from a similar study using the MATB task environment (Russell and Gustafson, 2001).

A BioRadio 110 (Great Lakes NeuroTechnologies, Cleveland, OH, USA) telemetry system was used to acquire the 8 aforementioned channels of electrophysiological data (using a common reference montage for the five EEG channels) during task performance. All available channels were recorded at 200 Hz, with 12-bit resolution, using an AC-coupled amplifier (bandpass filtered between 0.5 and 52.4 Hz).

#### **SUBJECTIVE WORKLOAD ASSESSMENT**

Participants' subjective workload ratings were assessed using the National Aeronautics and Space Administration's Task Load Index (NASA-TLX; Hart and Staveland, 1988). The NASA-TLX was administered immediately following each of the four AF-MATB trials (**Figure 2**). Participants completed both the individual six subscale ratings as well as the Sources of Workload subscale comparison.

#### **ELECTRODE IMPEDANCE**

Complex electrode impedance was monitored and recorded preand post-session for both S1 and S2. This was done to quantify changes in impedance during and between sessions, regardless of whether the recording system was replaced or left in place. Upper limit thresholds for accepting an electrode preparation were 5 k*-* for EEG and 20 k for EOG and ECG electrodes. Electrodes were re-prepped during the pre-session impedance check if any of these thresholds were exceeded.

#### **ELECTROPHYSIOLOGICAL DATA PROCESSING**

All preprocessing and generation of electrophysiological feature data was accomplished in real-time as part of the data acquisition in a software suite developed in the LabVIEW (National Instruments Corporation, Austin, TX, USA) development environment (Krizo et al., 2005). The primary user interface for this software is shown in **Figure 4**. All feature data, as well as the raw electrophysiological data, were saved for further *post-hoc* (offline) processing. Electrophysiological feature data were created using an averaging window with an overlap to define the rate at which this data was updated. This update rate was synchronized to 1 Hz across all feature types. A total of 37 features consisting of EEG, VEOG, and HEOG band powers, inter-beat interval (IBI) between consecutive R-wave peaks of the ECG, and blink rate derived from the VEOG channel were used in this study.

#### *EEG data processing*

EEG channels were first corrected for gross artifact due to eye movement using a recursive least-squares implementation of a noise canceling adaptive filter (He et al., 2004, 2007). An example of the original and noise-canceled time series for F7 is shown in **Figure 5**. Following ocular artifact correction each EEG channel was then used to create power spectral densities (PSD) via the Discrete Fourier Transform (DFT) algorithm with a corresponding Hanning window (also known as the periodogram method) over a 1 s window. Band power estimates were then derived from the PSD of each of the channels using commonly defined traditional clinical frequency bands. The frequency band ranges used in this pipeline were: delta (0.5–3 Hz), theta (4–7 Hz), alpha (8– 12 Hz), beta (13–30 Hz), and gamma (31–42 Hz). EEG features were created by averaging these 1 s band power estimates over a 10 s window (with a 9 s overlap) and then applying a base 10 logarithmic transform to improve the normality of the band power feature distributions (Gasser et al., 1982). This resulted in 25 features (5 EEG channels × 5 frequency bands) from the EEG data.

#### *EOG data processing*

VEOG and HEOG channels were processed using the same frequency band pipeline as the EEG data. This resulted in an additional 10 frequency band features (2 EOG channels × 5 frequency bands). VEOG was also used in a real-time implementation of a blink detection algorithm (Kong and Wilson, 1998). Blink counts were summed over 30 s window (with a 29 s overlap) to calculate average blink rate as a feature (blinks per [min]). In total, 11 additional features were derived from the VEOG and HEOG channels. An example of the output of this algorithm, as well as the resulting blink rate feature time series, is shown in **Figure 6**.

Jasper (1958) for what is now commonly referred to as P7 (Chatrian et al.,

#### *ECG data processing*

A single feature, related to heart rate, was derived from the ECG data. Individual cardiac cycles, as defined by the R-wave, were first identified using a real-time algorithm developed by Pan and Tompkins (1985) and Hamilton and Tompkins (1986). The IBI time series within a 10 s window (with a 9 s overlap) was averaged to create the IBI feature. An example of the output of this algorithm, as well as the resulting IBI feature time series, is shown in **Figure 7**.

#### **MACHINE LEARNING APPROACHES**

The pBCI framework in this study consisted of using data from the first session on the test day (S1) to train a machine learning algorithm that could then be used as a fixed pattern classifier to assess the participants' cognitive workload in the second session (S2). This general architecture supports workload assessment in real-time by providing the feature vectors, updated at 1 Hz, as the input layer of the learning algorithm. A number of learning approaches were compared in investigating whether the between-subjects factor of electrode removal and replacement affects learning algorithm accuracy in the simulated real-time assessment of workload during S2. Each learning algorithm was structured to solve a binary classification problem of low vs. high workload (a priori hypothesized to be driven by low vs. high task difficulty, but which can be tested *post-hoc* with a combination of task performance and subjective workload assessment measures). All learning approaches were implemented in *posthoc* analyses using the feature vectors that were generated by the real-time acquisition and processing software (although the feed-forward implementation of each learning algorithm could be integrated into the real-time software). All *post-hoc* implementations of the learning approaches used in this study were developed in MATLAB R2010b (The Mathworks, Inc., Natick, MA, USA) using custom-written code and available toolboxes where noted.

pipeline. Y-axis units of all time series are given in microvolts [uV].

#### *Accuracy vs. sensitivity*

While the overall accuracy of the learning approach (represented as proportion of 1 s epochs correctly classified as either low or high workload) is a useful measure to help understand algorithm performance, other measures such as d- (d-prime; Green and Swets, 1966) may be better suited for quantified performance comparisons. The use of das a signal detection sensitivity

**filter.** Using the implementation of He et al. (2004, 2007), VEOG and HEOG bipolar time series are used as reference noise input channels to the adaptive noise canceling structure. The time series shown in this figure, in order from

filter is shown in red. Large amplitude artifact from blink activity (early in the time series) and saccadic activity (later in the time series) are absent in the noise-corrected time series. Y-axis units of all time series are given in [uV].

measure may be preferred as it is free of bias that may occur in using the proportion of epochs correct as an algorithm performance measure (such as the case would be if algorithm performance was biased toward one class in the binary problem). The calculation of d is given in Equation (1), where z() represents the inverse of a unit normal Gaussian cumulative distribution function (the "norminv" function in MATLAB, with μ = 0 and <sup>σ</sup><sup>2</sup> <sup>=</sup> 1), and True Positive Rate (TPR) and False Positive Rate (FPR) are calculated from the test set confusion matrix and have a range of (0,1) (not inclusive).

$$d' = z(TPR) - z(FPR) \tag{1}$$

In this work, the correct detection of a high workload state epoch is considered to be a true positive (TP), while any low workload state epoch incorrectly classified as being from a high workload state is a false positive (FP). TPR and FPR are then calculated from Equations 2 and 3 using the confusion matrix structure that is shown in **Table 1**.

$$TPR = \begin{cases} \frac{1}{(TP + FN)}, & TP = 0\\ \frac{TP - 1}{TP + FN}, & FN = 0\\ \frac{TP}{TP + FN}, & \text{otherwise} \end{cases} \tag{2}$$

$$FPR = \begin{cases} \frac{1}{(FP+TN)}, & FP = 0\\ \frac{FP-1}{FP+TN}, & FN = 0\\ \frac{FP}{FP+TN}, & otherwise \end{cases} \tag{3}$$

#### *Definition of learning set, test set, and k-fold procedures*

All learning algorithms were trained using data from S1 (the learning set) and then feed-forward tested on S2 (the betweensession test set) to create an unbiased estimate of learning algorithm performance. To guard against spurious learning results, a k-fold (*k* = 10) cross-validation procedure was used on the learning set. Each set (learning and test) contained approximately 1800 balanced samples (900 from each of the 15-min low and high task difficulty trials in the session) as a product of the 1 Hz feature vector update rate. The 10 folds were created by randomly subsampling 90% of the available data from S1, resulting in approximately 1620 learning samples in each fold. Approximately 162 samples, or 10%, of the learning set was reserved as a nested test set to observe the within-session performance of the learning approach. Learning algorithm output for the within-session and between-session test sets was unweighted (not explicitly biased) given balanced classes in the learning set. This folding process, per the Central Limit Theorem, will result in a normal distribution

on the display, the scale for the VEOG time series is [uV], and the scale for the blink rate time series is [blinks/min]. The time scale is in standard HH:MM:SS format.

for classifier performance (expressed as sensitivity, or d- ) for a sufficient number of folds, thus ensuring equality of the mean and median of each performance distribution. Due to the number of folds used in this analysis (*k* = 10), the median of each set of folds is used to represent that learning approach's performance in all subsequent analyses of variance. The choice of median in this analysis is sufficient to reduce any distribution skew resulting from the *k* = 10 folds that would otherwise bias the distribution mean.

The learning set was normalized to itself by converting (within-feature) to z-scores using the mean and standard deviation of each of the 37 features separately for each participant. These mean and standard deviation vectors were then used to z-score the within-session nested test sets from S1 and the between-session independent test sets from S2 in order to simulate a real-time implementation of the feed-forward algorithm architecture.

#### *Linear discriminant analysis*

LDA was implemented via the MATLAB Statistics Toolbox v7.4 (R2010b) using an implementation of the "classify" function. LDA defines a linear decision boundary based on linear combinations of the input feature vectors to separate the learning set according to categorical class labels. Classification is then achieved by assigning the estimated class of each tested sample according to its location as referenced to the linear decision boundary.

**Table 1 | Classifier output structure (confusion matrix) used to determine d-.**

scale for the ECG time series is [uV], and the IBI time series is visualized as heart rate in beats per minute, or [BPM]. The time scale is in standard


#### *Support vector machines*

HH:MM:SS format.

The implementation of the SVM in this study utilized the kernel approach to mapping the learning set to a non-linear feature space. Lacking any a priori decision information to choose an appropriate kernel for this particular dataset, two popular approaches were tested: a linear kernel (LIN) and a (Gaussian) radial basis function (RBF) kernel. Kernel parameters for both the LIN and RBF kernels were optimized via the "tunelssvm" function using the multidimensional unconstrained non-linear optimization approach ("simplex") contained within the LS-SVMLab v1.8 Toolbox (De Brabanter et al., 2010). Both the linear kernel SVM (LIN-SVM) and the radial basis function SVM (RBF-SVM) algorithms were implemented via the exact incremental learning and adaptation approach (Cauwenberghs and Poggio, 2001) with the Incremental SVM Learning in MATLAB package (Diehl and Cauwenberghs, 2003). Following the decision boundary rule for the LDA, classification using both the LIN-SVM and RBF-SVM was achieved by assigning the estimated class of each tested sample according to is location as referenced to the non-linear decision boundary.

#### *Artificial neural networks*

The particular ANN implementation used for this work follows that in Christensen et al. (2012). The input layer of the ANN was matched to the 37 features; a single hidden layer utilized a fully-connected structure. Training was accomplished via the backpropagation algorithm (Lippmann, 1987; Widrow and Lehr, 1990). A nested validation set (33% of the learning set) was used to implement an early stopping rule (del R Millan et al., 2002) at the learning iteration at which root mean-squared error, or RMSE, was minimized for the validation set. This early stopping rule was intended to guard against overfitting to the learning set (Wilson and Russell, 2003a; Bishop, 2006). The ANN used a 2 node output layer for the binary classification problem addressed here. Binary classification was implemented by assigning each test case to the higher weight between the 2-node outputs. This ANN structure was implemented in MATLAB R2010b using customdeveloped code and functions from the Neural Network Toolbox v7.0 (R2010b).

# **RESULTS**

While the main factor being investigated in this work is the (between-subjects) effect of electrode removal and replacement on learning algorithm accuracy in a pBCI framework for cognitive workload assessment, a number of analyses must first be accomplished given factors of task difficulty (two levels, low and high) and session (two levels, S1 and S2). These two withinsubjects factors, when combined with the between-subjects factor of electrode replacement (two levels, Remained and Replaced), were the basis for analyses of both task performance and subjective workload data. In addition, impedance data were analyzed for any significant changes across time (pre- and post-session) and with respect to the between-subjects factor of electrode replacement. Unless noted otherwise all statistical tests were performed using IBM SPSS Statistics Standard 21. All analyses of variance were analyzed using α = 0*.*01.

#### **AF-MATB PERFORMANCE**

The four primary subtasks in AF-MATB all have associated outcome measures related to task performance. While there exist a number of performance measures for each subtask that could be investigated, a single measure related to "hit rate" appropriate for each subtask was chosen. These measures were: (1) proportion of stimuli (including both lights and gages) with correct responses for the System Monitoring subtask, (2) proportion of stimuli for the participants' active callsign for which the participant responded with a comm channel/frequency change for the Communications subtask, (3) RMS tracking error (from center, in pixels) for the Tracking subtask, and (4) deviation from the nominal fuel level, averaged between Tanks A and B, for the Resource Management subtask. To investigate these multiple task performance measures for this study design, a 2 (task difficulty, within) × 2 (session, within) × 2 (electrode replacement, between) mixed-model multivariate analysis of variance (MANOVA) was performed using the four subtask performance measures as dependent variables.

There was no significant effect of the between-subjects factor of electrode replacement, *<sup>F</sup>*(4*,* 15) <sup>=</sup> <sup>0</sup>*.*890, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*494, <sup>η</sup><sup>2</sup> *p* = 0*.*192. For the within-subjects factors, there was no significant main effect of session, *<sup>F</sup>*(4*,* 15) <sup>=</sup> <sup>1</sup>*.*513, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*248, <sup>η</sup><sup>2</sup> *p* = 0*.*288, but the main effect for task difficulty was significant, *<sup>F</sup>*(4*,* 15) <sup>=</sup> <sup>104</sup>*.*693, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*001, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*965. Two-way interactions for (task difficulty × session), *F*(4*,* 15) = 2*.*321, *p* = 0*.*104, η2 *<sup>p</sup>* = 0*.*382, (task difficulty × electrode replacement), *F*(4*,* 15) = <sup>1</sup>*.*999, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*146, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*348, and (session × electrode replacement), *<sup>F</sup>*(4*,* 15) <sup>=</sup> <sup>1</sup>*.*035, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*421, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*216, were all nonsignificant. The three-way interaction, (task difficulty × session × electrode replacement), was not significant, *F*(4*,* 15) = 1*.*030, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*424, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*215. Since the exact subtask factors responsible for the main effect of task difficulty are not of importance to the goals of this work, further analysis of subtask effects are omitted in favor of the individual subtask boxplots shown in **Figure 8**. The significant main effect of task difficulty on performance validates that the task manipulation was successful at inducing significantly different task performance states; the lack of a significant effect of session suggests that performance was consistent across sessions. Similarly, the electrode replacement factor was not of significant effect, thus confirming a lack of difference in task performance between the two groups.

#### **NASA-TLX**

Subjective workload ratings obtained via the NASA-TLX were analyzed in a similar manner to the performance data with very similar results. The factor-weighting procedure per the original work of Hart and Staveland (1988) was used to calculate the overall subjective workload rating for each trial. A 2 (task difficulty, within) × 2 (session, within) × 2 (electrode replacement, between) mixed-model ANOVA was performed to investigate effects of subjective workload.

Results from the ANOVA test showed no significant main effect of the between-subjects electrode replacement factor, *<sup>F</sup>*(1*,* 18) <sup>=</sup> <sup>0</sup>*.*729, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*405, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*039. The within-subject main effect of session was not significant, *F*(1*,* 18) = 0*.*269, *p* = 0*.*611, η2 *<sup>p</sup>* = 0*.*015, but there was a significant main effect of task difficulty, *<sup>F</sup>*(1*,* 18) <sup>=</sup> <sup>66</sup>*.*272, *<sup>p</sup> <sup>&</sup>lt;* <sup>0</sup>*.*001, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*786. Two way interactions for (task difficulty × session), *F*(1*,* 18) = 3*.*143, *p* = 0*.*093, η2 *<sup>p</sup>* = 0*.*149, (task difficulty × electrode replacement), *F*(1*,* 18) = <sup>2</sup>*.*902, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*106, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*139, and (session × electrode replacement), *<sup>F</sup>*(1*,* 18) <sup>=</sup> <sup>0</sup>*.*857, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*367, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*062, were all nonsignificant. The three-way interaction, (task difficulty × session × electrode replacement), was also non-significant, *F*(1*,* 18) = 2*.*447, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*135, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*120. A boxplot showing the NASA-TLX data is shown in **Figure 9**. As with the analysis of the performance data, the subjective workload data provides additional evidence for the validity of the task difficulty manipulation as a strategy for creating varying workload states that were constant between sessions and groups.

#### **ELECTRODE IMPEDANCE (Z)**

Four measurements of individual electrode impedance (pre- and post-session for both S1 and S2) were made during this study to account for any change in group-level impedance with respect to the between-subjects factor of electrode replacement. A 4 (measurement time point, within) × 2 (electrode replacement, between) mixed model analysis of variance (ANOVA) was

performed to assess any possible impedance changes due to these two factors. Lacking any a priori evidence for investigating each electrode independently, an omnibus measure of impedance was created for each measurement time point by averaging impedance across all electrodes.

by task difficulty. There appears to be some effect of task difficulty on the

Mauchly's test revealed a significant deviation from the assumption of sphericity, χ2(5) = 38.397, *p <* 0*.*001, thus necessitating adjustments to the degrees of freedom. Following the guidance of Huynh and Feldt (1976), the Greenhouse-Geisser estimate of sphericity (Greenhouse and Geisser, 1959) was used (<sup>ε</sup> <sup>=</sup> <sup>0</sup>*.*461) in lieu of the Huynh-Feldt estimate (<sup>ε</sup> <sup>=</sup> <sup>0</sup>*.*514) given <sup>ε</sup> *<sup>&</sup>lt;* 0.75. There was not a significant main effect of electrode replacement, *<sup>F</sup>*(1*,* 18) <sup>=</sup> <sup>0</sup>*.*056, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*815, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*03, but the main effect of measurement time point approached significance, *<sup>F</sup>*(1*.*382*,* <sup>24</sup>*.*877) <sup>=</sup> <sup>3</sup>*.*224, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*073, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*152. The 2 way interaction, (measurement time point × electrode replacement), was not significant, *F*(1*.*382*,* <sup>24</sup>*.*877) = 2*.*318, *p* = 0*.*151, η2 *<sup>p</sup>* = 0*.*106. A boxplot depicting the omnibus impedance data is shown in **Figure 10**. Results of the analysis of the impedance data suggest that impedance for all participants, regardless of electrode replacement group assignment, was constant over the duration of the data collection.

times the box height, are indicated by asterisks.

#### **EXAMPLE FEATURE DATA FROM REPLACED GROUP**

An example dataset from the Replaced group is shown in **Figure 11**. The data from this participant is represented as a single time series for both the Blink Rate and IBI features as well as time-frequency plots for both Fz and Pz. All four AF-MATB trials are shown individually in a representation of the 2 (session) × 2 (task difficulty) study design. Like the feature vectors, the data in this figure are averaged using a 10 s window with a 9 s overlap. All corresponding data series are shown on the same scale (e.g., all of the time-frequency plots use the same scale for mapping log power [dB/Hz] to the colormap shown in the colorbar). Individual band ranges for theta, alpha, beta, and gamma are annotated on the time-frequency plots (delta is omitted). Examining the time series, we observe workload differences consistent with results in similar previous studies (Wilson and Fisher, 1995; Gevins et al., 1998), but no obvious differences as a function of session or having the electrodes replaced between sessions.

#### **LEARNING ALGORITHM PERFORMANCE**

Implementing the k-fold (*k* = 10) procedure for each participant (using S1 as the learning set and S2 as the test set) for each of the *N* = 20 participants resulted in 200 trained/tested classifiers for each of the four learning approaches (LDA, SVM-LIN, SVM-RBF, and ANN). A modified analysis design from that used for the AF-MATB performance and NASA-TLX subjective workload data is necessary given that (1) the within-subject factor of workload is collapsed into a single algorithm performance metric, either proportion of epochs correctly classified ("accuracy") or d- , and (2) the within-subjects factor of session is eliminated given the desire to only investigate the simulated real-time implementation of the pBCI architecture performance on S2. Classifier performance on the nested test set (random 10% of S1) was at ceiling for all of the learning approaches (**Figure 12**) and is omitted from all further analyses. For all statistical tests the learning algorithm performance measure used was d- ; however, to aid in ease of interpretation, all figures will present overall classifier accuracy as proportion of all epochs that were correctly classified.

In order to evaluate learning algorithm performance, observed classifier performance was compared to the null distribution for each approach. Given the binary classification problem presented here, the theoretical null accuracy should be 0.50 (or 50% accuracy, with a theoretical null d of 0). An empirical comparison requires that the empirical null distribution for classifier performance be available. Following the methods of Hughes et al. (2013), empirical null distributions were calculated for each of the learning approaches by randomizing class label assignments (while keeping the sets balanced) for both the learning and test sets. These empirical null distributions were determined via the

protocol, and (2) this decreasing trend was interrupted by electrode replacement in the Replaced group (although still apparent between Z1/Z2

same k-fold procedure used for the actual accuracy results. The learning data included in each of the empirical null k-folds was identical to that included in a corresponding accuracy k-fold (that is, the same exact same feature input matrices used for the accuracy k-folds were also used for the empirical null k-folds). Accuracy distributions from both k-fold procedures are shown in **Figure 13**.

The accuracy distribution (using d- ) for each individual learning approach was compared to its corresponding empirical null distribution using a paired *t*-test (two-tailed, α = 0*.*01) and the Bonferroni correction for multiple comparisons. In this series of analyses, each participant contributed a single median calculated across the folds for each approach. This choice was made for two reasons, first so that the sample size for learning algorithm performance is not inflated from that used in other analyses, and second so that spurious algorithm performance cases (if present) would not appear in the dataset as extreme values or outliers. All four learning approaches generated performance results that were significantly greater than their corresponding empirical nulls. These test results are summarized in **Table 2**.

A 2 (electrode replacement, between) × 4 (learning approach, within) mixed model ANOVA was performed to test for significant effects of these factors on learning algorithm performance. As with the empirical null comparisons, and with the same justifications, median classifier performance values were used in this analysis.

samples exceeding 3 times the box height, are indicated by asterisks.

Mauchly's test revealed no violations of sphericity, χ<sup>2</sup> (5) = 6*.*542, *p* = 0*.*258; therefore, exact degrees of freedom were used in the following analyses. The main effect of electrode replacement was not significant, *<sup>F</sup>*(1*,* 18) <sup>=</sup> <sup>0</sup>*.*086, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*773, <sup>η</sup><sup>2</sup> *p* = 0*.*005. There was, however, a significant main effect of learning approach, *<sup>F</sup>*(3*,* 54) <sup>=</sup> <sup>4</sup>*.*489, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*007, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*200. The two-way interaction of (electrode replacement × learning approach) was not significant, *<sup>F</sup>*(1*,* 18) <sup>=</sup> <sup>2</sup>*.*593, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*125, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*126. Boxplots for learning algorithm performance are shown in **Figure 14**.

To further probe the main effect of learning approach, a *posthoc* pairwise comparison employing Bonferroni correction for multiple comparisons was performed on the learning approach factor (collapsed across electrode replacement), α = 0*.*01. The pairwise comparison tests revealed a difference of 0.282, 99% CI [−0.21, 0.586], that approached significance, *p* = 0*.*018, between ANN (*M* = 2.406, *SD* = 1*.*248) and RBF-SVM (*M* = 2.124, *SD* = 1*.*146) classifier performance. There was also a difference of 0.343, 99% CI [−0.085, 0.769], that approached significance, *p* = 0*.*05, between LDA (*M* = 2.467, *SD* = 1*.*178) and RBF-SVM (*M* = 2.124, *SD* = 1*.*146) classifier performance. All other comparisons were non-significant (*p >* 0*.*13). A boxplot outlining the *post-hoc* tests (collapsed across electrode replacement) is shown in **Figure 15**.

replacement.

# **ADDITIONAL LEARNING ALGORITHM PERFORMANCE ANALYSIS**

y-axis scale (e.g., all time-frequency plots for Fz and Pz are shown using the

To test for cross-session generalization, all learning algorithms were trained on data from S2 and tested on S1 for between-session accuracy; all subsequent preparation of the learning algorithm performance results, expressed as d- , was consistent with previous analysis where the learning set was extracted from S1 and the between-session test set consisted of all data from S2. A pairedsamples *t*-test (two-tailed) was independently performed for each learning algorithm using the median learning algorithm performance from the k-fold distributions. The paired-samples *t*-tests, α = 0*.*01, were all non-significant, with a mean difference of 0.123, 99% CI [−0.200, 0.445], *p* = 0*.*291 for the ANN, 0.120, 99% CI [−0.174, 0.141], *p* = 0*.*257 for the LDA, 0.128, 99% CI [−0.214, 0.471], *p* = 0*.*297 for the LIN-SVM, and 0.030, 99% CI [−0.357, 0.418], *p* = 0*.*825 for the RBF-SVM. With all results being non-significant, a correction for multiple comparisons was not necessary. These results demonstrate very good crosssession generalization when using either session, S1 or S2, for the learning set.

While not the focus of this study, the *post-hoc* nature by which the dataset may be examined provides the opportunity for a number of additional, and informative, analyses. In particular, new learning approaches can be simulated using features that are derived from individual data sources, such as separating feature sets into those originating from EEG channels and those originating from non-EEG (or, peripheral) channels. Each of these new feature sets can also be tested for cross-session generalization (learning on S1 as compared to learning on S2). To this point, a number of different learning trials were performed by considering a variety of situations under which only certain classes (or sources) of features would be available. Each of these new feature sets was also tested for cross-session generalization. Tables containing learning algorithm performance metrics, separated for each participant and each group, can be found in the online Supplementary Material for this manuscript. Researchers interested in obtaining a copy of this dataset for additional analysis should contact the corresponding author.

As an example of an additional analysis that could be performed using these data tables, learning algorithm performance using the complete feature set (**Figure 14**) was compared to using only those features derived from EEG data channels. A paired-samples *t*-test (two-tailed) was independently performed for each learning algorithm using the median learning algorithm performance from the k-fold distributions. The paired-samples *t*tests, α = 0*.*01, with a mean difference of 0.565, 99% CI [0.017, 1.112], *p* = 0*.*008 for the ANN, 0.437, 99% CI [0.151, 0.722], *p* = 0*.*0003 for the LDA, 0.488, 99% CI [0.083, 0.892], *p* = 0*.*003 for the LIN-SVM, and 0.372, 99% CI [−0.111, 0.857], *p* = 0*.*040 for the RBF-SVM, all revealed significant differences or approached significance between the two approaches without any correction for multiple comparisons (noting that only the LDA result remained significant, *p* = 0*.*0012, after Bonferroni correction, with the ANN, *p* = 0*.*032, and the LIN-SVM, *p* = 0*.*012, approaching significance; the RBF-SVM, *p* = 0*.*16, would be considered not significant). In each case, overall mean classifier performance for the group (expressed as sensitivity, or d- ) was

whiskers, respectively, or inner fences). Outliers exceeding 1.5 times the box height are shown as individual sample points (circles). Extreme outliers, or those samples exceeding 3 times the box height, are indicated by asterisks.

higher using the complete feature set than using only EEGderived features.

#### **FEATURE SALIENCY AND RANKINGS FOR ANN LEARNING SET**

Feature saliency was calculated from the ANN learning procedure using the Ruck saliency method (Ruck et al., 1990). Saliency values were converted to proportion of summed saliency (across all 37-features) for each of the 200 k-fold iterations and then averaged together to form an omnibus saliency ranking. The top five features, ranked by mean saliency, are shown in **Table 3** (the full ranking table can be found in Supplementary Materials).

A second approach to examining feature saliency is to compare the ordinal rank of the features, regardless of their relative saliency, within a single learning iteration. This results in a very simple measure, the average rank (range of 1–37) for each of the features used in the learning set. This average rank measure was also computed across the 200 k-fold iterations (*N* = 20, *k* = 10) for the ANN learning approach. Results for the top five features, ranked by mean ordinal position, are presented in **Table 4** (the full ranking table can be found in Supplementary Materials).

#### **DISCUSSION**

The study presented here aimed to investigate the impact of methodological variability due to same-day, between-session electrode replacement on learning algorithm performance in the context of a pBCI system for assessing cognitive workload. The importance of understanding these effects can be easily understood when considering that (1) real-world implementation of pBCI systems will almost necessarily be implemented for multisession and multi-day use, and (2) sensor systems for monitoring neurophysiological and neurobehavioral measures in these architectures will almost necessarily require removal and replacement between sessions. Decoupling this effect from previously observed declines in classifier performance over time courses as short as hours (Christensen et al., 2012) necessitated a betweensubjects design to probe electrode replacement as a factor. This prior observation, when considered in tandem with the logistical difficulty of maintaining electrode montage preparation in a non-clinical setting for extended time periods, made the selection of a time course of minutes to hours the logical choice over which to investigate these effects.

A critical first-step analysis was to observe electrode impedance for the duration of the data collection. Neither the between-subjects factor of electrode replacement nor the withinsubjects factor of time of measurement significantly impacted the omnibus measure of electrode impedance. This result strongly suggests that any electromechanical variability introduced by both a second electrode preparation and the use of different electrode sets between sessions did not influence data quality (either negatively or positively) for the Replaced group. A second conclusion that can be drawn is that electrode impedances were also maintained at acceptable levels for the Remained group.

There were two participants who, at the beginning of S2 (impedance measurement point Z3 in **Figure 2**), had one or more electrodes with impedances above set tolerance maximums (one electrode for the first participant, and four electrodes for the second participant), all of which were on electrodes at EEG scalp sites. In each instance, adding more Electro-Gel to the site reduced the impedance to within tolerance; additional skin preparation was not required. Data quality was not noticed to be affected by any impedance changes that may have occurred during S1 (all raw and real-time processed time series were viewed online during data collection). The effect of the addition of Electro-Gel, when coupled with observations of researchers during data collection, indicates that the conductive gel leaked from underneath the plastic housing on the electrode cap during the between-session break. Out-of-tolerance impedances were never reported for either the single-lead (VEOG, HEOG, and mastoid) or disposable electrodes (ECG). Both of these electrode types seal to the skin via temporary adhesive and necessarily prevent gel leakage, whereas the plastic housings on the elastic electrode cap are more easily separated from the skin surface under some conditions (i.e., inadvertent displacement, less-than-perfect conformity of the cap to the participant's head, varying hair styles, etc.).

Also worth noting is the potential for variability in electrode location due to replacement between sessions. All electrode preparations for this study were completed by experienced EEG researchers; as such, it is reasonable to assume that electrode location was consistent for the Replaced group. Even small variations in physical electrode locations are likely to be negligible given the volume conduction phenomenon in skin surface electropotential

**Table 2 | Results of pairwise comparison tests for individual learning approaches compared to their respective empirical null distribution.**


measurement, which can further be interpreted as the cause of the often reported poor (native) spatial resolution of EEG (Gevins, 1987).

The dual-session approach is a potential confound in this study design. Workload may vary between sessions due to participant task learning or fatigue, resulting in degraded pBCI performance. However, results obtained with subjective workload and task performance measures suggest that workload was highly consistent across sessions. The only significant effect observed for both task performance and subjective workload was that of task difficulty. These complementary results show that the manipulation of task difficulty was successful in influencing cognitive workload, or more precisely, increased workload between low and high task difficulty (as evidenced by the subjective measures) such that task performance decreased (as evidenced by the task performance measures). The consistency of these measures across sessions confirms that workload state, respective of task difficulty, can be considered constant for both S1 and S2. It is also evident that both the Remained and Replaced groups experienced the same relative workload levels.

With all of the aforementioned variables being equal with respect to electrode replacement and session, and a meaningful difference in workload evidenced between task difficulty conditions, it is thus appropriate to make comparisons in learning algorithm performance given a pBCI system approach for assessing cognitive workload. Accuracy distributions for all learning approaches, compared to their respective empirical null distributions, showed significant performance above the chance accuracy level. While *post-hoc* comparisons between learning algorithms did not reach significance, there is evidence to suggest some differences in performance between the algorithms used. Considering the nested test set (reserved from S1) accuracies in **Figure 12** together with the between-session (tested on S2) accuracies in **Figure 15**, it appears that both the LDA and LIN-SVM learning techniques produced slightly better generalization to S2 (between-sessions) at the cost of lower overall nested test set accuracy, suggesting the possibility of over-fitting in the non-linear approaches. Indeed, the RBF-SVM exhibited very high nested test set accuracies only to perform worst, overall, when fixed as a pattern classifier for testing on S2. The ANN, with its early stopping rule based on learning error from a withheld validation set, appears to strike a balance between robust learning and over-fitting. It is worth noting that the

validation set is not strictly independent from the learning set since they were both sampled from S1. A more thorough methodology would be to use a validation set with greater independence from the learning set, such as data from a third session, or even perhaps a different day. Given this consideration the ANN still showed robust generalization to the betweensession test set while maintaining nearly perfect nested test set accuracy.

Overall learning algorithm accuracies presented here, as related to temporal distance between learning and test sets, largely replicate those obtained using a very similar cognitive workload task in previous work (Christensen et al., 2012). Namely, we observed workload state classification for data temporally separated from the learning set by only seconds to perform at or near ceiling (**Figure 12**). Further, classification accuracy for data temporally separated from the learning set (S1) by minutes to hours (S2) suffers from a decrement in accuracy relative to the nested test set from the same session (**Figure 15**). It is noteworthy to state, here, that the temporal delay between S1 and S2 was 45 min, which is comparable to the "minutes" of separation category in Christensen et al. (2012). At this level of separation from learning to test, both studies produced classification accuracies of 85–90%, on average.

The most important result of this work, however, is that learning accuracy was not impacted by the replacement of the electrode montage between sessions. The impact of this finding is perhaps even greater considering that a new set of electrodes was applied in between sessions for the Replaced group. Eliminating this methodological variability as a potential factor in learning algorithm performance is a key step forward in developing strategies for implementing multi-session, multi-day paradigms for pBCI usage. While only one feature set was tested here, it is reasonable to believe that similar results would also be obtained for evolving signal processing methodologies that are being actively developed and used elsewhere (see Makeig et al., 2012 for a recent review). It is also reasonable to hypothesize that this result would also transfer to other task protocols given that the electrode preparation is uniquely independent from the underlying cognitive task protocol; however, it is important for future work to consider the expansion of these considerations in regard to other protocols as well, such as steady-state conditions of shorter duration than those used here (15 min task states), dynamic, and concurrent task states. Additional analyses of learning algorithm performance showed good generalization of these results when using S2 as the learning set and S1 as the test set. Also of interest is that the addition of the perhipheral physiological measures to the feature set increased overall classifier performance for all four learning approaches, with only the RBF-SVM not approaching or obtaining significance as compared to using the EEG-only feature set.

**FIGURE 15 | Boxplots showing between-session classifier accuracy, collapsed across group, data representative of the** *post-hoc* **pairwise comparison testing given the significant main effect of learning approach shown in Figure 14.** Despite the significant main effect shown in **Figure 14**, all *post-hoc* pairwise comparison tests were non-significant (using the Bonferroni correction for multiple comparisons), although the two comparisons that approached significance were ANN vs. RBF-SVM and LDA vs. RBF-SVM. The boxplots shown represent the median (line inside the box), first and third quartiles (bottom and top of the box, or the lower and upper hinges, respectively), and minimum and maximum values (lower and upper whiskers, respectively, or inner fences). Outliers exceeding 1.5 times the box height are shown as individual sample points (circles). Extreme outliers, or those samples exceeding 3 times the box height, are indicated by asterisks.

#### **Table 3 | Mean saliency rank of features (top 5 of 37).**


#### **Table 4 | Mean ordinal rank of features (top 5 of 37).**


The feature saliency analysis revealed similarities and differences between this study and previously-published results regarding EEG signals associated with workload. For example, Wilson and Fisher (1995) reported significant contributions from higher frequency bands including gamma, while Gevins et al. (1998) reported increased frontal theta and decreased parietal alpha with increasing workload. Both patterns of results were found in this study, depending on which approach to determine feature saliency rank was used. Saliency-based assessment showed three of the top five features in the gamma band, while ordinal rank assessment showed three different bands (theta, alpha, and beta) from four different sites as being most highly ranked. Both included IBI, and Blink Rate was the most salient over all (on average). One reasonable interpretation for the difference in saliency vs. ordinal rankings is that features such as gamma band activity are very highly separable but may not appear frequently for all participants, while other features that are less separable (weaker learners) may be more consistently present across a group of participants. Additional evidence in favor of this interpretation is found in the analysis approaches taken in prior studies; Wilson and Fisher (1995) obtained their results implicating gamma activity via individually trained classifiers, while Gevins et al. (1998) analyzed data at the group level and found frontal theta and parietal alpha to be significant indicators of workload. This result suggests that the use of a diverse sensor suite and continued investigation of new sensor types and feature extraction techniques are worthwhile endeavors for those interested in pBCI system research. As an example, Whitham et al. (2007, 2008) have provided convincing evidence that beta and gamma bands are heavily influenced by tonic eletromyographic artifact (EMG); given the large amplitude of EMG activity, even when projected to scalp EEG sites, it is reasonable to infer that high-amplitude EMG differences associated with workload state changes could be responsible for highly-separable beta and gamma band features. If it is the case that EMG activity happens to be a useful "feature" for some pBCI applications, a systematic investigation of this effect in the context of cognitive and affective state assessment that leverages relevant feature separation and extraction approaches (e.g., McMenamin et al., 2010, 2011) would be a worthwhile effort.

There are a number of reports from researchers suggesting less-than-perfect success rates, or the so called "BCI-illiterate" effect, in traditional BCI applications (e.g., Guger et al., 2009, 2011; Allison et al., 2010), so it is not at all surprising that pBCI architectures can produce low-performing state classification accuracies for some participants. Of the 20 participants in this study, one (de-identified with an identifier of P24) was consistently at or below chance accuracy on between-session test set accuracy; this below chance accuracy persisted when S2 was used as the learning set, as well. This participant's low-performing workload state classification impacted the sample distributions shown in **Figure 14** by negatively skewing the learning algorithm performance of the electrodes replaced group. Given a lack of any a prioi basis on which to exclude these results, P24's data was including in all prior analysis; however, given the significant skew, it is worthwhile to investigate learning algorithm results without this participant's data. In order to examine the learning approach data in such a way, the skewed data from P24 was replaced with the sample mean (by factor) and the distributions were reexamined. These data, along with corresponding time series data similar to that shown in **Figure 11**, are included in the online Supplementary Material (Figures 2–4). Unsurprisingly, replacing P24's data with the sample mean (by factor) all but eliminates the skew from the data distributions (Supplementary Material, Figure 2). As a comparison, this same data is also expressed as d- , the learning algorithm performance measure that was used for all statistical analysis (Supplementary Material, Figure 3). As with the classifier accuracy representation, there is no noticeable skew represented in the learning algorithm performance distributions when expressed as d- ; note, also, that the normality and equality of variance across factors is greatly improved in the d distributions, thus further justifying the use of d as a suitable metric for all analyses of variance. Repeating the previously reported 2 (electrode replacement, between) × 4 (learning approach, within) mixed model ANOVA to test for significant effects of these factors on learning algorithm performance after correcting for P24 as an outlier produces nearly identical results: using exact degrees of freedom (no violation of sphericity via Mauchly's test, χ<sup>2</sup> (5) = 6*.*560, *p* = 0*.*256), the main effect for electrode replacement was not significant, *<sup>F</sup>*(1*,* 18) <sup>=</sup> <sup>0</sup>*.*084, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*775, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*005. There was, however, a significant main effect of learning approach, *<sup>F</sup>*(3*,* 54) <sup>=</sup> <sup>5</sup>*.*131, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*003, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*222. The two-way interaction of (electrode replacement × learning approach) was not significant, *<sup>F</sup>*(1*,* 18) <sup>=</sup> <sup>0</sup>*.*789, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*505, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*042. That is to say that correcting for P24 as an outlier participant does not affect the outcome of the test for learning approach performance. When compared to **Figure 11**, the time series data for this participant (Supplementary Material, Figure 4), does not exhibit any easily identifiable features that appear to be separable with respect to changes in workload. An understanding of why pBCI systems may work for some persons but not others (or at least may not be as accurate) could be tremendously helpful, enabling adaptations in sensor choice, feature selection, training procedures, and other such interventions to mitigate those differences.

In summary, this work set out to determine what, if any, impact electrode removal and replacement has on learning algorithm performance in dual-session, same-day use of pBCI systems. Testing was conducted over a time course of minutes to hours, known from prior work to result in observable declines in algorithm accuracy comparable with those observed over multiday testing. Results showed that, after successfully implementing a paradigm for increasing cognitive workload in a multitask environment, the accuracy for a group of participants whose electrodes were replaced in a between-session test did not significantly differ from a control group whose electrodes remained in place for the entire data collection. Having reduced concern for this potential source of methodological variability as a confound to learning accuracy decline in dual-session paradigm, it is recommended that future work in this area focus on nonstationarity and reduced classifier performance due to intrinsic factors not related to the removal and replacement of electrodes. However, it is also pertinent that this type of study be repeated and replicated in other paradigms for increased validity of the results presented here. Future pBCI research should also strongly consider novel sensor and feature development in an effort to improve the long-term stability of these systems, particularly for real-world applications (e.g., McDowell et al., 2013).

#### **ACKNOWLEDGMENTS**

The authors would like to thank William D. Miller, Jr. (Science, Mathematics and Research for Transformation Fellow, George Mason University) for his ongoing commitment to AF-MATB software development, Iris E. Davis, Margaret A. Bowers, and Samantha L. Klosterman (Ball Aerospace and Technologies Corp.) for their substantial efforts during data collection for this study and contributions in preparing this manuscript, and Dr. Glenn F. Wilson (Air Force Research Laboratory Emeritus, and Physiometrex, Inc.) for his helpful discussions about and contributions to study design. This work was supported by the Air Force Office of Scientific Research.

# **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fnins.2015. 00054/abstract

#### **REFERENCES**


assistive technologies: state-of-the-art and challenges. *Front. Neurosci.* 4:161. doi: 10.3389/fnins.2010.00161


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 22 February 2014; accepted: 06 February 2015; published online: 09 March 2015.*

*Citation: Estepp JR and Christensen JC (2015) Electrode replacement does not affect classification accuracy in dual-session use of a passive brain-computer interface for assessing cognitive workload. Front. Neurosci. 9:54. doi: 10.3389/fnins.2015.00054*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2015 Estepp and Christensen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Artificial Neural Network classification of operator workload with an assessment of time variation and noise-enhancement to increase performance

# *Alexander J. Casson1,2\**

*<sup>1</sup> Sensing, Imaging and Signal Processing Group, School of Electrical and Electronic Engineering, The University of Manchester, Manchester, UK*

*<sup>2</sup> This Work was Carried Out While the Author was with the Optical and Semiconductor Devices Group, Department of Electrical and Electronic Engineering, Imperial College London, London, UK*

#### *Edited by:*

*Thorsten O. Zander, Technical University of Berlin, Germany*

#### *Reviewed by:*

*Michal Lavidor, Bar Ilan University, Israel Jonas Brönstrup, Institute of*

*Technology Berlin, Germany*

# *\*Correspondence:*

*Alexander J. Casson, Sensing, Imaging and Signal Processing Group, School of Electrical and Electronic Engineering, The University of Manchester, Sackville Street Building, Manchester, M13 9PL, UK e-mail: alex.casson@ manchester.ac.uk*

Workload classification—the determination of whether a human operator is in a high or low workload state to allow their working environment to be optimized—is an emerging application of passive Brain-Computer Interface (BCI) systems. Practical systems must not only accurately detect the current workload state, but also have good temporal performance: requiring little time to set up and train the classifier, and ensuring that the reported performance level is consistent and predictable over time. This paper investigates the temporal performance of an Artificial Neural Network based classification system. For networks trained on little EEG data good classification accuracies (86%) are achieved over very short time frames, but substantial decreases in accuracy are found as the time gap between the network training and the actual use is increased. Noise-enhanced processing, where artificially generated noise is deliberately added to the testing signals, is investigated as a potential technique to mitigate this degradation without requiring the network to be re-trained using more data. Small stochastic resonance effects are demonstrated whereby the classification process gets better in the presence of more noise. The effect is small and does not eliminate the need for re-training, but it is consistent, and this is the first demonstration of such effects for non-evoked/free-running EEG signals suitable for passive BCI.

**Keywords: Artificial Neural Network, Augmented Cognition, EEG, noise-enhanced processing, passive BCI, stochastic resonance, workload classification**

# **1. INTRODUCTION**

Augmented Cognition is a recent research concept focusing on creating the next generation of Human-Computer Interaction devices (Schmorrow and Stanney, 2008). Closed-loop Brain Computer Interfaces (BCIs) are a classic example of such systems. In these, a human uses a computer whilst simultaneously the computer monitors the human and changes its operation based upon the results. Changes might be made to the outputs presented, to the input streams which are used, or to the levels of automation and assistance that are provided, amongst others. For example, workload monitoring systems aim to detect when an operator is in a high or a low workload state to potentially change the speed at which information is presented. As such the work flow and operating environment can be optimized in a real-time and time-varying manner (Wilson and Russell, 2007). Alternatively, workload monitoring could be used to enhance human training: the mental load of a new task can be objectively measured and training times increased or decreased to terminate the training only when the task involves a low level of effort (Ayaz et al., 2012).

Emerging passive BCIs use the spontaneously produced *freerunning* EEG (electroencephalogram) signals that are naturally present on the scalp due to the normal functioning of the brain without any specific stimuli present or the user consciously controlling their brain activity (Zander and Kothe, 2011). Workload classification is quickly emerging as a *killer app* for passive BCI as it is particularly suited to being based upon free-running EEG as opposed to evoked responses. At the extreme fatigue end: sleep onset is characterized in free-running EEG by a reduction in alpha (7–12 Hz) activity, with this being replaced by theta (4– 7 Hz) activity (Rechtschaffen and Kales, 1968). This has long been used clinically for sleep staging and there are numerous recent papers (see for example Christensen et al., 2012; Dijksterhuis et al., 2013) demonstrating that frequency band changes can be used to identify less extreme changes in vigilance level.

As with many BCI systems the challenge for workload monitors now is in moving out-of-the-lab and into uncontrolled environments which are significantly corrupted by noise, and in creating systems that are reliable, robust and re-usable. For machine learning based systems an essential parameter in this is the training required to set up the classification process. During training, EEG data which is known to arise during a particular workload state is presented to the classifier, which it then uses to *learn* classification boundaries which can be applied to new EEG data where the workload state is not known. Ideally this classifier generation and training process:


Traditionally much focus has been attached to maximizing condition 1, ensuring that the classification system has good generalisability and can be correctly applied to previously unseen data. However, for practical systems this is not the only objective, and conditions 2 and 3 which determine the *temporal performance* of the system are of critical importance. Recent work (Estepp et al., 2011; Christensen et al., 2012) has suggested that the performance of workload monitors is not constant, instead degrading as the time gap from the training session is increased. (Christensen et al., 2012) showed that some of this performance loss could be recovered by increasing the amount of training data used and including data from multiple EEG sessions: each new EEG session would start with a short new training period to supplement the existing training data that was collected in previous sessions, potentially some time ago.

However, each time the classifier has to be re-trained the user must be placed into a known high or low workload state and new EEG data collected. This requires both a considerable amount of effort, and decreases the time during which the system can be practically used to perform useful classifications. It is thus essential to devise and investigate new techniques that can potentially be used to improve temporal performance without requiring more training sessions to be carried out, and this paper begins this investigation.

This paper presents an Artificial Neural Network based workload classifier for determining operator state from the EEG (Section 2) with its performance evaluated in two novel ways. Firstly, focus is given to the temporal performance of the classification process, quantifying how long a system developed using very little training data can be used before re-training is required (as opposed to maximizing the general performance, condition 1 above). Whilst very good short term performance is obtained, it is verified that the performance does degrade with time.

Secondly, *noise-enhanced processing* is investigated as a technique to mitigate the change in performance. In this, artificially generated corrupting noise is deliberately added to the otherwise raw EEG input signal to evoke *stochastic resonance* from the classification process. This is an effect in non-linear processes whereby performance actually gets better in the presence of small amounts of noise (Kay, 2000; Chen et al., 2007). The method is particularly interesting to explore as noise is widespread in ultra-portable outof-the-lab EEG recordings, and creating algorithms that are firstly robust to the presence of noise and then even enhanced by the presence of noise can completely change how ultra-portable EEG systems are designed.

As an illustration, nearly all modern dry EEG electrodes are based upon having *fingers*, rather than discs, for easier penetration through the hair, as illustrated in **Figure 1**. However, electrode contact noise is a function of the electrode area (Huigen et al., 2002), and electrode fingering decreases this area. In-depth measurements of dry electrode performance have been presented (Chi et al., 2010; Gandhi et al., 2011; Slater et al., 2012) but

**FIGURE 1 | Two state-of-the-art commercially available dry EEG electrodes show the common** *fingered* **configuration.** Photographs by the author.

most studies report only a correlation coefficient between EEG recorded at nearby locations with wet and dry electrodes. Typical values reported are: *>*0.93 (Xu et al., 2011); 0.89 (Matthews et al., 2007); 0.83 (Gargiulo et al., 2008); 0.81–0.98 (IMEC, 2012); 0.68– 0.90 (Patki et al., 2012); 0.39–0.85 (Estepp et al., 2009). New results in this paper show that even with up to 15 *µ*Vrms of artificial noise added to the raw EEG traces, correlations in-line with those reported for dry electrodes are found.

Further, in general electronics there is often a tradeoff between power consumption and the effective noise present (Harrison and Charles, 2003; Xu et al., 2011; Cuadras et al., 2012). A number of high quality, highly miniaturized wearable EEG units have become commercially in recent years, but with battery lives typically in the 8 h range (Casson, 2013). This power performance still falls far short of *pick up and use* devices and substantial improvements in system power consumptions are still required to realize units that can be trusted to be re-usable session after session. If BCI applications can be enhanced by noise the design of ultra-portable EEG systems is completely changed: more noise is now desirable and this potentially allows the use of even smaller electrodes and lower power consumption processing electronics.

A preliminary investigation of using noise-enhancement in workload monitoring was presented in Casson (2013). However, this was restricted to the investigation of two subjects and the use of a single Artificial Neural Network based algorithm. The results here have been extended to eight subjects and include the use of an array of neural networks to enable collective decision making.

# **2. MATERIALS AND METHODS**

#### **2.1. WORKLOAD TASK AND DATA**

In this work scalp EEG data is used to classify an operator as being in either a high or low workload state whilst they are performing a flight simulator task (Comstock and Arnegard, 1992; Miller Jr., 2010). This task was designed to represent aircraft operations, and particularly those of remote piloting. The data was recorded as part of the 2011 Cognitive State Assessment Competition (Estepp et al., 2011; Christensen et al., 2012) and is publicly available from the competition organizers.<sup>1</sup> It consisted of two EOG channels

<sup>1</sup>Competition contact for data: Justin Estepp, Air Force Research Laboratory, cognitive.state@gmail.com

(vertical and horizontal) and 19 EEG channels in standard 10–20 locations (Fp1, Fp2, F7, F3, Fz, F4, F8, T3, T5, C3, Cz, C4, T4, T6, P3, Pz, P4, O1, O2). All channels used a mastoid reference and ground and a 256 Hz sampling rate.

A total of 118 EEG recordings were performed using eight subjects (15 tests in six subjects, 13 tests in two) on five separate days spread over a month. Each recording day consisted of three 15 min EEG sessions, allowing the temporal performance of the workload monitor to be evaluated on a number of scales. Firstly, within each each 15 min session the EEG data can be divided into training and testing periods which are separated in time by seconds or minutes. Recordings on the same day are separated by minutes or hours, while recordings on different days are separated by days and weeks, and up to a month.

The simulator task was set up such that the difficulty varied dynamically in order to induce known high and low workload states in the operator. No measure of task error was present in the publicly available data and instead the user workload is inferred from the set simulator difficulty. A total of 5 min was spent in each state, with at least a 1 min transition present between task segments classed as high and low workload. Here, only the high and low workload monitoring data segments are analyzed, with the transition segments being discarded. All subjects were trained in the operation of the simulator before the workload monitoring experiment was carried out, eliminating learning effects in the operators themselves.

# **2.2. CLASSIFICATION ENGINE**

The workload classifier is a feedforward backpropogation Artificial Neural Network (ANN) with five hidden layers (Duda et al., 2001). The network is trained on EEG frequency domain information from seven frequency bands: 0–4 Hz, 4–7 Hz, 7– 12 Hz, 12–30 Hz, 30–42 Hz, 42–84 Hz, 84–128 Hz; calculated using a 1024 point FFT. This gives a total of 147 input features, and these are generated in 30 s epochs with 25 s overlap such that the assessment of operator state is updated every 5 s. All features are zero mean and unit standard deviation normalized before being passed to the network for classification. The network training used the scaled conjugate gradient backpropagation method and incorporated an early stopping training/validation/testing data split to avoid overfitting. Only 50% of the training vectors in each block of training data were used directly for network training. During network development 10 different networks with random starting weights/biases were used with different random selections of the feature vectors placed in the training sets.

After the classification has been performed, the output of the ANN is a binary state placing the subject into one of the workload categories, with this assigned as the class with the maximum output from the ANN. The classification performance is reported as the percentage of epochs in each testing set which are correctly matched with the known high or low workload in that period in the simulation.

#### **2.3. AVERAGE AND TEMPORAL PERFORMANCE ASSESSMENT PROCESS**

Multiple trained ANNs are generated and evaluated here to assess the performance of the classification process in three ways: the time independent average performance, the temporal performance, and the temporal performance when noise enhancement techniques are applied.

The time independent average performance is found by using a leave-one-out cross validation procedure where all but one of the EEG recordings are used to train the ANN, and the remaining EEG record is used to test the ANN. This process is done on a per-subject basis, and is repeated using all of the different EEG records as the test data. The output is multiple out of sample test performances which are averaged to obtain an overall figure. This is a standard ANN development procedure (Duda et al., 2001), but it uses large amounts of training data (approximately 150 min per subject) and it does not maintain the temporal ordering of the EEG recordings. In many instances the training data will come from time points after the test data (the procedure is non-causal) and the overall average figure does not reflect changes in ANN performance over time.

To assess the temporal performance a modified procedure is used, which maintains causality and uses much less training data. Firstly, a unique ANN is generated for each of the 118 EEG sessions using training data from the start of an EEG session. This network is used to monitor the operator workload in the remainder of this session and demonstrates the *same session* performance when the training and test data segments are very close together in time. The procedure for sub-dividing each EEG session into training/validation data and prospective testing data is illustrated in **Figure 2**. Here, the first 20 epochs (125 s duration) from each of the high and low workload periods are used for generating the network. The remainder of the data is used for testing. For the performance evaluation 11 epochs (80 s) of test data are assessed at a time: this vector is then stepped through all of the available test data epochs to show the achieved classification performance as a function of time. These values can then be averaged to produce an overall figure which does reflect changes in ANN performance over time.

The *cross session* temporal performance is then evaluated by reusing the ANNs trained as above (from a part of one session only) and testing them using EEG data collected in different recording sessions. In this case the test EEG is not split into training/test segments, and is instead all used as test data. Multiple time scales are investigated using different configurations of the available data:


All of these networks are kept subject specific, and it can be seen that the testing is purely prospective: data from different sessions is not mixed, testing data can only be from the future compared to training data, and the objective is to use little training data to

accurately classify operator states which are significantly removed in time from the training data.

#### **2.4. NOISE-ENHANCED PERFORMANCE ASSESSMENT PROCESS**

To investigate noise-enhanced processing two analyses are carried out here. Firstly, to determine suitable noise levels to inject, a raw recorded EEG trace is compared to the same EEG trace after it has had artificial white Gaussian noise deliberately added to it. The correlation coefficient is then found, allowing a comparison with the correlations found in typical wet vs. dry EEG electrode studies (see discussion in Section 1). The white Gaussian noise is generated in MATLAB using the wgn function with independent noise streams added to each of the EEG channels. For this analysis a single complete 12.5 h EEG recording is used; the previously publicly available data from (De Clercq et al., 2006; Vergult et al., 2007). This long EEG record is split into multiple shorter duration EEG sections, and the correlation in each section plotted against the duration of the section. This allows the maximum, minimum and median correlation coefficients over time to be found.

Secondly, to assess the noise-enhancement, there are numerous different ways in which noise corrupted EEG data can be passed to the ANN in order to optimally evoke stochastic resonance from the classification procedure, and only one approach is evaluated here. In this, illustrated in **Figure 3**, 10 identical ANNs are used in parallel each driven by EEG traces which have been corrupted by independent noise sources. The classification process is thus repeated 10 times with the output class being decided by a majority vote (in the case of a tie the output is put into the high workload state). The use of artificial noise thus allows multiple attempts of the classification for each individual EEG epoch, which would normally only be possible once.

This novel *testing with noise*is employed here to assess the cross session performance. *Training with noise* is a common technique used to increase the accuracy of Artificial Neural Networks by adding small levels of noise to the training data before training the network (Duda et al., 2001). (The aim is to do this multiple times and make the available training data more variable and more representative of future unknown data.) However, this is not employed here. The ANNs are created and trained as detailed above for assessing temporal performance, using the raw recorded EEG signals to generate the input features. Artificial noise is only added in during the testing process. To demonstrate that the results are repeatable, multiple-runs of this 10 ANN configuration using independently generated noise cases have been carried out.

# **3. RESULTS**

#### **3.1. AVERAGE AND TEMPORAL PERFORMANCE**

Across all subjects the average performance from the leave-oneout cross validation procedure is 73%. The per subject performances are given in **Table 1**. This is a satisfactory level showing that the networks can be used for determining the operator state, and this information potentially fed back in order to optimize operating procedures.

The *same session* temporal performance is shown in **Figure 4**. For illustrating the spread of results, **Figure 4** breaks down the performance per subject, and plots the mean performance as the time distance between the training and test data is increased. Vertical lines illustrate the distribution of results across the 15 EEG records from each subject (13 in subjects C and D). Combining all of these results together the overall average performance, where all the performances in **Figure 4** are averaged to produce a single value for each person and then averaged again across the 8 subjects, is 86%.

This is higher than the 73% from the cross-validation results because the ANNs have been trained on only a small amount of data and the ANNs generalize well to new data close in time to this, at the cost of worse performance as the time gap increases. From **Figure 4** it is apparent that even over the time span of seconds to minutes the performance is not constant, with noticeable variations present. In subject B the average classification performance gets better over time, but for all of the other subjects there is a decrease in the average performance as the training data becomes increasingly distant in time, with a mean correlation of −0.77.

**Table 1 | Classification performance when using the leave-one-out cross validation to assess average performance.**


The impact of extending the time gap to days is shown in the *cross session* results in **Figure 5** which illustrates how well the ANN generated in the very first EEG session can be directly re-used over a long time span. The overall average performance is 57%. To compare this to chance a re-sampling approach has been used where the state classification from the ANN is replaced with that from a random number generator. Random values are drawn from a uniform distribution with it being equally likely to mark an EEG epoch as high or low workload. This artificially generated classification output is then analyzed in exactly the same way as the true ANN output. Simulations over 1000 runs show that the mean performance of this random classifier is 50.0%, and the best chance performance is 52.1%. This indicates that the 57% performance of the ANN is above the chance level, although it is unlikely to be of practical use.

Nevertheless, within this 57% wide per-session performance variances are seen. Many of the records achieve very good classification performances above 70%, even many days after the training session. Similarly, however, many perform at the chance level, and indeed some perform substantially worse than chance such that better classification would be actually obtained by inverting the output of the classification process. From this, it is clear that in some cases it is possible to directly re-use the workload classifier across multiple days, but this must be coupled with a method for assessing whether good classification performance is likely to be obtained.

#### **3.2. NOISE-ENHANCED PERFORMANCE**

**Figure 5** also shows the *cross session* performance of the ANN array when used with 10 *µ*Vrms of artificially generated noise. This result is noise-robust: in the 10 *µ*Vrms case the average performance is decreased only to 56%, while with 5 *µ*Vrms of added noise it remained at 57%.

To put this noise level in context, **Figure 6** shows the correlation coefficients between the 12.5 h raw recorded EEG trace and noise corrupted versions of it, as the coefficient is calculated over different time spans. Coefficients in excess of 0.9 are readily achieved, even in the presence of up to 15*µ*Vrms noise. Partly this is because the underlying correlation present is not accurately estimated when very short sections of data are analyzed. There is a consistent tendency for the median correlation to be underestimated at the cost of much larger variances.

Given this, the array of ANNs was tested using noise levels of 0, 5, 10, and 20*µ*Vrms. The resulting temporal performance is summarized in **Figure 7** which demonstrates the average classification performance of the *cross session* ANNs when all of the possible training/test configurations are used to investigate each time gap. In the no noise case this finds a very similar performance degradation to that in (Christensen et al., 2012, **Figure 6**). The achieved performance stabilizes over the time frame of hours to 56%, substantially below the starting performance level (86%), with the temporally independent cross validation results (73%) being between the two.

Christensen et al. (2012) demonstrated that some of this performance loss could be recovered by re-training the network with a small amount of known workload state data from each new EEG session, but this is undesirable due to the time and effort required. To avoid this process **Figure 7** also tests the hypothesis of using noise-enhanced processing as a potential mitigation approach which does not require network re-training. A single noise case run of the parallel ANN configuration is presented in **Figure 7**. Apparent is that the 5 and 10 *µ*Vrms cases both show small improvements in classification performance over the time span of minutes to hours. Further, all of the noise levels show a reduction in the mean deviation of the performances, showing that the performance is less variable if additional noise is introduced. However, neither of these effects is substantial: the performance improvement in **Figure 7** is approximately 0.3%,

and the improvement in the results spread is less than 1% in most cases.

A repeated measures ANOVA, with null hypothesis that there is no difference between the mean performances in the noise and no noise cases, rejects the null hypothesis (*p* = 0*.*13 for the 5 *µ*Vrms, minutes case). This indicates that the mean performance change is not significant. Nevertheless, to demonstrate that the noise-enhancement in **Figure 7** (which used a single run of the parallel ANN) is repeatable, 10 independent runs of the parallel ANN configuration have been performed with different input noise signals generated for each run. The performance values in the 5 *µ*Vrms of added noise case, when the training and test EEG data sections are minutes apart, are given in **Table 2**. This shows that no performance decreases are present. Instead there is a clear and consistent effect of improved performance with the small amount of noise added. Modeling the probability of the noise-enhanced algorithm being better than the no noise one as a Binomial distribution *B*(10*,* 0*.*5) (that is, there is a 50% chance of the noise added case outperforming the no noise case) the probability of all 10 noise added runs having better performance than the no noise one is less than 0.001. Assuming the noise added case is always better, *B*(10*,* 1), for the 10 runs performed the 95% confidence intervals go down to *p* = 0*.*69 suggesting as a lower bound that the performance is enhanced in approximately 70% of noise cases.

#### **4. DISCUSSION**

This paper has presented a passive BCI based workload classification system investigating two of the most important factors for practical out-of-the-lab systems: the performance variation over time; and the noise robustness of the classification process. Inline with previous work, very good classification performance is achieved using little training data when the time gap between the training data and testing data is small.

#### **4.1. AVERAGE AND TEMPORAL PERFORMANCE**

The cross validation ANNs, which used all of the possible training data regardless of when it was recorded compared to the test data, achieved an average performance level of 73%. This procedure is a standard approach for obtaining the best generalization performance when applied to new EEG data (maximizing criteria 1 from the introduction). However, it uses large amounts of test data from many different recordings to ensure the training data is as representative as possible. This is because when the training features are all temporally close together there is the potential for specific features, such as eye blinks or changes in muscle tone, to be present in the training sections which are not present in subsequent EEG recording sessions. This affects how Independent, Identically Distributed (IID) the feature vectors are, and places a lot of weight on a single training session to be representative in terms of equipment set up, user familiarity and user neurophysiology. Changes in performance in subsequent test sessions could then be due to a change in the mental status, or due to changes in the feature IID distributions.

It is for this reason that the *same session* tests, using ANNs trained using just 250 s of data (maximizing condition 2 from the introduction), achieved better performance (86%), but only over the short time frame where the EEG, feature IID distribution,

and user mental state in the test period are very similar to that in the training period. As the time gap between the training data and the testing data increased there was a consistent decrease in the classification performance (**Figure 4**) as the used training EEG becomes less representative of the current testing EEG. This decrease in performance is apparent even over the relatively small time scale of minutes. Over larger time scales, substantial decreases in classification performance are observed with, as would be expected, the performance becoming worse than the 73% achieved when using more temporally spread training data. The average performance leveled off at approximately 57% for prolonged multi-day testing (**Figure 7**) with this degradation occurring over the time frame of hours. This performance level is above chance, but is unlikely to be meaningful for practical use. Moreover, a substantial variance in day-to-day performance was observed (**Figure 5**). On some days the existing neural network could be directly re-applied without further training of the network and very acceptable performances obtained. However, on many days the network achieved poor classification and would not be re-usable.

Only one point in the trade-off between the amount/spread of training data used and the amount/speed of the performance drop off has been investigated in this paper. Classically this trade-off is altered by using more training data: either in advance as in the cross validation approach; or by periodically re-training the network (periodically putting the user in a known workload state to generate new training data). Christensen et al. (2012) estimated that adding in 2.5 min of known workload state data per class was sufficient for re-training. However, generating this data requires time and effort, which places an emphasis on maximizing the time before re-training is required (maximizing condition 3 from the introduction).

#### **4.2. NOISE-ENHANCED PROCESSING**

Noise-enhanced signal processing is proposed here as an algorithmic method for recovering some of the ANN performance when there is a large time gap between the training and test data sections, and so increasing the time before re-training is required. It has the benefits of not requiring any additional training data to be collected and being transparent to the end user.

*Stochastic resonance*is a well known but counter-intuitive effect where the performance of a non-linear system actually increases as more noise is present in the system (McDonnell and Abbott, 2009; McDonnell and Ward, 2011). Noise-resonance is commonly seen in biological systems, including neurons and the brain itself (Wiesenfeld and Moss, 1995; Gluckman, 1996; McDonnell

**time between the training and testing data set.** This is evaluated using the array of Artificial Neural Networks and injecting different levels of artificially generated noise into the input EEG signals.

and Abbott, 2009) and can be used in signal detection (Kay, 2000; Chen et al., 2007). For example the technique has been used for improving the performance of algorithms for detecting microcalcifications (a key early sign of cancer) in breast mammograms (Peng et al., 2009), in radar target classification (Jouny, 2010), and is widely used in chromatography (Zhang et al., 2014). It has recently been applied to detecting transient signals in the EEG, both evoked and natural (Casson and Rodriguez-Villegas, 2011; Sampanna and Mitaim, 2013). Given the natural associations between noise and the EEG it is a very relevant technique to attempt to utilize, although it has not previously been applied to free-running EEG, suitable for passive BCI applications.

By using an array of ANNs to perform *testing with noise*, as a complement to the more common *training with noise*, **Figure 5** shows that the workload classification process over the time span of days is robust to 5 and 10*µ*Vrms of noise being added into the EEG traces. **Figure 7** and **Table 2** then demonstrate small stochastic resonance effects whereby the classification performance over the time span of minutes was consistently enhanced by the presence of small amounts of noise. To this end, the application of noise-enhancement has been successful, and this paper is the first demonstration of these effects in free-running EEG. However, while consistent, the achieved improvements are less than 1% and far too small to meaningfully eliminate any re-training required to get good multi-day use of the workload classification training with 250 s of data. It is clear that for free-running EEG (unlike evoked and transient EEG, Casson and Rodriguez-Villegas, 2011; Sampanna and Mitaim, 2013) the potential stochastic resonance effects are either very small or are not yet being exploited optimally.

**Table 2 | Classification performance when training and test data are taken from EEG sessions performed on the same day, minutes apart. Results are for 10 independent runs with 5** *µ***Vrms of added white Gaussian noise.**


Only one arrangement for introducing noise has been investigated here, but the process introduces many new degrees of freedom in terms of how much noise is added, its spectral composition and the parallel processing and multiple-runs options that it enables. Noise-enhancement and stochastic resonance effects are only just starting to be exploited in BCI applications, and if further established these new degrees of freedom are potentially highly interesting for re-visiting in other BCI problems where they can be exploited to improve performance.

# **5. CONCLUSIONS**

This paper has investigated the temporal performance of Artificial Neural Networks for operator workload classification. Networks trained using 150 min of EEG data in a leave-one-out cross validation procedure obtained an average classification performance of 73% (**Table 1**). In contrast networks trained on only 250 s of EEG data, and so being much quicker to set up, achieved 86% average performance (**Figure 4**) over a short time frame when the test EEG/user state was very similar to the training data. However, these networks do not generalize well (due to changes in the IID distribution of the features, changes in user state, and similar) which leads to a drop off in classification performance as the time gap between the training and testing data increases (**Figure 7**). This shows that short training periods can be used with the ANN classifiers, but the classifiers will only work for a short amount of time.

To overcome this, noise-enhanced processing was explored as an algorithmic technique for recovering some of the lost performance, increasing the amount of time that the classifier can work for. Noise-enhanced processing is explored because it is potentially transparent to the end user and has interesting correlates with the design of EEG units/electrodes. The results (**Table 2**) show that noise enhanced processing has an effect: consistently better performances are obtained; although the current improvements are very small. Nevertheless, the consistent improvement is an interesting result and the first evidence that stochastic resonance effects could be exploited in free-running EEG and passive BCI applications.

# **FUNDING**

This work was carried out while the author was at Imperial College London and has been supported by the Junior Research Fellowship of Imperial College London and the Imperial College Open Access fund.

# **REFERENCES**


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 14 February 2014; accepted: 29 October 2014; published online: 01 December 2014.*

*Citation: Casson AJ (2014) Artificial Neural Network classification of operator workload with an assessment of time variation and noise-enhancement to increase performance. Front. Neurosci. 8:372. doi: 10.3389/fnins.2014.00372*

*This article was submitted to Neuroprosthetics, a section of the journal Frontiers in Neuroscience.*

*Copyright © 2014 Casson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# ADVANTAGES OF PUBLISHING IN FRONTIERS

FAST PUBLICATION Average 90 days from submission to publication

COLLABORATIVE PEER-REVIEW

Designed to be rigorous – yet also collaborative, fair and constructive

RESEARCH NETWORK Our network increases readership for your article

# OPEN ACCESS

Articles are free to read, for greatest visibility

### TRANSPARENT

Editors and reviewers acknowledged by name on published articles

GLOBAL SPREAD Six million monthly page views worldwide

COPYRIGHT TO AUTHORS

No limit to article distribution and re-use

IMPACT METRICS Advanced metrics track your article's impact

SUPPORT By our Swiss-based editorial team

EPFL Innovation Park · Building I · 1015 Lausanne · Switzerland T +41 21 510 17 00 · info@frontiersin.org · frontiersin.org