# METHODOLOGICAL QUALITY OF INTERVENTIONS IN PSYCHOLOGY

EDITED BY: Salvador Chacón-Moscoso, Susana Sanduvete-Chaves and Jason C. Immekus PUBLISHED IN: Frontiers in Psychology

#### *Frontiers Copyright Statement*

*© Copyright 2007-2017 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88945-249-1 DOI 10.3389/978-2-88945-249-1

## About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

## Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

## Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

## What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **METHODOLOGICAL QUALITY OF INTERVENTIONS IN PSYCHOLOGY**

Topic Editors:

**Salvador Chacón-Moscoso,** Universidad de Sevilla, Spain and Universidad Autónoma de Chile, Chile **Susana Sanduvete-Chaves,** Universidad de Sevilla, Spain

**Jason C. Immekus,** University of Louisville, United States

Glacier Perito Moreno (Argentina). Photo by Salvador Chacón-Moscoso

Evaluations of intervention programs seek to present high-quality design, measures and data to assess their merit and worth. While evaluations differ in their purpose, theoretical framework and methodology, their collective aim is to obtain relevant and meaningful information to inform practice, research, and policy. As such, evaluation findings serve to build a body of knowledge on effective approaches to promote designated psychological outcomes, critical to an individual's overall health and well-being. However, as examined in this e-book, methodological weaknesses directly limit the potential of evaluations of intervention programs. As discussed by Chacón-Moscoso and Sanduvete-Chaves, methodological weaknesses can be attributed to how to define and measure methodological quality and the contextual dependency of instruments designed to measure this quality.

In response, this e-book provides a collection of studies on methodological approaches to promote the quality of psychological interventions. Specifically, 10 original works published in the Research Topic Methodological Quality of Interventions in Psychology are included. The papers are organized into two chapters. Concretely, Chapter 1 includes studies pertaining to methodological approaches to enhance the quality of psychological intervention, being context independent solutions. Furthermore, Chapter 2 presents original work in different areas (health, education, sport and social welfare) where methodological quality has been better assessed. Collectively, the papers in this e-book serve to expand the awareness of practitioners and researchers interested in psychological interventions of the critical role of methodological quality in this work.

This research was funded by the projects 1150096 (Chilean National Fund of Scientific and Technological Development, FONDECYT); and PSI2015-71947-REDT (Spain's Ministry of Economy and Competitiveness).

**Citation:** Chacón-Moscoso, S., Sanduvete-Chaves, S., Immekus, J. C., eds. (2017). Methodological Quality of Interventions in Psychology. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-249-1

# Table of Contents

*06 Editorial: Methodological Quality of Interventions in Psychology* Salvador Chacón-Moscoso and Susana Sanduvete-Chaves **Chapter 1: Methodological approaches to enhance the quality of psychological intervention** *08 The Development of a Checklist to Enhance Methodological Quality in Intervention Programs* Salvador Chacón-Moscoso, Susana Sanduvete-Chaves and Milagrosa Sánchez-Martín *20 A Simulation Study of Threats to Validity in Quasi-Experimental Designs: Interrelationship between Design, Measurement, and Analysis* Fco. P. Holgado-Tello, Salvador Chacón-Moscoso, Susana Sanduvete-Chaves and José A. Pérez-Gil *29 Analyzing Two-Phase Single-Case Data with Non-overlap and Mean Difference Indices: Illustration, Software Tools, and Alternatives* Rumen Manolov, José L. Losada, Salvador Chacón-Moscoso and Susana Sanduvete-Chaves

## **Chapter 2: Methodological quality in specific areas Health**

*45 Evaluation of a Psychological Intervention for Patients with Chronic Pain in Primary Care*

Francisco J. Cano-García, María del Carmen González-Ortega, Susana Sanduvete-Chaves, Salvador Chacón-Moscoso and Roberto Moreno-Borrego

*57 Characterization of Vulnerable and Resilient Spanish Adolescents in Their Developmental Contexts*

Carmen Moreno, Irene García-Moya, Francisco Rivera and Pilar Ramos

*79 Animal Models of Maladaptive Traits: Disorders in Sensorimotor Gating and Attentional Quantifiable Responses as Possible Endophenotypes* Juan P. Vargas, Estrella Díaz, Manuel Portavella and Juan C. López

## **Education**

*88 Reading Ability Development from Kindergarten to Junior Secondary: Latent Transition Analyses with Growth Mixture Modeling*

Yuan Liu, Hongyun Liu and Kit-tai Hau

## **Sport**

*98 Does Exercise Improve Cognitive Performance? A Conservative Message from Lord's Paradox*

Sicong Liu, Jean-Charles Lebeau and Gershon Tenenbaum

## **Social welfare**

*108 Incremental Validity and Informant Effect from a Multi-Method Perspective: Assessing Relations between Parental Acceptance and Children's Behavioral Problems*

Eva Izquierdo-Sotorrío, Francisco P. Holgado-Tello and Miguel Á. Carrasco

# Editorial: Methodological Quality of Interventions in Psychology

#### Salvador Chacón-Moscoso1, 2 \* and Susana Sanduvete-Chaves <sup>1</sup>

<sup>1</sup> HUM 649—Innovaciones Metodológicas en Evaluación de Programas, Psicología Experimental, Universidad de Sevilla, Sevilla, Spain, <sup>2</sup> Departamento de Psicología, Universidad Autónoma de Chile, Santiago, Chile

#### Keywords: methodological quality, interventions, psychology, design, measurement, analysis

**Editorial on the Research Topic**

#### **Methodological Quality of Interventions in Psychology**

The need to evaluate intervention programs rigorously in different areas of psychology (e.g., health, education, sports, or social welfare) is widespread. However, we find clear methodological weaknesses in professional practice when it comes to evaluating intervention programs.

In many cases, fundamental details are not learned, such as how an intervention is framed, how it was implemented, what aspects of it are responsible for the effects, and how effective it is relative to other alternatives. Such absences hinder the replicability of interventions, learning what program aspects could be improved and how the knowledge from a single intervention can be integrated with other findings. All this prevents the growth of cumulative knowledge, the ability to use research to inform policy, and even the advancement of science.

According to previous research, much of this methodological weakness can be attributed to two factors: disagreement about how to conceptualize and measure methodological quality in evaluation, and the context dependency of existing instruments that claim to measure such quality.

The concept quality is complex and multidimensional. It has been defined from different theoretical perspectives that variously emphasize individual concepts or sets of concepts dealing with, for example, internal, external, and construct validity. This theoretical diversity leads to different approaches to measuring research quality, such as scales (tools where at least content, construct, and criterion validity evidence was tested), checklists (tools that have not been tested through an extensive validation process), and general recommendations (taking the form of advice).

The second methodological weakness stems from the context dependency of the instruments used that reduces the chance of the information they generate to be general. Indeed, many tools are used on just one occasion, and so dependable knowledge about its psychometric properties, including reliability and validity, are rarely available.

In this Research Topic, some works present methodological approaches to enhance the quality of psychological intervention, being context independent solutions. Thus, Chacón-Moscoso et al. (a) systematize and summarize the available literature about methodological quality in primary studies to describe the state of the art in assessing the methodological quality of interventions; (b) propose a specific, parsimonious, context independent, 12-items checklist to empirically define the methodological quality of primary studies based on a content validity study; and (c) present an inter-coder reliability study for the resulting 12 items.

Holgado-Tello et al. use Structural Equation Modeling (SEM) as a first approximation to operationalize the analytical implications of threats to validity in quasi-experimental designs. The study presents this empirical solution to the existing weak link between design features, measurement issues, and concrete impact estimation analyses. Finally, Manolov et al. make practitioners and applied researchers aware of the available appropriate options for extracting

Edited and reviewed by: Pietro Cipresso, IRCCS Istituto Auxologico Italiano, Italy

> \*Correspondence: Salvador Chacón-Moscoso schacon@us.es

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 07 April 2017 Accepted: 26 May 2017 Published: 12 June 2017

#### Citation:

Chacón-Moscoso S and Sanduvete-Chaves S (2017) Editorial: Methodological Quality of Interventions in Psychology. Front. Psychol. 8:975. doi: 10.3389/fpsyg.2017.00975 maximum information from the data. Concretely, they suggest that the evaluation of behavioral change should include visual and quantitative analyses, complementing the substantive criteria regarding the practical importance of the behavioral change.

In a complementary way, this Research Topic also presents original work in different areas where methodological quality has been better assessed in order to estimate unbiased effect sizes and study possible moderator variables influencing the results obtained.

In health area, Cano-García et al. evaluate formatively (before, during and after the intervention), a program of multicomponent psychological intervention for patients with chronic pain implemented: (a) based on techniques with empirical evidence, but developed in Spain; (b) at a public primary care center; (c) among patients with limited financial resources and lower education; (d) by a novice psychologist; and (e) taking measures of all domains of painful experience using the instruments recommended by the Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT).

Additionally, Moreno et al. use the adversity level associated with family functioning and the positive adaptation level, as measures of a global health score, to distinguish four groups within adolescents: maladaptive, resilient, competent, and vulnerable. Such groups are compared in a number of demographic, school context, peer context, lifestyles, psychological, and socioeconomic variables, which can facilitate or inhibit positive adaptation in each context. In this way, they offer very valuable information for optimizing design and assessment of interventions and policies aimed at fostering adolescent health.

Furthermore, Vargas et al. use animal models of mental illness as a useful tool to characterize indicators of possible cognitive dysfunctions in humans. In this way, the subjectivity of the classical psychological evaluation processes where the patient must calibrate the magnitude of his/her symptoms and therefore the severity of his/her disorder, is overcome.

In education, Liu et al. extend the measurement part of latent transition analysis to the growth mixture model to examine the reading ability development of children. They found that the new model fitted the data well. Results also revealed that most of the children stayed in the same ability group with few cross-level changes in their classes. Finally, after adding the environmental factors as predictors, analyses showed that children receiving higher teachers' ratings, with higher socioeconomic status, and of above average poverty status, would have higher probability to transit into the higher ability group.

In sports area, Liu et al. examine relevant randomized controlled trials (RCTs) published in the past 20 years (1996– 2015) for methodological concerns arise from Lord's paradox. Their analysis revealed that RCTs supporting the positive effect of exercise on cognition are likely to include Type I Error(s). This result can be attributed to the use of gain score analysis on pretest-posttest data as well as the presence of control group superiority over the exercise group on baseline cognitive measures. To improve accuracy of causal inferences in this area, analysis of covariance on pretestposttest data is recommended under the assumption of group equivalence.

Finally, referring to social welfare, Izquierdo-Sotorrío et al. explore the informant effect and incremental validity to examine the relationships between perceived parental acceptance and children's behavioral problems (externalizing and internalizing) from a multi-informant perspective.

## AUTHOR CONTRIBUTIONS

The two authors contributed to documenting, designing, drafting, and writing the manuscript, and revised it for important theoretical and intellectual content. Additionally, both authors provided final approval of the version to be published and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

## FUNDING

This research was funded by the projects 1150096 (Chilean National Fund of Scientific and Technological Development, FONDECYT); and PSI2015-71947-REDT (Spain's Ministry of Economy and Competitiveness).

## ACKNOWLEDGMENTS

The editors greatly appreciate the contributions received from the authors in this research topic.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Chacón-Moscoso and Sanduvete-Chaves. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Development of a Checklist to Enhance Methodological Quality in Intervention Programs

Salvador Chacón-Moscoso1,2 \*, Susana Sanduvete-Chaves<sup>1</sup> and Milagrosa Sánchez-Martín<sup>3</sup>

<sup>1</sup> HUM-649 Innovaciones Metodológicas en Evaluación de Programas, Departamento de Psicología Experimental, Facultad de Psicología, Universidad de Sevilla, Sevilla, Spain, <sup>2</sup> Universidad Autónoma de Chile, Santiago de Chile, Chile, <sup>3</sup> Department of Psychology, Universidad Loyola Andalucia, Sevilla, Spain

The methodological quality of primary studies is an important issue when performing meta-analyses or systematic reviews. Nevertheless, there are no clear criteria for how methodological quality should be analyzed. Controversies emerge when considering the various theoretical and empirical definitions, especially in relation to three interrelated problems: the lack of representativeness, utility, and feasibility. In this article, we (a) systematize and summarize the available literature about methodological quality in primary studies; (b) propose a specific, parsimonious, 12-items checklist to empirically define the methodological quality of primary studies based on a content validity study; and (c) present an inter-coder reliability study for the resulting 12-items. This paper provides a precise and rigorous description of the development of this checklist, highlighting the clearly specified criteria for the inclusion of items and a substantial intercoder agreement in the different items. Rather than simply proposing another checklist, however, it then argues that the list constitutes an assessment tool with respect to the representativeness, utility, and feasibility of the most frequent methodological quality items in the literature, one that provides practitioners and researchers with clear criteria for choosing items that may be adequate to their needs. We propose individual methodological features as indicators of quality, arguing that these need to be taken into account when designing, implementing, or evaluating an intervention program. This enhances methodological quality of intervention programs and fosters the cumulative knowledge based on meta-analyses of these interventions. Future development of the checklist is discussed.

Keywords: checklist, methodological quality, content validity, inter-coder reliability, primary studies

## INTRODUCTION

Meta-analyses and systematic reviews aim to summarize the literature and generalize the results from a series of different studies about a given area of interest (Cheung, 2015). To avoid biased or erroneous conclusions, this requires clear criteria regarding the methodological quality of the primary studies and how to combine or analyze studies of different methodological quality (Jüni et al., 2001). Although, there is a general consensus about this need (Moher et al., 1996; Altman et al., 2001), a number of controversies arise when studying methodological quality in

#### Edited by:

Pietro Cipresso, IRCCS Istituto Auxologico Italiano, Italy

#### Reviewed by:

Michael Bosnjak, Free University of Bozen-Bolzano, Italy Vassilis Barkoukis, Aristotle University of Thessaloniki, Greece

> \*Correspondence: Salvador Chacón-Moscoso schacon@us.es

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 08 October 2015 Accepted: 02 November 2016 Published: 18 November 2016

#### Citation:

Chacón-Moscoso S, Sanduvete-Chaves S and Sánchez-Martín M (2016) The Development of a Checklist to Enhance Methodological Quality in Intervention Programs. Front. Psychol. 7:1811. doi: 10.3389/fpsyg.2016.01811

practice. For example, is it possible to give a one-dimensional answer to what is probably a multidimensional problem? Do we have clear criteria for deciding which specific and differently weighted methodological quality items should be considered? Which criteria should be used to decide between methodological quality indexes based on scores obtained from just one item or from a global assessment of several weighted items? Is it worthwhile trying to study a general construct that might not be equally applicable to all the contexts in which it might be used?

Despite this complexity, the extensive literature on these issues is testament to the importance of considering the methodological quality of primary studies. The present paper reviews the work in this area until July 2015. We begin by summarizing the relevant literature and then introduce the main problems derived from the state of the art.

## Theoretical and Empirical Definition of Methodological Quality

The concept of methodological quality is complex and multidimensional. It has been defined theoretically from different perspectives, such as (a) internal validity (Moher et al., 1996); (b) external validity (Rubinstein et al., 2007); (c) both internal and external validity (Jüni et al., 2001); (d) internal, external, statistical, and construct validity (Valentine and Cooper, 2008); (e) precision of the study report (Moher et al., 1998; Altman et al., 2001; Efficace et al., 2006; Hopewell et al., 2006; Rutjes et al., 2006; Cornelius et al., 2009; Li et al., 2009); (f) appropriate statistical analysis (Minelli et al., 2007); (g) ethical implications (Jüni et al., 1999); (h) relevance for the intervention area (Sargeant et al., 2006; Jefferson et al., 2009; Jiménez-Requena et al., 2009); or (i) publication status (Moher et al., 2009).

This theoretical diversity of the concept of methodological quality leads to different approaches to measuring it empirically. The main approaches described in the literature are:


• General recommendations. These take the form of advice, including general aspects to consider when assessing methodological quality. They may sometimes describe just a few examples of possible items, without specifying a whole list of proposed items. In sum, recommendations refer to those approaches that do not fulfill the criteria required by the previous two categories (Ford and Moayyedi, 2009; Linde, 2009; Wilson, 2009).

At this point, it is interesting to mention the difference between quality in primary studies and quality of the report of primary studies (Leonardi, 2006). It is very important to study the quality of the report of primary studies because the study of quality in primary studies is mostly based on reports given by authors. Indeed, this is usually the only source to obtain information about primary studies (Altman et al., 2001; Grimshaw et al., 2006; Cornelius et al., 2009). Nevertheless, we base our study on quality of primary studies (instead of the report) to (a) give researchers guidelines to check the methodological quality of studies included in a metaanalysis, to facilitate conclusions about possible risk of bias in the conclusions; (b) provide practitioners with a checklist to enhance methodological quality when designing, implementing, and evaluating their interventions; and (c) make explicit the criteria for why we included some concrete items and excluded others from an available extensive list. This information can be useful in case researchers or practitioners are interested in including different items from the extensive list based on their aims and specific contexts.

## Problems Derived from the Dispersion in the Definition of Methodological Quality

The abovementioned characteristics of the concept of methodological quality, that is, the diversity in its theoretical and empirical definition (Linde, 2009), imply three interrelated and specific problems:

**Lack ofrepresentativeness** (R), the extent to which the specific item represents the methodological quality domain to which it is assigned. There are no clear criteria for choosing the optimal tool to measure methodological quality. This occurs especially since it is common to use non-randomized studies in social sciences (Shadish et al., 2005). This is due to a shortage of instruments that (a) are rigorously developed and (b) have reliability and/or validity evidence with tested R (Crowe and Sheppard, 2011). Their use is based on criteria that have no empirical support (Valentine and Cooper, 2008). For example, some authors opt to use individual components (Field et al., 2014; Eken, 2015). Other authors apply scales that provide a global value, even when they are strongly criticized for the lack of a bias estimation (Crowe and Sheppard, 2011). In spite of this, many scales are available and used nowadays (Dechartres et al., 2011). As a consequence, different scales applied to the same group of studies may indicate different levels of methodological quality (Greenland and O'Rourke, 2001; Jüni et al., 2001). Furthermore, some tools might be labeled as scales but without providing information about their construction process (Taji et al., 2006; Jefferson et al., 2009).

**Lack of utility** (U), the extent to which the specific item is useful for assessing the methodological quality of the study with respect to the assigned domain. In practice, scales usually include many items susceptible to omission because they are not relevant or essential for measuring the construct. Therefore, they could be shortened (Jüni et al., 2001; Conn and Rantz, 2003).

**Lack of feasibility** (F), the extent to which data codification is viable because data are available and can be gathered. Tools to measure methodological quality are usually complex and their items lack operational specificity. As a consequence, they are hard to understand and require previous training for coders. Additionally, the information needed is in most cases unavailable (Classen et al., 2008; Valentine and Cooper, 2008).

## Objectives

To resolve the aforementioned problems when measuring methodological quality, the objectives of this paper are (a) to systematize and summarize the available literature about methodological quality in primary studies published until July 2015 (Study 1: systematic review); (b) to propose a specific, parsimonious checklist to empirically define the methodological quality of primary studies in meta-analyses and systematic reviews. This tool offers evidence of good R, U, and F based on expert judges (Study 2: content validity); and (c) to present evidence of adequate inter-coder reliability in the items that form the checklist (Study 3).

## Contributions of this Study Compared to Other Studies Available in the Literature

The most popular tools to measure methodological quality present some of these problems. For example, the study Design and Implementation Assessment Device (DIAD) (Valentine and Cooper, 2008) was systematically developed. Nevertheless, it did not present reliability and validity evidence (weak R), and its application was complex (weak F).

Another example is the Cochrane Collaboration's tool for assessing risk of bias in randomized trials. It focuses on individual biases (Higgins et al., 2011). In this case, we did not find reliability and validity evidence (weak R). Furthermore, there was lack of U in social sciences because it is only applicable for randomized control trials (Shadish et al., 2005). Finally, at least two of the items (incomplete outcome data and selective reporting) are difficult to assess (weak F).

The Physiotherapy Evidence Database quality scale for randomized control trials —the PEDro scale— (Sherrington et al., 2000) presents reliability (Maher et al., 2003) and validity (Macedo et al., 2010) evidence (good in R). A website<sup>1</sup> offers access to the tool and a training program for raters (good in F). Nevertheless, it lacks U for our proposal because it is an adequate tool only for randomized control trials and only in the context of physiotherapy.

The checklist for the assessment of methodological quality presented by Downs and Black (1998) is good in U because it can be applied to randomized and non-randomized studies. Nevertheless, it partially presents weaknesses in R because,

<sup>1</sup>www.pedro.org.au

although it presents validity evidence, it attains poor reliability in a subscale and some specific items. Furthermore, practitioners who are not experts in methodology might experience some problems in its application (weak F).

The Newcastle–Ottawa Scale (NOS) for assessing the quality of non-randomized studies in a meta-analysis (Wells et al., 2009) presents good F: the tool and its manual are freely accessible through the Internet. Nevertheless, its R is medium because it presents intra-rater reliability and content and criterion validity but its construct validity has not been established yet. In addition, its U can be considered medium because it has been tested exclusively to be applied to non-randomized studies, but we do not know how it works for randomized studies.

There are quite well-developed tools that measure the quality of the report of primary studies, indicating the aspects to be made explicit when reporting a study, but without valuing the actions to improve the methodological quality of a study or intervention. Some of them are (Portell et al., 2015) (a) the Consolidated Standards of Reporting Trials (CONSORT) statement (Schulz et al., 2010) for randomized control trials; (b) the STrengthening the Reporting of OBservational Studies in Epidemiology (STROBE) statement (von Elm et al., 2007); (c) Guidelines for Reporting Momentary Studies (Stone and Shiffman, 2002) for intensive repeated measurements in naturalistic settings; (d) Guidelines for Qualitative Research Methodologies (Blignault and Ritchie, 2009); (e) Guidelines for Conducting and Reporting Mixed Research for Counselor Researchers (Leech and Onwuegbuzie, 2010); and (f) Guidelines for Reporting Evaluations Based on Observational Methodology (Portell et al., 2015) for low intervention designs. Our proposal is to measure the methodological quality of primary studies instead of the report of these studies. Consequently, our aim and the aim of the previously mentioned tools are clearly different. They both can be considered complementary because the methodological quality of a study cannot be valued when the aspects to evaluate are not reported.

Literature reviews about methodological quality have already been done (e.g., Donegan et al., 2010). Furthermore, tools to measure methodological quality with good results in interrater reliability and content validity already exist (e.g., Wells et al., 2009). This paper integrates both contributions: it updates the literature reviews until July 2015 exhaustively providing a list of the most frequent quality items; and based on the results, proposes a tool to enhance methodological quality with content validity (R, U, and F of items) and inter-rater reliability evidence.

In sum, our proposed 12-items checklist addresses the limitations that the other proposals present in total or partially. First, it presents R, U, and F evidence for each of its items based on a systematic literature review and content validity study. Second, appropriate results in reliability can be considered an additional evidence of R and F. In that case, we can describe our items as operationally specified, easy to be applied, and understandable. Third, additional U evidence of the tool is its applicability in different designs (randomized and nonrandomized) and different contexts (it can be applied in the

design, intervention, and/or evaluation of any program). Forth, additional F evidence is the transparency in procedure and results (presented objectively, thoughtfully, and in detail). We made explicit (a) the inclusion and exclusion criteria applied in each stage of the development of the tool; (b) the papers, tools, and items found in the literature; (c) the values obtained in the content validity study in R, U, and F for the most frequently used items to measure quality; and (d) the reliability coefficients. Finally, the proposed tool measures methodological quality instead of the quality of the report in methodological aspects.

## STUDY 1. SYSTEMATIC REVIEW TO SEARCH FOR METHODOLOGICAL QUALITY INDICATORS

## Method

### Inclusion and Exclusion Criteria

We searched for papers published up to July 2015. Four inclusion criteria were applied: (a) methodological quality in primary studies was measured, (b) the full text was available, (c) it was written in English or Spanish, and (d) the instrument used to measure methodological quality was not previously included (was original, not repeated).

### Information to Code

Tools to measure methodological quality in primary studies were identified. After that, they were assigned to the previously defined categories regarding the empirical definition of methodological quality: scales, checklists, and general recommendations.

Subsequently, the most frequently used items in the previously identified tools were compiled by two independent researchers. This item gathering was exhaustive but not necessarily mutually exclusive; that is, different items could refer to the same methodological quality content but define it with different degrees of detail/accuracy. Any redundancies in this regard would be removed in the content validity study (Study 2).

Finally, items were assigned to different dimensions and subdimensions based on a categorization of moderator variables in meta-analyses (Lipsey, 1994; Sánchez-Meca, 1997; Sánchez-Meca et al., 1998; Merrett et al., 2013): (a) substantive characteristics, pertinent to characterizing the phenomenon under study and referring to three aspects: subject characteristics (description of participants such as gender, age, or cultural status), the setting in which the intervention was implemented (e.g., geographical, cultural, temporal, or political context), and the nature of the intervention provided (e.g., modality, underlying theory, duration or number of sessions); (b) methodological or procedural aspects, referring to the manner in which the study was conducted (i.e., variations in the design, research procedures, quality of measures, and forms of data analysis); and (c) characteristics extrinsic to both the substantive phenomenon and the research methods. This includes characteristics of the researcher(s) (e.g., gender or affiliation), research circumstances (e.g., sponsorship), or reporting (e.g., form of publication or accuracy of the reporting). It has been reported that these variables are correlated with the magnitude of the effect in many meta-analyses (Lipsey, 1994).

#### Search Strategies

The search was carried out in 12 databases that were of interest due to their content. Specifically, these were Web of Science,

Scopus, Springer, EBSCO Online, Medline, CINAHL, Econlit, MathSci Net, Current Contents, Humanities Index, ERIC, and PsycINFO.

The keywords were "methodological quality" AND "metaanalysis" AND "primary studies." Title, abstract, keywords, and full text were examined. In addition, the reference lists of studies found were checked to identify other studies of interest. This procedure was repeated until no further relevant studies were discovered.

#### Coding Procedures

Inter-coder reliability (Nimon et al., 2012; Stolarova et al., 2014) was studied. The degree of agreement between two researchers (two of the authors, CM and SC) was calculated using Cohen's κ coefficient. Any disagreements were resolved by consensus.

## Results

**Figure 1** presents the flow chart based on the PRISMA statement (Moher et al., 2009). A total of 930 abstracts were initially screened. Considering full-text availability and exclusion criteria, the final sample comprised 548 full texts that referred to the measurement of methodological quality in primary studies, using different procedures (Supplementary Data 1). Four were scales, 425 checklists, and 119 sets of general recommendations (Supplementary Table S1). The interrater reliability gave a κ = 0.874 (p < 0.001), 95% CI [0.827, 0.921].

We gathered a list of the most frequent 43-items to measure methodological quality. Supplementary Tables S2 and S3 list these items, along with the corresponding original references from Supplementary Data 1. The inter-rater reliability coefficient was κ = 0.924 (p < 0.001), 95% CI [0.918, 0.93]. This was considered an adequate level of agreement between the two researchers.

Finally, the 43-items identified were assigned to the previously defined dimensions and sub-dimensions according to their content (see Supplementary Table S4). Specifically, six items were assigned to extrinsic characteristics, 14 to substantive characteristics (five referred to the sample, three to the setting, and six to the intervention), and 23 to methodological characteristics. The degree of consensus across items assigned to different dimensions yielded a good agreement with a κ = 0.842 (p < 0.001), 95% CI [0.695, 0.989].

## STUDY 2. CONTENT VALIDITY STUDY

## Method

#### Sample

Thirty judges participated in the content validity study. They were experts in design, systematic reviews, quality measurement, program evaluation, and/or applied psychology (social, educational, developmental, or clinical). They were all members of the Methods Group of the Campbell Collaboration and/or European Association of Methodology. Specifically, they consisted of 12 women and 18 men, 20 from Europe and 10 from the USA. Their mean age was 42 years. They had an average of 14 years of experience on these issues.

#### Instruments

The 43-items previously obtained and structured by the dimensions were presented as a questionnaire (see Supplementary Table S4). Experts had to score each item by taking into account the three previously mentioned problems: R, U, and F (Chacón-Moscoso et al., 2001; Martínez-Arias et al., 2006). This was done using a three-point rating scale (Osterlind, 1998): −1 was the lowest, 0 the medium, and +1 the highest score. The experts could also offer suggestions (such as including another item not currently considered, modifying or eliminating existing items, or changing the dimension to which an item was assigned).

#### Procedure

#### **Tool distribution and gathering**

The questionnaire was sent by e-mail to 52 experts. After the third request, a total of 30 questionnaires were completed and returned. Anonymity was assured in all cases.

### **Data analysis**

The Osterlind index of congruence (1998) was used to quantify the consensus between experts in their judgments of each item and issue (Glück et al., 2015). The formula used was

$$I\_{ik} = \frac{(N-1)\sum\_{j=1}^{n} X\_{ijk} + N\sum\_{j=1}^{n} X\_{ijk} - \sum\_{j=1}^{n} X\_{ijk}}{2(N-1)n}$$

where N = number of dimensions; Xijk = score given by each expert to each item (between −1 and +1); and n = number of experts.

The results could range from −1 to +1. A score of −1 meant that all the experts awarded the most negative rating to the item in question. A score of +1 indicated that they all considered that the item in question merited the highest rating.

#### **Inclusion criterion**

Items that obtained a score of 0.5 or more on at least two of the three issues studied (R, U, and F) were included as important indicators to take into account when studying methodological quality in primary studies (Osterlind, 1998).

## Results

**Table 1** shows the Osterlind index obtained for each item on the three issues studied: R, U, and F. Fourteen methodological items fulfilled the inclusion criterion. A total of 18-items obtained scores equal to or higher than 0.5 on R, whereas 15-items obtained this score on U and 16 on F.

Item 22 was omitted because of its redundant content and suggestions by the experts (it shared redundant information with items 21 and 36). Furthermore, items 26 and 27 were combined into a single item. Consequently, the final proposed checklist contained 12-items focused on methodological characteristics. Definitions of items and their coding criteria can be found in the Appendix.

#### TABLE 1 | Osterlind indexes of representativeness (R), utility (U), and feasibility (F) obtained for the 43 items.


Items appear in abbreviated form; the whole version can be consulted in Supplemental Material 4. Scores of 0.5 or higher are printed in bold.

## STUDY 3. INTER-CODER RELIABILITY STUDY

## Method Sample

Four coders participated in the study. Two of them (C1 and C2) were coauthors of this study (SC and SM) and two others (C3 and C4) were not. Each coder had a high level of understanding of written English and received prior training on the coding task by an expert in the topic, also a coauthor of this article (CM).

### Instruments

The 12-items checklist resulting from the previous Studies 1 and 2 was applied. The Appendix presents the final version of the coding scheme after including the changes derived from the pilot study described in this Study 3.

Papers were found by searching 11 computerized databases to locate training programs: EBSCO Online, Medline, Serfile, CABHealth, CINAHL, PsycINFO, Econlit, ERIC, MathSci, Current Contents, and Humanities Index. Finally, we used SPSS 17.0 to calculate Cohen's κ coefficient.

### Procedure

First, we conducted a bibliographic search to collect articles published in the training program field. The issue was chosen by research interest. The keywords used were "evaluation," "training programs," and "work." From the resulting 1,399 published journal articles, we obtained 124 after discarding (a) the duplicates (n = 223); (b) those that were not written in English or Spanish (n = 46); (c) those for which the complete text was not available (n = 421); or (d) where the training program was not aimed at employees to improve their professional skills (n = 585). Twenty-five studies (20% of the total) were randomly selected to be used in the pilot study.

C1 and C2 were trained under the supervision of one of the authors of this article (CM), an expert on the topic. The three researchers revised the coding scheme to be sure that they understood each item in the same way (Bennett et al., 1991). CM solved the questions that C1 and C2 asked. Later, as a test, C1 and C2 jointly coded one study that was not included in this research. This task was useful to clarify some discrepancies between the coders about the items and their meaning and the way to locate the information in the papers. Then, independently, they applied the checklist to the 25 studies selected. Each study was coded in an average of 15 min.

To analyze the degree of agreement on each item, Cohen's κ (Cohen, 1960; Bechger et al., 2003; Engelhard, 2006; Nimon et al., 2012) was used for categorical items. For quantitative items (items 3–6), a correlation coefficient was calculated. When assumptions were accepted (normality Kolmogorov–Smirnov z with p > 0.05 and independence of errors Durbin–Watson d between 1.5 and 2.5), the Pearson correlation coefficient (r) was calculated; when at least one of the assumptions was violated, the Spearman correlation coefficient (ρ) was calculated.

This reliability study was replicated twice: (a) C1 and C2 applied the scale to 20 new studies (20% of the total, randomly chosen after excluding the 25 papers previously analyzed). After analyzing the results, the wording of some definitions and alternatives of the items that might have caused coding discrepancies were modified to achieve greater clarity and simplicity in the instrument; (b) C3 and C4 applied the scale to the same 20 studies. C3 and C4 received information about the research, its main characteristics, the topic it covered, the task to do, and guidelines to codify the studies. In both replications, reliability was analyzed using the same coefficients that were used in the pilot study. In addition, the reliability among the four coders in the replication phase was analyzed. For that, we calculated Cohen's κ for categorical items and Krippendorff's α coefficient for quantitative items 3–6 (Hayes and Krippendorff, 2007).

## Results

### Testing Assumptions for Quantitative Items 3–6

**Table 2** presents the results obtained on the normality (Kolmogorov–Smirnov z) and independence of errors (Durbin– Watson d) assumptions for the quantitative items 3–6.

Normality and independence of errors assumptions were accepted for item 4 in the pilot study and items 3 and 4 in the replication carried out by C3 and C4. In these cases, Pearson's r coefficient was calculated as inter-coder agreement value. For the rest of the situations (when at least one assumption was violated), Spearman's ρ coefficient was obtained.

### Inter-coder Reliability

**Table 3** shows the results obtained for each item individually. In the pilot study, we obtained a significant agreement value for seven items; only items 4 and 10 obtained an agreement value higher than 0.7; and, in general, the 95% CI amplitudes were wide, ranging from 0.376 in item 4, [0.994, −0.618] to 1.422 in item 5, [0.551, −0.871].

In the replication of the reliability study carried out by C1 and C2, we obtained a significant κ value for nine items. Four of them obtained an agreement value higher than 0.8, seven of them an agreement value higher than 0.7. The highest agreement value was 1 for item 5, Exclusions after randomization. The lowest agreement value was 0.5 for item 12, Statistical methods for imputing missing data. Compared to the results in the pilot study, the level of agreement improved substantially for most of the items except for items 4, 9, 10, and 12, where it fell slightly; 95% CIs were, in general, narrower than in the pilot study but still wide, ranging in amplitude from 0.045 (item 6, [0.994, −0.949]) to 1.168 (items 2 and 11, both [1.445, −0.277]).

In the second reliability study replication, performed by C3 and C4, the agreement value was significant for all the items. Ten items obtained an agreement value higher than 0.8. The lowest value was equal to 0.744, obtained for item 10 (Control techniques). Five items obtained the highest agreement value (1). Compared to the results in the replication study carried out by C1 and C2, the level of agreement was higher for C3 and C4 in all the items except for item 11, where it fell slightly, although it maintained significance and had an agreement value close to 0.8. 95% CIs were in general narrower than in the pilot study but still wide in some occasions, ranging in amplitude from 0 (items 2, 7, 8, and 12, in all cases [1-1]) to 0.998 (item 11, [1.269, −0.271]).

#### TABLE 2 | Testing assumptions for quantitative items.


Items appear in abbreviated form; the whole and final version (after including improvements derived from the pilot study) can be consulted in the Appendix. C1–C4 = coder 1–4, respectively. z = Kolmogorov–Smirnov z to study normality assumption (accepted when p > 0.05). d = Durbin–Watson d to study independence of errors assumption (accepted when 1.5 < d < 2.5). Results that imply an assumption violation are in boldface.

<sup>∗</sup>p < 0.05; ∗∗p < 0.01.

The results obtained in reliability across the four coders were positive, with significant values in all the items, ranging in agreement values between 0.73 and 0.931; whereas some 95% CIs remained too wide, ranging in amplitude from 0.248 (item 8, [0.854, −0.606]) to 1.15 (item 10, [1.342, −0.192]).

## DISCUSSION

In this paper, we propose a simple 12-items checklist that, when used, can contribute to enhance the methodological quality of interventions. This checklist is formed by individual methodological features that serve as indicators of quality to be taken into account when designing, implementing, or evaluating an intervention. Thus, its use does not imply obtaining a single methodological quality measure by summing the evaluation of several indicators, which is a highly criticized approach due to the inconsistent results when measuring the same studies with different methodological quality scales (Greenland and O'Rourke, 2001).

It must be asked what this checklist adds to the state of the art. Why and how is our measurement tool any different from other proposed measures that are routinely used? The first advantage is its clear, careful, and explicit process of development. First, we made an extensively updated review of all available papers referring to the measurement of methodological quality in primary studies. Second, we carried out a content validity study through expert judges. Thus, we obtained results about the congruence between checklist items with respect to their R, U, and F in relation to the dimensions they were assigned to Osterlind (1998). Third, we carried out an inter-coder reliability pilot study and multiple replication studies. As a result, we obtained appropriate coefficients in all the items, comparing the degree of agreement in pairs and with four coders joined.

In this sense, lack of R can be considered solved. In contrast to existing publications, we have clarified to the reader how and why the checklist was developed, setting up the criteria for the inclusion of items. In this regard, the appraisal made by each item on the complete checklist can be consulted with respect to its R, U, and F; as well as in relation to the categorization of the moderator variables (i.e., substantive —about subjects, setting, and intervention—, methodological and extrinsic characteristics) usually used in a meta-analysis (Lipsey, 1994). The following information has also been made available as supplementary material: the complete list of 548 reviewed papers referring to the measurement of methodological quality in primary studies and published until July 2015 (Supplementary Data 1); the list of references classified according to different and specific approaches to the empirical definition of methodological quality (Supplementary Table S1); the 43-items chosen and the original references in which they were found (Supplementary Tables S2 and S3); and the content validity questionnaire given to experts (Supplementary Table S4).

Referring to the lack of U, some issues have been solved. The proposed 12-items checklist can be useful, not just for improving the reporting of studies. First, it can assess the methodological quality of studies that have already been carried out. It gives researchers guidelines regarding inclusion–exclusion criteria in a systematic review or meta-analysis. It also checks the methodological quality of included studies to facilitate conclusions about possible risk of bias in the conclusions. Additionally, the checklist items can be used as potential moderator variables in a meta-analysis (Conn and Rantz, 2003). Second, the checklist can enhance the methodological quality in ongoing interventions that are being planned, designed, or implemented. It is extensively useful because it can be applied to experimental and non-experimental studies (interventions with random assignment of participants to the different groups or without random assignment). This is a critical issue for practitioners and in practical systematic reviews and metaanalyses because the latter type of design is frequently used in the social sciences (Shadish et al., 2005; Mayer et al., 2014).

One advantage of focusing on methodological characteristics is that it enables the tool to be extrapolated and generalized to different areas of intervention rather than being linked to one specific context. It is therefore interesting to use a common methodological framework through which one can obtain and analyze differences and communalities both within and between different intervention contexts. Logically, conclusions obtained with the same checklist would be modulated, depending on the area of intervention.

In a parallel way, we made explicit the criteria by which we included some concrete items and excluded others. Thus, we provided practitioners and researchers with clear criteria for choosing items that may be adequate to their needs. As a consequence, some of the 43-items categorized in the extrinsic,


substantive, and methodological characteristics (available in Supplementary Table S4), which were obtained from the search described in Study 1, can be selected in case researchers and practitioners are interested in including different characteristics based on their aims and specific contexts.

Referring to the lack of F, we also made advances due to the acceptable results yielded in the inter-coder reliability study (Study 3), that is, few discrepancies when different professionals coded the same studies, and because the average time needed to apply the checklist was 15 min per primary study. These facts can be interpreted in that the checklist is relatively easy to apply by having the definitions of the 12-items and their coding criteria for the final proposed checklist (Appendix).

Although this is not particularly relevant for reliability studies, the performance in Study 3 in only one intervention area is another possible limitation. Nevertheless, we are certain that the results can be generalized to other areas. We applied previous versions of the final proposed checklist in a number of pilot studies, systematic reviews, and meta-analyses. The topic was varied: psychological interventions in general, for elderly people, and for children with attention deficit hyper-activity disorder (e.g., see Supplementary Table S5). In all these cases, results obtained in inter-coder reliability were adequate.

Some of the research is ongoing or being planned. We will carry out another inter-coder reliability study enlarging the sample size to improve the accuracy of the results found in Study 3. Furthermore, we will conduct pilot studies to analyze the psychometric properties of the 12 previously obtained items. Thus, for example, we will calculate their capacity for discrimination by using the mean discrimination index and item reliability according to classical test theory (Holgado-Tello et al., 2006). Finally, the inter-coder reliability obtained was adequate but could be improved. This is why we will constantly review the definition of the 12-items of the checklist based on comments obtained from different professionals who use this tool.

## CONCLUSION

∗p < 0.05; ∗∗p < 0.01.

There is no single approach for the issue of methodological quality, and this paper was not intended to give a definitive answer. However, we do offer a justified response to the question. For that, we summarized our continuous and collaborative research over the past 15 years, which began with our first pilot applications in Baltimore in 2002 (Methods Campbell Collaboration Meeting). Furthermore, we do not merely argue the case for our own 12-items approach but also encourage other possible answers by researchers and practitioners, based on the R, U, and F assessment of the 43 most used methodological quality items in a meta-analysis.

In sum, this paper describes the rigorous process of methodological quality index selection for meta-analyses and systematic reviews and for designing, implementing, and evaluating interventions. To achieve this, we carry out an updated review on an ongoing basis. Instead of partial reviews, with poorly specified criteria for the inclusion of items, we present a checklist that has been and is being reviewed periodically. This

TABLE 3 | Results of

inter-coder

 reliability.

checklist is based on the literature, experts' opinion, applications, and feedback from related professional meetings, mainly from the Campbell Collaboration group (C2), the Society for Research Synthesis Methodology (SRSM), the European Association of Methodology (EAM) and the Spanish Association of Methodology in Behavioral Sciences (AEMCCO). The most recent comments on this work were received from the last editions of some of these meetings: the VI European Congress of Methodology in Utrecht, Netherlands (July 2014), and the XIV Congress of Methodology in Health and Social Sciences in Palma de Mallorca, Spain (July 2015).

Finally, we would like to invite any interested readers who design, implement, and/or evaluate interventions to collaborate with this project, so that we can share comments or results regarding the application of the proposed checklist. We also invite collaborations from those who are able and willing to assess the methodological quality of primary studies in meta-analyses and systematic reviews.

## AUTHOR CONTRIBUTIONS

SC-M developed the initial idea and design of the work and performed the analysis. SS-C, and MS-M performed the analyses and interpreted the data. SC-M and SS-C were in charge of drafting the manuscript. MS-M revised the manuscript critically for important intellectual content. All three authors (SC-M, SS-C,

## REFERENCES


and MS-M) provided final approval of the version to be published and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work were appropriately investigated and resolved.

## FUNDING

This research was funded by the Projects with Reference PSI2011-29587 (Spanish Ministry of Science and Innovation); 1150096 (Chilean National Fund of Scientific and Technological Development, FONDECYT); and PSI2015-71947-REDT (Spanish Ministry of Economy and Competitiveness).

## ACKNOWLEDGMENTS

The authors greatly appreciate all the comments received from William R. Shadish (recently passed away), the reviewers, and the English language editor. We believe that the quality of this paper has been substantially enhanced as a result.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2016.01811/full#supplementary-material

for systematic literature reviews. Am. J. Occup. Ther. 62, 335–348. doi: 10.5014/ajot.62.3.335



the meta-analytic approach. J. Headache Pain 7, 157–159. doi: 10.1007/s10194- 006-0298-y



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Chacón-Moscoso, Sanduvete-Chaves and Sánchez-Martín. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Simulation Study of Threats to Validity in Quasi-Experimental Designs: Interrelationship between Design, Measurement, and Analysis

Fco. P. Holgado-Tello<sup>1</sup> , Salvador Chacón-Moscoso2,3 \*, Susana Sanduvete-Chaves<sup>2</sup> and José A. Pérez-Gil<sup>2</sup>

<sup>1</sup> Metodología de las Ciencias del Comportamiento, Universidad Nacional de Educación a Distancia, Madrid, Spain, <sup>2</sup> HUM-649, Innoevalua, Psicología Experimental, Universidad de Sevilla, Sevilla, Spain, <sup>3</sup> Universidad Autónoma de Chile, Santiago, Chile

#### Edited by:

Pietro Cipresso, IRCCS Istituto Auxologico Italiano, Italy

#### Reviewed by:

Timothy R. Brick, Pennsylvania State University, USA Adam R. Ferguson, University of California, San Francisco, USA

> \*Correspondence: Salvador Chacón-Moscoso schacon@us.es

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 13 October 2015 Accepted: 31 May 2016 Published: 16 June 2016

#### Citation:

Holgado-Tello FP, Chacón-Moscoso S, Sanduvete-Chaves S and Pérez-Gil JA (2016) A Simulation Study of Threats to Validity in Quasi-Experimental Designs: Interrelationship between Design, Measurement, and Analysis. Front. Psychol. 7:897. doi: 10.3389/fpsyg.2016.00897 The Campbellian tradition provides a conceptual framework to assess threats to validity. On the other hand, different models of causal analysis have been developed to control estimation biases in different research designs. However, the link between design features, measurement issues, and concrete impact estimation analyses is weak. In order to provide an empirical solution to this problem, we use Structural Equation Modeling (SEM) as a first approximation to operationalize the analytical implications of threats to validity in quasi-experimental designs. Based on the analogies established between the Classical Test Theory (CTT) and causal analysis, we describe an empirical study based on SEM in which range restriction and statistical power have been simulated in two different models: (1) A multistate model in the control condition (pretest); and (2) A single-trait-multistate model in the control condition (post-test), adding a new mediator latent exogenous (independent) variable that represents a threat to validity. Results show, empirically, how the differences between both the models could be partially or totally attributed to these threats. Therefore, SEM provides a useful tool to analyze the influence of potential threats to validity.

Keywords: threats to validity, quasi-experimental designs, Structural Equation Modeling, causal analysis, Classical Test Theory

## THREATS TO VALIDITY: THEORETICAL AND ANALYTICAL PERSPECTIVES

The unstable social and political conditions of most contexts to which evaluation programs are applied (Gorard and Cook, 2007) imply that, a priori, there are no standardized evaluation design structures (Chacón-Moscoso et al., 2013). Due to this fact and because random assignment of participants to different groups is not always possible (Colli et al., 2014), quasi-experimental designs are more commonly used in social sciences than experimental ones (Shadish et al., 2005). Quasi-experiments lack control over extraneous variables generated by random allocation; therefore, it is extremely important that the evaluation process is conducted in a manner that provides reliable and valid results based on the analysis of the influence of potential threats to validity (Reichardt and Coleman, 1995).

There are conditions other than the intervention program itself that could be responsible for the outcomes. These conditions are called threats to validity which, unless controlled, limit the confidence of causal findings (Weiss, 1998).

This evaluation research context presents two main problems. On the one hand, as a conceptual-theoretical framework, the Campbellian tradition presents a series of threats to validity that can affect four different kinds of validity (Campbell, 1957; Campbell and Stanley, 1963; Cook and Campbell, 1979; Shadish et al., 2002): (a) statistical conclusion validity (García-Pérez, 2012) can be affected by a low statistical power (Tressoldi and Giofré, 2015) and a restricted range (Vaci et al., 2014); (b) internal validity can be affected by selection, history, maturation, and regression; (c) construct validity can be affected by construct confounding, treatment-sensitive factorial structure, and inadequate explication of constructs; and (d) external validity can be affected by interaction of the causal relationship with units or outcomes. Although Campbell's approach provides a conceptual framework for evaluating the main threats to four types of validity (Shadish et al., 2002) and some guidelines (design features) to enhance validity were presented, there is not an empirical, systematic approach to check and control the influence of threats to validity on the treatment effect estimations in program evaluation practice (e.g., Stocké, 2007; Krause, 2009; Johnson et al., 2015).

On the other hand, from an analytical point of view, procedures have been developed to assess some construct validity threats, such as inadequate explication of constructs, confounding of constructs operations, mono-operationalization, and mono-method bias. Some of these procedures include the multimethod-multitrait approach (Eid et al., 2008) and factor retention analysis, through the study of systematic pattern in the error covariance (Brown, 2015). The apparently useful analytical proposal weakens because it is not based on any theoretical framework.

In sum, there is a small connection between design features, measurement issues, and concrete impact estimation analyses. In order to provide an empirical solution to this problem, we use Structural Equation Modeling (SEM) as a first approximation to operationalize the analytical implications of threats to validity in quasi-experimental designs.

## STRUCTURAL EQUATION MODELING (SEM): AN INTEGRATED APPROACH

Based on Steyer (2005), who draws analogies between the Classical Test Theory (CTT)— measurement point of view (Trafimow, 2014) —and causal analysis and Rubin's Causal Model— design and analysis points of view, which determine the concepts of statistical inference for causal effects in experiments and observational studies (West, 2011) — we assume that SEM can be used to systematize the model assumptions and test the model fit likelihood statistically, and empirically check the way threats to validity affect data and how different threats to validity influence each other.

If we focus on the participant-level scores in each experimental condition, we can establish an analogy between causal analysis and CTT, so that the measurement, design, and analysis aspects would be linked. The participants' expected value in causal analysis is similar to the true score defined by CTT. That is, it would be the expected score obtained after an infinite number of independent administrations of a measurement, under some assumptions (Lord and Novick, 1968). Based on the concept of parallel test, we can assume that across a set of scores, the variance of the observed score is composed of the sum of the true scores and the error variance. If we consider an experimental or quasi-experimental design with two conditions (control and experimental), then we can expect two true scores for each unit one for the control condition and another for the experimental condition (Steyer, 2005)—and, therefore, two observed variances (one for the control condition and another for the experimental condition), and one covariance (control-experimental).

Taking into account the number of groups, and the number of measurement occasions, this theoretical framework could be translated into any experimental or quasi-experimental design. For example, if we combine the measurement occasion [pre-test (f0) and post-test (f1)] and the expected value or true score of each group (control: X = 0 and experimental: X = 1), **Table 1** presents the variance/covariance matrix in a non-equivalent control group design.

Variances are in the diagonal in boldface; e.g., S<sup>2</sup> [f0/X = 0] is the control group variance for the pre-test measurement and S 2 [f1/X = 1] is the experimental group variance for the posttest measurement. Covariances are out of the diagonal; e.g., S[f0,f0/X = 1, X = 0] is the covariance between the control and experimental groups at pre-test; and S[f1,f0/X = 0] is the pre-testpost-test covariance for the control group.

From the implied variance-covariance matrix, we can establish the model derivations for the non-equivalent control group design: (a) the control-experimental group covariance in the pretest (S[f0,f0/X = 1, X = 0]) should be equal to the control and experimental variance in the pre-test (S<sup>2</sup> [f0/X = 0]; and S 2 [f0/X = 1]); equal to the control variance in the posttest (S<sup>2</sup> [f1/X = 0]); and equal to the pre-testpost-test control group covariance S[f1,f0/X = 0]); and (b) the pre-testpost-test experimental group covariance (S[f1,f0/X = 1]) should be equal to the control-experimental group covariance in the post-test (S[f1,f1/X = 0, X = 1]); and equal to the pretestcontrol posttestexperimental covariance (S[f1,f0/X = 1, X = 0]).

These assumptions are shown in **Figure 1** over a nonequivalent control group design scheme.

The variance-covariance derivations could be operationalized via SEM through restriction of models, including more latent variables or testing the error terms, and therefore statistically tested. For example, the true scores (expected values or µ, the means of the population) of the control and experimental groups should be significantly equivalent in the pre-test (f0) and significantly different in the post-test (f1); in the control group, true scores should be significantly equivalent between the pre- (f0) and post-test (f1); and in the experimental group, these expected values should be significantly different between the pre- (f0) and post-test (f1). We can design these restrictions via

#### TABLE 1 | Implied variance/covariance matrix in a non-equivalent control group design.


Pre (f0), pre-treatment time point; Post (f1), post-treatment time point; Control (X0), Control group; Exptal. (X1), Experimental group; S<sup>2</sup> , variance (in boldface); and S, covariance.

SEM, whether working with one or more groups or measurement occasions (Bollen and Curran, 2006). If the above conditions are not satisfied, it may be due to any validity threats that could be tested in an SEM framework. At this point, it is important to emphasize that this approach is only applicable in cases when the intervention aims to change the level of the dependent variable/s. However, when the aim is to maintain it, then the logic would be different (the expected changes would be in the control group across the pre-test and the post-test, rather than in the experimental group).

Once we have established the theoretical assumptions, the next step is to try to draw a non-equivalent control group into the SEM framework. As an example, we opt to use only one group (control group) in a simple design (pre-test and post-test) because the conditions are more easily simulated (a more complex design and model with control and experimental groups would require two pre-tests and two post-tests). In this sense, **Figure 2** presents a multistate model where four different endogenous latent variables are measured at the same time (in the pre-test).

Believable inferences are based on the assumption that all changes between the pre-test and post-test are caused by the treatment, and this assumption can only be true if we do not find systematic changes between the pre-test and post-test in the control group.

Then, we can suppose that X = 1 in the variance-covariance matrix is not an experimental group (i.e., a group that participated in a treatment), but a group affected by a threat to validity that can modify the data and generate systematic changes. In a parallel way, the latent variable T represented in **Figure 3** is not a treatment, but a threat to validity, so this figure would represent the control group in a concrete time point (the post-test), where an odd element, such as history, for example, can be affecting the results in two of the four endogenous latent variables, i.e., η<sup>3</sup> and η4. Let's suppose that, to measure the effectiveness of an intervention program to improve attitudes toward immigration in a developed country with an aging population, participants from an experimental group and a control group filled in a questionnaire at an early stage (pre-test). A year later, after the implementation of the intervention program, the post-test was completed. This questionnaire was formed by four dimensions: public safety (η1), education (η2), economy (η3) and public health (η4). It was not expected to obtain significant differences between pre- and post-test measures in the control group. However, a wave of young immigrants (T) occurred concurrently with the study and, according to research, promoted an increase in economic activity and an improvement in public health by increasing the number of taxpayers. Thus, in the control group, there was a significant improvement in attitudes toward immigration in economy (η3) and public health (η4), while attitudes in public safety (η1) and education (η2) did not vary significantly.

At this point, it is important to clarify that this is just a hypothetical situation; the model could have been defined with

the possible influence of T over three endogenous latent variables instead of two, over only one, over η<sup>2</sup> and η<sup>3</sup> instead of over η<sup>3</sup> and η4, and so on.

In this case, we expect the same results in all the variances and covariances presented in **Table 1**. As **Figures 2** and **3** represent, respectively, pre- and post-test in the control group, any systematic change found could be attributed to the influence of a threat to validity (T).

## ADVANTAGES OF SEM OVER OTHER METHODS

Scarce research in psychology was aimed to empirically detect the influence of threats to validity in interventions based on a theoretical framework. In this regard, Crutzen et al. (2015) used meta-analysis in order to study the relationship between differential attrition and several moderator variables; nevertheless, they could not study the relationship between the differential attrition and the effect size owing to technical problems. Furthermore, Damen et al. (2015) carried out a longitudinal study to finally conclude that a possible attrition bias occurred in a percutaneous coronary intervention, as dropouts and completers differed systematically on some sociodemographic, clinical, and psychological baseline characteristics; nevertheless, as drop-outs did not receive the complete intervention, the authors could not study the difference across both groups (drop-outs and completers) in the results obtained in the post-test. Mixed-effects regression is useful to study the difference between the pre- and post-test across experimental and control groups. Nevertheless, this option is based on a pure analytical perspective and is restricted to include only directly observed variables; whereas, SEM is not just based on analysis, but on the integration of design, measurement, and analysis. Thus, it provides the possibility of obtaining concrete data about the relationship between latent and observed variables used to measure them and the associated measurement error for each one (measurement model), and the relationship between different latent variables (structural model), as shown in Duncan et al. (2006). Additionally, when the design presented includes two groups, the degree of equivalence between them can be defined depending on the restrictions imposed: we can assume equivalence between experimental and control groups in both the measurement and the structural model, or only in one of them. As a consequence of these differences, regression tends to obtain less sensitive results compared with SEM (Nusair and Hua, 2010).

Therefore, the SEM framework presents several advantages compared with other procedures. This approach includes a measurement-design-analysis point of view, so it is more complete than the traditional approaches based on a single aspect. Moreover, it allows to (a) take conclusions about the relationship across multiple latent variables between them (structural model) and each latent variable with the observed variables that measure it (measurement model); (b) define the degree of equivalence between the experimental and the control group; and (c) obtain inferences about the influence of threats to validity in the results; (d) be generalized to any threat to any kind of validity; and (e) study the concrete conditions under which different threats to validity could be influencing the results.

A similar methodology has been already found as useful to study the influence of threats to validity in other fields; e.g., (a) in forest research, Ficko and Boncina (2014) operationalized the influence of response style bias and the robustness of statistical methods in the results using simulations and including latent variables in the models representing those threats to validity; and (b) in medical research, Mickenautsch et al. (2014) studied the inflation of effect size owing to selection bias using simulations. In the current study, we show the application and usefulness of simulations and the SEM framework in social sciences, specifically in psychology, to detect the influence of other different threats to validity.

## OBJECTIVE

The objective of this study is to illustrate conceptual problems of threats to validity through causal analyses using SEM, under the framework of design. Concretely, based on the multistate and single-trait-multistate models, we carry out a simulation study where two threats to statistical conclusion validity are manipulated (restriction of range and low statistical power) in order to analyze the way in which a third threat to validity named T (unspecified, it could be potentially any of them) could be affecting the measures in the post-test of a non-equivalent control group in a quasi-experimental design.

## MATERIALS AND METHODS

Data was generated using two different models. However, in both the cases, we considered only the control condition: (a) **Figure 2** represents the multistate model in the control condition (Model 1), where four latent endogenous (dependent) variables (η) are measured through three observed endogenous variables (Y) in a concrete time point (pre-test); (b) **Figure 3** represents the single-trait-multistate model in the control condition (Model 2), where the same four latent endogenous variables are measured through the same three observed endogenous variables in another concrete time point (post-test) (Steyer, 2005; Dumenci and Windle, 2010; Pohl and Steyer, 2010). In this case, a new mediator latent exogenous (independent) variable that represents a threat to validity (T) was added in order to detect its possible influence in the model fit; f<sup>0</sup> is another latent exogenous variable that represents the expected outcome under the control condition (Steyer, 2005). Both Models 1 and 2 assume that all effects are linear (Kline, 2012).

When the multistate model (Model 1, pre-test in control group; **Figure 2** that does not include T) is rejected and the singletrait-multistate model (Model 2, post-test in control group; **Figure 3** that includes T) is accepted, we can conclude that the T variable could be affecting the data in the post-test; thus, differences found between the pre-test and the post-test could be partially or totally attributed to threats to validity. Under these circumstances, further analysis would not provide valid inferences about the effectiveness of treatment. In that case, the T variable could be operationalized in a SEM (Ryu, 2014).

Additionally, two previously mentioned threats to statistical conclusion validity are manipulated in order to study the possible interaction with the threat to validity named T in **Figure 3**: (a) the low statistical power implies obtaining non-significant relationship between the treatment and outcome because the experiment has insufficient power (probability of finding an effect when the effect exists). This threat to validity was manipulated by varying the sample size, with 5 conditions: 100, 500, 750, 1000, and 5000 participants; and (b) the restricted range implies that reduced range on a variable usually weakens the relationship between this variable and another (Coenders and Saris, 1995; DiStefano, 2002; Holgado-Tello et al., 2010; Yang-Wallentin et al., 2010; Williams and Vogt, 2011; Bollen, 2014). This threat to validity was manipulated by varying the number of levels in the dependent observed variables (Y), with four conditions: 3, 5, and 7 discrete categories, and as continuous variables.

For each latent endogenous variable, three observed variables were simulated. The factor loadings of the observed variables

were always the same in all factors (0.8). The simulated factor loadings were high in order to avoid doubts about the specification in the estimation stage of the parameters. Observed variables were generated according to a normal distribution N(0,1). Then, these answers were categorized according to 3, 5, and 7 discrete categories, that is, were categorized in Likert scales with different numbers of possible responses to restrict the range of variation, or remained as continuous variables. The responses to all observed variables remained symmetrical in order to avoid the influence of skewness. To categorize the Likert scales, as stated by Bollen and Barb (1981), the continuum was divided into equal intervals from z = −3 to z = 3 in order to calculate the thresholds of the condition in which the response distribution to all items is symmetrical (skewness = 0). Finally, the sample size had five experimental values (100, 500, 750, 1000, and 5000).

The combination of the two experimental factors produced 20 experimental conditions (4 × 5) which were replicated 1000 times. These replications were performed using R version 2.0.0, which invoked PRELIS successively (Jöreskog and Sörbom, 1996b) to generate the corresponding data matrices according to the specifications resulting from the combination of the experimental conditions. Thus, for each data generated matrix, correlation matrices were obtained. After obtaining correlation matrices for each replication, the corresponding Confirmatory Factor Analysis was performed successively. The instrumental problem of underidentification in Model 2 (**Figure 3**) was solved by constraining four model components as equal to one: two

As in the previous case, these replications were performed using R version 2.0.0, which invoked LISREL 8.8 successively (Jöreskog and Sörbom, 1996a).

## RESULTS

**Table 2** presents the results obtained in the multistate model in the control condition, pre-test (Model 1) and the single-traitmultistate model in the control condition, post-test (Model 2) in the different experimental conditions.

We found, in general, that: (a) in none of occasions Model 1 fitted better than Model 2; (b) increase in chisquare (3χ 2 ) was significant from Model 1 to 2; therefore, Model 2 fitted significantly better than Model 1 in all the experimental conditions in most of the replications (in 100% of replications when there were 500 participants or more); (c) with 100 participants, both models were rejected, regardless of the categorization of the observed dependent variables (Y).

Taking into account the percentage of replications where RMSEA was lower than 0.08 we found the following results: (a) with 500 participants or more, both Models 1 and 2 fitted when the observed dependent variables (Y) were continuous; and (b) Model 2 fitted better than Model 1 when the observed dependent variables (Y) had 5 or 7 categories; with 750 participants or more


TABLE 2 | Results obtained in Models 1 and 2 in different conditions.

Model 1, the pre-test in control group, which does not include a possible threat to validity influence (T); Model 2, the post-test in control group, which includes a possible threat to validity influence (T); n, sample size; % accepted Ho, percentage of null hypothesis accepted (i.e., the model fits) in Models 1 and 2 using χ 2 ; % 3χ 2 is significant, percentage of significant increase in χ <sup>2</sup> between Models 1 and 2; 3df, increase in the degrees of freedom between Models 1 and 2; % RMSEA < 0.08, percentage of occasions in which the Root Mean Square Error of Approximation is under 0.08 (i.e., the model fits); Con., the dependent variables (Y) are continuous. Values marked in bold are the results that suggest a better fit in Model 2 than in Model 1.

as well, the same result was found in the case that the observed dependent variables (Y) had 3 categories.

Taking into account the percentage of accepted null hypothesis considering χ 2 , an index more sensible than RMSEA, the only model that fitted was Model 2 in the case where there were 500 participants or more and the observed dependent variables (Y) were continuous.

Whether the multistate model (Model 1) is rejected and the single-trait-multistate model (Model 2) is accepted, following the results obtained, T could be affecting the results: (a) in all the experimental conditions, if we consider the percentage of significant increase of chi squared values (% 3χ 2 ); (b) when there were 500 participants or more and the observed dependent variables (Y) had 5 or 7 categories, if we consider the percentage of RMSEA lower than 0.08 (% RMSEA < 0.08); and (c) when there were 500 participants or more and the observed dependent variables (Y) were continuous, if we consider the percentage of accepted null hypothesis considering χ 2 .

Following the same logic, we can conclude that the possible effect of the threat to validity (T) was only annulled in the case that we had at least 500 participants and the observed dependent variables (Y) were continuous, if we consider the percentage of RMSEA lower than 0.08 (% RMSEA < 0.08).

## DISCUSSION

We would like to remark that the current study is a very preliminary approximation to study the threats to validity in quasi-experimental designs under the Campbellian tradition. We have attempted to present the basic aspects of the conceptual foundations to approach the study of threats to validity from an empirical perspective. The combination of design, CTT, and SEM has enabled us to obtain the design models derivations expressed in a variance-covariance matrix whose likelihood could be tested via SEM. Finally, from a pragmatic perspective, we have attempted to empirically illustrate the proposals presented via a simulation study. This study is an attempt to open slightly a door to develop vast empirical research for the solution of the problem regarding the threats to validity. From this perspective, we suggest potential future research to analyze other types of validity taking into account many possible designs.

Overall, we conclude that the single-trait-multistate model in the control condition, post-test (Model 2, including T), presented a better fit than the multistate model in the control condition, pre-test (Model 1, without T), across the experimental conditions. As the number of categories and sample size increase, the results showed that Model 1 was rejected in favor of Model 2.

The key findings obtained based on the simulation study of threats to validity using SEM applied to causal analysis are as follows: (a) a general view including measurement, design, and analysis aspects can be provided, bridging design issues and analytical implications, by analytically studying the consequences of threats to validity; this would give a necessary insight for practitioners when considering the consequences of design features on impact analysis; (b) it is useful to include several variables in the analysis using SEM representing any threat to any kind of validity in order to explain the interindividual differences in the individual causal effects of the treatment variable on the response variable; with SEM, a test of measurement invariance using a concrete set of data could be carried out in a complementary way to study the possible differences between models, obtaining conclusions for specific situations in specific conditions (Muthén and Asparouhov, 2013); the advantage of using simulations is that conclusions about possible threats to validity can be easily generalized to any situation and to different conditions due to the high number of replications obtained (1000 in this study) and the possibility of manipulating different variables (e.g., number of possible responses and sample size in this study); thus, based on this study, we can conclude that (c) it is recommended not to categorize the dependent variables and, when done, try to have as many categories as possible; with continuous dependent variables, the possible negative effect of the threat to validity included in Model 2 (named T) tended to be neutralized (Model 1, without T, also obtained a good fit considering RMSEA); and (d) using small sample sizes (less than 100 participants) is not adequate (Models, including T or not, did not present an acceptable fit).

For future research: (a) we shall apply the procedure presented in the current study using real data obtained from a real situation in order to show practitioners how this proposal can increase the control over extraneous variables and, consequently, the quality of the interventions; (b) it will be necessary to work under the logic of multigroup analysis. This perspective would enable us to consider the control and experimental groups at the same time, and the pre- and post-test measures; then, it would be possible to impose the restrictions of the variancecovariance matrix of **Table 1**. In this way, some weaknesses of the present proposal would be solved, such as the fact that extraneous variables do not necessarily imply a threat to validity because, when provoking the same effect in the treatment and control groups, this effect is neutralized (for example, the effect of maturation in children); in this sense, we would find a positive change in the control group when comparing preand post-test (instead of the same true score), but the change would be significantly lower than in the treatment group (if the treatment were effective). In sum, the control group does not need to have an identical level of X in pre and post-test, but this possible level difference does not need to be statistically significant compared to the treatment group. These differences can only be detected when comparing both groups (control and experimental) and both measurement occasions (pre and post-test); (c) when working with control and experimental groups, we shall manipulate the degree of equivalence between both to study the changes in the model fit when control and experimental groups are strictly equivalents (strong equivalence; i.e., equal structural and measurement model), or only the structural model is equal between control and experimental groups, or only the measurement model is equal between control and experimental groups (weak equivalence). Thus, it has completely different consequences on possible inferences to be made from the quasi-experimental designs. If a strong equivalence is achieved, then we would have empirical evidence of a "high degree of validity" in the results obtained. However,

when the equivalence found is only weak, we could suspect that some threats to construct validity could be affecting the results when the equivalence is found only in the measurement model, and it is possible that some threats to internal validity could be working if the equivalence is achieved only in the structural model; and (d) we shall manipulate other threats to validity (Shadish et al., 2002) using the same approach; e.g., violated assumptions of statistical tests (a threat to statistical conclusion validity) can be studied by simulating data with and without normal distribution and checking if the same model fits under both conditions; regression (a threat to internal validity) can be studied by simulating data sets with and without extreme values and checking if the same model fits under both conditions; treatment-sensitive factorial structure (a threat to construct validity) can be studied by simulating possible changes in data when comparing pre- and post-test results and checking if the factorial structure obtained in the pre-test is maintained equal in the post-test; inadequate explication of constructs (another threat to construct validity) can be studied by taking real data obtained from questionnaires and, before checking the possible relationships between constructs (structural model), confirming that items measure adequately each construct (measurement model); or interaction of the causal relationship with units (a threat to external validity) can be studied by checking the measurement invariance of a model across different groups, such as male and female.

## AUTHOR CONTRIBUTIONS

FH-T and SC-M came up with the initial idea and design of the work and interpreted the results. FH-T, SC-M, and JP-G made

## REFERENCES


critical revisions to the manuscript for important intellectual content. FH-T and JP–G performed the analyses. SS-C helped out in the interpretation of data and was in charge of drafting the manuscript. All the four authors (FH-T, SC-M, SS-C, and JP-G) provided the final approval of the version to be published, and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

## FUNDING

This study forms part of the results obtained in research project PSI2011-29587, funded by Spain's Ministry of Science and Innovation; and in research project number 1150096, funded by Chilean National Fund of Scientific and Technological Development - FONDECYT).

## ACKNOWLEDGMENTS

We would like to dedicate this article to our fellow researchers and very good friends José A. Pérez-Gil, who passed away while the article was being written and William R. Shadish, an example of honesty and hard work, who is a referent in quasi-experimental designs. May they rest in peace. The authors highly appreciate all the comments received from the reviewers and the English language editor. We believe that the quality of this paper has been substantially enhanced as a result.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Holgado-Tello, Chacón-Moscoso, Sanduvete-Chaves and Pérez-Gil. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Analyzing Two-Phase Single-Case Data with Non-overlap and Mean Difference Indices: Illustration, Software Tools, and Alternatives

*Rumen Manolov1\*, José L. Losada1, Salvador Chacón-Moscoso2,3\* and Susana Sanduvete-Chaves2*

*<sup>1</sup> Departamento de Metodología de las Ciencias del Comportamiento, Facultad de Psicología, Universidad de Barcelona, Barcelona, Spain, <sup>2</sup> Psicología Experimental, Universidad de Sevilla, Seville, Spain, <sup>3</sup> Universidad Autónoma de Chile, Santiago, Chile*

#### *Edited by:*

*Pietro Cipresso, IRCCS Istituto Auxologico Italiano, Italy*

#### *Reviewed by:*

*John G. Holden, University of Cincinnati, USA Shana Cornelis, Ghent University, Belgium*

#### *\*Correspondence:*

*Rumen Manolov rrumenov13@ub.edu; Salvador Chacón-Moscoso schacon@us.es*

#### *Specialty section:*

*This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology*

*Received: 22 October 2015 Accepted: 08 January 2016 Published: 21 January 2016*

#### *Citation:*

*Manolov R, Losada JL, Chacón-Moscoso S and Sanduvete-Chaves S (2016) Analyzing Two-Phase Single-Case Data with Non-overlap and Mean Difference Indices: Illustration, Software Tools, and Alternatives. Front. Psychol. 7:32. doi: 10.3389/fpsyg.2016.00032*

Two-phase single-case designs, including baseline evaluation followed by an intervention, represent the most clinically straightforward option for combining professional practice and research. However, unless they are part of a multiple-baseline schedule, such designs do not allow demonstrating a causal relation between the intervention and the behavior. Although the statistical options reviewed here cannot help overcoming this methodological limitation, we aim to make practitioners and applied researchers aware of the available appropriate options for extracting maximum information from the data. In the current paper, we suggest that the evaluation of behavioral change should include visual and quantitative analyses, complementing the substantive criteria regarding the practical importance of the behavioral change. Specifically, we emphasize the need to use structured criteria for visual analysis, such as the ones summarized in the What Works Clearinghouse *Standards*, especially if such criteria are complemented by visual aids, as illustrated here. For quantitative analysis, we focus on the non-overlap of all pairs and the slope and level change procedure, as they offer straightforward information and have shown reasonable performance. An illustration is provided of the use of these three pieces of information: visual, quantitative, and substantive. To make the use of visual and quantitative analysis feasible, open source software is referred to and demonstrated. In order to provide practitioners and applied researchers with a more complete guide, several analytical alternatives are commented on pointing out the situations (aims, data patterns) for which these are potentially useful.

Keywords: non-experimental, single-case, data analysis, guidelines, methodological quality

## INTRODUCTION

The evidence-based practices movement aims to provide guidelines for carrying out methodologically sound research in fields such as psychology (Apa Presidential Task Force on Evidence-Based Practice, 2006) and special education (Odom et al., 2005). According to this movement, the studies providing solid evidence need to meet a series of criteria related to how

an experimental effect is documented and how generality can be established (Maggin et al., 2014). The first of these aspects refers, among other features of the study, to its design and analysis. In the current work, we focus on two-phase designs that do not meet the criteria established by the What Works Clearinghouse *Standards* (Kratochwill et al., 2010), unless they are part of a within-study replication, as in a multiple-baseline design. Twophase designs may be weaker, from the perspective of internal validity, but they are still used (e.g., Cordery et al., 2010; O'Neill et al., 2013; Finn and McDonald, 2014; Winkens et al., 2014) and can be useful as pilot studies and also due to the fact that establishing the evidence basis of interventions is related to the replication of results and their integration via systematic reviews and meta-analyses (Jenson et al., 2007). Such reviews can offer a comprehensive summary of findings while trying to avoid publication bias, which would take place when excluding studies on the basis of the design. In that sense, it is potentially useful to report the results of all studies and, afterward, consider whether some studies show no differences or negative results (Kratochwill et al., 2001) or whether there are differences according to the design used or the methodological quality of the study. Actually, Gage and Lewis (2014) suggest that experimental control can be used as a moderator variable in meta-analyses.

In this context, the present paper arises from our conviction that practitioners' professional practice, mainly aimed to help individual clients, can also contribute to informing fellow professionals about the results of applying certain interventions. In order to make this contribution possible and in order to be able to translate practice into research certain design and analysis considerations are necessary. The current paper mainly aims to answer two specific questions "What can be done to improve the data analysis in my practice so that its results are more useful to the discipline, despite using a sub-optimal design?" and "How can I easily implement some appropriate analytical techniques?" However, design and data analysis should be considered jointly (Brossart et al., 2014) and this is why we first review some aspects related to how the study is conducted.

Regarding the ways in which a study can be considered as providing evidence, a design implemented as a randomized controlled trial is one option, but it is not always feasible. Another alternative is single-case designs, also referred to as N-of-1 trials (Howick et al., 2011). For this latter option, there are several guidelines on how the studies should be carried out (see Smith, 2012; Maggin et al., 2014, for a review). Two of these guidelines are What Works Clearinghouse *Standards* (Kratochwill et al., 2010) and the Risk of Bias in N-of-1 Trials (RoBiNT) scale by Tate et al. (2013). In brief, the optimal features of a single-case study contributing solid evidence are: to use a design allowing for at least three comparisons between conditions (as in multiple baseline, alternating treatments, and ABAB designs; Barlow et al., 2009); to include randomization in the design when assigning measurement times to conditions (Kratochwill and Levin, 2010); to include blinding of the patient, therapist, and assessor; to show high inter-rater reliability when recording the data (especially useful when by means of observation, Cohen, 1960); to apply the intervention as planned (see also Ledford and Gast, 2014, for a discussion on procedural fidelity); the use a repeatable measure for the target behavior; to use an appropriate data analysis procedure; to assess generalization across other behaviors and settings; and to replicate the results.

These requirements reflect the aspects of a study or a professional practice that moderate the extent to which its findings are "solid evidence" and also affect the practitioner's confidence in the conclusions regarding intervention effectiveness. Accordingly, using a sub-optimal two-phase design such as AB (referred to as "pre-experimental," Kazdin, 1982, or "quasi-experimental," Campbell and Stanley, 1966) is a drawback, but it does not necessarily preclude a study from being useful1 , as there are other characteristics that can increase the credibility in the obtained results. In the present work, we focus on one of these aspects – data analysis – showing how to meet the condition for an appropriate data analysis.

The structure of this article is as follows. First, we comment on the characteristics of non-experimental studies in order to frame a context, where improvements are required (Institute of Education Sciences, 2013). Second, we present an analytical method meeting the criterion for appropriate data analysis; we refer to its strengths, limitations, and alternatives. Third, we apply the analytical method to a real data set. Fourth, we point out several analytically challenging situations and present our own advice to practitioners and applied researchers. With the justification and illustration of the analytical method and the software, we aim to offer practitioners and applied researchers a useful tool, and indications about its alternatives.

## NON-EXPERIMENTAL STUDIES

Demonstration of causal relations via experimental designs is considered optimal for building the evidence basis of interventions (Kratochwill et al., 2010; Tate et al., 2013), but everyday practice cannot always meet this requirement (e.g., due to time pressure or to the unethical withholding or removal of a potentially beneficial intervention). However, non-experimental studies can still contribute via in-depth assessment of effects, taking into consideration different sources of information (e.g., visual and numerical analyses of the data gathered, the interpretation of the client, his/her significant ones, and the practitioner) and relying on replication.

Non-experimental studies consisting only of a preintervention and post-intervention condition resemble "natural experiments," such as disasters or legislation changes, and they also resemble observational studies in which continuous recording of a single individual is taking place (see **Figure 1** representing the taxonomy of observation studies by Anguera et al., 2001, used in Jonsson et al., 2006). Moreover, an experimental multiple-baseline design across behaviors is similar to an observational plan in which several behaviors of the same participant are recorded each time that a video-taped situation is seen by the observers (i.e., a multidimensional observational

<sup>1</sup>Actually, even pre–post designs with a single measurement before and after an intervention can provide useful evidence (e.g., Pazzagli et al., 2014), especially if clinical significance is assessed, for instance using the Reliable Change Index (Jacobson and Truax, 1991).

recording according to Anguera et al., 2001). Another similarity can be seen between a multiple-baseline design across subjects and a multiple-case one-dimensional continuous recording observational plan. However, observational (or nonexperimental, in general) and experimental methodology allow reaching different conclusions. Regarding experimental control, the main differences are in: (a) the use of randomization to decide when to introduce and withdraw an intervention, (b) the staggered introduction of the intervention and (c) the replication of effects. Accordingly, in the absence of staggered introduction of the intervention, in an observational study there is less control over alternative explanations of potential behavioral change and the demonstration of intervention effectiveness is not so strong (Kazdin, 1984). Thus, multidimensional single-case continuous observation is not equivalent to multiple-baseline design across behaviors. Moreover, in a natural setting it is usually not possible to choose *at random* when to intervene in order to support internal and conclusion validity (Kratochwill and Levin, 2010). Thus, the conclusions made need to refer to the existence and amount of change in the behavior, but not to the cause for such a change.

## THE ANALYTICAL METHOD EXPLAINED

The analytical method is grounded on the "data analysis" item of the RoBiNT scale: controversy remains about whether the appropriate method of analysis in single-case reports is visual or statistical. Nonetheless, two points are awarded if systematic visual analysis is used according to steps specified by Kratochwill et al. (2010, 2013), or visual analysis is aided by quasi-statistical techniques, or statistical methods are used where a rationale is provided for their suitability (Tate et al., 2013, p. 629).

Our proposal is to use the option of "visual analysis aided by quasi-statistical techniques," where the latter are understood as descriptive measures that do not intend to yield statistical significance values due to various reasons. First, visual analysis is not only frequently used, but it is apparently the only kind of single-case data analysis that researchers seem to agree that is necessary (e.g., Parker et al., 2006; Gast and Spriggs, 2010; Kratochwill et al., 2010; Davis et al., 2013; Fisher and Lerman, 2014). Second, the evidence on visual analysis suggests that its exclusive use is potentially problematic (i.e., visual analysis is not sufficient) and techniques increasing the reliability of visual analysis are necessary (Maggin et al., 2013). Third, we consider that certain quasi-statistical techniques with favorable evidence for their performance can be used as natural complements of the commonly used visual analysis, as they share the emphasis on the same main data features (overlap, level, and trend), whereas the visual aids also take data variability into account and allow comparing projected and actual data. Fourth, applied researchers may not be willing to use the more complex statistical techniques whose results are more easily misinterpreted, in case of incomplete understanding of what exactly is being done with the data. Fifth, the use of inferential statistical procedures may not be fully justified in the absence of random sampling (Edgington and Onghena, 2007). Moreover, an inference to a population is not necessarily an aim of idiographic research (Johnston and Pennypacker, 2008) that focuses on the needs and the improvement of the individual clients. Sixth, easy to use software is available for the descriptive statistical procedures recommended here.

## SYSTEMATIC VISUAL ANALYSIS

## Rationale

Visual analysis has been and still is popular among professionals in their everyday psychological practice (Robey et al., 1999; Parker and Brossart, 2003) and is still advocated for (Lane and Gast, 2014) and used as a gold standard for assessing quantitative procedures (Wolery et al., 2010). Visual analysis has been considered both appropriate and sufficient for data gathered longitudinally (Michael, 1974). However, this sufficiency has been defended only for experimental studies (Sidman, 1960), which points at the need for complementing it with a quantitative procedure.

Tate et al. (2013) advise for systematic visual analysis and it necessarily starts with assessing the baseline, specifically, whether the intervention can be introduced or it should be postponed until stability is reached (Barlow et al., 2009). Alternatively, deterioration in the behavior of interest would suggest even more clearly the need for intervention. In that sense, deterioration is not expected to interfere with subsequent conclusions about intervention effectiveness (Kazdin, 1978), given that it allows exploring whether an intervention reverts the situation. Nonetheless, it is possible to assess intervention effectiveness even when the behavior is already improving before the intervention itself, as it will be shown later.

The specific data aspects, which are foci of attention, are the amount of overlap between data in the different conditions, within- and between-phase variability, slope and level change (SLC; Kratochwill et al., 2010; Lane and Gast, 2014). A more objective assessment of the degree to which data share the same values (i.e., overlap), whether levels and trends are similar across conditions, and whether data become more stable or more variable after the intervention can be done using visual aids instead of relying on naked-eye impressions. Finally, visual analysis focuses on the whole data pattern (Parker et al., 2006) in order to assess whether it resembles the expected one, that is, a consistent improvement only during intervention. Kratochwill et al. (2010) summarize the overall assessment as a comparison between projected and actually obtained measurements. Specifically, in two-phase designs, it is relevant to project the baseline (in case it is stable or presents trend stability) into the intervention phase and compare this projection with the real treatment phase data.

## Potentially Useful Tools

The assessment of overlap can be done using visual aids, such as range lines, as provided by the SCDA plug-in (Bulté and Onghena, 20122 ) for R-Commander. The upper left panel of **Figure 2** shows an example with the data reported by Taylor and Weems (2011) for a participant called Elizabeth. This graph suggests a minor overlap between the observations. Regarding the assessment of changes in level, the same software can be used to superimpose, for instance, the median of the behavioral observations in the pre-intervention and postintervention conditions.The upper right panel of **Figure 2** shows an example with the same data and suggests that there has been a reduction in the level of target behavior. However, the median is not very useful for the post-intervention observations in which there is a clear downward trend.

Regarding the assessment of changes in slope, two situations should be considered: when pre-intervention data are stable and when baseline data show an upward or downward trend. In case of stability, it is possible to use the stability envelope (Lane and Gast, 2014) or the two-standard deviations band used in statistical process control (Callahan and Barisa, 2005). The twostandard deviations band implies computing the average of the data for a specific condition and representing it with a solid line. The standard deviation of the same data is also computed and two dashed lines are represented: one located two standard deviations below the mean and the other two standard deviations above. The basis of this procedure is that, for a normally distributed variable, few points (less than 5%) are expected to be out of these limits in case there is no change in the behavior with the introduction of the intervention. However, we suggest using it only as visual aid and not as a formal statistical procedure, as the data cannot be reasonably assumed to be normal, continuous, or independent. This visual aid is implemented in R Core Team (2013) code3 that only requires inputting the data and specifying the number of pre-intervention observations. As an example see the lower left panel of **Figure 2**, indicating that the reduction in behavior is beyond what is expected only by random variability as there are multiple observations with values smaller than the lower limit.

In case the pre-intervention data show a trend, it is necessary to compare the projection of this trend and the actually obtained measurements (Kratochwill et al., 2010). For that purpose, there is another potentially useful R code4 which allows applying the stability envelope to the trend line: (a) estimating split-middle trend (Miller, 1985), (b) projecting it into the next phase, and (c) constructing an envelope around it. The envelope can be constructed on the basis of the baseline median5 , so that the lower limit is located 25% of the median below the estimated split-middle trend and the upper limit at the same distance above it (Lane and Gast, 2014). In case 80% of the data are within those limits, this would indicate trend stability, that is, it would suggest that no change in slope has been produced with the introduction of the intervention. For using this code only data input is required before copy-pasting it in R. The lower right panel of **Figure 2** shows an example with Elizabeth's data. Given that the projected trend and its stability envelope are lower than the actual observations, this is the only piece of graphical information that does not suggest improvement in the behavior, but practitioners should be cautious when trend is estimated from as few as four observations and when it is projected farther away in time into values that are out of the range of possible measurements (Parker et al., 2011b).

Another aspect assessed is whether the introduction of the intervention has led to an immediate change in the behavior. Moreover, the duration of the change (maintained or transitory) is also taken into account in order to evaluate the strength of the intervention. A structured guide on visual analysis is offered by the What Works Clearinghouse *Standards* (Kratochwill et al., 2010; see also the application and a scoring procedure by Maggin et al., 2013) and by Lane and Gast (2014).

## Limitations

Despite these guidelines on visual analysis, there are still no soundly based formal decision rules for all data aspects that are visually assessed (Kazdin, 1982) and objective and replicable outcomes are also missing (Robey et al., 1999). These two drawbacks might be among the reasons for the frequently reported inadequate performance of visual analysts (Gibson and Ottenbacher, 1988; Ottenbacher, 1990; Danov and Symons, 2008; Ximenes et al., 2009; see also Ninci et al., 2015, for a recent meta-analysis reporting insuficient interrater agreement, especially among single-case experts). Moreover, the visual analysts' decisions are not directly useful for documentation or for meta-analysis (Busse et al., 1995), which would allow establishing the evidence basis for interventions (Jenson et al., 2007), especially as generalization in single-case studies depends on replication6 rather than on random sampling and statistical inference. As a result of these limitations, there is a consensus that visual and quantitative analyses should be used jointly (Franklin et al., 1996; Fisch, 2001; Houle, 2009; Harrington and Velicer, 2015).

<sup>2</sup>http://cran.r-project.org/web/packages/RcmdrPlugin.SCDA/index.html 3https://dl.dropboxusercontent.com/s/elhy454ldf8pij6/SD\_band.R

<sup>4</sup>https://dl.dropboxusercontent.com/s/5z9p5362bwlbj7d/ProjectTrend.R

<sup>5</sup>Another option is to take into account the baseline data variability, operationally defined as the interquarile range, when constructing the trend stability envelope (Manolov et al., 2014).

<sup>6</sup>Kratochwill et al. (2013) recommend that the findings be replicated in at least five different studies, conducted by at least three different research teams on a total of 20 participants or more (i.e., the 5-3-20 rule).

## QUANTITATIVE ANALYSES RECOMMENDED

Our choice of procedures [non-overlap of all pairs (NAPs); Parker and Vannest, 2009 and SLC; Solanas et al., 2010a] is based on the six criteria detailed below, although alternative quantifications are provided later in this article.

## Criterion 1: Simple to Compute

The techniques are relatively simple to compute and offer straightforward interpretations for practitioners who are not experts in statistics (as the Institute of Education Sciences, 2013, suggests). The calculation does not entail statistical decisions about the likelihood of obtaining such a large difference under the null hypothesis. This criterion also relates to the need for easily trainable procedures (Fisher et al., 2003).

## Criterion 2: Complementary to Visual Analysis

This criterion is related to the popularity of visual analysis among practitioners (Parker and Brossart, 2003), which makes necessary to develop and promote suitable complements to it. NAP and SLC are actually based on relevant visual criteria (i.e., data overlap, change in slope and in level) and thus potentially useful as complements7 . Specifically, visual inspection can be used to assess the adequacy of the baseline as a reference for comparison. The change identified visually can then be quantified in an objective manner. The numerical values also offer information that can be communicated among researchers and professionals and used for further analyses with different analytical techniques or as part of research synthesis (e.g., NAP was used in the meta-analysis by Jamieson et al., 2014, whereas the new developments on SLC make possible its comparability across studies; Manolov and Rochat, 2015).

## Criterion 3: Synergic Application

Wolery et al. (2010) criticized non-overlap methods for omitting relevant data aspects such as level, trend, and stability or variability: SLC partially addresses this issue and it also responds to Beretvas and Chung's (2008) suggestion for quantifying separately level and slope change. Moreover, SLC yields unstandardized results, which help assessing the practical importance of the behavior change when using meaningful measures (Grissom and Kim, 2012) such as the number of tantrums or the number of self-injurious behaviors. In contrast, NAP is bounded, which allows comparisons and quantitative integrations. Thus, NAP and SLC can be used jointly as they provide different information. Specifically, NAP is an ordinal measure (Solomon et al., 2015) that does not distinguish between conditions once complete overlap is achieved. In contrast, SLC can be used even in absence of overlap to quantify how different the measurements belonging to different phases are.

<sup>7</sup>Wolery et al. (2010) found that no overlap technique had highest agreement with visual analysts for both data with and without a change. However, they did not include NAP or Tau-U (Brossart et al., 2014) in their study, and these two nonoverlap indices are considered to be superior, given their more solid statistical basis

and greater statistical power according to the review performed by Parker et al. (2011a).

## Criterion 4: Absence of Assumptions and Restrictions of Use

The procedures used here do not make explicit *a priori* assumptions about independence or homoscedasticity of the data, as serial dependence is likely to present in data obtained from the same individual (Matyas and Greenwood, 1996). There are also no specific design requirements.

## Criterion 5: Appropriate Performance

In relation to the previous point, there is evidence that their performance is appropriate for a variety of single-case data patterns (Manolov et al., 2011). NAP is a suitable indicator when data is stable and even when data is variable. In contrast, in such situations visual analysis is more difficult to perform and means and medians are not informative and trends are not estimated with precision. On the other hand, NAP is not suitable when the data show improving trend, but SLC can be applied in such a situation – this complementarity relates to Criterion 3 "Synergic application." SLC is useful for separately quantifying the change in level and the change in slope in potentially meaningful terms. In relation to this criterion, it is important to discourage the use of methods for comparing conditions that have been shown not to perform appropriately, such as the binomial test applied after the split-middle method (Crosbie, 1987) which does not control for Type I error rates, ITSACORR which presents modeling flaws (Huitema et al., 2007), or the C-statistic (Young, 1941; Tryon, 1982; used by Fabio et al., 2013), which is actually an estimator of autocorrelation (DeCarlo and Tryon, 1993).

## Criterion 6: Reduced Likelihood of Misinterpretation

Using descriptive measures like the ones provided by NAP and SLC makes it less likely for applied researchers to make inferences, which would be statistically incorrect in absence of random sampling of the participant or of the behavior of interest (Barlow et al., 2009). We consider that inferential statistical techniques are more susceptible to being misunderstood and to prompt researchers to make dichotomous decisions (Cohen, 1994) about intervention effectiveness or behavioral change. In case inference is desired, we recommend causal inference, instead of population inference, in line with the recommendations by Heyvaert et al. (2015).

## Non-overlap of All Pair

Non-overlap of all pairs is an improvement of the Percent of non-overlapping data commonly used for quantifying the degree to which the measurements pertaining to each phase share the same values (Scruggs and Mastropieri, 2013). It represents the number of non-overlapping data relative to all possible comparisons and it is actually identical to the non-parametric version of the probability of superiority (Grissom, 1994), which is related to the common language effect size (McGraw and Wong, 1992). When a decrease in the behavior is expected, as in the example provided later, the formula for this indicator can be written as (#(*Xpre*(*i*) > *Xpost*(*j*)) + 0.5#(*Xpre*(*i*) = *Xpost*(*j*)))/*nprenpost* where *Xpre* and *Xpost*, which represent the values of the preintervention and post-intervention phases, respectively, with *i* = 1, 2,···, *npre* and *j* = 1, 2, ···, *npost*, and # denotes the number of times that the inequality or the equality is true. Given that each data point of the pre-intervention phase is compared to a data point from the post-intervention phase there is a total of *nprenpost* comparisons, where *npre* and *npost* denote the number of measurements in the first and second phase, respectively. In each of these comparisons, a non-overlap occurs when a post-intervention measurement represents an improvement over a pre-intervention measurement, with ties counting as half a non-overlap. To obtain the index value, the number of non-overlapping pairs is divided by number of comparisons. This value can be interpreted in two different ways. One the one hand, it represents the proportion of comparisons for which intervention phase data improve baseline data. On the other hand, it can be conceptualized as the probability that a randomly selected post-intervention data point will improve (here, be smaller than) a randomly selected pre-intervention data point. The NAP can be computed via the online calculator http:// www.singlecaseresearch.org/calculators/nap by Vannest et al. (2011), where it is only necessary to enter the data from the different conditions in separate columns. It is also part of the output ("A vs. B" comparison) of the R code for Tau-U https://dl.dropboxusercontent.com/u/2842869/Tau\_U.R (Brossart et al., 2014), which requires loading a data file with a single comma-separated column including "Time" (1, 2, ..., *npre*+*npost*), "Score" (denoting the measurements) and "Phase" denoting the condition (*npre* times the value of 0 followed by *npost* times the value of 1).

## Slope and Level Change

Slope and level change quantifies two aspects of behavior's evolution after a change in the conditions: change in slope and change in level. Actually, this procedure first estimates pre-intervention linear trend (β -*<sup>A</sup>* ) as the average of the differenced first phase measurements, that is, β -*<sup>A</sup>* <sup>=</sup> *npre*−<sup>1</sup> *<sup>i</sup>*=<sup>1</sup> (*Xi*+<sup>1</sup> <sup>−</sup> *Xi*)/(*npre* <sup>−</sup> <sup>1</sup>). Baseline trend is thus the average increase (or, if negative, decrease) from one baseline measurement occasion to the next one. This estimation can inform about the characteristics of the data before an intervention is introduced. Moreover, baseline trend is removed from the whole data series so that it does not affect the quantification of the effects of the intervention. Technically, each data point is corrected according to its position in the series of observational sessions. This initial step allows for applying an intervention even when the theoretically undesirable linear improvement is present already during the assessment period. Thus, SLC would show whether there is an effect of the intervention beyond the initial improvement. After the correction it is assumed that the pre-intervention phase shows zero trend (i.e., stable data) and thus the trend present in the post-intervention phase actually represents an effect (i.e., a change in slope). This effect is estimated in the same manner as in the initial step, that is, as the average of the differenced (and already detrended) post-intervention measurements: *SC* <sup>=</sup> *npost*−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> (*Xj*+<sup>1</sup> <sup>−</sup> *Xj*)/(*npost* <sup>−</sup> <sup>1</sup>), where *<sup>X</sup>* represent detrended values (i.e., after eliminating pre-intervention trend), instead of the original measurements. Therefore, the intervention phase estimate of trend presents the average increase (or, if negative, decrease) from one intervention phase measurement occasion to the next one, after controlling for baseline linear trend. For instance, the slope change estimate reflects the average decrease in the number of tantrums in a child with each successive post-intervention measurement, that is, a progressive change.

Once slope change is estimated, post-intervention trend is removed in order to obtain a net estimate of the change in level. This way of proceeding is similar to what is done in ARIMA models, before obtaining a quantification of change in level (see Harrington and Velicer, 2015). Net change in level is estimated as the difference between the average of the corrected post-intervention measurements and the average of the corrected pre-intervention measurements. The expression for this step is *LC* <sup>=</sup> *npost <sup>j</sup>*=<sup>1</sup> *<sup>X</sup>*˜*j*/*npost* <sup>−</sup> *npre <sup>i</sup>*=<sup>1</sup> *<sup>X</sup>*˜*i*/*npre*, where *X*˜ represents post-intervention measurements with both pre-intervention trend and post-intervention trend (i.e., slope change) removed and *X*˜ represents pre-intervention measurements with pre-intervention trend removed. The net level change estimate quantifies, for instance, the average decrease of tantrums in a child after the intervention, once slope change has been taken into account. Thus, it can be conceptualized as a quantification of an abrupt and maintained effect. The SLC can be computed using R code https://dl.dropboxusercontent.com/s/ltlyowy2ds5h3oi/ SLC.Ror via the R-Commander Plug-in offering point-andclick menus, available at http://cran.r-project.org/web/packages/ RcmdrPlugin.SLC/index.html. For obtaining the numerical results and a graphical representation of the original and detrended data, both options only require inputting the values of the observations and specifying the pre-intervention phase length.

## ALTERNATIVES FOR QUANTITATIVE ANALYSIS

There is currently no consensus on which the optimal quantitative procedure for single-case designs is (Kratochwill et al., 2010; Smith, 2012), as the RoBiNT scale also reflects (Tate et al., 2013). For a comprehensive review of most currently available techniques the interested reader should consult the state-of-the-art information provided in the Special Issues of the *Journal of School Psychology* in 2014, volume 52, issue 2 (e.g., Shadish et al., 2014; Swaminathan et al., 2014) and of *Neuropsychological Rehabilitation* also in 2014, volume 24, issues 3-4 (e.g., Borckardt and Nash, 2014; Brossart et al., 2014; Heyvaert and Onghena, 2014). Here, we provide brief comments on the strengths and limitations of several analytical alternatives, which in some cases may be more appropriate than NAP and SLC included in the analytical method suggested.

Considering specifically observational studies in which data is recorded continuously within a session, it is possible to follow an analytical approach different from the one used in single-case designs, namely, to apply sequential analysis to explore whether the occurrence of some behaviors make more or less probable that other behaviors take place (Bakeman and Quera, 2011). Additionally, longer series of data gathered across time can be analyzed using Markov chains or analyses of rhythm, according to the aims of the study (Suen and Ary, 1989).

Starting our discussion from procedures similar to the ones included in the analytical method, Tau-U (Parker et al., 2011b) is closely related to NAP and it is preferable when pre-intervention trend is present in the data. For both Tau-U and NAP *p*-values have been offered, although their basis has not clearly been explained in the presence of autocorrelation. However, Tau-U is interpretatively and computationally less straightforward than NAP (i.e., Criterion 2 "Complementary to visual analysis" is met to a lesser extent). For instance, even in case a baseline trend is generally deteriorating, if there is a single improving value in the baseline phase, as compared to a previous baseline data point, this would reduce the value of the non-overlap index. Thus, in case trend is not reasonably clear, Tau-U can be an excessively conservative procedure (i.e., it would overcorrect). Furthermore, more evidence is required on its performance (thus the abovementioned Criterion 5 "Appropriate performance" is not fully met, as Parker et al., 2011a,b, offer only applications to real data, but no simulation study).

Regarding procedures quantifying average differences, similar to the SLC, the *d*-statistic (Shadish et al., 2014) has to be mentioned. We highlight here the *d*-statistic developed by Shadish et al. (2014), which has been created specifically for single-case designs rather than the *d*-statistic described by Busk and Serlin (1992; approach one8 ), recommended by Beeson and Robey (2006), for two reasons: (a) the latter is an adaptation of the group designs indicator and does not take into account autocorrelation, while it has been shown to be somewhat affected by autocorrelation (Manolov and Solanas, 2008); and (b) its sampling distribution in single-case studies is unknown (Beretvas and Chung, 2008). In contrast, the *d*-statistic developed by Shadish et al. (2014), offers a standardized measure of the mean difference with a solid statistical basis offering the possibility to estimate the index variance for future meta-analyses. So far, it has been developed for AB, reversal (e.g., ABAB) and multiplebaseline designs and assuming that pre-intervention data is stable, assuming that within-case residuals and between-case variation do not change over time. Thus, this procedure fails in terms of Criterion 4 "Absence of assumptions and restrictions of use." Some potential drawbacks include: (a) its computation requires several cases per study; and (b) the calculations are potentially difficult to understand by applied researchers with less statistical knowledge and require the use of software, such as the R code provided in the appendix of the Shadish et al. (2014) paper. Hence, the *d*-statistic is preferable to SLC when there is more than one participant per study and the aim is to obtain a standardized

<sup>8</sup>This indicator is equivalent to Glass' - (Glass et al., 1981), as it divides the mean difference by the standard deviation of the pre-intervention phase data.

measure, but it is not suitable when pre-intervention trend is present and when the focus on a specific client.

Generalized least squares regression analysis (Swaminathan et al., 2014) also enables computing an effect size index. Its strengths include the fact that it can take into account changes in level and in slope (although they are quantified as part of the same overall indicator, unlike SLC), the versatility in modeling (e.g., controlling for linear and non-linear trends), and that it deals explicitly with autocorrelation. However, autocorrelation estimation has been shown to be problematic (Solanas et al., 2010b) and the analytical procedure requires several steps, some of them taking place iteratively (i.e., Criterion 1 "Simple to compute" is not met). This procedure is applicable to longer data series for which autocorrelation can be estimated with greater precision. Moreover, we recommend that practitioners work together with a statistician, so that the analysis can be properly run. Brossart et al. (2006) compared the agreement between visual analysis and several regression-based approaches and the best performer in this terms (related to Criterion 2 "Complementary to visual analysis") was Allison and Gorman's (1993) method, which is however affected by autocorrelation (Manolov and Solanas, 2008). The generalized least squares approach was not yet proposed by the time Brossart et al. (2006) conducted their study and more evidence is necessary to assess its performance.

Multilevel models are an extension of piecewise regression and can be used to model several data aspects (e.g., trend, autocorrelation, heterogeneous data variability across phases) and they yield estimates of the change in the same measurement units as the target behavior and their statistical significance (Moeyaert et al., 2014a). The main drawbacks of multilevel models are the problematic estimation of variance (Ferron et al., 2009), their relative complexity for applied researchers with less statistical knowledge and the fact that they the replication of the intervention in several participants. Actually, such a complex procedure is more suitable for more complex design structures that the two-phase AB (Moeyaert et al., 2014b). Finally, most implementations of this analytical procedure have been done in commercial software (e.g., Moeyaert et al., 2014a include SAS code in their article).

An effect size index can also be computed from interrupted time series analysis via ARIMA (autoregressive integrate moving average) models, which allow controlling for trend and autocorrelation (Simonton, 1977). The main difficulties of this option are the need for long data series and the problematic initial model identification step. However, there have been suggestions for using some general models that make model identification unnecessary (Harrop and Velicer, 1985). A recent application of ARIMA models has shown that these can be applied to two-phase data, but there might be convergence problems and, more importantly, the agreement with visual analysis is low (Harrington and Velicer, 2015). We consider that this latter drawback and the relative complexity of the technique make it less attractive to applied researchers with no statistical expertise.

Statistical significance (i.e., *p*-values) can be estimated for *d* and the generalized least squares procedure on the basis of the comparison between the test statistic and a theoretical reference (the sampling distribution) and allows making inference about the population from which the individual was drawn. In contrast, randomization tests (Heyvaert and Onghena, 2014) yield a *p*-value on the basis of a comparison between the test statistic and an empirical reference –the randomization distribution. In the current context of two-phase studies, this reference is the distribution of the test statistic values quantifying the difference between the two conditions for each possible intervention start point (i.e., for each possible way in which the data series can be split into two; Edgington, 1980). For this analytical option the inference is restricted to the case studied, referring to the likelihood of obtaining such a large difference in case the intervention was ineffective. Randomization tests are versatile in terms of test statistic to use (e.g., it can be an effect size such as a non-overlap index) and offer flexible options for dealing with different situations (e.g., Levin et al., 2012). However, the necessary randomization as part of the data collection process is both a strength (Kratochwill and Levin, 2010) and a limiting characteristic (Fisher and Lerman, 2014) in a clinical setting (i.e., Criterion 4 "Absence of assumptions and restrictions of use" is not met). Moreover, in certain conditions Type I error rates are not controlled (Manolov et al., 2010). Randomization tests can be recommended when the aim is to obtain statistical significance and the point(s) of change in the conditions can be chosen at random. Randomization tests are also accompanied by freely available software (Bulté and Onghena, 2013; Levin et al., 2014).

Another procedure using an empirical reference distribution is simulation modeling analysis (SMA; Borckardt and Nash, 2014). In SMA, data are generated with the same autocorrelation as estimated from the data, but with no difference between the conditions, thus representing the null hypothesis of identical behavioral level across conditions. The *p*-value represents the likelihood of the outcome, computed as a point biserial correlation between the measurements and a dummy variable representing the condition (0 = without intervention, 1 = with intervention). This approach is intuitive, takes autocorrelation into account, and it can be implemented via the software available freely at http://clinicalresearcher.org/software.htm. However, so far the evidence on its performance (i.e., Criterion 5 "Appropriate performance") is not sufficient. Finally, as the focus of is put on the *p*-value, which may enter in conflict with Criterion 6 "Reduced likelihood of misinterpretation."

Whereas SMA uses Monte Carlo methods or bootstrap for generating samples and estimating the likelihood of the value of test statistic in case there is not difference between conditions, bootstrap has also been suggested for single-case as a way of reducing bias and estimating standard errors (McKnight et al., 2000) and specifically for estimating confidence intervals of regression-based *R*-squared values (Parker, 2006). This option has not received much attention lately and it is unclear whether applied researchers would be willing to use it.

Another computer-intensive option could be the Monte Carlo based method for modeling non-linearity proposed by Theiler et al. (1992). However, modeling non-linear patterns can also be achieved without prior knowledge and without the need to specify a model, by using local regression (LOESS; Jacoby, 2000; Solmi et al., 2014). We consider LOESS to be more practical for applied researchers than the Theiler et al. proposal. Moreover, randomization tests are also more parsimonious as they require no assumptions about the process generating the data or about random sampling. Actually, Theiler et al. (1992) mention this option as rank statistic approach for obtaining *p*-values. Randomization test offer the advantage of not only mimicking the preserved data features (such as mean and standard deviation), as expressed by Theiler et al. (1992), but they actually preserve the whole data series and its order, taking advantage of the different possible moments of change in phase, when such moments are determined at random.

A simplified summary of these general recommendations regarding the use of the analytical techniques can be found in **Figure 3**.

## INTERVENTION EFFECTIVENESS IS NOT ONLY DATA ANALYSIS

Assessing the relevance of an intervention cannot be constrained solely to visual and descriptive or inferential statistical analyses. It is important to assess aspects such as quality of life (Kendall, 1999), whether the behavior has moved from dysfunctional to functional ranges (Kazdin, 1999), without forgetting subjective evaluation (Hugdahl and Ost, 1981). Regarding the latter, Kratochwill and Levin (2010) highlight the need to get to know the perceptions of the client and of significant others. According to the specific context being studied, these significant others would be the family members (parents, siblings, marital partner), the teacher, the coach, or the boss (as figure with a higher hierarchical role), and friends, classmates, or colleagues (at the level of "peers"). Kazdin (1984) has referred to these groups of people as "paraprofessionals," as they help detecting the behavior that requires intervention and they can also be the agents reinforcing the behavior of interest (e.g., a mother reinforcing a child's disruptive behavior by paying attention to it) or producing stimuli for discriminating conditions in which certain types of behavior are desirable (e.g., a boss may encourage jokes with one type of clients and more distant behavior with others).

## THE ANALYTICAL METHOD APPLIED

In the present section, we will illustrate the application of the analytical method and the information that can be obtained via visual and quantitative analyses, while also considering substantive criteria. This application focuses on the family context, where it is common to gather data before and after an intervention (Crane, 1985). One of the empirically supported interventions in this context is the Parent Child Interaction Therapy (PCIT; Eyberg et al., 2008), which has been reported to increase positive parent behavior and reduce child behavior problems (Borrego et al., 2006). For the current example, the data gathered by Bagner et al. (2009) will be used. The participants are a 23-months-old premature-born child displaying difficult behaviors and his mother. The application of the PCIT focuses on teaching parenting skills in order to improve the interaction with the child and to decrease his externalizing behavior. Teaching takes place in two phases. First, child-directed intervention (CDI) takes place. It is similar to play therapy: the child is the leader and the parent has to learn how to act positively (e.g., praising the child, imitating the child's play). Second, parent-directed intervention (PDI) phase occurs. It is similar to clinical behavior therapy: the parent is more directive and has to improve her way of disciplining so that a greater compliance is achieved. In order to assess intervention effectiveness, several sources of information are used: parent reports provided via inventories, observation of the parent–child interaction, and physiological measurements. In the running example, we focus on the parent weekly reports obtained via the Intensity scale of the Eyberg Child Behavior Inventory (ECBI; Eyberg and Pincus, 1999) on disruptive behavior, although a complete assessment entails exploring whether all available information converges to the same conclusion. The Bagner et al. (2009) ECBI data were chosen here given that there is a cut-off point at a T-score of 60 which indicates clinically significant results and eases the interpretation in substantive terms. The data gathered9 on the ECBI scale are represented on **Figure 4**. The upper panel contains ordinary least squares trend lines provided by the SCDA plug-in for R, the middle panel contains split-middle trend for the first phase, and the lower panel represents the application of the two-standard deviations band fit to the first condition's data and projected into the second one.

Firstly, when visually inspecting the data, it has to be kept in mind that both phases are treatment phases and thus in both some reduction in child's behavior is expected and desired. Moreover, it has to be taken into account that the pre-treatment (i.e., actual baseline) value is 82, equal to the first CDI phase measurement. At the beginning of the first phase there is actually a reduction, but then a new increment starts. Considering this alternating pattern the CDI does not seem especially effective. Given the amount of variability in the first phase, neither the central tendency measure (mean represented on the lower panel of **Figure 4**), nor the different types of trend fitted (upper and middle panel) seem to represent the data well-enough. This can hamper the comparison between this condition and the subsequent one.

Once the intervention is introduced, there is apparently a decrease in the ECBI score on disruptive behavior. The downward trend is stable, as shown by the good fit of the ordinary least squares regression line to the data (upper panel of **Figure 4**). For such data it is not meaningful to discuss level or variability around a mean or a median level; actually variability is only assessed looking at the (small) distance of the measurements from the fitted trend line.

Comparing the two phases in terms of overlap, the values in the beginning of the PDI-phase are similar to the ones in the CDIphase, but not so in the end. Comparing levels is not meaningful. Comparing trends is hindered by the lack of fit of the trend lines to the CDI data, but if we focus on the last four (out of five)

<sup>9</sup>We would like to thank Dr. Daniel Bagner for kindly offering the raw data for re-constructing their original figure.

CDI measurements, there is a deterioration that is reverted with the introduction of the PDI: thus a change in slope has taken place. The comparison between projected and actual data is done in two ways, projecting the baseline mean with limits based on the baseline standard deviation and projecting the split-middle trend line with limits based on 25% of the baseline median. In this case, both approaches lead to a very similar graphical representation, which is well-aligned with the conclusion that the last PDI data points are clearly lower that what would be expected (i.e., values within the limits) in case there was no difference between the two interventions. Additionally, we should consider that Bagner et al. (2009) collected a posttreatment measurement equal to 38 – a value even lower than the last PDI-phase measurement and so the downward trend seems to continue, which could be interpreted as maintenance of the effect.

Secondly, regarding quantitative analyses, the NAP performs 50 comparisons, given that *npre* = 5 and *npost* = 10, in which there are 19 full overlaps, that is, 19 cases in which a CDI datum is better (here, lower) than a PDI measurement, 0 ties, and 31 cases in which a PDI measurement is better than a CDI data point. (Lower rather than greater values are considered as overlaps, given that the aim is to reduce the disruptive behavior and thus also the ECBI T-score.) The value yielded by NAP is 62.00%, which can be interpreted as the percentage of PDI measurements that improve the CDI measurements. Therefore, the index does not suggest that the change is especially salient, given that the value is only slightly higher than the one expected by chance (50%) and it is within the range of values (0–65%) denoting small effect according to Parker and Vannest (2009). However, it has to be considered that this may be due to the fact that the effect is delayed. The data pattern is not specifically easily analyzed by the SLC either. The procedure estimates the CDI-phase trend as −2.25, which represents an average of approximately two T-score units reduction for each CDI measurement time. However, this value does not reflect the visual impression, provided that this phase shows a specific kind of variability (i.e., an alternating pattern). Correcting for this initial phase trend, the slope change estimate is −1.64, that is, nearly two T-score points average reduction for each PDI measurement time. This quantification reflects to some extent the visual impression of slope change. SLC's estimate of the net change in level is positive, 18.15, which contrasts with the visual impression of the graphed data.

Thirdly, focusing on substantive criteria, Bagner et al. (2009) summarize their results in terms of improved parent practice and increased child compliance. In fact, while the former result stems from observation and evaluation by the authors, the latter is based in reports from the parents (i.e., the paraprofessionals). Regarding the ECBI scores, the last three scores during the PDI phase fall out of the clinical range, indicating that a practically significant change in behavior of the child has taken place. Interestingly, these same three scores also fall out of the twostandard deviations band and out of the split-middle trend stability envelope represented in the middle and lower panels of **Figure 4**. To complement this assessment, the authors report that at a 4-months follow-up the results of the ECBI remained in the normal range (the value was 47), which increases the confidence in the importance of the behavioral change. Finally, it should be noted that Bagner et al. (2009) comment explicitly the "inability to conduct statistical analyses" (p. 475), which suggests that informing applied researchers about analytical options for two-phase single-case designs, as we intend with the current paper, is a timely endeavor.

The main conclusion of this application of the analytical method is that visual analysis is necessary for focusing at different aspects of the data, such as an unstable baseline which is not well-represent by mean or trend lines, a somewhat delayed slope change, and a considerable amount of overlap only in the beginning of the second condition but not at the end. The variability and relative shortness of the first phase (although it meets the current standards of five measurements; Kratochwill et al., 2010) have to be kept in mind when comparing it to the measurements obtained in the subsequent condition. In the current case, the visual aids reflected this

FIGURE 4 | Graphical representations of the Bagner et al. (2009) data gathered through observation in the family context: upper panel – trend lines; middle panel – split middle and trend envelope; lower panel – standard deviation bands.

variability and suggested a similar conclusion as the one based on substantive criterion expressed as a cut-off point. All this information is critical for interpreting correctly the numerical yielded by descriptive statistical procedures. Actually, we preferred to use a data set that is challenging for the quantitative analyses in order to alert applied researchers on the need to interpret numerical values with caution and to use all information available; we also wanted to avoid doubts about the data being picked up only to show the quantification in a positive way (Fisher and Lerman, 2014). Finally, the follow-up measures, the parent-report and the physiological measures recorded by Bagner et al. (2009) also contribute to building solid conclusions. The two-phase design may not be sufficient for establishing a causal effect in a scientifically sound way, but there is enough information pointing at the clinically important reduction of problematic behavior.

## DISCUSSION

The present work focused on the question of what can be done to improve the data analysis in studies/practices using sub-optimal designs in such a way that results are more useful to the discipline. We recommended an analytical method consisting of structured visual analysis complemented with descriptive statistical procedures, while also keeping in mind substantive criteria (i.e., the opinion of the individuals involved in the process: family members, teachers, peers, coworkers, or supervisors). On the one hand, quantifications are useful for summarizing different aspects of the data and making the results available for subsequent meta-analysis. On the other hand, visual analysis is required for gaining an in-depth knowledge of the data and for assessing the adequacy of any specific quantitative procedures, due to the lack of consensus regarding the most appropriate technique (Tate et al., 2013).

A second question concerned the availability of tools for implementing the procedures proposed as part of the analytical method. We have mentioned, referenced, and illustrated the output of several tools implemented in the freeware R. Some of them are based on clickable menus, whereas others only require inputting the data before copying and pasting the code. The availability of software is crucial for eliminating the errors in obtaining the numerical and graphical results and in terms of time efficiency, both for short and relatively straightforward data series (e.g., Bunn et al., 2005) and for longer series with and less visually clear data patterns (e.g., Abney et al., 2014).

One potential issue with the analytical method is that it is possible that, in some instances, the three components do not coincide. A cautious approach would be to gather follow-up data after a certain period of time in order to check whether the initial ambiguous result of the assessment still holds. In case the unclear change is maintained and perceived as a change by the participants, then there would be evidence in favor of its practical importance. If there is disagreement between the substantive criterion and the other two components, we think that if the clients' well-being, quality of life, functionality, performance, etc. is improved according to their own opinion, then the substantive criterion should prevail, regardless of its numerical expression. In any case, the general effectiveness of an intervention depends on replications (Pashler and Wagenmakers, 2012) and not on the numerical result in a single study. Finally, if there is a divergence between the visual and quantitative information, it is important to know: (a) whether there is any data feature (e.g., preintervention trend, outliers) that might affect the performance of the quantitative analysis – in such case visual inspection should prevail; or (b) whether the data pattern prevents from getting a clear visual impression (e.g., due to highly variable data and/or a complex design structure) – in such case the quantitative summary is potentially more useful.

Another issue with the analytical method is that it might fail in certain situations such as the ones described in this paragraph (the list is not necessarily comprehensive). First, it is possible that the pre-intervention phase is too short or the measurements too variable for estimating trend with precision: the SLC quantifications would be less useful, but if there is no clear evidence of trend, then the NAP can be used as main quantification. Second, if there is complete non-overlap between the observations of the two conditions, the NAP will not be very informative, but the SLC can be used as an unstandardized quantification of the amount of difference and the *d*-statistic as a standardized quantification if more than one participant is being studied. Third, there might be a non-linear trend present in data, which is not an optimal situation for applying the SLC. In such case running medians (Tukey, 1977) can be used as a visual aid via the SCDA plug-in for R, while data modeling via the generalized least squares approach and LOESS is also possible. Fourth, there might be a delayed change in the behavior, not occurring simultaneously with the change in conditions (an issue that has remained practically unstudied except for Lieberman et al., 2010). In such case, the descriptive statistics will reflect the delay with lower quantifications of the effect, but it would be crucial to explore the cause of the change among the external uncontrolled factors (i.e., the solution is not an analytical one), given that the immediacy of the effect is one of the cornerstones for demonstrating causality (Kratochwill et al., 2010).

We hope that the discussion presented here would help practitioners and applied researchers to apply a systematic approach to data analysis and take a step toward partially improving the methodological quality of the studies. However, this would only be *one* step and studies would also need to meet the recommendations about the assessment and measurement of the target behavior, the implementation of the intervention, and the use of blinding to ensure objectivity, and also about reporting the results of the study (Tate et al., in press). Finally, it should always be considered whether what is assessed can be considered an "intervention effect" (in causal terms) or only a "behavioral change," which after several replications might point at the possible effectiveness of the intervention. In that sense, the analytical method was described in the context of studies with less-than-optimal designs in which causal relations cannot be readily established. Nonetheless, it is possible to extrapolate the method to experimental situations (e.g., multiplebaseline designs in which it is crucial to assess whether the behavioral change coincides with the staggered introduction of the intervention).

As a limitation of the quasi-statistical component of the analytical method, it is debatable whether the numerical results can be presented confidently in absence of a conventionally accepted optimal procedure, i.e., when all analytical techniques can be criticized. Considering the analytical method as a whole, further discussion is necessary on how to proceed when practitioners are faced with data that cannot be easily analyzed visually or quantitatively (e.g., short series, great data variability). One option would be to use the substantive criteria as basis for the conclusions and label the study as "practice" but not as "research." In contrast, when all three pieces of information (visual, quantitative, and substantive) coincide, it still has to be kept in mind that not meeting current *Standards* (Kratochwill et al., 2010) could render two-phase studies only a "pilot" status and, when included in meta-analysis, they are likely to be assigned lower weights and have less influence on the summary measures obtained.

## AUTHOR CONTRIBUTIONS

The initial idea was due to JL and it was subsequently complemented and further developed by RM. The manuscript was written by JL (observational, non-experimental conceptual

## REFERENCES


part in the Introduction) and RM (analytical part in the Analytical Method Explained, Analytical Method Applied, and Discussion). SC-M and SS-C made substantial contribution to the design of the work. All four authors (RM, JL, SC-M, and SS-C) participated in several revisions during the process of creating, discussing, and improving the manuscript, with RM leading all revisions and guiding the continuous improvement of the manuscript; gave their consent that this final version is submitted for publication; and agreed in their co-responsibility regarding all aspects of the work, such as the accuracy of the data and the integrity of the research.

## ACKNOWLEDGMENTS

This study forms part of the results obtained in research project PSI2011-29587, funded by Spain's Ministry of Science and Innovation; and in research project number 1150096, funded by Chilean National Fund of Scientific and Technological Development -FONDECYT).


Young, L. C. (1941). On the randomness in ordered sequences. *Ann. Math. Stat.* 12, 293–300. doi: 10.1214/aoms/1177731711

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2016 Manolov, Losada, Chacón-Moscoso and Sanduvete-Chaves. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Evaluation of a Psychological Intervention for Patients with Chronic Pain in Primary Care

Francisco J. Cano-García<sup>1</sup> \*, María del Carmen González-Ortega<sup>1</sup> , Susana Sanduvete-Chaves<sup>2</sup> , Salvador Chacón-Moscoso2,3 and Roberto Moreno-Borrego<sup>4</sup>

<sup>1</sup> Departamento de Personalidad, Evaluación y Tratamiento Psicológicos, Universidad de Sevilla, Seville, Spain, <sup>2</sup> Departamento de Psicología Experimental, Facultad de Psicología, Universidad de Sevilla, Seville, Spain, <sup>3</sup> Departamento de Psicología, Universidad Autónoma de Chile, Santiago, Chile, <sup>4</sup> Centro de Atención Primaria Príncipe de Asturias, Servicio Andaluz de Salud, Utrera, Spain

#### Edited by:

Pietro Cipresso, IRCCS Istituto Auxologico Italiano, Italy

#### Reviewed by:

Daniel Saverio John Costa, University of Sydney, Australia Bárbara Oliván Blázquez, University of Zaragoza, Spain Deepak Kumar, Institute of Human Behaviour and Allied Sciences, India

\*Correspondence:

Francisco J. Cano-García fjcano@us.es

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 10 December 2016 Accepted: 08 March 2017 Published: 23 March 2017

#### Citation:

Cano-García FJ, González-Ortega MC, Sanduvete-Chaves S, Chacón-Moscoso S and Moreno-Borrego R (2017) Evaluation of a Psychological Intervention for Patients with Chronic Pain in Primary Care. Front. Psychol. 8:435. doi: 10.3389/fpsyg.2017.00435 According to evidence from recent decades, multicomponent programs of psychological intervention in people with chronic pain have reached the highest levels of efficacy. However, there are still many questions left to answer since efficacy has mainly been shown among upper-middle class patients in English-speaking countries and in controlled studies, with expert professionals guiding the intervention and with a limited number of domains of painful experience evaluated. For this study, a program of multicomponent psychological intervention was implemented: (a) based on techniques with empirical evidence, but developed in Spain; (b) at a public primary care center; (c) among patients with limited financial resources and lower education; (d) by a novice psychologist; and (e) evaluating all domains of painful experience using the instruments recommended by the Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT). The aim of this study was to evaluate this program. We selected a consecutive sample of 40 patients treated for chronic non-cancer pain at a primary care center in Utrera (Seville, Spain), adults who were not in any employment dispute, not suffering from psychopathology, and not receiving psychological treatment. The patients participated in 10 psychological intervention sessions, one per week, in groups of 13–14 people, which addressed psychoeducation for pain; breathing and relaxation; attention management; cognitive restructuring; problem-solving; emotional management; social skills; life values and goal setting; time organization and behavioral activation; physical exercise promotion; postural and sleep hygiene; and relapse prevention. In addition to the initial assessment, measures were taken after the intervention and at a 6-month follow-up. We assessed the program throughout the process: before, during and after the implementation. Results were analyzed statistically (significance and effect size) and from a clinical perspective (clinical significance according to IMMPACT standards). According to this analysis, the intervention was successful, although improvement tended to decline at follow-up, and the detailed design gave the program assessment a high degree of standardization and specification. Finally, suggestions for improvement are presented for upcoming applications of the program.

Keywords: formative evaluation, clinical effectiveness, chronic pain, Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT), methodological quality, primary care

## INTRODUCTION

fpsyg-08-00435 March 22, 2017 Time: 17:27 # 2

Pain is an unpleasant sensory and emotional experience associated with actual or potential tissue damage, or described in terms of such damage (Merskey, 1994). Pain becomes chronic when it loses its adaptive function, lasts longer than expected (3–6 months), and does not respond to the prescribed medical treatments. Pain and chronic pain are global, complex experiences for human beings, and interdisciplinary theoretical models have been developed to study them. One such model is the gate control theory (Melzack and Wall, 1967) and its more recent version, the neuromatrix theory (Melzack, 1999). Essentially, painful experience is defined at different levels here, including the sensory, behavioral, emotional and cognitive level, all of which are integrated in a more comprehensive framework of stress processes (for a more detailed description, see Gatchel et al., 2007). For this reason, psychology's contribution to the study and treatment of chronic pain has been critically important for the past few decades.

Chronic pain is a public health issue in the developed world. In an aging population like that of Europe, 19% of the population suffers from chronic pain; in Spain, where this study was conducted, chronic pain stands at 11%. A recent study by Andrew et al. (2014) estimated the costs associated with chronic pain. In the work world, for every dollar lost by the average person, the costs associated with a person suffering from chronic pain are between \$3.60 and \$12.50 for absenteeism, between \$2.50 and \$3.00 for loss of productivity, and between \$1.90 and \$2.60 in paid unemployment. In terms of healthcare costs, for every dollar spent on other patients, the costs associated with a person suffering from chronic pain are between \$2.50 and \$3.00 in visits to primary care centers, between \$3.30 and \$7.60 in hospital stays, \$4.00 in medicine and \$3.00 in emergency care.

The gateway for patients with chronic pain in healthcare systems is usually the primary care center, as seen in Europe, where 70% of these patients saw a general practitioner (Breivik et al., 2006). Patients with chronic pain are seen as a challenging but low-priority customer similar to those suffering from mental health disorders, in contrast to high-priority patients like those suffering from cardiovascular disease (Johnson et al., 2013). Although professionals who see such patients usually have clinical practice guidelines, they tend not to use them to either evaluate or treat such patients because they are overwhelmed by the quantity and complexity of the demand. In most cases, such physicians limit themselves to prescribing drugs or referring the patient to a specialist.

There is unquestionable evidence on the efficacy of psychological intervention in chronic pain. According to the Society of Clinical Psychology (APA, 2016), evidence is particularly strong for two types of psychological intervention: cognitive-behavioral therapy (Morley et al., 1999; Huguet et al., 2014; Cherkin et al., 2016; Kroner et al., 2016) and Acceptance and Commitment Therapy (Veehof et al., 2011, 2016; Hann and McCracken, 2014). Other treatment options like relaxation therapy (Meeus et al., 2014), guided meditation and hypnosis have yielded moderate efficacy levels. Finally, evidence of efficacy has been growing for more recent treatment options such as eye movement desensitization and reprocessing (EMDR) (Tesarz et al., 2014) and particularly, mindfulness (Lauche et al., 2013). Given the current state of knowledge, multicomponent psychological treatments could be considered more efficacious than others and represent a viable alternative for healthcare when applied in small groups (APA, 2016). However, identifying efficacious treatment is one thing and getting the general population to benefit from such treatment is quite another. A good example of this is an epidemiological study conducted among 2,596 fibromyalgia patients in the USA: only 8% had received cognitive-behavioral therapy (Bennett et al., 2007).

In the scientific study of pain, the Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) began in 2002 to improve the quality of assessments in clinical trials, bringing together scholars, regulatory bodies and public healthcare institutions, consumer and patient associations, and representatives from the pharmaceutical industry. Various scientific disciplines within healthcare like anesthesiology, clinical pharmacology, internal medicine, law, neurology, nursing, oncology, psychology, rheumatology and surgery are part of IMMPACT. The initiative has yielded three main results: the identification of the basic and complementary areas of the pain experience that must be evaluated (Turk et al., 2003; McGrath et al., 2008); the identification, development and validation of instruments to assess them (Dworkin et al., 2005; Turk et al., 2006; McGrath et al., 2008); and the determination of clinical importance standards to assess treatment outcomes (Dworkin et al., 2008, 2009; Turk et al., 2008).

The evidence presented above regarding both psychological treatment and the IMMPACT initiative is generally produced by studies conducted in ideal conditions, with the funding necessary for an adequate selection of participants: expert psychologists, patients with middle-high educational levels who are motivated to participate and do not leave the study, etc. In conditions such as these, many doubts regarding the efficacy of psychological intervention go unanswered. However, little information is available on clinical efficacy in real healthcare contexts: what if the studies focused on patients from a rural area in the south of Spain with different educational levels and from a different sociodemographic? What happens when they visit a primary care facility and are seen by a novice psychologist?

Ehde et al. (2014) addressed these challenges in an interesting review on cognitive-behavioral therapy for patients with chronic pain. The authors found only one study with rural and low literacy samples (Thorn et al., 2011). Worse still, they found no study that considered the level of experience of the therapist, but indicated that this variable might be relevant, since cognitive-behavioral therapy is more effective when performed by psychologists than other care providers (Nicholas et al., 2011).

These questions are what motivated us to assess a multicomponent cognitive-behavioral program specifically designed for patients with chronic pain, applied in a public primary care center located in the south of Spain, with participants from different socioeconomic and educational backgrounds and implemented by an inexperienced psychologist.

## MATERIALS AND METHODS

fpsyg-08-00435 March 22, 2017 Time: 17:27 # 3

## Participants

Patients at the Príncipe de Asturias primary care center participated in the study. The primary care center is located in Utrera, a small rural town in the province of Seville, Spain.

The inclusion criteria were the following: (a) to be at least 18 years old; (b) to have visited primary care due to difficulties handling chronic pain during the recruitment period (present maladaptive adjustment to pain); (c) to not be in the middle of an employment dispute or waiting for approval on a disability pension; (d) to not have a primary psychopathologic disorder; (e) to not be in psychiatric or psychological treatment, but could be taking psychotropic drugs; (f) the ability to follow group sessions, thus excluding conditions such as deafness, blindness, or dementia; (g) willingness to sign an agreement to attend the sessions (group and/or individual); and (h) not be hospitalized.

## Design

This study presents a quasi-experimental one-group pre-test – post-test – follow-up design (Shadish et al., 2002; Chacón-Moscoso et al., 2008). This means that there are three measurement instances: one before the intervention and two after the intervention (specifically, one immediately after the intervention and another 6 months later). Additionally, this design lacks a control group. As we are interested in studying the change over time in only one group, this is a within-subject design (APA, 2010).

## Variables and Measures

Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials recommendations were used to assess the pain experience in terms of both procedures and instruments (Turk et al., 2003; Dworkin et al., 2005). The assessment covered pain, physical functioning, emotional functioning, and the patient's rating of change. Although IMMPACT recommendations do not establish the main assessment variables, pain, specifically pain intensity, is usually considered a primary outcome. As a result, the remaining areas and variables would be considered secondary in this study, but also extremely important as indicators of possible improvements in the patients' quality of life. To evaluate pain, the patient was asked to describe the intensity of perceived pain in the 24 h period preceding the interview and at the time of the interview, using a numerical scale with 0 meaning "No pain" and 10 meaning "Pain as bad as you can imagine" (Dworkin et al., 2005).

Physical functioning was evaluated through (1) the items How much has pain interfered in your daily life during the last 24 h? and How much is pain interfering right now?, with a four-point rating scale where 0 is nothing and 3, totally; and (2) the Spanish language version of the pain interference subscale (Ferrer et al., 1993) of the West Haven-Yale Multidimensional Pain Inventory (WHYMPI) (Kerns et al., 1985). The WHYMPI is the first psychometric instrument for multidimensional pain evaluation. The 11 items interference subscale consists of a seven-point Likert scale (0–6) to rate pain interference in daily life; the total points are then divided by the number of items. The psychometric properties of the original scale have been clearly demonstrated internationally (Haythornthwaite, 2003). Cronbach's α was 0.68 for the Spanish language version of the interference scale (Ferrer et al., 1993).

Two instruments were used to evaluate emotional functioning: (1) the Profile of Mood States (POMS) (Haythornthwaite, 2004). This psychometric instrument assesses, using 58 adjectives rated from 0 (not at all) to 4 (extremely) on a five-point Likert scale, six mood states: Fatigue (0–28), Depression (0–60), Tension (0–36), Hostility (0–48), Confusion (0–28), and Vigor (0–32). In addition to six partial scores, it provides a global score on Total Mood Disturbance that ranges from −32 to 200 after adding the scores obtained in Fatigue, Depression, Tension, Hostility and Confusion, and subtracting the score obtained in Vigor. The POMS properties have also been demonstrated within the framework of IMMPACT with an internal consistency of the different scales between 0.63 (Confusion) and 0.96 (Depression) (Dworkin et al., 2005); and (2) the Beck Depression Inventory (BDI) (Beck et al., 1961). This instrument is comprised of 21 items that are answered on a four-point Likert scale (0–3). A total score is obtained by adding the values given for the 21 items ranging from 0 to 63. Higher values mean higher levels of depression. Specifically, 0–9 indicates none or minimal depression; 10–18 indicates mild to moderate depression; 19–29 indicates moderate to severe depression; and 30–63 indicates severe depression. This tool presents evidence of reliability and validity in the assessment of symptoms of depression and emotional distress (Dworkin et al., 2005).

The expected rating of change (pre-test) and the rating of change (post-test and follow-up) were evaluated using the Patient Global Impression of Change Scale (PGIC) (Guy, 1976). This measure is a single-item rating of a patient's rating of improvement as the result of treatment on a seven-point scale that ranges from 1 "very much worse" to 7 "very much improved" with no change at the middle of the scale. Due to its simplicity, validity and reliability, the PGIC was included as a scale recommended by IMMPACT (Farrar, 2003).

Patient willingness was evaluated using the CONSORT (Consolidated Standards of Reporting Trials) guideline (Altman et al., 2001; Moher et al., 2001) which provides information on recruitment processes; the number of candidates excluded and the reasons for exclusion; the number of candidates who did not start treatment and the reasons; and the number of participants who abandon treatment and the reasons.

## Psychological Intervention

Psychological intervention consisted in a multicomponent protocol developed and published in Spain by a group of professionals and scholars, including one of the authors of this work (FJC). This protocol incorporates the principal cognitivebehavioral techniques with evidence of efficacy in pain treatment and combines them with a few others inspired by Acceptance and Commitment Therapy. A description of the program can be found in Moix and Casado (2011) and the full program is available at Kovacs and Moix (2011).

The program is structured in 10 weekly sessions, each lasting an hour and a half, that approach the following topics sequentially: (1) introduction to cognitive-behavioral intervention; (2) breathing and relaxation; (3) attention management; (4) cognitive restructuring I; (5) cognitive restructuring II; (6) problem-solving; (7) emotional management and assertiveness; (8) life values and goal setting; (9) time management and reinforcement activities; and (10) exercise, postural and sleep hygiene and relapse prevention.

Each session consists of three parts: first, a review of doubts and the tasks presented in the previous session; second, a discussion of the contents corresponding to the current session; and third, an overview of the tasks for the following session.

In addition to providing a handbook for the therapist, the program provides each patient with a dossier that includes a summary of the sessions and the tasks to accomplish as well as a CD audio guide on the breathing and relaxation exercises done in session 2.

The 40 patients assigned to the intervention were divided into three groups based on age and gender variables that will be detailed in Section "Results." The first consisted of 14 women ages 33–55, the second of 13 men ages 33–55, and the last of 13 patients (eight women and five men) between ages 55 and 69. The total compliance rates for the full sessions were 78% in group 1 and 69% in groups 2 and 3. In groups 2 and 3, the intervention was not applied to two patients and in group 1, it was not applied to one patient; one patient from group 1, three from group 2 and two from group 3 discontinued.

## Procedure

This study was carried out in accordance with the recommendations of the Ethics Committee of the Southern Seville Health District (Andalusian Health Service) with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the South Seville Health District (Andalusian Health Service).

This study was carried out as part of a scientific-technical agreement with Southern Seville Primary Care. As part of this agreement, the second author of this study, MCG, then a post-graduate student, was selected through Ícaro<sup>1</sup> , a blog to manage practices in business and employment, as the psychologist who would carry out the study. She was selected because of an impressive academic record and after receiving a positive evaluation in a personal interview. The first author, FJC, informed her of the aim of the intervention and the task she was going to carry out; gave her all the materials (slides, handbook, dossier for the patients and CDs to be used during relaxation techniques); and provided her with training in a 4-h session.

The first step was to get the healthcare personnel, doctors and nurses involved in patient information and recruitment. This task that was handled by the last author of this study, RM. Recruitment relied on the inclusion criteria specified in Section "Participants."

<sup>1</sup>https://icaro.ual.es/

Following patient recruitment by the healthcare personnel, MCG informed the patients what the study entailed. The patients then signed the informed consent form and their first appointment was scheduled. During that appointment, each patient had a one-on-one interview with an undergraduate psychology student instructed in the application of the measures to be used in the study. Next, they participated in the group intervention with MCG. The sessions were held in a meeting room in the center with audiovisual equipment and mats for the participants to do the breathing and relaxation exercises.

Formative assessments (Chacón-Moscoso et al., 2013) were done throughout the process (before, during and after the implementation of the program). Immediately after the program ended and 6 months later, another assessment session with a one-on-one interview like the one described above was held.

All the data collected before the intervention, immediately afterward and 6 months later were anonymously added to a database by interning students from the authors' departments and supervised by two of the authors, SS and SC, who also did the statistical analysis using the SPSS 22.

## Statistical Analyses

Cronbach's (1951) α was used to test the reliability of the measures gauged with psychological tests and comprising more than one item, specifically the pain interference subscale of WHYMPI, POMS (subscales and global score), and BDI. Additionally, given the small sample size, in order to obtain a more precise reliability coefficient the unbiased estimator of Cronbach's α was calculated (Feldt et al., 1987); and the significance of each unbiased estimator was calculated using the procedure of Kristof (1963) and Feldt et al. (1987). Following criteria established by George and Mallery (2003), values above 0.9 were considered excellent; between 0.8 (excluded) and 0.9 (included), good; between 0.7 (excluded) and 0.8 (included), acceptable; and between 0.6 (excluded) and 0.7 (included), questionable. Following criteria by Huh et al. (2006), values equal or higher than 0.7 were considered appropriate.

To study the changes to the different dependent variables across the three measurement instances (pre-test, post-test and follow-up), we first checked the normality assumption using Shapiro–Wilk's test –W– (Shapiro and Wilk, 1965), adequate for small samples (N ≤ 50). When normal distribution was rejected (p ≤ 0.05), we used a non-parametric test (Friedman test); when this assumption was accepted (p > 0.05), we calculated a parametric test (ANOVA for repeated measures). In the case of ANOVA, Mauchly's test of sphericity was calculated. When sphericity was assumed (p > 0.05), no correction of degrees of freedom (df) of F distribution was made; when it was rejected (p ≤ 0.05), df were multiplied by Greenhouse–Geisser's epsilon to correct them.

Additionally, linear and quadratic trend contrasts were used to compare the three levels (pre-test, post-test, and followup). ANOVA trend analysis was used as a parametric test and showed results to be statistically significant when p < 0.05. As a non-parametric test, post hoc comparisons for trends were used (Marascuilo and McSweeney, 1967); here results were statistically significant when zero was not included in the interval

obtained with a confidence level of 0.95. A significant linear trend would be interpreted as an increase, or at least maintenance, of changes detected in post-test during follow-up. In our case, this would be ideal. A significant quadratic trend would be interpreted as a reversal of the change detected in post-test during follow-up.

To calculate the effect size in the case of ANOVA, the partial eta or omega squared index can be overestimated in repeated measure designs (Olejnik and Algina, 2003). For this reason, we proceeded to calculate r <sup>2</sup> by dividing the sum of squares of the intra-subject by the addition of the sum of squares of the intra-subject, the sum of squares of the intrasubject error and the sum of squares of the within-subject error. To calculate the effect size in the case of Friedman test, we calculated Kendall's W coefficient of concordance, considered a strength-of-relationship index. It ranges from 0 to 1. Higher values indicate a stronger relationship (Green and Salkind, 2010). To interpret the effect size, we follow the conventional levels (Cohen, 1992) of effect size: small (0.01), medium (0.06), and large (0.16).

Finally, we used the IMMPACT clinical importance criteria (Dworkin et al., 2008). In terms of pain intensity, score drops (mean differences) between 1 and 2.9 were considered scarcely important; 3–4.9, moderately important; and above 5, substantial. In terms of the WHYMPI interference subscale, score drops equal to or higher than 0.6 are considered clinically important. For the POMS subscales, a reduction (or increase in the case of Vigor) of the score equal to or higher than two points is considered clinically important. In the case of the scale total, the required reduction is at least 10 points. Finally, in terms of the patient's perception of improvement (PGIC), minimally improved (category 5) suggests a minor change, much improved (category 6), a moderately important change, and very much improved (category 7), a substantial change. In all cases, we compared the score obtained in pre-test with post-test and pretest with follow-up.

## RESULTS

Forty patients participated in the study. The age range was 33– 69, with an average age of 47.9 and a standard deviation of 8.68. Twenty-two patients (55%) were women and 18 (45%) were men; 38 (80%) were married or lived with a partner; four (10%) were separated or divorced; three (7.5%) were single; and one (2.5%) was widowed. Eighteen (45%) had finished only elementary school and 10 (25%) had not; 11 (27.5%) had received their high school degree; and only one (2.5%) had attended college. In terms of employment, nine (22.5%) were unemployed; nine (22.5%) were housewives; 12 (30%) worked; eight (20%) had received early retirement for illness; one (2.5%) had retired after reaching retirement age; and one (2.5%) was laid off. According to their diagnoses, 22 (55%) were suffering from chronic low back pain, 12 (30%) from fibromyalgia and the remaining six (15%) from chronic headaches. They had been dealing with chronic pain for 2–30 years, with an average of 16.75 years and a standard deviation of 9.14 years. In 22 (55%) of the cases, the patient's support person was their partner or spouse; in 13 (32.5%) of the cases, their father or mother; and in the remaining five (12.5%), other people. In 30 (75%) of the cases, the support person lived with the patient.

One important advantage of this intervention program is its high degree of standardization and specificity, aspects that facilitate its assessment and its replication and, as a consequence, allow its results to be generalized. Next we present the evaluation of the intervention program before, during and after the implementation.

## Before the Intervention: Needs Assessment, and Evaluation of Objectives and Design

In general, as this stage was based on IMMPACT recommendations, the objectives, design and instruments used to measure the aspects that the intervention aims to improve were all based on empirical evidence and a theoretical framework.

In order to facilitate the comparison with the results (measures before and after the intervention), information about the scores obtained by the sample before the intervention and its reliability are presented in Section "After the Intervention: Evaluation of Outcomes."

The study of the internal coherence of the program yielded adequate results: all the needs had an associated objective, and at least one activity was included for each objective. Specifically, sessions 2 (training in breathing and relaxation) and 10 (physical activity, sleep and postural hygiene, and relapse prevention) were developed to reduce perceived pain; sessions 3 (attention management), 6 (problem solving), 8 (life values and goal setting), 9 (time management and reinforcement activities) and 10 were implemented to reduce the degree to which pain interferes in a patient's life; sessions 4 and 5 (cognitive restructuring), 6, and 7 (emotions management and assertiveness) were developed in order to improve mood; and all activities (from session 1, the introduction to cognitivebehavioral intervention, through session 10) had a positive influence on patient's perceived satisfaction. Additionally, the timeframe was realistic and the materials available for each activity were made explicit.

## During the Intervention: Evaluation of Implementation

As a measure of participant willingness, **Figure 1** presents a participant flow chart in keeping with CONSORT recommendations (Moher et al., 2001).

## After the Intervention: Evaluation of Outcomes Reliability

5

**Table 1** presents the reliability results. All were significant at 95% CI. Considering the unbiased estimator of Cronbach's α, six (22.2%) were excellent, 10 (37%) were good, nine (33.3%) were acceptable, and two (7.5%) were questionable (the subscale

disturbance; BDI, Beck Depression Inventory; PGIC, Patient Global Impression of Change Scale.

#### TABLE 1 | Reliability.


WHYMPI, West Haven-Yale Multidimensional Pain Inventory; POMS, Profile of Mood States; F, fatigue; D, depression; T, tension; H, hostility; C, confusion; V, vigor; M, total mood disturbance; BDI, Beck Depression Inventory; α= unbiased estimator of Cronbach's α. α and α lower than 0.7 are marked in bold.

Tension of POMS in the pre-test and the follow-up). Overall, 25 (92.6%) of the results reached at least appropriate values (above 0.7) and the remaining two (7.5%) were close to 0.7 (concretely, 0.684 and 0.669).

#### Normality

Considering the 14 variables and the three instances separately (14 × 3 = 42 combinations), the normality assumption using Shapiro–Wilk (W) was accepted on all occasions but nine: 24-h intensity, follow up (W = 0.857, p = 0.027); 24-h interference, pre-test (W = 0.639, p < 0.001) and follow-up (W = 0.798, p = 0.005); present interference pre-test (W = 0.639, p < 0.001) and follow-up (W = 0.849, p = 0.022); POMS-V, follow-up (W = 0.841, p = 0.017); BDI, pre-test (W = 0.834, p = 0.003); and PGIC post-test (W = 0.816, p = 0.008) and follow up (W = 0.851, p = 0.023).

As a result, the calculations for the six variables affected by normality rejection in at least one instance (24-h intensity, 24-h interference, present interference, POMS-V, BDI and PGIC), were done using non-parametric tests.

### Effectiveness of the Psychological Intervention

### **Pain**

**Table 2** presents the results. In terms of pain, both the pain intensity present at the time of the interview and the pain experienced in the 24 h beforehand diminished in a statistically significant manner after the intervention, with a large effect size.

In present intensity, the clinical significance was minimally important and both linear and quadratic trends were significant. The quadratic trend was stronger, however, with a large effect size, while the effect size for the linear trend was medium. This can be interpreted as a slight maintenance of results obtained in post-test at follow-up.

On pain intensity in the previous 24 h, we found a minimally important change when pre and post-test results were compared, and no change in the pre-test and follow-up comparison. The quadratic trend was statistically significant. This suggests that after the intervention, there was a decrease in 24-h pain intensity, but an increase 6 months later.

#### **Physical functioning**

The 24-h and present pain interference and the WHYMPI interference score diminished in a statistically significant manner after the intervention with a large effect size.

The clinical significance in WHYMPI was substantial in the pre–post comparison and moderately important when comparing pre-test and follow-up. The significant linear and quadratic trends with medium effect size revealed that, although there was a slight deterioration, the improvement continued in the follow-up period.

There was a statistically significant deterioration with regards to 24 h-interference in the follow-up period (significant quadratic trend). Nevertheless, the improvement in present interference continued in the follow-up period (significant linear trend).

#### **Emotional functioning**

In general, we can say that there was a statistically significant improvement in POMS and BDI. The effect size was medium/large in all the variables. In all cases, the clinical significance implied a substantial change when comparing pre-test and post-test. Additionally, the quadratic trend was statistically significant in all cases. This can be interpreted as an important deterioration in a comparison of the post-test and follow-up. Comparing the clinical significance at pre-test and follow-up, we find that the deterioration does not represent a return to the starting point in all the variables studied, because there is a substantial change in POMS-T, POMS-H, and POMS-V (with this last variable also showing a significant linear trend), and a moderately important change in POMS-F, POMS-D and POMS-M. Moreover, BDI also yielded a significant linear trend in favor of a possible maintenance of the results obtained.

#### **Improvement perceived by the patient**

Patient Global Impression of Change Scale shows that the improvement participants expected before the intervention was statistically lower than the subjective improvement perceived by the participants after the intervention, with a large effect size and a moderately important clinical change. This variable presents a statistically significant trend both linearly and quadratically,



 vigor; M, total mood disturbance; BDI, Beck Depression Inventory; PGIC, Patient Global Impression of Change Scale. +, small effect size (around 0.01); ++, medium effect size (around 0.06); and +++, large effect size (around 0.16). Following Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) recommendations, ♠, minimally important change; ♠♠, moderately important change; and ♠♠♠, substantial change.

aSphericity was not assumed (Mauchly's test of sphericity p < 0.001). bFriedman test. cANOVA F for repeated measures. dKendall's coefficient of concordance W. e r 2 . fMean difference. gMore detailed information can be consulted on Table 3. hNon-parametric post hoc comparisons for trend, 95% confidence interval (statistically significant trends –zero excluded from the confidence interval are in bold). <0.05;∗∗p<0.01;∗∗∗p<0.001.

fpsyg-08-00435 March 22, 2017 Time: 17:27 # 8

∗p so it can be concluded that participants maintain their positive assessment when comparing post-test and follow-up.

In more detail, **Table 3** shows that, at post-test, all patients noted improvement, with more than half reporting a moderately important change and around one-third reporting substantial change (the maximum). At the 6-month follow-up, two patients reported that their chronic pain was similar to what it had been before the intervention. However, approximately half noted a moderately important improvement and one-fourth, substantial improvement. Overall, 90% of the patients stated that they had improved 6 months after the intervention.

## DISCUSSION

This study has provided additional evidence on the generalization of multicomponent interventions that have been already shown in other contexts (Morley et al., 1999; Veehof et al., 2011; Hann and McCracken, 2014; Huguet et al., 2014). While such interventions are usually implemented in English-speaking contexts, this paper presents an implementation in a Spanish rural area. While reported interventions are generally performed in a very controlled context, the sample of this study was selected from among users of a public health center who came in for a consultation. Participants in most studies are usually uppermiddle class with a high educational level; 70% of participants in this intervention had a low educational level (complete or incomplete elementary) and 45% had no paid work and, as a result, low income. Finally, it is usual to find a limited number of domains of painful experience to evaluate interventions; in this case, we evaluated all the domains of chronic pain using instruments recommended by IMMPACT, i.e., to measure pain, intensity of perceived pain the previous 24 h and at the time of the interview (Dworkin et al., 2005). To measure physical functioning, we utilized the items referring to pain interference in daily life in the previous 24 h and at the time of the interview, and WHYMPI (Kerns et al., 1985). POMS and BDI were used to gauge emotional functioning. To measure perceived improvement after the treatment, PGIC was used.

Patient flow data were similar to those of other studies. Wetherell et al. (2011) carried out a randomized controlled trial comparing acceptance and compromise therapy with cognitive behavioral therapy in patients with chronic pain. They reported that 66% of patients were excluded from the recruitment, 12% of patients did not receive the intervention, and 16% of patients dropped out. Our percentages were 51, 12.5, and 21%, respectively. The principal reasons for exclusion and drop out of our study were similar to those reported by Wetherell et al. (2011): schedule incompatibilities, adverse life events and noncompliance.

In spite of the variants our study introduced to the standard intervention, the program assessment showed a high degree of standardization and specification owed to its highly detailed design (Kovacs and Moix, 2011; Moix and Casado, 2011). Moreover, the evaluation followed the IMMPACT recommendations, using instruments with tested psychometric properties. This facilitates the replication of the intervention and reinforces the results obtained. Second, there was a high degree of internal coherence. The same measures taken before the intervention were repeated immediately after and again 6 months later, using the same instruments. This comparison of the three instances facilitated the analysis of the change and provided evidence not only of the program's effectiveness but also of the duration of the effects for a longer period of time. Each assessed need had at least an associated objective to be covered and each objective had at least one activity to be reached, and fitted timeframe and resources. Third, explicit selection criteria for participants were applied to all potential participants (Chacón-Moscoso et al., 2016). Forth, the measures presented sufficient reliability coefficients. Fifth, we found evidence of effectiveness, as there was a statistically significant improvement after the intervention or at least a medium effect size in all the variables measured and all the domains taken into account; and substantial clinical change in 75% of the variables measured.

From our point of view, the main contributions of the study is to demonstrate that cognitive-behavioral therapy can be effective even if performed by an inexperienced therapist to groups of lowliteracy patients with a low socioeconomic status. As for therapist experience, although common sense suggests that it should improve the effectiveness of therapy, the first longitudinal study that addresses this question, with data from 170 psychotherapists and 6,591 patients (Goldberg et al., 2016), did not endorse this. In our opinion, the highly structured intervention program and the wealth of resources and material available to the therapist minimize the possible impact of their inexperience. In terms of the second aspect, literacy and socioeconomic resources are considered a barrier for the efficacy of cognitive behavioral treatment of chronic pain (Campbell, 2011) and this led to the



Following Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) recommendations, ♠, minimally important change; ♠♠, moderately important change; and ♠♠♠, substantial change.

creation of personalization initiatives for these patients (Thorn et al., 2011; Eyer and Thorn, 2016). Even so, in the first study with personalized treatment (Thorn et al., 2011), 26.5% of patients did not complete the intervention, which is 5.5% more than in our study. This could be explained by the therapist's familiarity with the patients and by the effort that she carried out to make the program contents understandable for the patients.

The improvement observed just after the intervention worsened in approximately two-thirds of the variables measured (only the quadratic trend was statistically significant), though the measures did not return to their starting points. The ostensibly mild deterioration is still strong enough to be statistically significant. Maintaining the long-term effects of these programs is another major challenge, considering the high chronicity of these patients (in our study, patients had been suffering from chronic pain for over 16 years on average). A possible moderating factor could be the quantity and quality of homework, a neglected aspect of cognitive behavioral therapy research, the importance of which has been revealed in a recent meta-analysis (Kazantzis et al., 2016). Anyway, it would be highly advisable to add some sessions after the intervention, one every 4 months, to maintain the improvements patients have obtained.

On the other hand, the principal limitation was the absence of a control group that would have enhanced the design and increased evidence of the intervention's effectiveness. Nevertheless, a control group would not have been feasible in this study, because we were ethically obliged to offer the intervention program to every patient in a public primary care setting. In any case, we were less interested in the program efficacy than in identifying who could benefit from the intervention.

Further research is going to take two directions. First, we are going to adapt the intervention to a broader potential population. People with a disability such as deafness, blindness or dementia were excluded from the initial intervention, but we trust that it is possible to adapt the intervention to cases such as these. Second, in order to increase the evidence of the efficacy of the intervention applied in this study (Moix and Casado, 2011), a meta-analysis will be developed. This will assist us in obtaining a

## REFERENCES


global effect size after a statistical synthesis of the results obtained in the different interventions while also allowing us to detect possible moderator variables that influence the effectiveness of these interventions. From this study, we would be able to establish practical recommendations for psychologists to increase the likelihood of success of this kind of programs.

## AUTHOR CONTRIBUTIONS

FC-G came up with the initial idea and design. RM-B recruited the sample. MG-O carried out the intervention. SS-C and SC-M performed the analyses and interpreted the data. FC-F, SS-C, and SC-M were entrusted with drafting the manuscript. All the authors reviewed the manuscript, approved the final version to be published, and agree to be accountable for all aspects of the work, ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

## FUNDING

This research was funded by the following project grants: PSI2011-29587 (the Spanish Ministry of Science and Innovation); 1150096 (the Scientific and Technological Development Fund of Chile -FONDECYT-); and PSI2015-71947-REDT (the Spanish Ministry of Economy and Competitiveness).

## ACKNOWLEDGMENTS

This work would not have been possible without the collaboration of Southern Seville Primary Care, which opened its door to us, and of the professionals at the Príncipe de Asturias Health Center (Utrera, Spain), led by Dr. José Sánchez-Blanco.

The authors greatly appreciate all the comments received from the reviewers and the English language editor Wendy Gosselin. We believe that the quality of this paper has been substantially enhanced as a result.


Anguera, S. Chacón-Moscoso, and A. Blanco-Villaseñor (Madrid: Sïntesis), 185–218.



review and meta-analysis. PAIN 152, 533–542. doi: 10.1016/j.pain.2010. 11.002


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Cano-García, González-Ortega, Sanduvete-Chaves, Chacón-Moscoso and Moreno-Borrego. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Characterization of Vulnerable and Resilient Spanish Adolescents in Their Developmental Contexts

Carmen Moreno<sup>1</sup> , Irene García-Moya<sup>1</sup> , Francisco Rivera<sup>2</sup> and Pilar Ramos <sup>1</sup> \*

<sup>1</sup> Developmental and Educational Psychology, University of Seville, Sevilla, Spain, <sup>2</sup> Department of Behavioral Sciences, University of Huelva, Huelva, Spain

Research on resilience and vulnerability can offer very valuable information for optimizing design and assessment of interventions and policies aimed at fostering adolescent health. This paper used the adversity level associated with family functioning and the positive adaptation level, as measured by means of a global health score, to distinguish four groups within a representative sample of Spanish adolescents aged 13–16 years: maladaptive, resilient, competent and vulnerable. The aforementioned groups were compared in a number of demographic, school context, peer context, lifestyles, psychological and socioeconomic variables, which can facilitate or inhibit positive adaptation in each context. In addition, the degree to which each factor tended to associate with resilience and vulnerability was examined. The majority of the factors operated by increasing the likelihood of good adaptation in resilient adolescents and diminishing it in vulnerable ones. Overall, more similarities than differences were found in the factors contributing to explaining resilience or vulnerability. However, results also revealed some differential aspects: psychological variables showed a larger explicative capacity in vulnerable adolescents, whereas factors related to school and peer contexts, especially the second, showed a stronger association with resilience. In addition, perceived family wealth, satisfaction with friendships and breakfast frequency only made a significant contribution to the explanation of resilience. The current study provides a highly useful characterization of resilience and vulnerability phenomena in adolescence.

#### Edited by:

Jason C. Immekus, University of Louisville, USA

#### Reviewed by:

Evgueni Borokhovski, Concordia University, Canada Eleonora Riva, University of Milan, Italy

> \*Correspondence: Pilar Ramos pilarramos@us.es

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 21 December 2015 Accepted: 14 June 2016 Published: 04 July 2016

#### Citation:

Moreno C, García-Moya I, Rivera F and Ramos P (2016) Characterization of Vulnerable and Resilient Spanish Adolescents in Their Developmental Contexts. Front. Psychol. 7:983. doi: 10.3389/fpsyg.2016.00983 Keywords: adolescence, resilience, vulnerability, family functioning, global health score

## INTRODUCTION

Fostering wellbeing is one of the current priorities of international agendas in health promotion (WHO, 2012, 2014), and adolescence has been considered to be a key developmental stage for this objective (WHO, 2014). Scientific evidence on factors that help mitigate risk or promote good adjustment despite adversity is crucial to governments and international agencies, which need to efficiently and effectively invest their resources. Positive and negative factors for wellbeing accumulate throughout life and health promotion interventions, which maximize protective factors while minimizing risks, can be successful in achieving wellbeing gains (Marmot, 2010). Resilience research, which analyses risk and protective factors to understand positive development under adverse circumstances, therefore presents itself as a particularly valuable approach that can provide the foundations for the design of effective health promotion and preventive interventions (Roosa, 2000).

More specifically, the value of resilience studies for the design and evaluation of health promotion interventions is apparent for the following reasons. First, resilience research provides critical information about key factors that help reduce potential harm and encourage positive adaptation (Masten, 2014). Each identified protective or vulnerability factor offers a possible focus of intervention (Olsson et al., 2003). Furthermore, the advantage of these studies is that they not only provide a list of intervention targets, but also emphasize the most relevant factors for different population groups and adversity levels (Luthar and Cicchetti, 2000).

Additionally, in highlighting an individual's positive adaptation resilience studies facilitate a change of approach (Luthar and Cicchetti, 2000; Olsson et al., 2003; Fergus and Zimmerman, 2005). Thus, resilience is in line with the perspective shift which has gradually taken place in different disciplines, including psychology, in the last decades: from the reduction of existing problems and exclusive emphasis on deficit and risk, to a focus on the development and promotion of health resources and assets (Morgan et al., 2010).

Lastly, it is important to bear in mind that the utility of resilience research goes further than merely understanding the processes linked to adversity. According to existing evidence, protective factors (as vulnerability ones) are not specific to situations of adversity, but they are the manifestation of basic adaptational systems that come into play in a variety of situations (Masten and Coatsworth, 1998; Masten, 2001). Therefore, increasing our knowledge about resilience and vulnerability phenomena provides useful evidence for intervention and evaluation in adversity contexts and helps to better understand and promote positive development in the general population.

In order for scientific research to make a significant contribution to the design and evaluation of interventions and policies, it is fundamental that studies on resilience (as well as those on vulnerability) include a clear definition and operationalization of the terminology involved (Luthar and Cicchetti, 2000; Masten, 2014; Luthar et al., 2015). In this regard, resilience is defined as "a dynamic process encompassing positive adaptation within the context of significant adversity" (Luthar et al., 2000, p. 543). There is a wide consensus that the two criteria implicit in this definition must be met in order to identify resilience. Indeed, exposure to adversity and some evidence of positive adaptation have been referred to as the two "judgements," "dimensions," "sides" or "coexisting conditions" of resilience (Masten and Coatsworth, 1998; Luthar et al., 2000, 2015; Luthar and Cicchetti, 2000; Masten, 2001; Rutter, 2006).

The adversity element has been defined by characteristics as diverse as: an experience of war or catastrophe (Masten and Narayan, 2012), low economic status (Buckner et al., 2003), belonging to minority groups (Sandín-Esteban and Sánchez-Martín, 2015), living in disadvantaged neighborhoods (Tiêt and Huizinga, 2002) and an individual's or caregiver's disorders or illnesses (Werner and Smith, 1982). Nevertheless, the key defining characteristic of adversity is that a significant threat to development or demonstrable risk must be present (Luthar and Cicchetti, 2000; Masten, 2001). More specifically, adversity is defined by "current or past hazards judged to have the potential to derail normative development" (Masten, 2001, p. 228) and it "typically encompasses negative life circumstances that are known to be statistically associated with adjustment difficulties" (Luthar and Cicchetti, 2000, p. 858).

In this regard, putting key adaptational systems in danger, including the relationship with loving and competent adult caregivers in a family context, is amongst the principal hazards to human development (Masten, 2001). Extant evidence has documented the fundamental links between the quality of parentchild relationships and adolescent development and adjustment (Steinberg and Silk, 2002; Clarke-Stewart and Dunn, 2006). In this sense, family context has a very strong influence on the person from the beginning of life and through multiple channels. No wonder, therefore, that family is the center of many adaptation and human development studies in this field (Masten and Shaffer, 2006). Hence, low-quality parent-child relationships (García-Moya et al., 2013b) or the existence of problems in the family (Fergusson and Linskey, 1996) have been considered to be key elements in defining an adverse situation. Accordingly, low scores in a composite factorial measure of the quality of parentchild relationships (García-Moya et al., 2013a) will be used as the indicator of adversity in the present study.

In defining positive adaptation, resilience research is especially varied. Luthar et al. (2000, 2015) concluded that a single criterion to establish the best adaptation indicator for any given study does not exist. External criteria such as behavioral adjustment and social competence have tended to predominate (Olsson et al., 2003) but internal criteria including emotional health, life satisfaction or absence of emotional distress are increasingly seen as similarly important indicators of positive adaptation (Masten and Reed, 2005). Furthermore, some revealing studies show that individuals showing positive adaptation according to external competence criteria can still experience internalizing symptoms and health problems (e.g., Luthar et al., 1993). Drawing on this evidence, we selected a global health score, which encompasses self-rated health, psychosomatic complaints, health-related quality of life and life satisfaction, as the indicator of positive adaptation in the present study. This is not to say that positive adaptation is synonymous to health or wellbeing, but we made the conceptually-informed decision to give priority to the aforementioned internal dimensions of health to define positive adaptation. More specifically, the global health score (Ramos et al., 2010) was selected because of its relevance for the kind of adversity examined (Karademas et al., 2008; Jiménez-Iglesias et al., 2015), as well as being a sound composite factorial score that encompasses multiple domains of health and has shown good psychometric properties in adolescents (Ramos et al., 2012). Specifically, using the global health score as the criterion for positive adaptation fits with one of the approaches mentioned in a seminal chapter about measurement issues in the empirical study of resilience, underlining that the assessment of positive adaptation "must be tied in to the particular risk domain being studied" and "rests on multipleitem instruments, typically with well-documented psychometric properties, that provide assessments on the continuum between adjustment and maladjustment" (Luthar and Cushing, 1999, pp. 139–140). Furthering the definition of the constructs related to resilience and adaptation, some authors (Tiêt and Huizinga, 2002) have proposed an interesting classification of individuals based on their level of exposure to adversity and the resulting adaptation shown, which divides the population into four large groups. Two of the groups show expected results in accordance with their level of exposure to adversity: low-risk good adaptation (competent or unchallenged) and high-risk—bad adaptation (maladaptive). The paradox occurs in the remaining two groups: those that are exposed to high-risk but show good adaptation and those that, despite being exposed to low levels of risk, exhibit low competence levels. The first of these latter two groups constitutes the sample of interest in resilience studies whereas the second group, although rarely studied, could offer interesting information about vulnerability factors in the normative population.

After establishing the group or groups of interest, the next step is to identify which factors facilitate (protective factors) or inhibit (vulnerability factors) positive adaptation in the given context. Research has tended to classify these factors using a theoretical framework which distinguishes three fundamental levels: individual-level, family-level, and extrafamily-level factors (Masten and Coatsworth, 1998; Luthar and Cicchetti, 2000; Olsson et al., 2003).

On the individual level, self-esteem, self-efficacy, and intellectual capacity have been extensively studied in classic literature on resilience as determinant factors on the individual level (Masten and Coatsworth, 1998; Dumont and Provost, 1999; Hamill, 2003). Nevertheless, the claim that positive selfperception along with confidence in one's efficacy and motivation to engage in the environment are fundamental for successful adaptation (Masten, 2001) justifies the need to explore the role of other constructs with clear links to the aforementioned description. Regarding positive self-perception, satisfaction with body image is one aspect that has been considered especially influential in adolescence (Tiggemann, 2005). Confidence in one's efficacy and motivation to engage in the environment are linked to some novel constructs in positive psychology that are likely to play a significant role in explaining positive adaptation, such as sense of coherence (Antonovsky, 1987) and curiosity and exploration (Kashdan et al., 2009). Finally, another fundamental factor is emotional regulation. This skill, which is closely related to intellectual functioning, is currently receiving special scientific attention since it seems to be fundamental for successful coping and good behavioral, emotional and social adjustment (Lengua, 2002; Buckner et al., 2003). The present study will try to further the understanding of individual-level factors by exploring the aforementioned constructs that, despite having connections with well-known individual factors in resilience studies, have not usually been included in previous resilience research.

Along with them, we will analyse the role of lifestyles that, despite their significant contribution to health and wellbeing, have also received little attention to date (Elliot, 1993; Ramos, 2010). Regarding tobacco, alcohol and cannabis use, the absence of these risk behaviors has been predominantly used as criteria for defining adaptation (for a review, see Fergus and Zimmerman, 2005) or its presence has been analyzed as a risk indicator (Anteghini et al., 2001). The associations between resilience and healthy lifestyles, such as eating habits, dental hygiene and physical activity, has also been rarely explored in classic studies. Nonetheless, physical activity, for example, has been highlighted as a relevant factor when explaining resilience due to its protective effects on health in stress situations (Gerber and Pühse, 2009; Silverman and Deuster, 2014) or the fact that it tends to be incompatible with some health-threatening activities or risk behaviors, such as alcohol and other substances abuse (Pate et al., 1996). Consequently, examining the associations between lifestyles and resilience is of unquestionable interest.

On the family level, besides aspects related to the aforementioned quality of relationships and processes in the family context (which will be used to define adversity in the present study), it is worth exploring the contribution of the families' socioeconomic status (Masten and Coatsworth, 1998). A good socioeconomic position is associated with access to material, cultural and educational resources, making it a significant source of social capital (Bornstein and Bradley, 2003), whereas low family affluence limits access to the aforesaid resources and could become a significant source of stress, having negative consequences on children's development (Conger et al., 2000). Unlike objective indicators, subjective measures of socioeconomic status have not generally been analyzed in resilience studies. However, the study of socioeconomic inequalities in health indicates that subjective perceptions of wealth have a strong predictive capacity regarding adolescent health (Goodman et al., 2001) and their significant effects on health remain even after controlling for objective measures such as educational level, parents' occupation and family affluence (Elgar et al., 2016).

Lastly, on the extrafamily level, experiences of belonging and efficacy, such as a positive school climate and experiences of academic achievement, can significantly contribute to positive adaptation outcomes (Masten and Coatsworth, 1998), whereas bullying episodes can hamper them (McVie, 2014). Significant extrafamily relationships with important adults, including teachers (DuBois et al., 1992; Masten and Coatsworth, 1998), as well as the contribution of peer support and the degree in which peers provide positive or adjusted models of behavior (e.g., Jain et al., 2012) have also been emphasized. The present study will consider all the aforementioned aspects.

Therefore, the selection of variables in the present study is supported by an ample consensus on the need to analyse factors from individual, family and extrafamily levels in order to obtain a detailed view of the factors associated with resilience and vulnerability (Masten and Coatsworth, 1998; Luthar and Cicchetti, 2000; Olsson et al., 2003). In addition, the selection of variables is guided by an explicit effort to explore relevant content from those levels that have not been sufficiently examined in resilience research so far. Thus, the present study will try to further the understanding of individual-level factors by exploring emotional regulation along with other constructs such as satisfaction with body image, sense of coherence and curiosity and exploration that, despite having connections with wellknown individual factors in resilience studies, have not usually been included in previous resilience research. Similarly, because lifestyles contribute significantly to wellbeing, the selection of variables included breakfast frequency, physical activity and substance use, which have also received little attention in the study of resilience. On the family level, a similar rationale motivated the selection of perceived family wealth as the measure of socioeconomic status, instead of the objective indicators which have dominated previous resilience research. Finally, on the extrafamily level, the selected variables (including academic achievement, classmate and teacher support, bullying victimization, peer support, and models of behavior in the peer group) ensure simultaneous consideration of the most frequently mentioned factors on this level.

Accordingly, this paper starts by using the criteria on adversity and positive adaptation described above to identify two reference groups within a representative sample of adolescents: those that showed good global health despite having a low-quality family environment (resilient), and those that showed poor health even with high-quality parent-child relationships (vulnerable). Afterwards, drawing on the classification by Tiêt and Huizinga (2002), the phenomena of resilience and vulnerability are characterized by comparing them to groups of maladaptive (high risk, poor adaptation) and competent (low risk, good adaptation) adolescents, respectively.

The aim of the paper is to characterize resilience and vulnerability in adolescents, considering an ample number of potential protective and vulnerability factors that were selected from the three main levels described in scientific literature (individual, family, and extrafamily). The selection of the specific factors used in this study is also intended to initiate a new direction by exploring relevant constructs for positive adaptation in adolescence which had not received sufficient attention in classic resilience research, amongst others, satisfaction with body image, sense of coherence, curiosity and exploration, and diverse lifestyles.

In short, after conducting preliminary analyses on the differences among resilient, vulnerable, competent, and maladaptative adolescents in individual factors (including psychological variables and lifestyles), family socioeconomic status and extrafamily factors (including those from the school context and the peer context), the ability of those factors (as independent variables) to explain the dependent variables resilience (vs. maladaptation) and vulnerability (vs. competence) is examined. A detailed list of research question is presented in **Table 1**.

This approach is designed to identify important factors for adaptation in adverse and non-adverse contexts respectively, but it may also provide valuable findings that contribute to informing the debate on whether some factors contribute to positive development in the face of adversity but have little impact in the absence of it or whether there are some common protective and risk factors associated with positive adaptation irrespective of the level of adversity exposure (Roosa, 2000). Also, on the potential implications and contributions offered by the present study, it has been stated that although "this kind of epidemiological research does not unpack the processes by which each individual is impacted by contextual experience, it does document the multiple factors in the environment that are candidates for more specific analyses (Sameroff, 2010, p. 14)." The aforementioned factors and levels do not operate independently, rather they relate amongst themselves in people's lives (Fergus and Zimmerman, 2005). For this reason, approaches which provide an ample characterization of resilience and vulnerability phenomena while taking into account a significant number of the aforementioned factors (usually referred as person-focused approaches) provide a very valuable complementary approach (Masten, 2001).

## METHOD

## Participants

Data were obtained from the Health Behavior in Schoolaged Children (HBSC) cross-sectional survey. The HBSC study is an international network supported by the World Health Organization that collects data in more than 40 countries in Europe and North America. The survey is conducted every 4 years with the aim of increasing knowledge about health-related behaviors, lifestyles and developmental contexts of young people.

Participants of the present study come from a representative sample of school-aged children aged 13–16 years residing in Spain, who were selected for the 2014 edition of the HBSC study using a random multi-stage sampling stratified by conglomerates, representative by age, area of residence (rural or urban), type of school (public or private) and region (Spain has 19 regions) (Moreno et al., 2016). Participants were recruited from a database of schools published by the Spanish Ministry of Education. Those centers that refused to participate in the study were substituted for other centers, also selected randomly within the same stratum. The final student participation rate was 87%.

For the purpose of this article, terciles were used to identify adolescents scoring high (upper tercile) and low (lower tercile) in the scales for Global Health Score (GHS) and the Quality of the Parent-Child Relationship (QPCR) (described later in the section on instruments).

Despite the limitations of categorizing quantitative variables (Preacher et al., 2005), dividing them into three groups in order to identify their extremes is supported by three reasons: firstly, by the essence of the construct itself, since "resilience is never directly measured, but instead is indirectly inferred based on evidence of the two subsumed constructs" ("adversity" and "positive adaptation"; Luthar et al., 2015, p. 248); secondly, it is consistent with literature that identifies both resilient and vulnerable subjects as extreme groups in unfavorable and favorable circumstances, respectively, but whose results in adjustment indicators are not consistent with their circumstances (Luthar et al., 2000; Masten, 2014); and lastly, from a purely methodological perspective, because, as DeCoster et al. (2009) argues, categorization is advised when focusing on the extreme groups since it allows for identification of groups of subjects based on conceptual definitions.

Based on the four combinations resulting from this division 1753 adolescents were selected from the total of 3845 studied (see **Table 2**). In the selected sample, 45.8% are boys and 54.2% are girls, with a mean age of 14.62 years (SD = 1.11). Additionally, 62.7% attended public schools and 37.3% private, with 54.1% living in urban areas and 54.9% in rural areas.

#### TABLE 1 | Research questions in the present study.

#### Research question 1

How do the four groups of adolescents analyzed in this paper (maladaptative, resilient, competent, and vulnerable) characterize and differentiate from each other in relation to the three sets of variables analyzed: individual factors (including psychological variables and lifestyles), family socioeconomic status and extrafamily factors (including those from the school context and the peer context)?

#### Research question 2

Which factors (individual, familial, and extrafamiliar) are useful to understand adaptation in high adversity contexts? In other words: which factors (individual, familial, and extrafamilial) are useful to distinguish between resilient and maladaptative adolescents?

The following specific questions will be answered before addressing research question 2:


#### Research question 3

Which factors (individual, familial, and extrafamilial) are useful to understand adaptation in low adversity contexts? In other words: which factors (individual, familial, and extrafamilial) are useful to distinguish between vulnerable and competent adolescents?

The following specific questions will be answered before addressing research question 3:


TABLE 2 | Sample subgroups according to their tercile position in the global health and the quality of parent–child relationship scores (the four groups examined in the present study are highlighted in bold).


Therefore, following the classification criteria for adaptation status developed by classic research (Tiêt and Huizinga, 2002), the sample was classified in four groups, defined as follows: resilient adolescents (tercil 1 in QPCR and tercil 3 in GHS), maladaptative adolescents (tercil 1 in CRPF and tercil 1 in GHS), vulnerable adolescents (tercil 3 in CRPF and tercil 1 in GHS) and competent adolescents (tercil 3 in QPCR and tercil 3 in GS).

## Instruments

The variables were assessed using the 2014 Spanish HBSC Questionnaire, which included questions about lifestyles, positive health and characteristics of the principal developmental contexts (family, peers, and school) in adolescence. The instrument is comprised of an extensive series of mandatory questions, optional packages and questions that cover specific national interests (Roberts et al., 2009). The complete questionnaire is revised and improved for each edition of the study (for the last edition, see Inchley et al., 2016). For the present paper, key measures of quality of parent-child relationship and health, as well as sociodemographic, school and peer contexts, lifestyle, and psychological and socioeconomic variables were selected from the Spanish version of the 2014 HBSC survey.

Firstly, the following two measures were used to derive the classification in groups (maladaptative, resilient, vulnerable, and competent) that acts as the dependent variable.

1. Global Health Score (GHS). This measure is based on 20 items related to the variables: life satisfaction, self-rated health, health-related quality of life and psychosomatic complaints. The details of these instruments can be consulted in **Table 3**. The GHS is a score with a mean of 50 and standard deviation of 10 that has shown good fit indices (NNFI = 0.98, CFI = 0.99, RMSEA = 0.03), as well as good reliability and validity (Ramos et al., 2010). This measure assesses the adolescent's physical, psychological and social wellbeing, following the most widely used and currently accepted definition of health, i.e., the definition proposed by the World Health Organization (WHO, 1948). As previously described, terciles were used in the present study to classify the adolescents in three groups according to this measure of global health.

2. Factorial score on the Quality of Parent-Child Relationship (QPCR), with a mean of 5 and a standard deviation of 2. This score is an adaptation of the measure developed by García-Moya et al. (2013a), that consists of the following three indicators: perceived affection, ease of communication with parents and satisfaction with family relations. The details of these instruments can be consulted in **Table 3**. The factorial score on the QPCR showed good fit indices (NNFI = 0.99, CFI = 0.99, RMSEA = 0.02) has been considered a useful tool in global assessments of the relationships between parents and children according to the adolescents' perception (García-Moya et al., 2013a). As previously mentioned, terciles were used in the present study to classify adolescents in three groups according to the quality of their parent-child relationship.

In addition, the independent variables were selected in line with the aims of this study and assessed by means of several instruments that were part of the 2014 HBSC Questionnaire, explained above. The details of these instruments are presented in **Table 4**.

## Procedure

New information and communication technologies (ICT), based on a CAWI (Computer-Assisted Web Interviewing) model, were used in the data collection process. The data was always collected in the school setting, under the supervision of teachers. In those schools with internet-connection problems or problems with the condition or number of computers, members of the research team traveled personally to those schools to collect data using tablets. Ultimately, the guided computerized procedure has the advantage of immediately receiving and incorporating the students' responses in the database, thus reducing the possible errors from the data entry process, as well as helping to maintain the anonymity of the responses.

In all of the schools, after contacting via telephone with the head teacher, deputy head teacher or school counselor, instructions were given to the teachers who would be supervising the classroom when the adolescents responded to the questionnaires. On the other hand, instructions for the students were included at the beginning of the questionnaire to guarantee homogeneity amongst all the participants.

Ultimately, data collection complied with the three requirements dictated by the HBSC international protocol (Roberts et al., 2009): students themselves answered the questionnaires; anonymity was guaranteed; and the questionnaires were completed at school under the supervision of instructed staff.

## Statistical Analysis

Firstly, bivariate analyses including chi-square and ANOVA (with Bonferroni test for multiple comparisons) were used to compare the four groups of adolescents (maladaptative, resilient, competent, and vulnerable) in each one of the independent variables (sociodemographic, school context, peer context, lifestyle, psychological, and socioeconomic variables). This analysis corresponds to the research question 1. Also, Crammer's V and Cohen's d were calculated to determine the effect size, with the following cut-off points: 0–0.19 = negligible, 0.20–0.49 = small, 0.50–0.79 = medium, 0.80 and above = high (Cohen, 1988).

Secondly, separate binary logistic regression analyses were carried out for resilience and vulnerability, with adaptation status (resilient vs. maladaptative -research question 2- and vulnerable vs. competent -research question 3-, respectively) as the dependent variables, and the different sets of variables analyzed (demographic, school context, peer context, lifestyle, psychological, and socioeconomic variables) as predictor variables. The predictive capacity of each set of variables (controlling for significant demographic variables) was calculated using the Nagelkerke R<sup>2</sup> . Afterwards, a final model including only significant variables in previous analysis was estimated. The odds ratio (OR) and its confidence interval at the 95% level (95% CI) was calculated for each examined predictor, establishing the statistical significance as p < 0.05 for each variable.

Statistical analyses were conducted using the IBM SPSS Statistics 22.0 software.

## RESULTS

## Research Question 1. Comparisons Among the Four Adaptation Groups: Maladaptative, Resilient, Competent, and Vulnerable Adolescents

This first subsection focuses on the comparisons among maladaptative, resilient, competent and vulnerable adolescents in all variables of this study. The comparisons of these groups show significant differences (p < 0.001, V = 0.231, medium effect size) in the distribution of gender. **Table 5** shows that the maladaptative and vulnerable groups have a greater proportion of girls. However, comparisons between the four groups are not significant neither for educational center (p = 0.067, V = 0.087, negligible effect size) nor habitat (p = 0.145, V = 0.051, negligible effect size).

**Table 6** shows the distribution of the maladaptative, resilient, competent and vulnerable groups in the age, school context, peer context, lifestyle, psychological, and socioeconomic variables. The mean comparisons of the contrasts between pairs of groups can be consulted in **Table 7**.

Regarding age, older adolescents fell into the maladaptive and vulnerable categories, followed by the resilient adolescents



and finally, the youngest fell into the category of competent adolescents.

With respect to school, the competent adolescents show higher perception of academic achievement than the resilient adolescents, who in turn have a higher perception than the maladaptative and vulnerable adolescents. In relation to feeling toward school, the competent adolescents have the most positive feelings toward school and the highest perception of teacher support, followed by the resilient and vulnerable adolescents and, finally, the maladaptative adolescents.

In their peer relationships, the competent and resilient adolescents show the highest perception of social support, followed by the vulnerable adolescents and, finally, the maladaptative adolescents. The resilient and competent adolescents have a higher rate of positive models of behavior in their peer group than the maladaptative adolescents, with the vulnerable adolescents falling in the middle. Likewise, resilient and competent adolescents have higher satisfaction with their friendships than the vulnerable adolescents, and this group shows more satisfaction than maladaptative adolescents. In relation to bullying, the maladaptative adolescents show a higher likelihood to have been bullied and to have bullied others than the other groups (resilient, competent, and vulnerable adolescents).

Regarding lifestyles, the competent and resilient adolescents eat breakfast more days a week, followed by the vulnerable adolescents and, finally, the maladaptative adolescents. The resilient and competent adolescents eat fruit more frequently than the maladaptative adolescents do (the vulnerable adolescents show an intermediate score between the maladaptive and resilient adolescents). Likewise, resilient and competent adolescents do more physical activity (moderate to vigorous and vigorous) than the maladaptative and vulnerable adolescents. The competent adolescents brush their teeth more frequently than the maladaptative and resilient adolescents (the vulnerable adolescents show an intermediate score between the competent and resilient adolescents). In relation to tobacco, the maladaptative adolescents show higher use than the other three


#### TABLE 4 | Independent variables and instruments used for their assessment in the present study.

(Continued)



groups. However, the competent adolescents show lower alcohol use than all the others.

The analyses of psychological variables show differences in sense of coherence among the four groups of adolescents. Ordered from the highest to lowest score they are: competent, resilient, vulnerable, and maladaptative adolescents. In relation to emotional regulation, the competent adolescents have the highest score, followed by the resilient and vulnerable adolescents and, finally, the maladaptative adolescents. The resilient and competent adolescents present more curiosity and exploration and they see themselves as less obese than the maladaptative and vulnerable adolescents. In addition, there are differences among the four groups regarding satisfaction with body image. Ordered from highest to lowest they are: competent, resilient, vulnerable and maladaptative adolescents. Lastly, significant differences are found in parents' education, showing that the educational level of the competent and vulnerable adolescents' fathers is higher than that of the fathers of maladaptative adolescents. However, the educational level of the competent adolescents' mothers is higher than that of the mothers of maladaptative and vulnerable




adolescents. The resilient adolescents show an intermediate score between maladaptive and vulnerable adolescents for both the father and mother's education. There are significant differences in perceived family wealth, being higher in the resilient and competent adolescents than it is in the maladaptative ones (in this case the vulnerable adolescents are situated between the competent and maladaptive adolescents).

## Research Question 2. The Study of the Resilient Adolescents

This second subsection focuses on those adolescents who, despite having low-quality parent-child relationships have a high global health score, that is to say, the resilient group (4.5% of the global sample and 13.4% of the participants classified as low-quality in parent-child relationship). This group of adolescents are compared with those which, having a lowquality parent-child relationship, have a low global health score, that is to say, the maladaptative group (18.9% of the global sample and 56.5% of the sample with low-quality parent-child relationships).

The results of the logistic regression analyses using the group of resilient adolescents as the reference value are shown below. Specifically, six models have been estimated, one for each group of independent variables (although sex and age have been included in all of them to prevent them to become confounding variables). Additionally, a global model is shown at the end, including only those variables that were found to be significant in previous models.

As can be seen in the first row of **Table 8**, although model 1 explained 10.8% of the total variability, being significant the variables sex and age (specifically, boys and younger adolescents have a higher probability of being resilient), using these demographic variables only the percentage of well-classified adolescents was 0%.

In model 2, concerning school context, the explained variance is 22.8 and 22.1% of the resilient adolescents are correctly classified. In this case, those adolescents with a higher perception of academic achievement, with an OR of 1.83 (95% CI = 1.44–2.33), and those with higher teacher support (OR = 1.19, 95% CI = 1.11–1.29), have a higher likelihood of being resilient.

In model 3, which includes the variables of peer context, the predictive capacity is 23.8%, with 19.2% of the adolescents in the resilient group being correctly classified. Significant variables in this model are models of behavior, satisfaction with friendships and being a victim of bullying. Adolescents who are more satisfied with their friendships are 1.5 times more likely to be resilient (95% CI = 1.26–1.79), whereas those that were victims of bullying more often have a lower likelihood of being resilient (OR = 0.53, 95% CI = 0.33-0.84). Likewise, those adolescents with a group of friends providing better models of behavior also show a higher likelihood of being resilient (OR = 1.07, 95% CI = 1.01–1.14).

Model 4 is devoted to variables related to lifestyles and shows a level of explained variance of 24.7%, with 25.6% of the resilient adolescents being correctly classified. Only two variables in this model are significant: breakfast frequency and moderate to vigorous physical activity. Specifically, those adolescents that engage in higher levels of moderate to vigorous physical activity increase their likelihood of being resilient in 1.37 times (95% CI = 1.22–1.54). Additionally, those adolescents who have breakfast more regularly are more likely to be resilient (OR = 1.24, 95% CI = 1.12–1.38).

Model 5 includes the psychological variables. Among the six specific models, this model shows the highest level of explained variance, which reaches 37% (30.4% of the adolescents in resilient group are correctly classified). The significant variables in this model are: sense of coherence, curiosity and exploration and satisfaction with body image. Sense of coherence stands out for its high OR, which is 3.18 (95% CI = 2.14–4.73), meaning that those adolescents with higher scores in this psychological construct have the highest likelihood of being resilient. Next, adolescents with a higher satisfaction with their body image are 1.83 times more likely to be resilient (OR = 1.83, 95% CI = 1.31–2.56). Lastly, those adolescents with higher scores in curiosity and exploration have a higher likelihood of being resilient (OR = 1.07, CI 95% = 1.03–1.11).

Model 6, referring to the socioeconomic variables, shows a lower predictive capacity than the previous models (13.1%), with only 3.5% of resilient adolescents being correctly classified. The only significant variable in this model is perceived family wealth, meaning that those who perceive a higher family wealth are 1.98 times more likely to be resilient (95% CI = 1.37–2.85).

Finally, in model 7 or the global model (the one which includes only the significant variables from previous models), the results show that the variables sex, age, models of behavior in the peer group and being a victim of bullying loose predictive capacity. Therefore, the global model includes the following nine variables: perceived academic achievement, perceived teacher support, satisfaction with friendships, breakfast frequency, moderate to vigorous physical activity, sense of coherence, curiosity and exploration, satisfaction with body image and perceived family wealth. This model stands out for its high predictive capacity, surpassing 50% of the explained variance (specifically, 51.8%). Additionally, there are a notably high proportion of correctly-classified resilient adolescents, specifically, 51.5%.


TABLE 6 | Descriptive statistics of the age, school context, peer context, lifestyle, psychological and socioeconomic variables between maladaptative, resilient, competent and vulnerable adolescents.

MVPA, Moderate to Vigorous Physical Activity; VPA, Vigorous Physical Activity.

The most influential independent variables, with ORs higher than 2, are perceived family wealth (OR = 2.83, 95% CI = 1.47– 5.44) and sense of coherence (OR = 2.74, 95% CI = 1.84–4.10). Satisfaction with body image (OR = 1.84, 95% CI = 1.31–2.58) and perceived academic achievement (OR = 1.64, 95% CI = 1.13–2.38) also stand out. Lastly, more modest contributions were found for breakfast frequency (OR = 1.33, 95% CI = 1.13– 1.56), satisfaction with friendships (OR = 1.31, 95% CI = 1.02– 1.68), frequency of moderate to vigorous physical activity (OR = 1.24, 95% CI = 1.05–1.45), teacher support (OR = 1.19, 95% CI = 1.06–1.33) and curiosity and exploration (OR = 1.05, 95% CI = 1.01–1.10).

## Research Question 3. The Study of Vulnerable Adolescents

This third section focuses on those adolescents who, despite having good-quality parent-child relationships show low global health scores, that is to say, the vulnerable group (3.9% of the global sample and 11.9% of the group of participants that showed high-quality in parent-child relationships). This group of adolescents are compared with those adolescents who, having a good-quality parent-child relationship, show high global health score, that is to say, the competent group (18.3% of the global sample and 56.1% of the group with good-quality parent-child relationships).

Results from the logistic regression analyses, taking the group of vulnerable adolescents as a reference value, are shown. As in the analyses of the resilient adolescents, six models have been estimated, one for each set of independent variables (including the variables sex and age in all of them, so that they do not become confounding variables). In addition, a global model is presented at the end in which only the significant variables from previous models are included.

As can be seen in the first row of **Table 9**, although model 1 overall explained 9.7% of total variability, with the variables sex and age being significant (specifically, girls and older adolescents


TABLE 7 | Mean comparisons test (ANOVA with Bonferroni correction for multiple comparisons and effect size) of age, school context, peer context, lifestyle, psychological and socioeconomic variables between maladaptative, resilient, competent, and vulnerable adolescents.

MVPA, Moderate to Vigorous Physical Activity; VPA, Vigorous Physical Activity. Effect size interpretation: 0–0.19 = negligible (–), 0.20–0.49 = small (\*), 0.50–0.79 = medium (\*\*), 0.80 and above = high (\*\*\*). The bold values indicates (small, medium, or high) effect size values.

have a higher probability of being vulnerable), the percentage of correctly classified adolescents using theses demographic variables only was 0%.

In model 2, regarding school context, the explained variance is 20.4% and the model correctly classifies 16.8% of the vulnerable adolescents. Specifically, those adolescents who perceive lower teacher support, with an OR of 0.87 (95% CI = 0.83–0.91), have less positive feelings toward school (OR = 0.77, 95% CI = 0.65– 0.91) and worse academic achievement (OR = 0.60, 95% CI = 0.50–0.72) have a higher likelihood of being vulnerable.

Model 3, which includes the variables of peer context, shows a predictive capacity of 16.8%, with 10% of the vulnerable adolescents being correctly classified. Those adolescents who report lower perceived social support (OR = 0.97, 95% CI = 0.94– 0.99) and less satisfaction with friendships (OR = 0.76, 95% CI = 0.69–0.83) are more likely to be vulnerable.

Model 4 is devoted to variables related to lifestyles and its explained variance level is 21%, with 17.5% of the vulnerable adolescents being correctly classified. In this model, alcohol use stands out, showing that those adolescents who show a higher frequency of alcohol use in the last 30 days are 1.45 times more likely to be vulnerable (95% CI = 1.20–1.76). Likewise, the adolescents who do less moderate to vigorous physical activity (OR = 0.74, 95% CI = 0.67–0.80) and less vigorous physical activity (OR = 0.89, 95% CI = 0.80–0.99) have a higher likelihood of being vulnerable.

The group of psychological variables, analyzed in model 5, has the highest level of explained variance among the six specific models. Specifically, the level of explained variance in model 5 is 44.7% and 52% of the vulnerable adolescents are correctly classified. As in the previous section regarding resilient adolescents, the significant variables in this model are: sense of


TABLE

8


Logistic

regression

models

on

resilience

by

demographic,

school

context,

peer

context,

lifestyle,

psychological

and

socioeconomic

variables.

 <

 <


 variables.

TABLE

 p <

 p <

coherence, curiosity and exploration and satisfaction with body image. The likelihood of being vulnerable is higher in those adolescents with a lower score in sense of coherence (OR = 0.27, 95% CI = 0.18–0.40), less satisfaction with their body image (OR = 0.46, 95% CI = 0.35–0.69) and a lower score in curiosity and exploration (OR = 0.95, 95% CI = 0.92–0.98).

Model 6, examining the socioeconomic variables, again shows a lower predictive capacity than previous models (11.2%), with only 2.7% of the vulnerable adolescents being correctly-classified. The mother's educational level is the only significant variable in this model, revealing that those adolescents whose mothers have a lower educational level exhibit a higher probability of being vulnerable (OR = 0.66, 95% CI = 0.51-0.87).

Lastly, in model 7 or the global model (in which only the significant variables from previous models have been included), the following six variables were significant: perceived academic achievement, perceived teacher support, moderate to vigorous physical activity, sense of coherence, curiosity and exploration and satisfaction with body image. The predictive capacity of this model is very high, with an explained variance level of 56.5%. This model was also able to correctly classify a high proportion of vulnerable adolescents, specifically 62.9%.

The independent variables which stand out in this model due to their ORs being closer to zero, and therefore their higher predictive capacity, are sense of coherence (OR = 0.30, 95% CI = 0.19-0.45), academic achievement (OR = 0.49, 95% CI = 0.32–0.76) and satisfaction with body image (OR = 0.50, 95% CI = 0.35–0.71). Moderate to vigorous physical activity (OR = 0.70, 95% CI = 0.58–0.86) and teacher support (OR = 0.85, 95% CI = 0.75–0.95) appear on an intermediate level. Lastly, the level of curiosity and exploration (OR = 0.96, 95% CI = 0.93– 0.99) made the most modest contribution. Higher levels of the aforementioned variables are associated with a lower likelihood of belonging to the group of vulnerable adolescents.

## DISCUSSION

The aim of this study was to characterize resilience and vulnerability in a large and representative sample of adolescents. This objective was first addressed separately on a number of potential levels of influence (demographic, school, peer, lifestyle, psychological, and socioeconomic variables) and later, in a more holistic approach, by integrating the factors in all the aforementioned levels.

A separate analysis of each of the two phenomena showed, first at all, that although there was a higher representation of boys and younger adolescents in the resilient group, and of girls and older adolescents in the vulnerable group, the variables sex and age were not sufficient to accurately predict adolescent adaptation. Previous research has found differences in wellbeing and adjustment between boys and girls, as well as according to age (Cavallo et al., 2006; Ramos et al., 2010), but at the same time there is notable diversity amongst adolescents of the same sex and age. This diversity tends to be related to the combination of life experiences and psychological characteristics of these adolescents. Hence the demographic variables (that were included in all the regression models) were insufficient to characterize such complex phenomena such as resilience and vulnerability and their significant effects disappeared when they were entered along with the rest of the variables in the final model. In fact, sex and age already lost their significant effects in previous models, specifically in those evaluating the contributions of psychological and socioeconomic variables. This is probably owing to that those models incorporated variables such as satisfaction with body image, which tends to be lower and more strongly associated with girls' wellbeing (Knauss et al., 2007; Mond et al., 2011), or family wealth, which tends to be assessed more negatively by older adolescents (Goodman et al., 2001). Therefore, it could be understood that these predictor variables (such as body image or family wealth) explain the predictive capacity of the variables sex and age on the phenomena resilience and vulnerability.

Beyond demographic variables, a look to the separate models for each set of predictors shows that a hierarchy based on the predictive capacity of each set of variables would be very similar for resilience and vulnerability: psychological variables in the first place, along with contextual and lifestyle variables, and more modest contributions of demographic and socioeconomic variables.

In addition, the final models for resilience and vulnerability also revealed a number of common factors for the explanation of both phenomena. In other words, these analyses also helped identify several factors that contributed significantly to explaining both resilience and vulnerability.

First at all, sense of coherence was one of the most important factors for both resilience and vulnerability. This construct, coming from the salutogenic model in the field of public health, has to do with a person's ability to interpret their social environments as predictable and ordered, their confidence that any life demand can be successfully dealt with as well as a motivational-emotional component that helps one to see difficult situations as challenges and facilitates an active engagement in problem-solving (Antonovsky, 1987). Therefore, the important contribution of sense of coherence to resilience and vulnerability should come as no surprise. On one hand, its links to some factors associated with successful adaptation in classic resilience studies, such as analytical skills, motivation to engage in the environment, self-efficacy and self-esteem (Masten, 2001; Hamill, 2003), are apparent in the prior description. In addition, research on sense of coherence indicates that its relationship with health and wellbeing is rooted in helping people mobilize other useful coping resources in stressful situations (Lindström and Eriksson, 2010), which has led to its inclusion in the health assets model as a supra-order asset for wellbeing (Morgan and Hernán, 2013). In this sense, one line of research that arises from the results obtained in the current study is the study on the processes that explain why a high sense of coherence would help resilient adolescents take full advantage of available resources, whereas low levels of the same would hamper the effective use of the apparently more abundant resources in the case of vulnerable adolescents.

Satisfaction with body image and perceived academic achievement also appeared as important explanatory variables in the analysis of both resilience and vulnerability. The significant contribution of satisfaction with body image is probably related to the importance of physical appearance for adolescents' positive self-perception. In this regard, numerous studies have found a significant relation between satisfaction with body image and self-esteem in adolescence (Tiggemann, 2005), this latter being a factor traditionally connected to successful adaptation (e.g., Dumont and Provost, 1999). Something similar can be said of the relationship between perceived academic achievement and self-efficacy (Danielsen et al., 2009), another fundamental protective factor in resilience research (Hamill, 2003). Additionally, previous research indicates that feeling competent in daily life is very important for the adaptation of individuals suffering adversity (Masten and Coatsworth, 1998). Therefore, it is likely that behind an adolescent who thinks that their teachers consider their academic achievement as good, there are various underlying beneficial elements for adaptation and wellbeing, such as experiences of competence in the school context, higher school connectedness or even a higher intellectual capacity (Masten et al., 1999; Blum, 2005).

In addition to perceived academic achievement, higher levels of teacher support increased the likelihood of showing resilience and diminished that of being part of the group of vulnerable adolescents. Studies about teachers' contribution to adolescent wellbeing also suggest that, regardless of the level of academic achievement, teacher support acts as an asset associated with wellbeing for all adolescents (e.g., García-Moya et al., 2015), which makes it fundamental to favor close and supportive teacher-student relationships.

Moderate to vigorous physical activity was also amongst the significant factors associated with resilience and vulnerability. Physical activity has been found to have protective effects in stressful situations (Gerber and Pühse, 2009; Silverman and Deuster, 2014), as well as it tends to reduce the likelihood of engaging in risk behaviors (Pate et al., 1996), therefore serving as a clear example of the importance of taking into account lifestyles' contributions to explaining resilience and vulnerability.

Finally, higher levels of curiosity and exploration increased the likelihood of being resilient and diminished that of being part of the vulnerable group. The curiosity and exploration construct reflects openness and interest in learning, good management of the uncertainty associated with new or unknown situations (Kashdan et al., 2009) and is associated with psychological and contextual variables significantly linked to adaptation and resilience. Specifically, high levels of curiosity and exploration are related to an active response in unfamiliar and challenging environments (Kashdan and Roberts, 2004) and have been linked to psychological variables such as intrinsic motivation and selfefficacy (Kashdan et al., 2004). Additionally, curiosity was also significantly associated with more positive social interactions (Kashdan and Roberts, 2004). Specifically, people with higher levels of curiosity and exploration generated more positive responses from strangers, who tended to be more responsive, participative, and interested in social exchanges with people with high curiosity. Despite this, the contribution of curiosity and exploration to the final model was relatively modest, probably due to its connections with other constructs, such as sense of coherence. The conceptual delimitation of curiosity and exploration is still under study (Kashdan et al., 2009), and with regards to sense of coherence one focus of analysis and debate is precisely its connection to other constructs in positive psychology (Lindström and Eriksson, 2010). Consequently, advancing in the conceptual delimitation of these constructs, identifying common elements and differences between them, is an important line of research (García-Moya and Morgan, 2016) that could contribute to a better understanding of resilience and vulnerability and, in general, of their role in promoting adolescent wellbeing and adjustment.

As explained in the previous lines, the vast majority of the examined factors operated by increasing the likelihood of good adaptation in resilient adolescents and diminishing it in vulnerable ones. Overall, this suggests more similarities than differences in the factors contributing to explaining resilience and vulnerability. These findings coincide with previous research pointing out that factors associated with resilience are not specific to this phenomenon, but that they are the manifestation of basic systems of human adaptation and, therefore, are influential in both adversity and non-adversity situations (Masten and Coatsworth, 1998; Masten, 2001). Additionally, some scholars have noted that protective factors identified in resilience and vulnerability studies frequently correspond to the positive pole of risk factors for maladaptation or, in other words, that in this type of research it is possible to identify factors in which one of their extremes facilitates successful adaptation while the opposite hampers it (Sameroff and Fiese, 2000; Fergus and Zimmerman, 2005; Luthar et al., 2015), which seems to coincide with findings in the present study.

Despite the predominant similarities described so far, results also revealed some differential aspects between the resilience and vulnerability phenomena. First, the psychological variables showed a larger explicative capacity in vulnerable adolescents than in resilient ones (R <sup>2</sup>= 0.447 and 0.370, respectively), whereas factors related to school and peer contexts, especially the second, showed a stronger association with resilience than with vulnerability (R <sup>2</sup>= 0.228 and 0.238 respectively in the models on resilience vs. 0.204 and 0.168 for vulnerability). Some research suggests that certain protective factors such as temperament (e.g., Werner and Smith, 1982) or intellectual capacity (Masten and Coatsworth, 1998), to name some classic examples, have a multiplier effect, i.e., they can contribute to a higher likelihood of encountering other positive events in life, giving rise to chain reactions favoring positive adaptation or, conversely, they can initiate cascading effects in which new risk factors are more probable. Applying a similar logic, it can be hypothesized that certain psychological variables, such as a low sense of coherence, a lower tendency toward curiosity and exploration, or a higher dissatisfaction with body image, could be preventing vulnerable adolescents from taking advantage of potential resources in extrafamily environments (school and peer contexts), whereas resilient adolescents, despite their more unfavorable family context (which was the indicator used for the definition of adversity in the current study), would be more likely to find and benefit from resources available in extrafamily environments thanks to their more positive profile in psychological variables.

Along these lines, prior research has documented the existence of compensatory effects from other contexts in only part of the adolescents exposed to low-quality family contexts (e.g., García-Moya et al., 2013b).

Second, three of the examined factors, specifically perceived family wealth, satisfaction with friendships and breakfast frequency, were only significant in the analysis of resilience. This means that these variables made a difference for adolescents exposed to adversity in the family context (resilient vs. maladaptative adolescents) but did not contribute to explain differences in adaptation between vulnerable and competent adolescents. Scientific literature has extensively documented that resilience has among its defining attributes an ability, despite adversity, to find and take advantage of any resources and opportunities in proximal environments.

In this sense, it is not surprising that being raised in a family environment with good socioeconomic resources opens a horizon of possibilities to resilient adolescents that they seem to know how to take advantage of. What is interesting in the findings of the present study is that although vulnerable and resilience adolescents reported similar levels of perceived family wealth, this factor made one of the most noticeable contributions in analyzing resilience (but not vulnerability). A reflection on the nature of the indicator used may help understand this finding. On the one hand, research suggests that perceived family wealth includes some of the elements which are common to objective indicators such as family affluence, and therefore, it can arguably be interpreted as indicative to some extent of the wider access to external resources and opportunities for development that families' socioeconomic level relates to Bornstein and Bradley (2003). However, research also indicates that subjective and objective measures are not assessing exactly the same content (Hartley et al., 2015; Elgar et al., 2016), since unlike objective indicators perceived family wealth may also incorporate a comparative assessment of the socioeconomic position of the adolescent's family in comparisons with that of others they related with (Moreno-Maldonado et al., under review). The levels of wealth perceived by resilient adolescents may therefore represent a relative socioeconomic advantage for these adolescents compared to their peers also exposed to adversity in family relationships (the maladaptative group).

Results on satisfaction with friendships can also be interpreted in a similar sense, i.e., that resilient individuals are able to take advantage of the potential resources they find. Peer support tends to be considered a protective factor in adversity situations (Olsson et al., 2003) and resilient adolescents in the present study probably illustrate very well the compensatory effects which are frequently mentioned in this field (e.g., Luthar et al., 2015): they belong to a group who, despite coming from families in which parent-child relationships are not good, is able to build positive relationships with their peer group and benefit from them (Lansford et al., 2003; Rubin et al., 2004). In a similar vein, Luthar et al. (2015) state that relationships with peers can become a "remedial" socializing context for children who grow up exposed to family adversity. In addition, positive peer relationships are indicative of good social competence, a fundamental skill in which resilient adolescents usually show positive results, which are comparable to those of competent adolescents and clearly more favorable than the social competence levels exhibited by maladaptative adolescents (Masten et al., 1999).

Finally, the fact that breakfast frequency was significant only in the analysis of resilience may be related to the fact that, as children gain more independence during adolescence, the importance of parental supervision in this behavior decreases while internalization of the habit and other personal characteristics, such as constancy and self-regulation, gain prominence (Kalavana et al., 2010). Given that breakfast frequency is also believed to act as a proxy for diverse socioeconomic and family aspects this is an issue which, in particular, would benefit from further research.

In any case, the comments that have been made throughout this discussion about a higher ability of resilient individuals to take advantage of potential resources in proximal contexts or the important role of psychological factors for explaining the resilience and vulnerability phenomena should not be interpreted as evidence that they are characteristics unrelated to the contextual experiences associated with resilience and vulnerability. As rightly pointed out by Luthar et al. (2015), contextual experiences indeed give shape, from the beginning and in a continuous transactional dynamic, to said skills or psychological resources.

This study has some limitations that should be taken into consideration in the interpretation of its findings. Firstly, its cross-sectional design means that the results must be interpreted on an associative level, not being possible to draw conclusions about the directionality of the relationships found. Secondly, adversity was defined using quality of parent-child relationships as a sole criterion. Although, as explained in the introduction, this is an well-informed decision, which draws on scientific literature that highlights the role of family as a basic system for human adaptation (Masten, 2001; Fergus and Zimmerman, 2005), previous research also shows the wide variety of life circumstances that can constitute adversity in childhood and adolescence (Luthar et al., 2000); consequently, it would be inappropriate to generalize these findings to other adverse circumstances. Finally, this study used a factorial health score as its measure of adaptation. Although, this measure is a sound and validated global health indicator (Ramos et al., 2010) whose characteristics fit well with key measurement issues in the empirical definition of positive adaptation (Luthar and Cushing, 1999), there is substantial evidence on the multi-dimensional nature of human adaptation, which makes individuals show dissimilar levels of adaptation in different areas (Luthar et al., 1993). Therefore, future research should complement the present study by conducting separated analyses of the contributions of the factors analyzed here to distinct areas of adaptation, mainly the following: academic, behavioral, social and emotional (Masten et al., 1999; Luthar et al., 2000).

Despite the aforementioned limitations, this study also has important strengths. In line with recommendations from some of the seminal reviews in this research field (Luthar et al., 2000, 2015; Masten, 2014), the elements of adversity and adaptation were clearly operationalized for the definition of resilience and vulnerability in the present study, which is fundamental for an adequate interpretation of its findings and its comparability with other studies. The criteria used for making the distinction and comparisons among the four adaptation groups (competent, vulnerable, resilient, and maladaptative) were also based in previous research (Tiêt and Huizinga, 2002). Additionally, this research adheres to the methodological rigor characteristics of the HBSC survey (Roberts et al., 2009), as well as it stands out for its large sample size, which allowed for a characterization of resilience and vulnerability phenomena in a representative and notably large sample. Although, the four groups may appear unbalanced in size, the representativeness of the initial sample is a guarantee that this is a relatively realistic reflection of the population. In addition, using effect size tests in all of the analyses minimizes the potential bias that such differences in the subgroups' size could case from a methodological point of view. The high predictive capacity of the models of resilience and vulnerability obtained, which reached levels of explained variance higher than 50%, is also outstanding. These values are considered high in the field of behavioral science (Cohen, 1988), being notably above the 10– 20% that is usual for associations between protective factors and adaptation outcomes in resilience studies (Luthar et al., 2000). Finally, this study has three elements that are, to some extent, innovative. Firstly, factors traditionally receiving little attention as referred to in the introduction, such as lifestyles, satisfaction with body image, sense of coherence, curiosity and exploration and perceived family wealth, were analyzed in the present study of resilience and vulnerability. Secondly, this study included vulnerable adolescents, a population subgroup that had rarely been studied in previous research due to its limited sample size (Masten et al., 1999). Additionally, this work makes a valuable contribution regarding the prevalence of vulnerability and resilience in the general population. Given the difficulties associated with defining resilience and vulnerability and the limited methodological consensus with regards to the measures to use and how to apply them to a representative sample, it is understandable that prevalence studies are not available. In this regard, the present study found that vulnerable adolescents made up 3.9% of the global sample, representing 11.9% of the group that reported high-quality parent-child relationships. The resilience group represented 4.5% of the global sample, corresponding to 13.4% of the participants with low-quality parent-child relationships, which is in line with the findings of some longitudinal studies that have found a very low prevalence and stability in resilient coping (Cicchetti and Rogosch, 1997). Specifically, Bolger and Patterson (2003) found that between 6 and 21% of abused children were functioning competently during at least one of the temporal points examined in their longitudinal follow up from middle childhood to early adolescence, but less than 5% consistently maintained that competent functioning over time.

In addition to its strengths from a research perspective, which have just been highlighted, the fact that the present study provides valuable implications for the improvement of the methodological quality of interventions with resilient and vulnerable populations, which was one its guiding principles, should also be noted amongst its strong points.

Throughout these pages a number of important factors for adolescents' successful adaptation have been underlined. These include certain personal characteristics (such as sense of coherence, satisfaction with body image and curiosity and exploration), as well as some that characterize their lifestyles (regularity in healthy eating habits and physical activity) or that refer to the quality of their developmental contexts (such as satisfaction with peer relationships, academic achievement and teacher support). Therefore, all of these are elements to bear in mind in interventions aimed at promoting successful adaptation and wellbeing in adolescence (Olsson et al., 2003). Likewise, this study highlights the need to conduct further research devoted to developing reliable and valid indicators for the assessment of all these factors, both those that characterize the individual person and the ones that characterize their developmental contexts. These indicators will serve the double function of detecting subjects with different profiles of adaptation as well as of monitoring their evolution and evaluating the implemented interventions.

On a separate issue, it should be noted that some studies have advocated that interventions should be adjusted to the distinct developmental needs of adolescence (Kim et al., 2015). What the present research adds is that detecting different adaptation profiles would also serve to adjust interventions to every person's specific needs. On the one hand, some could argue that allocating powerful and costly resources to detect and intervene in vulnerable individuals, which represents 3.9% of adolescents, would not be an efficient strategy. However, it must be noted that this study used very demanding criteria to define the categories of vulnerability/resilience, and hence may have underestimated the prevalence figures. Additionally, it is well known from the accumulated evidence in previous research that life paths of vulnerable people will be full of difficulties in very different areas (this paper has provided some good examples of this). From an economic perspective, those adverse life paths will lead to a lot of public spending in the education, health, legal and judicial, and labor systems, amongst others (see Khan et al., 2015), if the direct and indirect costs to which these difficulties will give rise are taken into consideration; consequently, they should be detected as soon as possible. On the other hand, one should not forget that amongst the adaptation profiles considered in this paper, there were also a 18.88% of maladaptative adolescents with clear intervention needs, and in the remaining 77.3% of adolescents there will most likely always be areas of improvement and optimization in need of reinforcement. Similarly, it could also be thought that the resilient adolescents, for which our study shows a prevalence only slightly higher than 4%, would not need any intervention because they seem to resist adversity without help. It is true that these adolescents seem to have an admirable capacity to deal with adversity, but their resilience is not without limits. As can be seen when comparing them to the competent adolescents (please compare values of the resilient column with values of the competent column in **Tables 4**, **5**), resilient adolescents scored lower in an important number of variables. In other words, even though adolescents in the resilient group showed very high levels of adjustment despite coming from adverse family environments, their adjustment levels could still be higher if they were aided in taking full advantage of their skills and if interventions were implemented at the source of adversity. Needless to say, reducing adversity in their family environment should also be a top priority. Additionally, certain studies have already warned on the risks of underestimating resilient adolescents' needs for support, since some of these adolescents, despite being classified as resilient for showing excellent competence levels according to external and behavioral indicators, can nonetheless suffer from elevated levels of emotional distress (Luthar et al., 1993).

A final more general consideration should probably be added. In the dynamic relationship between research and intervention underlined in this paper, it should be emphasized that interventions should not work with models that explain development and change from a lineal or even interactive perspective, since empirical evidence shows data in favor of transactional models that involve much more complex multilevel dynamic systems (Sameroff, 2010). Therefore, even though all recent school intervention efforts aimed at strengthening life skills to optimize development and along the way prevent risk behaviors deserve our most sincere recognition and applause (Springer et al., 2004), the intervention that we defend here should go further. This guiding conceptual framework leads to the claim that intervention in adolescence should be preceded by an ambitious systematic and multi-sector intervention starting at the beginning of life. In this vein, as already noted by Luthar and Cicchetti (2000), interventions should take into consideration and simultaneously work on different levels of influence (individual, family and extrafamily) and should begin as early as possible. Community work with families and current steps toward promoting positive parenting very early in the baby's life are good points of reference in this direction (Rodrigo et al., 2012).

In summary, this study emphasizes the enormous potential of research on resilient and vulnerable individuals, both for creating scientific knowledge and for designing intervention guidelines. For a long time psychology overlooked both phenomena (vulnerability and resilience), due to the predominant scientific

## REFERENCES


interest in central trends, i.e., toward what happened to the majority of people. Research was focused, on one hand, on those individuals that succumbed to adversity, and on the other, on those that showed strength as the result of having grown up surrounded by quality relationships. However, psychology must acknowledge the great deal that has been learnt since then by studying the limited percentage of people whose developmental trajectories apparently challenged the centraltendency hypotheses of that time: individuals who appeared to be strong and healthy despite adversity, as well as those who, despite growing up surrounded by strengths, seemed to be weak. Analysing the life trajectories of the first helps us to clarify what is desirable that all people have in their lives and the analysis of the life paths of the second, teaches us what is necessary to eradicate in all of them.

## ETHICS STATEMENT

The study was approved by the ethics committee of Comité Ético de Experimentación de la Universidad de Sevilla. Parents of adolescents participating in the study received a letter with information about the study and informed consent model.

## AUTHOR CONTRIBUTIONS

All authors conceived of the study, participated in its design and helped to draft the manuscript. Introduction was drafted by IG, methods and results were drafted by PR, FR and the discussion draft was done by IG, CM. All authors made suggestions and critical reviews to the initial draft and contributed to its improvement until reaching the final manuscript, which was read and approved by all authors.

## ACKNOWLEDGMENTS

This study has been funded by the Ministerio de Sanidad, Servicios Sociales e Igualdad [Ministry of Health, Social Services and Equality] of Spain.

processes. Dev. Psychopathol. 15, 139–162. doi: 10.1017.S0954579403 000087


Crockett, and R. K. Silbereisen (Cambridge: Cambridge University Press), 201–223.


Young People's Health and Well-being, Vol. 7. Health behaviour in schoolaged children: International report from the 2013/2014 survey. Health policy for children and adolescents, World Health Organization Regional Office for Europe, Copenhagen.


of 13- and 15-year old adolescents. Sch. Psychol. Int. 21, 195–212. doi: 10.1177/0143034300212006


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Moreno, García-Moya, Rivera and Ramos. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Animal Models of Maladaptive Traits: Disorders in Sensorimotor Gating and Attentional Quantifiable Responses as Possible Endophenotypes

Juan P. Vargas, Estrella Díaz, Manuel Portavella and Juan C. López\*

Animal Behavior and Neuroscience Lab, Department of Experimental Psychology, Universidad de Sevilla, Seville, Spain

#### Edited by:

Pietro Cipresso, IRCCS Istituto Auxologico Italiano, Italy

## Reviewed by:

Eddy A. Van Der Zee, University of Groningen, Netherlands Kenn Konstabel, National Institute for Health Development, Estonia

\*Correspondence:

Juan C. López jclopez@us.es

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 07 September 2015 Accepted: 03 February 2016 Published: 19 February 2016

#### Citation:

Vargas JP, Díaz E, Portavella M and López JC (2016) Animal Models of Maladaptive Traits: Disorders in Sensorimotor Gating and Attentional Quantifiable Responses as Possible Endophenotypes. Front. Psychol. 7:206. doi: 10.3389/fpsyg.2016.00206 Traditional diagnostic scales are based on a number of symptoms to evaluate and classify mental diseases. In many cases, this process becomes subjective, since the patient must calibrate the magnitude of his/her symptoms and therefore the severity of his/her disorder. A completely different approach is based on the study of the more vulnerable traits of cognitive disorders. In this regard, animal models of mental illness could be a useful tool to characterize indicators of possible cognitive dysfunctions in humans. Specifically, several cognitive disorders such as schizophrenia involve a dysfunction in the mesocorticolimbic dopaminergic system during development. These variations in dopamine levels or dopamine receptor sensibility correlate with many behavioral disturbances. These behaviors may be included in a specific phenotype and may be analyzed under controlled conditions in the laboratory. The present study provides an introductory overview of different quantitative traits that could be used as a possible risk indicator for different mental disorders, helping to define a specific endophenotype. Specifically, we examine different experimental procedures to measure impaired response in attention linked to sensorimotor gating as a possible personality trait involved in maladaptive behaviors.

Keywords: dopamine, endophenotype, latent inhibition, mental disorder, prepulse inhibition

## INTRODUCTION

The criteria used by current diagnostic scales are based on the analysis of external symptoms of the patient. Disorders such as attention deficit with hyperactivity or mental disorders such as schizophrenia are diagnosed based on symptoms that, in many cases, require the patient to evaluate their intensity. This situation creates a serious problem for the diagnosis, given the large amount of subjective information handled by the psychologist or the psychiatrist (Robbins et al., 2012).

The problem of subjectivity and comorbidity in diagnostic errors are, in part, a consequence of the absence of biological markers to facilitate proper classification of the disorder. With relative ease, the diagnostic manuals such as the DSM or ICD propose a continuous change in the criteria for inclusion or exclusion of a disorder due largely to the heterogeneity and complexity of symptoms that define that disorder. These are so complex that patients with different symptoms might have the same diagnosis, a fact that significantly increases the difficulty of providing

proper treatment. This high comorbidity between various diseases indicates a clear deficiency in the classification system of mental disorders, preventing the identification of valid pathologies (Hyman, 2010). It is possible that the psychotherapeutic and pharmacological failures are largely due to this fact. Note for example that the therapeutic effectiveness of pharmacological treatments reaches approximately 50% (Wong et al., 2010).

Using a diverse group of pharmacological treatments to relieve disorders such as depression is also an indicator of the disparity of its diagnosis. For example, the use of inhibitors of serotonin reuptake is applied for a specific type of depressive symptoms, which differs from those used under MAO inhibitors or under tricyclics. The differential response of each patient to treatment indicates that disorders included in the same category should be treated with different principles. Alternatively, this phenomenon could be indirectly indicating that different types of disorders within a category may have a different biological basis.

An alternative to this traditional view is the characterisation of endophenotypes. An endophenotype is a quantitative measurable trait associated with a genetic predisposition (Gottesman and Shields, 1972, 1973). In contrast to the symptomatic view of psychopathology, the endophenotype analyses the characteristics that show possible brain vulnerability to suffer a specific type of disorder. The objective is the study and quantification of specific features that reflect a mental disorder associated with a biochemical sign (Hasler et al., 2006; Turetsky et al., 2007). Throughout its long history, the functional study of behavior in the laboratory has provided a number of indicators that could serve as markers for selective expression of the maladaptive behaviors. Applying this model to the field of psychopathology, mental disorders could be considered as extremes at one or both tails of these normal distributions (Miller and Rockstroh, 2013). From this point of view, psychopathology would view disorders as dimensional notions, and not as categories under a binary diagnosis (Hyman, 2010; Frances and Widiger, 2012; Morris and Cuthbert, 2012).

Here, we provide a set of measurable procedures sensitive enough to be used to identify possible endophenotypes developed from animal models. These endophenotypes are based on the correlation between brain processes and measurable responses of a subject that enable us to discriminate between different sets of symptoms, and facilitate new specific therapies. In addition, the evaluation of these traits could facilitate a more objective classification system of psychopathologies.

## HOW DOES THE USE OF AN ANIMAL MODEL CONTRIBUTE TO PSYCHOPATHOLOGY CLASSIFICATION?

The recent developments in genetics and epigenetics allow us to better approach understanding behavior and facilitate the understanding of mental disorders. The fact that some behaviors have a Mendelian basis, suggests the possibility of finding simple mutations that affect behavior in a relatively specific manner. However, there are only a small group of features known as Mendelian traits (or traits 1:1) in relation to genotype. Mental disorders such as depression or schizophrenia are clearly polygenic, or may also be generated by various mutant alleles of the same gene and specific environmental conditions, making the analysis of their causes a complex procedure (Zahn-Waxler et al., 1988; Winokur and Kadrmas, 1989; Kidd, 1997; Moldin, 1997; Owen, 2000; Torrey and Yolken, 2000; Goldman, 2012). Moreover, these illnesses are the result of the interactions of both genetic and epigenetic factors. And although we now have suitable tools for genotype analysis, the fact that these etiological factors -genes and environment- interact to produce similar phenotypes, significantly increases the difficulty to precisely define the specific weight of each one in the generation of behavior (Plomin and Rende, 1991). Identifying what groups of genes may contribute to the expression of a disorder is a long process of molecular genetics. However, the identification of relating groups of genes with specific traits is currently a more achievable goal.

The use of animal models for the study of personality traits, vulnerability to certain disorders or substance abuse dependence is an interesting strategy for developing behavioral protocols in the laboratory. Although in some cases these models could show poor face and predictive validity, the construct validity associated with the etiology or mechanism of the underlying disorder is usually high (O'Donnell, 2011). For example, animal models of schizophrenia have been successful in evaluating risk factors (see **Table 1**). This fact is crucial in order to develop new pharmacological treatments or genetic therapies. However, the reduced face validity is often a problem when applying to human models.

The development of endophenotypes is one alternative to try to improve this model. Taking advantage of high construct validity, we can develop sensitive tests for quantifying specific traits. Measures such as latent inhibition (LI) or prepulse inhibition (PPI) are, among others, easily quantifiable under controlled conditions in the laboratory. In addition, we can use the advantage of these procedures in a similar way in both animals and humans, and the results are easily extrapolated from an animal model to a human model (Le Pen et al., 2011). While PPI is a very simple procedure seeking to analyze early attentional gating mechanisms, the LI is a learning process related to selective attention and habituation to irrelevant information (Lubow and Gewirtz, 1995; Swerdlow et al., 1996; Braff and Swerdlow, 1997). Animal models indicate that problems in the expression of PPI or LI correlate with cognitive deficits such as working memory or alternation behavior, locomotion activity such as hyperactivity induced by a dopamine receptor agonist, and some negative symptoms also described in pathologies such as schizophrenia (Flagstad et al., 2004; Le Pen et al., 2006; Moore et al., 2006; Hazane et al., 2009). For example, patients with schizophrenia show these symptoms associated with a dysfunctional prefrontal cortex (PfC; Manoach, 2003; Silver et al.,


#### TABLE 1 | Several animal models have studied schizophrenia.

fpsyg-07-00206 February 17, 2016 Time: 20:28 # 3

Pharmacological models have used amphetamine, PCP or NMDA to simulate some of the symptoms. However, only two models have shown the illness as a developmental process. The neonatal ventral hippocampal lesion (NVHL) and MAM models showed marked maladaptive behavior when animals reached adulthood. Le Pen et al. (2011) and O'Donnell (2011) have described several behavioral procedures where we can find similar results with different techniques aimed at developing a dysfunctional PfC.

2003; Godsil et al., 2013), therefore, a behavioral test aimed to evaluate PfC function is a useful tool for an accurate differential diagnostic. The knowledge acquired in recent years on the use of a quantifiable measurement of these traits is boosting the development of unified models of diagnosis that include data from all levels, that is: genetics, biochemical, and behavioral levels.

But the question is, how can we contribute to this proposal? Consider, for instance, one of the most complex disorders, schizophrenia. Currently, schizophrenia is an umbrella term for a diverse group of disorders with possibly different etiologies. Focusing on PfC dysfunction, animal research has provided explanatory models to understand the possible development of this mental disturbance (O'Donnell, 2011; Godsil et al., 2013). Procedures aimed to alter gestation and fetal development such as the MAM model (Methylazoxymethanol), or techniques affecting the maturation process of PfC such as ventral hippocampus lesion in neonates, allow us to experimentally analyze this disorder (Waddington et al., 1999; Bramon et al., 2005; Chambers and Lipska, 2011). Both procedures show a clear PfC dysfunction (Tseng and O'Donnell, 2004, 2007). Cells unit recording studies indicate the possibility of a deficit in inhibitory GABAergic cells. This could be the cause of an excessive release of dopamine cells in the mesocortical system (Tseng and O'Donnell, 2004, 2007; O'Donnell, 2011; Godsil et al., 2013). This could be the reason that PPI or LI could be affected in these animal models and in schizophrenia. Both behavioral processes require an operative PfC for a normal expression. PPI and LI are very sensitive to disturbances in this structure. In this regard, a deficit in one or both processes could be a risk factor. On the whole, the characteristics of a dysfunctional PfC and the impairment in LI and PPI expression could be signs of a specific type of mental disorder, apart from the current model of mental illness where the disorder and its severity are expressed in terms of a scale filled out by the patient or a close family member.

## DOPAMINERGIC SYSTEM AS A SIGN OF A POSSIBLE RISK FACTOR

The function of the dopamine neurotransmitter has attracted great interest because of its relationship with the processes of learning and with several mental disorders such as schizophrenia, depression, ADHT or addiction to a substance of abuse (Robbins, 1992; Feldman et al., 1997; Weiner, 2003; Grace and Sesack, 2010; Simpson et al., 2010; Wise, 2010; Milad and Rauch, 2012; Díaz et al., 2015). The distribution of dopaminergic neurons is abundant in the central nervous system. The midbrain neurons and their efferences to the ventral striatum and PfC play a special role in the learning process (Robbins and Everitt, 1996). Dopaminergic pathways of the ventral tegmental area (VTA) toward the nucleus accumbens (NAc) are closely linked to the motivational processes of learning (Berridge and Robinson, 1998; Berridge, 2007). Many stimulant drugs, such as cocaine or amphetamine, operate in this place, and their function significantly increases the release or reduces the reuptake of dopamine in the system.

Dopamine receptors belong to the G-protein coupled receptors family. All these receptors possess seven transmembrane domains and five subtypes of dopamine receptors according to their molecular characteristics. These have been grouped into two pharmacological families according to the effect produced by agonists and antagonists. D1 family includes the subtypes D1 and D5 receptors. Both stimulate adenylyl cyclase, producing cAMP. On the other hand, D2 receptor family includes the subtypes D2, D3, and D4. These receptors inhibit the formation of cAMP. The D1 receptor is the most abundant in the central nervous system (Missale et al., 1998). The greatest concentration of this receptor is found in the neostriatum, NAc, amygdala, and substantia nigra. However, its affinity for dopamine is relatively low. The D2 receptor is found in high concentrations in the neostriatum (GABAergic neurons) in the NAc and hippocampus, and with a moderate density in the substantia nigra, cerebral cortex, globus pallidus,

thalamus, and hypothalamus. These data make D1 and D2 receptors specific targets for the study of cognitive, emotional, and motivational disorders. Electrophysiological studies have made important contributions concerning their functional activity in the mesolimbic system (O'Donnell and Grace, 1998; Moore et al., 1999; Grace, 2000; Floresco et al., 2001; Goto and Grace, 2008). These studies are of great relevance given the importance of these receptors in the processes of associative learning. Recent studies have shown that the D2 receptor is located in the projections of both the PfC and the amygdala in the form of autoreceptors (O'Donnell and Grace, 1995; Groenewegen et al., 1999; Goto and Grace, 2008). Specifically, D2 receptors are located in the presynaptic areas with the function of modulating the dopaminergic activity of the VTA over NAc through excitatory projections. That is why this receptor has been linked to the goal directed processes or controlled processes that require high attentional activity (O'Donnell and Grace, 1995; Goto and Grace, 2008). In contrast, the activity of D1 receptors in the mesolimbic system is different than the one described for D2. These are located in the post-synaptic cells of the NAc that receive glutamatergic afferences from the hippocampus and dopaminergic afferences from the VTA (O'Donnell and Grace, 1995; Groenewegen et al., 1999; Goto and Grace, 2008).

Disturbances in this system increase the risk of developing serious mental illness (O'Donnell, 2011; Godsil et al., 2013). Disorders such as schizophrenia have been linked directly to disturbances during brain development associated with the second trimester of pregnancy (Waddington et al., 1999; Bramon et al., 2005). Changes in the dopaminergic sensitivity and in the levels of dopamine or dopamine receptors volume could be the result of this process. Specifically, the family of D2 receptors seems to be more related to the disease process (Grace and Sesack, 2010; Simpson et al., 2010; Wise, 2010; Milad and Rauch, 2012), since the antagonists of these receptors such as haloperidol are effective in reducing symptoms (Lubow and Weiner, 2010). This is the reason why it was suggested a substantial increase of this type of receptors underlies this disorder as shown, for instance, in the studies of post-mortem tissue (Seeman and Nizkik, 1990).

Currently the drug treatment of disorders such as schizophrenia or ADHT act directly on the dopaminergic modulation in the brain. The changes that cause the blockade or stimulation of receptors of this neurotransmitter can be studied in the laboratory. Changes in the sensorimotor gating or selection of relevant stimulus of the environment can be considered as possible quantifiable traits directly related with the level of dopamine or dopaminergic receptors in the mesocortical and mesolimbic system. It is important to emphasize that injuries to PfC produce dopamine dysregulation and deficits in PPI and LI expression.

## PPI OF STARTLE RESPONSE. A SENSORIMOTOR GATING MEASURE

The startle response to an intense stimulus is a reflex behavior that has been described in all mammals studied. This is a fast-twitch of the skeletal muscle that leads to processing environmental stimuli and guiding the attention of the subject to a possible threat. This type of response is interesting because it has been associated with specific genes that appear in schizophrenia and as a possible trait with endophenotypical characteristics. For example, Vaidyanathan et al. (2014) studied the startle blink reflex using a very large human sample. Analyzing the startle response, they found a heritable specific pattern of behavior in the sample. In addition, this trait was associated with candidate genes in the endophenotype of schizophrenia. However, although it is an automatic reaction, the outcome can be modulated by the previous presence of a stimulus of lower intensity, therefore PPI is defined as the attenuation of the startle response to an intense pulse when it is preceded by a lower-intensity prepulse stimulus. When the prepulse is perceived, the mechanism of startle is inhibited and the animal displays a lower response (Graham, 1975; Lüthy et al., 2003; Larrauri and Schmajuk, 2006).

The problems with sensorimotor gating have been linked with the levels of dopamine in the NAc. The NAc integrates information from different structures, and even though dopamine modulation in NAc is dependent on mesocortical and mesolimbic systems (Ellenbroek et al., 1996; Larrauri and Schmajuk, 2006), the selective modulation of PfC afferent transmission is especially relevant. PfC afferences could facilitate behaviors oriented to specific goals, and a dopamine deficit could be involved in the incapacity to control the behavior (Goto and Grace, 2008).

It should be noted that the dopaminergic innervation of the PfC increases progressively through adolescence until adulthood. In this period, we can find modifications in density, shape and organization of the circuits (Kalsbeek et al., 1988; Benes et al., 2000; Seamans and Yang, 2004; Segalowitz and Davies, 2004; Manitt et al., 2011; Naneix et al., 2012). A mature circuit allows the dopaminergic neurons to fit their responses in an adaptive way, modulating their response in correlation with environmental changes (Spear, 2000; Tseng and O'Donnell, 2004, 2007; O'Donnell, 2011; Cass et al., 2013; Godsil et al., 2013). Currently, it is estimated that delays or alterations in the maturation process of the PfC dopamine system could be the cause of a large number of mental disorders (O'Donnell, 2011; Godsil et al., 2013). Specifically, a poor inhibitory capacity of the PfC over the NAc may be the major etiological factor in severe disorders such as schizophrenia. In fact, a deficit in the response to the pulse has been observed in different types of cognitive disorders, and it is specifically relevant in patients with schizophrenia (Braff et al., 1992, 2001a,b). Thus, a reduced PPI could be used as a trait for attentional deficit, besides being included as a schizotypy personality trait or a possible endophenotype of schizophrenia (Cadenhead et al., 2000; Braff, 2010; O'Donnell, 2011).

However, this trait is not specific for patients with schizophrenia but indicates a trait of vulnerability, and it is very clear in patients with schizophrenia. In this regard, the PPI deficit could be a necessary condition as a risk factor of schizophrenia, but it could not be sufficient by itself. The PPI deficit might be found in several disorders, and a pathological process such as schizophrenia needs other indicators.

## PPI, DOPAMINE AND IMPULSIVITY: A TRAIT, A NEUROTRANSMITTER AND A QUANTIFIABLE MEASURE NOT ASSOCIATED EXCLUSIVELY WITH SCHIZOPHRENIA

PPI is an easy system to measure in animals including humans. It has been used in animal models of schizophrenia, even though there are several studies where this procedure has been correlated with impulsivity traits. López et al. (2015) analyzed the PPI in rats classified as impulsive by an autoshaping procedure. Animals designated as sign trackers showed approach behavior to a conditional stimulus before delivery of unconditional stimulus. Specifically, for sign tracker animals (STa) the conditional stimulus could be a surrogate of the unconditional stimulus (Flagel et al., 2007; Robinson and Flagel, 2009). These kind of animals showed high levels of dopamine in NAc, but only in the presence of a conditional stimulus (Flagel et al., 2011). These data were consistent with the results of López et al. (2015) using a PPI procedure. In fact, the STa showed a lower PPI response to stimuli of low intensity. This reduced inhibitory ability of the STa showed a difference in the behavioral pattern in normal animals. Furthermore, these data may indicate that ST subjects may be more vulnerable to cognitive disorders in which dopamine is involved.

An important question about the vulnerability of STa to an impulsive behavior comes from specific activity of D2 subtype dopamine receptor. This receptor is located presynaptically on PfC terminals, and has been related with a selective modulation of the NAc to facilitate goal-directed behaviors (Goto and Grace, 2008). In addition, several psychopathologies associated with PfC have shown a deficit between this structure and the projections to NAc (O'Donnell, 2011). López et al. (2015) found a possible vulnerability from STa, since these animals showed a large sensibility of D2 receptor to the administration of an agonist such as quinpirole. This drug affected only STa performance, indicating that this type of trait differs from that observed in schizophrenia. It would be appropriate at this stage to point out the difference between an animal model of impulsivity and an animal model of schizophrenia regarding a dysfunction in PfC. These models have developed several protocols to evaluate attentional processes, and LI is a perfect candidate to discriminate between impulsivity and schizophrenia, because it allows for evaluating attention and executive functions, both specific to PfC function. Impulsive models of animals have found differences in incentive salience of the conditional stimulus, but not in attentional problems (Berridge and Robinson, 1998; Berridge, 2007) such as in schizophrenia models.

## LI, DOPAMINE AND ATTENTIONAL DEFICITS

LI is a learning process observed when the acquisition of a conditional response to a conditioned stimulus paired with a reinforcer is retarded if the same stimulus has previously been pre-exposed in the absence of the reinforcer. LI pharmacology has been associated almost exclusively with the use of an animal model of schizophrenia, and is therefore largely consistent with the pharmacology of schizophrenia (Lubow and Kaplan, 2010; Lubow and Weiner, 2010; Díaz et al., 2015). Specifically, because some of the symptoms of schizophrenia are characterized by an inability to filter, or ignore irrelevant or unimportant stimuli, an anomalous LI was proposed as a tool for the study of possible deficits of attention (Lubow and Weiner, 2010).

Again, dopaminergic activity of the NAc is the essential neural substrate for its expression. Animal models have shown that the primary role of the NAc is to restrict the expression of LI under certain conditions, and thus ensure that the LI is flexible and sensitive to environmental demands. It is important to highlight that, in the absence of modulator mechanisms responsible for restricting the expression of LI to the specific conditions, the effects of an irrelevant stimulus would be extremely robust and maladaptive. In this regard, LI might reflect the psychological processes that are impaired in schizophrenia, since most of the patients showed a reduced expression of this phenomenon. The identification of brain regions whose damage leads to disrupt the LI, joined with the studies of different parameters of expression in animal models, can provide important information on the dysfunctional brain circuits in schizophrenia. In previous decades it was suggested that some kind of hyperactivity of the dopaminergic systems represent a primary biochemical alteration in schizophrenia, which apparently constituted at least a plausible justification for biochemical alteration in this disorder (Iversen, 1976).

To gain insight into quantifiable attentional processes in LI, Díaz et al. (2014) analyzed the effect of various types of pre-exposure to a stimulus. The results indicated that there is a transfer from the ventral to the dorsal striatum in the processing of environmental information. In addition, the dorsomedial striatum is key to encode stimuli when these become irrelevant due to the lack of consequences after their presentation. A deficit in PfC could be the cause of a loss of transfer from ventral to dorsal striatum. Currently there are some laboratories working on this possibility. The inability to modulate dopamine in NAc does not allow for attentional disengagement, showing a persistent state of continuous attention.

The inability of encoding irrelevant information is one of the clearest deficits observed in patients with schizophrenia. Many modern learning theories assume that the amount of attention to a signal depends on how well the signal predicts the significant event of the past. Schizophrenia is associated with attention deficit and recent theories of psychosis have argued that positive symptoms such as delusions and hallucinations are related to a lack of selective attention. Patients with schizophrenia, who had severe positive symptoms, showed a clear difficulty in discriminating between predictive and non-predictive cues when compared to healthy adults. In addition, the rate of learning about non-predictive signals correlated with more severe positive symptoms in schizophrenia. These results suggest that the positive symptoms of schizophrenia were associated with increased attention, both to signals that are likely to be predictive and to those that are not predictive for causal learning. This

selective attention deficit was the result of learning irrelevant causal associations (Morris et al., 2013). In this regard, the development of specific protocols to differentiate the expression of LI could be used as a possible risk factor in the population.

However, the complexity of this disorder suggests the possibility of different etiological factors may underlie the disease. At present there are many contradictory results regarding whether IL is affected in schizophrenia. Lubow and Kaplan (2010) addressed this issue in a recent review. They emphasize the difference between positive and negative symptoms in relation to the expression of IL. For instance, patients with high levels of negative symptoms and low of positive showed a potentiated LI. This data is relevant, because they could be observing different symptoms of the illness or different illness.

## CONCLUDING REMARKS

The mesocortical input of dopamine and the PfC play a critical role in normal cognitive processes and in several neuropsychiatric diseases. This dopamine input regulates aspects of working memory, planning and attention, among others. Similarly, some disturbances may be the basis for a variety of positive and negative symptoms, and therefore of many of the cognitive deficits associated with mental illness. Despite intensive research, we still have a lack of understanding of the basic principles of dopamine activity in the PfC and all the mesolimbic system. In recent years, there has been considerable effort to understand the cellular mechanisms of modulation of dopamine neurons in the PfC and its relationship with behavior. However, the results of these efforts have often led to contradictions and disputes (Nieoullon, 2002). Given the complexity of the

## REFERENCES


function of the mesolimbic and the dopaminergic systems, the development of new tools will be necessary to facilitate discrimination of diagnostics and to provide a more objective assessment of the current classification systems. Namely, we suggest a shift or reconsideration in diagnostic scales adding other indicators. Clinical psychology has many tools to evaluate PfC dysfunction (for a review see Gruszka et al., 2010). We propose that PPI and LI could help to develop a new classification system, where we could distinguish between a psychotic illness such as schizophrenia by a dysfunction in PfC dopamine from other types of schizophrenia included in current scales. As we indicated above, current classification systems could be considering a diverse group of disorders under the same term of schizophrenia illness, and the different combination of positive and negative symptoms could indicate the severity of the disorder. The in depth analysis of these mechanisms, combined with genetic factors, is a new view that could facilitate the development of diagnostic categories in a more specific way and, therefore, a new therapeutic perspective in the future.

## AUTHOR CONTRIBUTIONS

All authors contributed similarly in the theoretical development of the manuscript.

## FUNDING

This research was supported by Ministerio de Economía y Competitividad (PSI2012- 32445 grant).


latent inhibition. Psychopharmacology 232, 4337–4346. doi: 10.1007/s00213- 015-4063-2


medial prefrontal cortex of juvenile rats. Neurobiol. Learn. Mem. 90, 339–346. doi: 10.1016/j.nlm.2008.04.005


Zahn-Waxler, C., Mayfíeld, A., Radke-Yarrow, M., McKnew, D. H., Cytryn, L., and Davenport, Y. B. (1988). A follow-up investigation of offspríng of parents with bipolar disorder. Am. J. Psychiatry 145, 506–509. doi: 10.1176/ajp.145.4.506

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Vargas, Díaz, Portavella and López. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Reading Ability Development from Kindergarten to Junior Secondary: Latent Transition Analyses with Growth Mixture Modeling

Yuan Liu1, 2, Hongyun Liu3, <sup>4</sup> \* and Kit-tai Hau<sup>5</sup>

<sup>1</sup> Faculty of Psychology, Southwest University, Chongqing, China, <sup>2</sup> Key Laboratory of Cognition and Personality, Southwest University, Ministry of Education, Chongqing, China, <sup>3</sup> Beijing Key Laboratory of Applied Experimental Psychology, Beijing Normal University, Beijing, China, <sup>4</sup> School of Psychology, Beijing Normal University, Beijing, China, <sup>5</sup> Department of Psychology, Chinese University of Hong Kong, Hong Kong, Hong Kong

The present study examined the reading ability development of children in the large scale Early Childhood Longitudinal Study (Kindergarten Class of 1998-99 data; Tourangeau et al., 2009) under the dynamic systems. To depict children's growth pattern, we extended the measurement part of latent transition analysis to the growth mixture model and found that the new model fitted the data well. Results also revealed that most of the children stayed in the same ability group with few cross-level changes in their classes. After adding the environmental factors as predictors, analyses showed that children receiving higher teachers' ratings, with higher socioeconomic status, and of above average poverty status, would have higher probability to transit into the higher ability group.

#### Edited by:

Salvador Chacón-Moscoso, University of Seville, Spain

#### Reviewed by:

Ratna Nandakumar, University of Delaware, USA Jocelyn Holden Bolin, Ball State University, USA

> \*Correspondence: Hongyun Liu hyliu@bnu.edu.cn

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 09 April 2016 Accepted: 11 October 2016 Published: 25 October 2016

#### Citation:

Liu Y, Liu H and Hau K-t (2016) Reading Ability Development from Kindergarten to Junior Secondary: Latent Transition Analyses with Growth Mixture Modeling. Front. Psychol. 7:1659. doi: 10.3389/fpsyg.2016.01659 Keywords: reading development, latent transition analysis, growth mixture model, dynamical systems, social rating

## INTRODUCTION

Reading is an important activity composing of various sub-skills which grow at different speed. In reality, students are nurtured in a dynamic system where they are not only self-organizing, but also interacting and being substantially affected by the psychosocial environment (Votruba-Drzal et al., 2008; Ding et al., 2013; Iruka et al., 2014). In such a system, one under-researched area is the effect of young students' social environment at home and at school on their learning to read behavior. The purpose of the present study, therefore, was to explain the pattern of reading development and to depict the relations between the developmental pattern and children's behavior as perceived by their parents and teachers. We applied and explored with the application of the latest appropriate statistical method—the latent transition analysis with growth mixture model on a large scale longitudinal survey (Early Childhood Longitudinal Study, Kindergarten Class of 1998-99, ECLS-K, Tourangeau et al., 2009).

## READING DEVELOPMENT: NON-CONTINUOUS PATTERN AND GROUPING

Reading can be seen as a way of meaning extraction which requires the working of different subskills on the text (Stahl, 1997; Clay, 2001; Rodgers, 2004). Recent research has highlighted the need to look more closely at the different skills. Word reading, therefore, might have to be separated from reading comprehension because the former includes some of the basic phonological abilities, letter knowledge, and shortterm memory (Muter et al., 2004; Kendeou et al., 2009), whereas the latter may need inference, monitoring, and knowledge of the story structure (Vellutino et al., 2007; Kendeou et al., 2009).

The mastery process of the language is, however, quite different for different subskills, such as for the constrained and unconstrained skills (e.g., Paris, 2005, 2009; Paris et al., 2005). Children's reading ability grows irregularly with spurts and stops (de Weerth et al., 1999). For example, with substantial individual differences, children's language competence may grow extremely rapidly before Spring Grade 1 but may decline thereafter (Palardy, 2010; Kieffer, 2012). Verhoeven et al. (2011) showed that the different patterns of the reading development were distinct from those around Grade 2.

Since the reading development pattern may differ from phase to phase, researchers are very interested in tracing and examining the growth trajectories. Paris (2005) suggested that when calibrating the unconstrained skills to the constrained skills, reading development follows a non-continuous growing pattern. This may not be easily detected when a simple linear growth modeling is used. Thus, for example, Quinn et al. (2015) have to use a two-part model to depict separately the developmental trajectories of the vocabulary knowledge and the reading comprehension through Grade 1 to Grade 4. Their bivariate model showed that vocabulary knowledge acted as a causal indicator of the subsequent reading comprehension growth. In summary, if researchers intend to depict the full picture of the reading developmental trajectory, a continuous growth model may not be suitable. Students stay at different "stages" with adaption to the new context using different reading skills.

A more sophisticated issue is that not all students share the same growing pattern across stages (Kaplan, 2002; Pianta et al., 2008). Empirically, these differential patterns in growth can be analyzed by (i) differentiating children into language ability groups and (ii) tracing their changes in groups as they progress in schools. For example, while most students develop rapidly before Spring Grade 1 and then slow down afterwards, some children may have a consistently slow growth rate (Kaplan, 2005; Kapland, 2008; Palardy, 2010).

The variation in growth rate is more likely to occur in the lower grades—as early as first grade (Ferrer et al., 2015), or around age of eight (Stanovich, 1986). Studies also showed that the dyslexic reader would probably grow at a slow pace that hardly enables the children to catch up with other typical readers (Grimm et al., 2010; Ferrer et al., 2015). The grouping phenomenon among slow developers is potentially harmful to them, since this low-ability-group students may have lower selfefficacy or motivation to learn let alone their ability shortage. Thus, it is important to find the conducive factors to facilitate these low ability students to "transit" into the higher competence group.

To solve the above challenging questions, we need a combined model to depict the various developing patterns with spurs and spots. Furthermore, as students' growth is determined by their current pre-exiting ability as well as by other influential factors in the environment, a dynamic systems model was adopted to analyze the interplay of these factors.

## READING DEVELOPMENT IN DYNAMIC SYSTEMS

To depict and explore the reading development, two issues should be noticed. Firstly, the sub-skills are correlated among each other. For example, Verhoeven et al. (2011) showed that the vocabulary at the beginning phase could predict word decoding and reading comprehension at the early stages of development. From Grade 2 onwards, word decoding competence in turn predicted later vocabulary development. Reading ability develops under the effects of the formal skills (Oakhill and Cain, 2012). Secondly, children live in a complicated environment where many of the external factors may influence the reading development. Thus, a dynamic systems view should be introduced when describing such a development.

The dynamic systems theory originates from natural science studies (for a review, see van Geert, 2003). According to this perspective, individual development is a consequence of the dynamic interactions within an individual and between an individual and the environment. In the last two decades, the dynamic systems view has been intensively discussed and widely applied, especially in language development research (Robinson and Mervis, 1999; van Geert and Steenbeek, 2005; Hollenstein, 2011; van Geert, 2011).

According to the dynamic systems, reading development can be described in terms of the change, interactions, and conjoint analysis of the individual and environment systems (Clay, 1977, 2001). For example, Clay (2001) believed that individuals would be able to construct and self-organize with their potential ability. They will push through the boundaries and improve their knowledge with their skills already mastered. So, proficient readers are able to mobilize the processing systems to fit the challenges of different texts by using environmental cues such as visual and motor stimulants. Kainz and Vernon-Feagans (2007) showed that the acquisition of reading ability was not isolated from the outside world. Kainz and Veron-Feagans worked with their colleague and developed a system of the dynamic circles involving the individuals, families, classrooms and school systems. This would be helpful to children's reading development and possibly help their transitions into higher ability groups (Kainz and Vernon-Feagans, 2007; Vernon-Feagans et al., 2008).

Among various factors in the social environment, teachers and parents' perception and attitude on students' study behaviors play important roles. These factors and their interplay vary from one individual to another and crucially affect students' academic outcomes. Ladd et al. (1999) Child × Environment model provides further explanation on how the quality of children's relationships can directly and indirectly influence school achievement from a dynamic system perspective. In the model, they show that children's initial behavior or the background factors influence their relationships with peers and teachers. Peer and teacher relationships in the school environment enhance or sometimes adversely affect student's achievement. For example, it is likely the students from lower socioeconomic backgrounds would be benefitted more by teachers who employed a more interpersonal approach of instruction, such as incorporating mixed group work, using peer tutoring, and solving problems with partners (Jung, 2014). Other studies have also consistently shown that high quality teacher-child relationship is conducive to high achievement (Davis, 2003; Pianta and Stuhlman, 2004; Hughes and Kwok, 2007; O'Connor and McCartney, 2007; Hughes et al., 2008). This relationship is also influenced by children's social behavior, such as their classroom engagement, which in turn affects children's achievement and academic outcomes (Cohen, 1997; Hughes and Kwok, 2007; O'Connor and McCartney, 2007).

From a dynamic systems perspective, teachers and parents could offer help to speed up children's transition into higher ability groups (Cho et al., 2013; Eyden et al., 2014). For example, teachers and parents' perceptions of students' ability and effort are closely related to children's academic achievement (Rytkönen et al., 2007; Natale et al., 2009; Longobardi et al., 2011). Particularly, since highly motivated children are perceived as talented and effortful (Upadyaya et al., 2012), parents and teachers' positive perceptions on children would be conducive to children's development. Upadyaya and Eccles (2015) showed that teachers' perceptions on ability and effort could predict the subsequent reading ability in a longitudinal study. It is thus quite important how teachers and parents perceive and show to the students their positive evaluation. This is because at the early elementary school years, children often assimilate teachers' perceptions in formulation the judgment of their own ability (Rosenholtz and Simpson, 1984; Tiedemann, 2000). From another perspective, children's educational aspiration partially reflected their parents and teachers' expectation on them as well, thus highlighting the importance of setting an appropriate but sufficiently high educational aspiration (Kuklinski and Weinstein, 2001; Herbert and Stipek, 2005).

## THE PRESENT STUDY

Two important issues would be addressed in the present study. Firstly, we were interested in the transition showing students' potency to develop their abilities. There are patterns shared by children in the same group in that they improved in their mastery of different reading skills, and thus grew together from one stage (lower ability groups) to the next (higher ability groups). Secondly and more importantly, we are interested in those environmental variables, especially the parents and teachers' perception on the children, that might facilitate such a transition.

Driven by the research questions, we had several research questions to examine under the dynamic systems theory. First, according to the integrated view of dynamic systems theory, a self-organizing process reflected an auto-regression development. We would examine whether and how extensive the subsequent ability status was determined by the previous status. Second, we would examine how much individual differences existed in students' growth trajectories. Finally, the contribution

of parents' and teachers' perception on students' growth would be examined.

## METHODS

## Participants

We used the publicly available data in the Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) (Tourangeau et al., 2009) 1 to examine our research questions. This data set was developed under the National Center for Education Statistics (NCES). We chose the ECLS-K because it focused on children's early school experiences from kindergarten to Grade 8, and the longitudinal data displayed students' long-term trajectory development. Furthermore, ECLS-K adopted a multi-source, multi-method approach, which included interviews with parents, data from principals and teachers, information from student records, and direct assessment on children (including reading, mathematics and science cognitive items). The study was in alignment with the dynamic systems theory, in which various environmental variables were considered.

In total, seven waves of measures of reading assessment were available in the data set (C1R4RSCL–C7R4RSCL). As the data at Fall Grade 1 (C3R4RSCL) contained only 30% of the total sample, without jeopardizing the generalization of our conclusion, it was not included in our study. The remaining data points were from Fall Kindergarten, Spring Kindergarten, Spring Grade 1, Spring Grade 3, Spring Grade 5 and Spring Grade 8 (y<sup>1</sup> − y<sup>6</sup> in **Figure 1**).

<sup>1</sup>Retrieved from http://nces.ed.gov/ecls/kindergarten.asp.

Together with the parents' and teacher's questionnaires, 7803 children's questionnaires were available in our analyses.

There were 456 individuals with missing covariate values, and totally 1033 individuals with missing values on one or more of the covariates or indicators. For the missing rate of each variable, other than the slightly higher rate at y<sup>1</sup> (7.0%), all other ranged from 0.5 to 4.8% only, with an overall average missing rate of 2.6%. Generally, the missing pattern of the present dataset could be treated as missing at random, so that the multiple imputation method by Mplus 7.0 (Muthén and Muthén, 2012) could be appropriately used. We generated 10 datasets, and the sample size 7803 was applied to the analyses with either the null model or with covariates being included. Basic information among the variables is shown in **Table 1**.

## Measures Reading Ability

The reading items were drawn from assessments used in other large-scale studies of similar-aged youth, including the National Assessment of Educational Progress (NAEP), the National Education Longitudinal Study of 1988 (NELS:88), the Education Longitudinal Study of 2002 (ELS:2002), the Texas Assessment of Knowledge and Skills (TAKS), and previous rounds of the ECLS-K. The reading items in ECLS-K were repeatedly measured with ten levels of the reading ability (see **Figure 2**). Each new wave was recalibrated to the former one and tests at each wave included some identical items so that the instruments at different waves could be linked on the same IRT scales (represented on the same unit of measurement). Specifically, in the collection of the Grade 8 data which was used in the present analysis, all the proficiency scores for the former levels were re-estimated to be pooled with the latest wave (see Tourangeau et al., 2009 for details).

### Social Rating

The social rating was the evaluation of the children's behavior by parents and teachers. The items were obtained from the Social Rating Scale (SRS) Approaches to Learning scales of the ECLS-K Parent and Teacher questionnaires. The SRS survey items comprised of parents' and teachers' ratings on how frequent and whether students had those study-related behaviors or not. The scale contained items such as intrinsic motivation, persistence/attention, and study habits. These ratings by teachers and parents, rather not self-reported by children, reflected students' social behaviors as perceived by the others, thus shows the interaction between students and their guardians.

A four-point scale was used, with "1 = never" and "4 = very often." Parents' SRS was collected annually except in the third, fifth, and eighth grades, while teachers' SRS was not collected at the eighth grade. In the study, the SRS in Fall Kindergarten was used to predict the transition of latent class. All these items were used as continuous variables in the present analyses (see Tourangeau et al., 2009).

## Background Information

While many studies had investigated the relationships among Socio-economic status (SES), poverty, race, minority and achievement, which were generally used as the background variables (e.g., Hattie, 2009; OECD, 2013). Specifically, SES referred to students' relative position in the social hierarchy, directly reflected the resources at home, and was often used as an important controlling variable. Both SES and poverty status measured important characteristics of the background family information and were thus chosen in our analysis (see Tourangeau et al., 2009).

## Analyses Procedure

### Model Definition

The latent transition analysis (LTA) (Prochaska and Velicer, 1997) was used to analyze the longitudinal transitions. The autoregression part of the LTA model described appropriately the self-organizing process under the dynamic systems. LTA also allowed us to add environmental covariates to moderate the autoregression process. With the LTA model, the measurement part could be further replaced according to different contexts and situations.


y<sup>1</sup> − y<sup>6</sup> = reading ability indicators from Wave 1 to Wave 6.

identifying upper- and lower-case letters of the alphabet; (2) Beginning Sounds, associating letters with sounds at the beginning of words; (3) Ending Sounds, associating letters with sounds at the end of words; (4) Sight Words, recognizing common "sight" words; (5) Words in Context, reading words in context; (6) Literal Inference, making inferences using cues that were directly stated with key words in text; (7) Extrapolation, identifying clues used to make inferences; (8) Evaluation, demonstrating understanding of author's craft and making connections between a problem in the narrative and similar life problems; (9) Evaluating Nonfiction, comprehension of biographical and expository text; and (10) Evaluating Complex Syntax, evaluating complex syntax and understanding high-level vocabulary (Tourangeau et al., 2009).

As an extension, we took advantage of the growth mixture model (GMM) to replace the measurement part of the original LTA (see Muthén et al., 2012). The GMM model could detect the growth of the reading skills by allowing individual differences in growth rate within each group, in contrast to the more stringent requirement with little individual differences allowed at each point of time.

According to earlier studies (Votruba-Drzal et al., 2008; Kieffer, 2012), the Spring Grade 1 (y3) was chosen to be the cut point of two stages. Thus, (y<sup>1</sup> − y3) were the indicators of Stage 1 (kindergarten stage) with latent growth factors INT 1 and SLP 1 (intercept/initial state and slope) classified into latent groups (G1); whereas (y<sup>3</sup> − y6) were the indicators of Stage 2 (primary to junior high school) with latent growth factors INT2 and SLP2 classified into groups (G2, see **Figure 1**).

#### Implementing the 3-Step Analysis

Specifically, in testing the effects due to environmental facilitating factors, covariates have to be introduced into the LTA. When adding these covariates, it is necessary to find appropriate ways to control for the characteristics that predict the membership in the different latent classes. Therefore, a 3-step Maximum Likelihood Method (referred to the 3-step approach in subsequent discussion) was used (see Collins and Lanza, 2010; Vermunt, 2010; Asparouhov and Muthén, 2014, see also Liu and Liu, 2015 for details).

In the first step in the 3-step LTA, GMM was used to get the classification of latent class for each stage, using the indicators at their respective stage only. For example, when estimating GMM at Stage 1, y<sup>1</sup> to y<sup>3</sup> were used as indicators, with y<sup>4</sup> to y<sup>6</sup> and the covariates serving as auxiliary variables; the proportions of each latent class were recorded. Similarly, GMM was conducted at Stage 2. In the second step, using the classification outcomes and the proportions given by Mplus, the classification error was computed for each latent class. With the odds ratio computed by the second step as the starting value of each latent class, LTA (G2 was regressed on G1) with the covariates (G1, G2, and the transition from G1 to G2, respectively, regressed on covariates) was applied (for detailed syntax, see Asparouhov and Muthén, 2014).

#### Model Selection Indices

The selection of the number of the latent classes has been a topic of much discussion (e.g., Nylund et al., 2007; Tofighi and Enders, 2008; Peugh and Fan, 2012). Most studies suggested that the BIC (Bayesian information criterion) value should be the best choice because it was a sample based index which also penalized sophisticated model. Tofighi and Enders (2008) in their simulation study showed that a sample size adjusted BIC (aBIC) was an even better index, and thus was used in our study. A smaller BIC/aBIC value indicated better model fit for nesting models. Besides, the entropy value was to measure how well a mixture model separated the classes. An entropy value close to 1 indicated good classification certainty. Asparouhov and Muthén (2014) suggested that an entropy level of 0.6 or higher might provide sufficient good classification for the 3-step method.

## RESULTS

## Selection of the Proper Model

As LTA was used in combination with GMM, the original GMM analyses were examined first. The piecewise GMM (y<sup>1</sup> − y<sup>3</sup> as Piece 1 and y<sup>3</sup> − y<sup>6</sup> as Piece 2) null model was chosen. We conducted the exploration analyses from 2 to 4 classes (see **Table 2**). The model fit indices, −2LL, BIC, and aBIC,

TABLE 2 | Model comparison and selection.


2c, 2 classes; 3c, 3 classes; 4c, 4 classes; 3-step, 3 step method.

consistently supported a 3-class model. Then we checked the class proportion to ensure the empirical significance. For the 2-class model, the proportion was 0.95 and 0.05 for each class; for the 3-class model, the proportion was 0.93, 0.05, and 0.02; for the 4 class model, two groups contained 0 individuals. It was evident that the third group in a 3-class model was so tiny (less than 5%) and would not contribute substantially and empirically to the model, so the 2-class model was retained.

We then conducted the GMM-LTA null model, using two stages of growth but without any covariate. Results showed that the 2-class model was the best according to the selection criteria (BIC and aBIC), with slightly worse but acceptable entropy value (see **Table 2**).

Finally, we conducted the 3-step GMM-LTA. BIC was 8099 with an entropy value of 0.92. The information criteria and entropy value indicated that the 3-step model was the best. The final model consisted of two groups at two stages, respectively (**Figure 3**).

## Grouping Membership

The classification results are shown in **Table 3**, and the parameter estimates for the growth factors are shown in **Table 4**. At Stage 1, most of the students were classified into the high ability group (90.9%, with initial ability of −1.10). The other 9.1% were in the low ability group with a lower initial status (−1.56). The growing rate (slope) of the high ability group (0.93) was slightly faster than that of the low ability group (0.89), but with quite similar pattern



High, High Ability Group; Low, Low Ability Group.

TABLE 4 | Parameter estimates of the growth factors for the final model solution.


seen from the trajectory in **Figure 3**. For Stage 2, from Spring Grade 1 to Grade 8, children in different classes had different growing rates. There were 95.2% in the high ability group with an initial ability of 0.41 and a growth rate of 0.12, while 4.8% in the low ability group had an initial ability of −0.48 and a growth rate of 0.19.

After grouping, there were two groups in each stage; so four possible combinations of sub-groups were formed (**Table 3**). Combination 1 (90.9%), which contains individuals classified in the high ability groups at both Stages 1 and 2, had the largest proportion. Combination 4 referred to individuals classified as low ability at both stages contained 4.8% of the population. This showed that most students' growth was stable (totally 95.7% of the population). There were about 4.3% of students being classified as Combination 3, who moved from the low ability group to the high ability group across time. No individual was in Combination 2, indicating that there was no reversed pattern (changed from high ability group to low ability group).

A transition probability showed that, once classified into the high group, students would have a 100% probability staying in the high ability group thereafter. In contrast, children starting in the low group would likely be in the low ability group at Stage 2 but had a considerably high probability to transit into the high ability group at Stage 2.

## Effect of the Environmental Factors

We set the significant level at p <.001 for this study with a large sample size. Results (see **Table 5**) showed that the covariates could predict the Stage 1's classification. Other than the background variables, both parents and teachers' higher ratings were associated with children's higher reading ability (with the lower ability group as the reference) at Stage 1. The Stage 2's classification could be predicted positively only by the parents' rating and SES level, with higher parents' rating and SES related to better children's performance (i.e., classified in the higher ability group). In contrast, higher teachers' rating was related to lower students' performance (being classified in the lower ability group; β = −6.02, odds ratio = 0.00).

Interactive effects with grouping transition were examined. It was found that when teachers' ratings (β = 5.66, odds ratio = 288) were more positive, then the children had a higher chance to transit from the lower to the higher ability group. Specifically, when the teachers' ratings were one unit higher, the low ability children at Stage 1 would have 288 times higher probability in transiting to the high ability group at Stage 2. However, the effects due to parents' ratings (β = −2.77, odds ratio = 0.06) and SES (β = −4.14, odds ratio = 0.02) were negligible.


<sup>a</sup>Classification was regressed on the covariates.

<sup>b</sup>Stage 2 High ability group (cf. low ability group) was regressed on the covariates in Stage 1 low ability group.

## DISCUSSION

## Developing Patterns

The present study showed the advanced 3-step GMM-LTA model well described the complex longitudinal ECLS-K database set in the dynamic systems model. The developmental trend showed a fast grow from kindergarten to Spring Grade 1 and then a slowing down to a plateau on time beyond. A closer examination of the reading ability scores (**Figure 2**) showed that the formal five levels of reading proficiency were more related to Paris's constrained skills which were close perfection after Spring Grade 1. After this time spot, students continuously learned unconstrained skills. From **Table 4**, statistical evidence showed that the variances were much smaller at Stage 2 than those at Stage 1, especially for their growth rates which had little variance at Stage 2. This indicates the non-normal distribution across the development from kindergarten to Grade 8. It is necessary, therefore, to analyze the reading skills separately at different stage, where sub-skills developed with quite different speeds and patterns.

The grouping results were consistent with the literature (Grimm et al., 2010; Ferrer et al., 2015) in that two groups with different ability levels could be differentiated. The classification indicates that most of the students were classified in high ability group, either at Stage 1 or Stage 2. We can thus treat the high ability group as the reference "normal" developing pattern, since it contained more than 90% of the population. So, students classified in the lower ability group were those likely to have reading problems. According to Ferrer et al. (2015), the grouping differentiation could emerge as early as Grade 1; our study indicates that the grouping may emerge even earlier. However, students still had a considerable chance to transit into the higher ability group through the self-organizing progress (conditional probability was 0.47). Educators should pay more attention to children's early reading problems as early as possible before they develop into more serious language learning problems.

## Environmental Facilitators

We found that all the factors being examined had substantial effects on the grouping at Stage 1. Contradictory results were found, however, in the prediction of Stage 2 grouping/transition. The results showed that, parents' rating and SES positively predicted Stage 2 grouping, whereas they negatively predicted the transition. Vice versa, teachers' rating negatively predicted classification, but positively predicted the transition. These contradictions may reflect problems in the long-term prediction efficiency. When we took the transition prediction terms out of our model, all predictions on Stage 2 grouping showed negative estimates (ranged from −0.48 to −0.10), but with quite small or non-significant effects (odds ratios ranged from 0.70 to 0.91). So the social environmental variables collected at Wave 1 may have less predictive power to the subsequent ability, especially for a long-term growth (8.5 years). This is somehow similar to the previous study (Upadyaya and Eccles, 2015) which showed that teachers' perception of the effort of students could predict the subsequent reading ability with a small interval (1 year) only. Further investigations on the prediction power in long term studies would be useful.

As for the transition, the results showed that teachers' ratings had larger effects in predicting the transition probability than that of the parents'. This reveals that the teachers' ratings are probably more accurate as compared to those of the parents', which might be explained by the Child × Environment model (Ladd et al., 1999; Pianta and Stuhlman, 2004). To illustrate, the teacher-student relationship is a mediator influenced by the effect of school behavior and other background or cognitive variables on children's achievement. With the accurate perceptions, teachers may adopt more efficient approaches on students' learning. Teachers' interaction with students is thus playing as a proximal factor influencing the achievement influencing academic achievement more directly, while school entries (family variables) are distal factors. On the other hand, longitudinal studies show that teachers' perception of the students (either ability or effort) can predict subsequent children's self-concept (Natale et al., 2009); teachers are significant socialization agents whose perception greatly impact children's self-concept formation (Madon et al., 2001), and thus have a great impact on students ability. To summarize, we are alerted again of the important role of the teacher-student relationships, since students spend more time in school with their teachers when they progress in schools. In contrast, their after-class activities with parents may reduce so that the parents' evaluations become less accurate and predictive of children's reading performance.

As for background variables, SES is a potentially useful predictor of children's reading performance, particularly on grouping but not on transition. Meta-analysis (e.g., Hattie, 2009) showed that SES has a moderate impact (d = 0.57) on academic achievement. In the present research, we took SES as one of the important home background variables, used it as a controlling covariate, and showed that it had influence on grouping. It is logical, therefore, to pay greater attention to the reading development of students from lower SES background (e.g., Ladd et al., 1999; Jung, 2014).

## LIMITATIONS AND FUTURE DIRECTIONS

One possible limitation is that we used a two-stage model to analyze the data. This was mainly decided from the general trajectory of the reading growth of the data and findings from earlier studies (Kaplan, 2005; Kapland, 2008; Palardy, 2010; Kieffer, 2012). However, the problem is that the interval of the stage (especially at Stage 2) is quite large with the time points of data collection being several years apart. There is a possibility, therefore, that students grow in discernible stages crossing a long period of time. If the intervals of the data collection had been much smaller, we would have been more confident to use the growth modeling within each stage. An alternative is to use the non-linear model to build the GMM (e.g. Grimm et al., 2010). But it requires demanding measures. Future studies could further explore the possible trajectories of reading development, identify the proper cutoff for each stage, and describe the most suitable trend within each stage.

We also notice that long-term effects and growth patterns are less well predicted by the social environmental covariates. These covariates may include the home and teachers' social environmental factors which generally have smaller effects than those of more direct variables such as teaching and school (for meta-analysis, see Hattie, 2009). One possible direction of the future study is, therefore, to focus on the short-term prediction of a set of more comprehensive social environmental factors from schools (teachers, peers, etc.) and families (parents, etc.). Another possibility is to treat the covariate as a time-varying variable in multilevel structure (Vermunt et al., 1999; Bartolucci et al., 2011). That means, in our analyses, the social rating recorded at Kindergarten, Spring Grade 1 Spring and Fall Grade 5 can all be treated as multiple indicators affecting the transition at different time points. Especially under the condition with a large interval of measuring time, time-varying measures would then produce more accurate prediction.

## CONCLUSIONS

In summary, the study contributes in showing that: (i) the LTA-GMM fitted the data well; (ii) most of the children stayed in the same ability group with practically few cross-level class changes in the transition; (iii) children receiving higher teachers' ratings and with higher SES, and of above average poverty status, would have higher chance to transit into the higher ability level group. The findings supported the importance of the moderating effects of these social environmental facilitators on the patterns of children's reading development.

## AUTHOR CONTRIBUTIONS

YL contributes the most to the article. HL is the corresponding author who organizes and helps conducting the analysis. KH helps a lot providing useful suggestions on modeling and revising the article.

## FUNDING

The manuscript is supported by the National Natural Science Foundation of China (No. 31571152).

## ACKNOWLEDGMENTS

We acknowledge the financial support from the Collaborative Innovation Center of Assessment toward Basic Education Quality at Beijing Normal University.

## REFERENCES


of ability in math and reading? Educ. Psychol. 35, 110–127. doi: 10.1080/01443410.2014.915927


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Liu, Liu and Hau. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Does Exercise Improve Cognitive Performance? A Conservative Message from Lord's Paradox

Sicong Liu\*, Jean-Charles Lebeau and Gershon Tenenbaum

*Department of Educational Psychology and Learning System, Florida State University, Tallahassee, FL, USA*

Although extant meta-analyses support the notion that exercise results in cognitive performance enhancement, methodology shortcomings are noted among primary evidence. The present study examined relevant randomized controlled trials (RCTs) published in the past 20 years (1996–2015) for methodological concerns arise from Lord's paradox. Our analysis revealed that RCTs supporting the positive effect of exercise on cognition are likely to include Type I Error(s). This result can be attributed to the use of gain score analysis on pretest-posttest data as well as the presence of control group superiority over the exercise group on baseline cognitive measures. To improve accuracy of causal inferences in this area, analysis of covariance on pretest-posttest data is recommended under the assumption of group equivalence. Important experimental procedures are discussed to maintain group equivalence.

#### Edited by:

*Jason C. Immekus, University of Louisville, USA*

#### Reviewed by:

*Evgueni Borokhovski, Concordia University, Canada Daniel Saverio John Costa, University of Sydney, Australia*

> \*Correspondence: *Sicong Liu 64zone@gmail.com*

#### Specialty section:

*This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology*

Received: *26 October 2015* Accepted: *06 July 2016* Published: *21 July 2016*

#### Citation:

*Liu S, Lebeau J-C and Tenenbaum G (2016) Does Exercise Improve Cognitive Performance? A Conservative Message from Lord's Paradox. Front. Psychol. 7:1092. doi: 10.3389/fpsyg.2016.01092* Keywords: exercise intervention, cognition, gain score analysis, ANCOVA, experimental group equivalence, false positive error, review

## INTRODUCTION

Does exercise enhance cognitive functioning in human beings? Meta-analyses have provided support for the beneficial effect of exercise on cognitive performance with effect sizes (g) ranging from 0.097 for acute exercise (Chang et al., 2012) to 0.158 for chronic exercise (Smith et al., 2010). Additionally, some authors have reported on several underlying mechanisms by considering evidence from behavioral and psychophysiological studies (for a review, see Hillman et al., 2008). These arguments seem to offer convincing evidence that exercise results in cognitive performance enhancement. The present study takes a critical perspective on this conclusion by assessing methodological characteristics of relevant evidence.

The most relevant evidence comes from exercise-cognition randomized controlled trials (RCT). First, these RCTs are considered clinical trials. According to World Health Organization (2015, para. 3) and the International Committee of Medical Journal Editors (Laine et al., 2007, p. 275), a clinical trial "is any research study that prospectively assigns human participants or groups of humans to one or more health-related interventions to evaluate the effects on health outcomes." Second, RCT is generally regarded as the best design for testing causal relationship because it makes group equivalence likely on all covariates (Freedman et al., 2007; Torgerson, 2009).

Several Exercise-cognition RCTs' findings support the causal relationship between exercise and cognition. For example, Chang et al. (2012) reported a larger effect size from RCTs (d = 0.19) compared to those from either quasi-experimental or observational designs (d = −0.02 and d = −0.14, respectively). These results have led some authors to conclude that exercise benefits cognition in a population ranging from children to older adults. Although such message is exciting, as Rubin (1974) cautioned, the relevance of evidence to answering research questions is not solely determined by the choice of research design but many other factors. Guided by this message, we examined exercise-cognition RCTs published in the past 20 years for potential methodological shortcomings.

## Why are Errors Possible

When analyzing pretest-posttest data from RCTs, researchers typically apply two group-comparison strategies to draw causal inferences: analysis of covariance and gain score analysis (Vickers and Altman, 2001; Van Breukelen, 2006). Analysis of Covariance (ANCOVA)<sup>1</sup> refers to the approach where posttest scores are compared between groups, adjusting for baseline scores (as covariates in the linear model). Assuming baseline group equivalence, Analysis of Partial Variance is a parallel of this strategy (Cohen et al., 2013). The alternative approach, Gain Score Analysis (GSA), considers the gain score (i.e., posttest minus pretest) as the criterion for group comparison. Forms of GSA include repeated-measures analysis of variance (RM ANOVA), gain score t-test, and ANOVA of gain score, among others. Researchers' choice between ANCOVA and GSA often leads to disparate conclusions, an inconsistency historically termed "Lord's Paradox" (Lord, 1967).

Lord's paradox generated a lasting research effort and a consensus was reached among methodologists. The consensus is that, as long as baseline group equivalence is likely by randomization (such as in a RCT design), investigators should choose ANCOVA in drawing causal conclusions, because ANCOVA has a higher testing power and unbiased effect estimate compared to GSA (Cronbach and Furby, 1970; Huck and McLean, 1975; Holland and Rubin, 1983; Miller and Chapman, 2001; Senn, 2006; Van Breukelen, 2006). However, when baseline group equivalence is unlikely (such as in a quasi-experimental design), none of the statistical procedures enables to "control for" such a flaw, and thus no causal inferences should be attempted (Campbell and Stanley, 1963; Lord, 1967; Cronbach and Furby, 1970; Meehl, 1970; Senn, 2006; Van Breukelen, 2006). To reiterate previous points with an analogy, perfect dishes ("causal inferences") come from fresh raw food ("baseline group equivalence") and skillful cooking ("ANCOVA"), whereas no perfect dishes can be made from non-fresh food ("baseline group non-equivalence") irrespective of how skillful the cook is.

Given Lord's paradox conclusion, strong evidence for causal inferences can be obtained only if (a) baseline group equivalence is likely, and (b) pretest-posttest data are analyzed using ANCOVA. In practice, researchers never know with certainty that a given RCT has baseline group equivalence, but they can ascertain baseline group non-equivalence when group baseline measures show statistical differences. Assuming that baseline group equivalence is achieved by identifying no baseline group differences on any baseline measures (which is a likely portrait of a given RCT, at least on baseline measures statistically tested), researchers should choose ANCOVA over GSA when comparing groups.

One advantage of ANCOVA over GSA is an increased power. Originally, ANCOVA was not developed to "control" for anything but to enhance the testing power of independent variables (Miller and Chapman, 2001). For instance, assuming identical within-group variance between pretest and posttest, Van Breukelen (2006) quantified that ANCOVA requires only 75% of the sample size of ANOVA of gain score (i.e., one form of GSA) to detect the same effect when the pretest-posttest correlation is 0.50. The other advantage of ANCOVA over GSA has to do with effect estimate accuracy. Specifically, ANCOVA produces the unbiased effect estimate, whereas GSA can generate under- or over- estimated effect size depending on the situation of baseline group imbalance (Vickers and Altman, 2001).

Baseline group imbalance is the descriptive difference between groups on baseline measures. If an exercise-cognition RCT has only two groups (i.e., one control and one exercise group), the control group and the exercise group have an equal chance to perform better than the other descriptively on a cognitive task at baseline. The interpretation of "better" is task specific. For instance, a shorter reaction time (RT) is better in simple reaction time tasks (e.g., Stroop Color), whereas a larger value is better in time-limited memory tasks (e.g., Digit Symbol). If the control group has baseline superiority (control-BS) by having, for instance, a shorter RT than that of the exercise group on the Stroop Color task, the adoption of GSA will lead to an overestimate of exercise's benefits on cognition. Conversely, baseline exercise group superiority (exercise-BS) will generate an underestimated effect with the GSA method (Vickers and Altman, 2001).

Baseline measures are usually negatively correlated with gain scores (Cronbach and Furby, 1970; Knapp and Schafer, 2009), a phenomenon known as "regression to the mean" (Galton, 1886; Bland and Altman, 1994). In such instances, the bias due to GSA's failure to account for baseline group imbalance can be larger. As a consequence, the Type I error (i.e., false positive) from control-BS and Type II error (i.e., false negative) from exercise-BS are likely to happen when using GSA. For example, Bland and Altman (2011) reported that comparing a baseline with a followup separately in each group by using t-test (i.e., one form of GSA) could raise the actual alpha level to be as high as 0.50 when comparing two groups and 0.75 when comparing three groups, depending on the power of a specific test. To make things worse, Bland and Altman's results were based on one outcome measure. When an exercise-cognition RCT assesses the effect of exercise on multiple cognitive measures (which is often the case), the practice of having a presumable false positive threshold (e.g., α = 0.05) could turn meaningless.

## How to Test for Possible Errors

Rather than assessing the effect of exercise on cognition by considering potential moderators, a procedure common to metaanalytic studies, the focus of the present study was to determine whether exercise-cognition RCTs published in the past 20 years (1996–2015) involve false positives or false negatives due to GSA application in pretest-posttest data analysis. We provided

<sup>1</sup> In this paper, the key distinction between ANCOVA and GSA is how researchers use the baseline measure. Although researchers can choose variables (e.g., age) as covariates in testing group difference on gain scores, these analyses are not what we mean by ANCOVA here.

a simple test to achieve this goal. Because group assignment was random, one would expect an equal chance for control-BS and exercise-BS on a certain cognitive measure. In other words, across all RCTs in our review, we expect half RCTs to show control-BS and the other half to have exercise-BS. In terms of a probability distribution, if we assume that X represents the number of RCTs showing control-BS, we would expect the probability of observing X, P (X), to follow a binomial distribution:

## P(X) ∼ Binomial(n, k)

where n represents the total number of RCTs examined and k symbolizes the expected probability (k = 0.5) of getting control-BS in a given exercise-cognition RCT<sup>2</sup> . Similarly, if researchers select randomly between GSA and ANCOVA, we should expect the group comparison strategy to follow the same binomial distribution with the only difference being that X is representing the number of RCTs employing GSA.

In order to detect possible false positive and/or negative errors among exercise-cognition RCTs using GSA, we must check for independence between baseline group imbalance (i.e., control-BS vs. exercise-BS) an statistical significance test result (i.e., significant vs. non-significant). If baseline group imbalance were independent to statistical significance test result, we would expect X, representing the number of RCTs using GSA that showed control-BS, to continue following the binomial distribution when conditioned on statistical test result. Assuming that Y stands for the statistical test result that has two possible outcomes (i.e., significant or non-significant), we will have the following conditional binomial distribution:

$$\mathcal{P}(X|Y) \sim \text{Binomial}(n|Y,k).$$

where n is the total number of RCTs using GSA method and k still takes the value of 0.5.

To summarize, we had three hypotheses in the present study. First, we hypothesized that, among all the RCTs, half of them should demonstrate control-BS and the other half should show exercise-BS due to randomization. Second, we hypothesized that researchers, as a group, selected between GSA and ANCOVA without preference, and therefore half of the RCTs should employ GSA and the other half should use ANCOVA as a groupcomparison strategy. Lastly, we hypothesized that, when GSA-RCTs are counted separately based on whether they are positive (i.e., include at least one significant finding) or negative (i.e., include no significant findings), more control-BS (than exercise-BS) GSA-RCTs should be found in positive GSA-RCTs, whereas more exercise-BS (than control-BS) GSA-RCTs should be found in negative GSA-RCTs.

## METHODS

## Literature Search and Inclusion Criteria

The second author (J.-C. L.) conducted a literature search in April and May 2015 using SPORTDiscus, Web of Science, and Google

<sup>2</sup>We chose k instead of p to avoid confusion later when reporting the probability of our hypothesis testing.

Scholar databases. The search strategy utilized the following key words within full documents: (exercise OR physical activity) AND (cognition OR cognitive performance) AND randomized controlled trial. A manual search of reference list from key studies (e.g., meta-analysis) was also performed. The first author (S. L.) screened studies by title and abstract, then by full documentation. Trial authors were contacted when required information was missing. In total, 38 RCTs were considered for coding. However, five articles were excluded because they were missing information and corresponding authors were unable to respond to our request by July 1, 2015. The final set of studies consisted of 33 exercisecognition RCTs.

The following inclusion criteria were applied to the exercisecognition RCTs: (a) studies were published between January1996 and May 2015, (b) randomization is evident at the individual level, (c) the design included pre- and post-intervention measures on cognitive tasks such as perception, intelligence, academic achievement, memory, executive function, and cognitive impairment, (d) exercise intervention focused on aerobic, resistance training, or a combination of both, (e) studies included a passive control (e.g., waiting list), an active control (that can have a cognitive, physical, or social focus), or a combination of both (see Scherder et al., 2005), and (f) group differences were tested on cognitive measures. If multiple exercise intensities were used within an RCT, we regarded the group receiving the highest intensity as the exercise group and compared it to the control group. For example, if an RCT has two exercise groups (e.g., participants exercising at 60 and 70% of their VO2max) and a reading control group, the group exercising at 70% VO2max was selected as the treatment group and was compared to the control group. In addition, if the two exercise groups differed in exercise modality (i.e., aerobic training and resistance training), we compared each of these exercise groups to the control group, respectively, and the results were coded under a given RCT. Furthermore, if multiple interventions were included and at least one of the groups received an intervention focusing on elements other than exercise (e.g., cognitive training), only the exercise group was considered as a treatment group and was compared to the control group. Finally, if multiple follow-up measurements were available after the intervention period, we chose the immediate post-intervention measurement as the post-test measure. Details of the literature search and study selection were shown in a flowchart (**Figure 1**).

## Coding and Reliability

The first two authors discussed and settled coding variables to be included in the coding sheet. One author (S. L.) independently coded all the studies. The coded variables focused on the information relevant to the focus of the study, which is to check potential Type I and Type II errors in exercisecognition RCTs. Therefore, for every cognitive task, we coded the targeted cognitive process (e.g., executive functioning), baseline group imbalance (control-BS vs. exercise-BS), and statistical test result (significant vs. non-significant). Other key methodological information were also coded including (a) group-comparison strategy in pretest-posttest data analysis (ANCOVA vs. GSA), (b) the form of control (passive vs. active), (c) the presence or

absence of randomization procedure, (d) testing baseline group equivalence on cognitive measure(s), (e) the use of blinding procedures (i.e., single-, double-, or triple-blind), (f) explicit inclusion of intention-to-treat (ITT) analysis, (g) presence of a priori power analysis, (h) total participant number and number of groups (enabling participant number per group to be calculated), and (i) the presence or absence of pre-registering the trial. **Table 1** displays the coded information for each study included.

Eleven articles (33.3% of total) were randomly selected and separately coded to produce inter-coder reliability. A research assistant blinded to the study purposes completed the coding. Inter-rater reliability was calculated using Cohen's Kappa coefficient for each coding variable (**Table 2**). Following Landis and Koch's (1977) recommendations, we considered Kappa values between 0.61 and 0.80 as substantial and above 0.80 as very good. All the coded variables in the present study showed very good reliability. Coding discrepancies were resolved by re-visiting studies and discussion.

## RCT Count and Statistical Analysis

We categorized and counted all the RCTs regarding their groupcomparison strategy and baseline group imbalance. For groupcomparison strategy, we categorized a given RCT into GSA-RCT if it used gain scores as the criterion in comparing groups. We classified an RCT as ANCOVA-RCT if the outcome variable was the post-test score while controlling for baseline score as covariate, or if analysis of partial variance was used.

Although we coded baseline group imbalance for every cognitive task within an RCT, we later counted the number of RCT regarding their baseline group imbalance favorableness (control-BS vs. exercise-BS). This ensured an equal weight for every RCT given their varying number of cognitive measures. For example, one RCT reported 42 cognitive measures but several RCTs reported only one cognitive measure. In this case, the 42-task RCT would be over-weighted if the count were made at the task level. We applied the "dominance rule" in judging whether a given RCT favors control-BS or exercise-BS. For example, if an RCT used four cognitive measures, we coded it as favoring control-BS if three of the four measures had better performing control group at baseline. Due to within-study measurement dependence, multiple cognitive measures tended to show homogeneous results with respect to baseline group imbalance. Among 33 RCTs, we applied the dominance rule to 14 RCTs. Two RCTs showed equal number of cognitive measures between control-BS and exercise-BS, and thus were dropped from the final count on baseline group imbalance.

We also made "conditional count" among GSA-RCTs. First, all the RCTs were screened for GSA employment. Then, GSA-RCTs were categorized as either positive (i.e., having at least one significant finding) or negative (i.e., having no significant TABLE 1 | Study coding sequenced by group comparison strategy and study positivity.


*Year, Year of publication; Grp, (T/C), Baseline group imbalance (total count/conditional count); Sig., Study positivity (at least one significant test result identified by corresponding RCT); Anal., Group comparison strategy in pretest-posttest data analysis; Control, Form of control group; Random, Described random allocation procedures; Test Base, Tested baseline group equivalence on cognitive measures; Blind, Blinding procedures reported; ITT, Explicitly mentioned following intention-to-treat principle; Power, Performed a priori power analysis; N (Grp.), Total sample size (number of groups); Prereg., Pre-registered the trial. Liu-Ambrose et al. (2012) reported data dependence with Liu-Ambrose et al. (2010); E, Exercise-BS; C, Control-BS; Y, Yes; N, No; GSA, Gain score analysis; ANCOVA, Analysis of covariance; A-Cog., Active control with a cognitive focus; A-Phy., Active control with a physical focus; A-Soc., Active control with a social focus; A-Mix, Active control with more than one focus (e.g., cognitive and social); P, Passive control, Both, A control group consisting both actively and passively controlled participants; Single, Single blinding procedure (i.e., cognitive task assessors).*

findings). The "conditional count" process was very similar to the previous count except that a RCT's baseline group imbalance was decided only on those cognitive measures fitting the positive/negative category. Specifically, if a GSA-RCT had at least one significant result (i.e., positive study), its baseline group imbalance was determined on all significant cognitive measures. If a GSA-RCT had no significant results (i.e., negative study), all its cognitive measures were included to determine its baseline group imbalance. These decisions were made for two reasons. First, some positive RCTs employed only one cognitive task (which reached statistical significance). Second, we could bias the negative RCT count regarding baseline group imbalance if we retained the non-significant measures from positive RCTs and recycled them in the negative RCT count.

During the "conditional count," we applied the dominance rule to only one GSA-RCT because it included one cognitive measure supporting control-BS and one cognitive measure with description-wise equal baseline between the control and exercise group; and thus it was counted as control-BS. In addition, one positive GSA-RCT reported a control-BS on one cognitive measure and exercise-BS on the other cognitive measure. This RCT was subsequently classified as neutral and was dropped from the final conditional count. We used the R version 3.2.0 (R Core Team, 2015) to estimate the probability of obtaining those counts based on continuity-corrected binomial distributions. Whereas the first two hypotheses had two-sided tests, the third hypothesis had one-sided test. The alpha level was set at 0.05.

## RESULTS

**Table 3** summarizes results pertaining to the first two hypotheses. The first hypothesis assumed that the occurrence of control-BS and exercise-BS are equally likely. Among all the RCTs (n = 31), we observed that 16 RCTs resulted in a control-BS and 15 RCTs in an exercise-BS (two RCTs were dropped in the count because they showed no clear favorableness between control-BS and exercise-BS). The probability of detecting this result met our expectation, <sup>ˆ</sup><sup>k</sup> <sup>=</sup> 0.52, <sup>p</sup> <sup>=</sup> 0.99, with a 95% CI of (0.33, 0.69). The second hypothesis assumed that the incidence of GSA and ANCOVA as a group comparison strategy are equal among RCTs. The count revealed 27 GSA-RCTs and 6 ANCOVA-RCTs. The test of such occurrence reached significance, <sup>ˆ</sup><sup>k</sup> <sup>=</sup> 0.82, <sup>p</sup> <sup>&</sup>lt; 0.001, with a 95% CI of (0.64, 0.92). Therefore, we rejected the second hypothesis and concluded that researchers predominantly used GSA over ANCOVA in analyzing pretest-posttest data.

**Table 4** displays results for the third hypothesis, which tested independence between baseline group imbalance and statistical significance test result among GSA-RCTs. Among

TABLE 2 | Kappa coefficients for coding variables.


TABLE 3 | The probability of observed RCT counts regarding baseline group imbalance and group comparison strategy.


*Group, Baseline group imbalance; Control, Control-BS; Exercise, Exercise-BS; Strategy, Group-comparison strategy used in pretest-posttest data analysis; GSA, Gain score analysis; ANCOVA, Analysis of covariance.*



*Positive, GSA-RCTs identifying at least one significant finding; Negative, GSA-RCTs identifying no significant findings; Control, Control-BS; Exercise = Exercise-BS.*

positive GSA-RCTs (n = 17), 14 resulted in a control-BS and three in exercise-BS. This pattern reached significant level, <sup>ˆ</sup><sup>k</sup> <sup>=</sup> 0.82, <sup>p</sup> <sup>=</sup> 0.006, with a 95% CI of (0.60, 1.00). Among the negative GSA-RCTs (n = 9), two studies had a control-BS and seven had exercise-BS. This observation was not significant, <sup>ˆ</sup><sup>k</sup> = 0.22, <sup>p</sup> <sup>=</sup> 0.09, with a 95% CI of (0.00, 0.55). Thus, baseline group imbalance was related to statistical test in that more control-BS GSA-RCTs (which had over-estimated effect sizes) than exercise-BS GSA-RCTs resulted in significant results.

## DISCUSSION

The objective of the present study was to determine whether exercise-cognition RCTs published in the past 20 years (1996– 2015) include false positives or false negatives due to the ignorance of Lord's paradox (i.e., performing GSA in analyzing pretest-posttest data). Overall, several findings emerged from this study. First, baseline group superiority was found to be randomly determined among all the RCTs, with an equal probability of control-BS and exercise-BS. Second, GSA was the more popular group comparison strategy (27 RCTs) compared to ANCOVA (6 RCTs). Lastly, evidence suggested that positive GSA-RCTs were likely to include false positive errors because 82% (14 out of 17 studies) of them tested on over-estimated effect sizes. However, no clear evidence supported false negative errors among negative GSA-RCTs although a descriptive consistency was revealed.

Given findings that GSA is prevalent and misleading, it is necessary to re-emphasize the adoption of ANCOVA in pretest-posttest data analysis. The employment of ANCOVA could eliminate the biased effect estimate due to baseline group imbalance and increase testing power, thus reducing inferential errors. However, choosing ANCOVA as group comparison strategy is only half the story because ANCOVA enhances causal inferences only when group equivalence is likely. The other half, baseline group equivalence, depends on multiple factors during the experimental process. Some important factors are discussed next.

## Randomization Procedures

One factor influencing group equivalence is randomization procedure. According to Schulz (1996), randomization consists of two stages: generation of unpredictable assignment sequence and concealment of that sequence until group allocation occurs. The first stage is related to the reliability of the randomizing tool (e.g., computer algorithm), and is often mistakenly identified as randomization itself. Consequently, sequence-concealment often receives insufficient attention, which introduces bias that emerges from the predictability of participant allocation. Ideally, the information on participant allocation should be revealed "as late as possible." As an example, Newell (1992) reported an anecdotal story of a surgeon who tosses a sterilized coin after a patient's abdomen was opened to decide which "treatment" he should perform. Although a little extreme, it highlights the importance of concealing participants' allocation information from experimenters. **Table 1** shows that only 7 out of 33 RCTs described randomization tools and even fewer RCTs described sequence-concealment procedures. In a couple of occasions, the randomization was done with imbalanced assignment ratio (e.g., 2:1 in assigning participants to exercise and control group, respectively) and no justifications were offered. Therefore, it is encouraged to report the randomization tool and to describe procedures for concealing the randomization sequence. In cases of imbalanced group assignment ratios, justifications are required.

## Baseline Check

Prior to intervention, researchers must examine group equivalence on baseline measures. To foster such an examination, the CONSORT (Consolidated Standards of Reporting Trials) statement (Schulz et al., 2010) suggests reporting baseline data of demographic and clinical characteristics for each group. Concerning the CONSORT statement and the difficulty in conducting double-blind trials in exercise-cognition area, we recommend researchers to examine baseline group equivalence using both significance tests and subjective judgments. Baseline significance tests can alert researchers to factors interfering with randomization (e.g., no double-blinding); even when no significant group differences are identified at baseline, researchers must still review descriptive group imbalance on its size and prognostic strength (Altman, 1985). If meaningful group differences are found on any of the baseline measures (regardless of test significance), researchers could take different approaches in solving the problem, depending on how many baseline measures showed group differences. For instance, researchers can block participants when only few baseline measures (i.e., one or two) showed group differences in baseline check, or can re-randomize participants when more baseline variables exhibited group differences (Rubin, 2008).

## Single-Blinding and Differential Expectation

Blinding procedure also affects group equivalence. When participants were assigned to either exercise or control group, it was challenging (if not impossible) to blind them to their respective interventions. In the present review, 18 out of the 33 RCTs reported blinding procedures and all of them were "single-blinded" (i.e., cognitive task assessors were blinded to participants' group assignment). No RCTs reported blinding participants to their group assignments. This raises the concern that participants may show differential expectations due to open group assignment. Such a possibility is consistent with the idea of "unmatched task" for the control group in the literature dealing with the effect of exercise on cognition (Brisswalter et al., 2002). The concern of differential expectation can also be evidenced by the diversity of control conditions in **Table 1**. This diversity reveals little agreement among researchers in speculating an active control for exercise intervention. To help select and/or design a good control, we recommend an empirical solution. That is, researchers should measure differential expectation. Although, preliminary effort has been made to survey differential group expectations prior to intervention (e.g., Stothart et al., 2014), we echoed Boot et al. (2013) in suggesting future research to consider testing differential expectation either during or after the intervention period. The optimal active control of exercise intervention must equate expectations on all these periods.

## Intention-to-Treat Principle

Intention-to-Treat (ITT) is a widely accepted principle in analyzing clinical trials. ITT prevents group non-equivalence due to participant dropout (e.g., differential attrition) by including all the randomized participants in data analysis based on their intended treatment assignment (Gillings and Koch, 1991). The ideal situation for ITT would be having complete data for all the randomized participants (Hollis and Campbell, 1999). However, attrition is typically inevitable for clinical trials. In order to include participants with incomplete data into the analysis, missing values need to be handled. Some missing value imputation methods are available. For example, methods based on multiple imputation or maximum likelihood are generally recommended, but special considerations must be given to specific situations (Enders, 2010). However, no statistical methods can perfectly fix experimental flaws. When applying ITT, it is necessary to develop protocols (e.g., excluding likely exercise-intolerant participants before randomization) to ensure that participant adherence rate is roughly 80% or higher (Gillings and Koch, 1991; Montori and Guyatt, 2001). Regardless of adherence rate for a given RCT, a sensitivity test should always be performed to compare the ITT analysis results (as primary outcome) with the complete-case analysis results (Gillings and Koch, 1991). Compatible result of the sensitivity test precludes the concern of differential attrition, whereas incompatibility suggests this threat to internal validity. In short, future investigations are advised to include protocols that maximize adherence rate, to follow ITT principle, and to perform sensitivity analysis. Two other important elements of clinical trials are discussed next, although they do not affect group equivalence directly.

## Power

Despite that no clear evidence of false negative errors was observed in the present study, it was still important to make sure that each RCT has sufficient power so that false negative errors could be minimized. Among all the RCTs included, only eight of 33 RCTs reported performing an a priori power analysis. Depending on the inputted parameters, the sample sizes varied among these RCTs. However, the average group size among the RCTs with a priori power analysis was about 65 participants, whereas the average group size for those not performing an a priori power analysis was about 32 participants<sup>3</sup> . It seems that a substantial proportion of exercise-cognition RCTs was underpowered, and thus could lead to false negative errors. It might be argued that 23 out of 33 included RCTs had at least one significant result, and thus false negative errors should not be a concern. However, 23 out of 33 RCTs having at least one positive result is not an evidence of sufficient power. First, we showed that false positive errors are likely to be included in those 17 positive GSA-RCTs, and by extension in the 23 positive RCTs. Second, as highlighted by Rubin (1974), a poorly implemented experiment can maintain many errors and ultimately be irrelevant to testing the research question. An experiment should follow optimal procedures (including a priori power analysis) for its conclusions to appropriately address research questions.

## Researcher Degrees of Freedom and Trial Pre-registration

Although researchers are following the best paradigm including fixed set of practices, they still make decisions on quite some circumstances. These decision-calling circumstances are regarded as the researcher degrees of freedom (Simmons et al., 2011). It includes, among others, types of measure used in data collection, group-comparison strategies employed for data analysis, and type of data reported. When considering the researcher degrees of freedom with publication bias, an increased likelihood of Type I error would follow. For example, Gelman and Loken (2013) argued that data analysis strategies could be unwittingly conditioned on data patterns, which allow for false positive findings. To restrict researcher degrees of freedom by increasing clinical trial transparency, the International Committee of Medical Journal Editors (ICMJE) declared a trial's pre-registration as a condition for publishing in its 11 member journals in 2004 (De Angelis et al., 2004). ICMJE only recognizes registries meeting several criteria, including being free to public access, electronically searchable, open to all registrants, run by not-for-profit organization, as well as able to ensure validity of registration data by offering a mechanism. For example, www.clinicaltrials.gov maintained by the U.S. National Institute of Health is a qualified registry, even though many other registries have become available since 2004 (Humphreys et al., 2013) maintained by the U.S. National Institute of Health is a qualified registry, even though many other registries have become available since 2004 (Humphreys et al., 2013). It is by revealing critical trial information before participant enrollment that trial pre-registration combats researcher degrees of freedom. By pre-registering trials, researchers can still make changes afterwards as long as they offer good justifications. Although pre-registration has been the rule in clinical trial publication for almost 10 years (Laine et al., 2007), it is not true among exercisecognition RCTs because only 8 out of 27 studies published in 2005 and later had trial pre-registration (**Table 1**). Therefore, we recommend future exercise-cognition RCTs to follow ICMJE's guidelines and make trial pre-registrations before enrolling participants.

## Limitations

Several limitations in the present study are worth pointing out. First, we only focused on group comparison strategies in analyzing pretest-posttest data in exercise-cognition RCTs because it generates good evidence to evaluate the claim that exercise benefits cognition, and it is a design shared by all the exercise-cognition RCTs. Second, although ANCOVA should be used in analyzing pretest-posttest data in RCTs given group equivalence, it should be noted that ANCOVA was developed under several statistical assumptions, among which the assumption of homogeneity of regression slopes should receive particular attention (Miller and Chapman, 2001). However, these assumptions should not be used as an excuse to choose GSA against ANCOVA because GSA shares the same set of assumptions and because of ANCOVA's robustness and flexibility under assumption violation (Huck and McLean, 1975). Lastly, the counting process may have introduced bias in our conclusions, especially for the conditional count. We made the counts at trial level rather than at task level, and thus applied the "dominance rule" in order to maintain equal weight among exercise-cognition RCTs. Even though a better approach may be possible, evidence supported our decision. For example, we applied the "dominance rule" only to a minority of collected RCTs and the marginal count met the exact expectation from a probability point of view. Among the 33 RCTs, only two RCTs switched the group regarding baseline superiority between the marginal count and the conditional count.

## CONCLUSION

Although exercise-cognition RCTs showed randomness of baseline group imbalance, RCTs adopting GSA as group comparison strategy were likely to have false positive errors and thus weakened the overall exercise-benefit-cognition claim. Future research will benefit from employing ANCOVA in analyzing pretest-posttest data while maintaining baseline group equivalence. Several suggestions have been offered to maintain baseline group equivalence in future research. It is likely that the results of current study are not limited to the effect of exercise on cognition and could potentially be extended to RCTs in other domains.

## AUTHOR CONTRIBUTIONS

Conceived and designed the study: SL, JL. Searched publications: JL. Screened publications, coded data, and analyzed results: SL. Calculated inter-rater reliability: JL. Contributed to the writing of this manuscript: SL, JL, GT.

## ACKNOWLEDGMENTS

The authors would like to thank Dr. Yu-Kai Chang and Dr. Walter R. Boot for their review of the initial draft of this paper.

<sup>3</sup>This information was calculated based on the "N (Grp.)" column of **Table 1**.

## REFERENCES


adults at risk for Alzheimer disease: a randomized trial. JAMA 300, 1027–1037. doi: 10.1001/jama.300.9.1027


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Liu, Lebeau and Tenenbaum. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

<sup>∗</sup>References marked with an asterisk indicate studies included in **Table 1**.

# Incremental Validity and Informant Effect from a Multi-Method Perspective: Assessing Relations between Parental Acceptance and Children's Behavioral Problems

Eva Izquierdo-Sotorrío<sup>1</sup> \*, Francisco P. Holgado-Tello1,2 and Miguel Á. Carrasco<sup>1</sup>

<sup>1</sup> Department of Personality, Assessment and Psychological Treatments, Faculty of Psychology, National University of Distance Education, Madrid, Spain, <sup>2</sup> Department of Behavioral Science Methodology, Faculty of Psychology, National University of Distance Education, Madrid, Spain

#### Edited by:

Jason C. Immekus, University of Louisville, USA

## Reviewed by:

Christian Wandeler, California State University, Fresno, USA Lourdes Ezpeleta, Universitat Autònoma de Barcelona, Spain

\*Correspondence:

Eva Izquierdo-Sotorrío eva.izq@cop.es

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 13 November 2015 Accepted: 21 April 2016 Published: 10 May 2016

#### Citation:

Izquierdo-Sotorrío E, Holgado-Tello FP and Carrasco MÁ (2016) Incremental Validity and Informant Effect from a Multi-Method Perspective: Assessing Relations between Parental Acceptance and Children's Behavioral Problems. Front. Psychol. 7:664. doi: 10.3389/fpsyg.2016.00664 This study examines the relationships between perceived parental acceptance and children's behavioral problems (externalizing and internalizing) from a multi-informant perspective. Using mothers, fathers, and children as sources of information, we explore the informant effect and incremental validity. The sample was composed of 681 participants (227 children, 227 fathers, and 227 mothers). Children's (40% boys) ages ranged from 9 to 17 years (M = 12.52, SD = 1.81). Parents and children completed both the Parental Acceptance Rejection/Control Questionnaire (PARQ/Control) and the check list of the Achenbach System of Empirically Based Assessment (ASEBA). Statistical analyses were based on the correlated uniqueness multitrait-multimethod matrix (model MTMM) by structural equations and different hierarchical regression analyses. Results showed a significant informant effect and a different incremental validity related to which combination of sources was considered. A multi-informant perspective rather than a single one increased the predictive value. Our results suggest that mother–father or child–father combinations seem to be the best way to optimize the multi-informant method in order to predict children's behavioral problems based on perceived parental acceptance.

Keywords: incremental validity, multiple informants, parental acceptance-rejection, behavioral problems, children, hierarchical regression, structural equations models, informant effect

## INTRODUCTION

The progress of psychology is inextricably linked to the development of new and more refined methods and strategies for measuring psychological concepts, models, and intervention programs (Eid and Diener, 2006). A multi-informant approach offers insights into scientific phenomena and can contribute to confirming psychological theories in a way that a single-informant approach cannot. Due to the complexity of constructs evaluated and developmental factors that take place in children's psychological adjustment, their assessment is mainly multimodal (e.g., rating scales, interviews, and observations), multi-informant (e.g., child, parents, teachers, and mates), and/or multi-trait (Eyde et al., 1993; Ollendick and Hersen, 1993; Mash and Terdal, 1997; Duhig et al., 2000; Meyer et al., 2001; Johnston and Murray, 2003; Achenbach, 2006;

Hunsley and Mash, 2007). Specifically for informant assessment, the most reliable source of information on a target's psychological characteristics is not to be found in his or her self-ratings, nor it is guaranteed by single informant ratings; rather, it is found in the combination of the judgments from the community of the target's knowledgeable informants. According to this, the multiinformant assessment is mostly accepted by the psychological assessment community as an adequate and useful procedure, since rarely is a unique measure sufficient for providing all the required information needed to form an accurate judgment (Meyer and Archer, 2001; Garb, 2003; De Los Reyes and Kazdin, 2004; Carrasco et al., 2008; Hughes and Gullone, 2010). However, informant effects represent bias that can derive from the use of the same source of information in the assessment of different traits, the knowledge of informants, the observability of assessed traits, the judgment of informants, or the social desirability, among other factors (Cheng and Furnham, 2004; Neyer, 2006). For these reasons, determining the extent to which an informant effect is affecting the assessment of constructs and its relations is an important goal in determining the real construct validity. Individual reports often yield inconsistent data and discrepancies that can create considerable uncertainties in designing interventions and drawing conclusions from research (Klein, 1991; Epkins, 1993; Jané et al., 2000; De Los Reyes and Kazdin, 2004, 2005, 2006; Achenbach, 2006; Goodman et al., 2010; De Los Reyes et al., 2015). For instance, associations between constructs tend to be largest: (a) when a single informant is used, because of shared method variance (Neyer, 2006); (b), when the assessment of interventions has a large effect on parent reports vs. observed child behaviors of children's externalizing problems (Tarver et al., 2014); or (c) when family members experience their interaction differently and therefore have dissimilar views on parenting and parent child relations (e.g., Lanz et al., 2001; Hoeve et al., 2009). A key reason for these uncertainties originates from the near-exclusive focus on mental health research as applied to whether informant discrepancies reflect measurement error or reporting biases (e.g., Richters, 1992; De Los Reyes, 2011). Consequently, what remains unclear is whether a multi-informant approach to assessment validly captures contextual variations displayed in children's behavioral problems or whether it instead reflects different perceptions or beliefs about what a symptom is, and, finally, which informants ought to be included in assessments of children and adolescents.

Regarding this last point, another important issue from a multi-informant approach is the differential contribution of a particular source of information in relation to others. That is, the incremental validity or degree to which adding a new informant to the assessment consistently increases the predictive power and decision making (Garb, 2003; Hunsley, 2003; Hunsley and Mash, 2005). Unfortunately, the incremental validity inherent in using and combining multiple assessment methods has not undergone wide empirical testing in the literature on either adult or child assessment (Mash and Terdal, 1997; Hunsley, 2002). Thus, strong psychometric properties of the individual measures are necessary but do not provide sufficient conditions to ensure the incremental validity of incorporating these measures into the assessment process. Furthermore, not only is the research that deals directly with incremental validity in child assessment relatively small, the incremental validity of mothers' vs. fathers' reports has seldom been tested (Johnston and Murray, 2003).

With regard to cross-informant use, some studies support the incremental value of adults' over children's information when externalizing problems are measured (Loeber et al., 1991; Carrasco et al., 2008). However, the use of adults' information in children's assessment does not always augment the value of using only one source of information (Biederman et al., 1990). On the other hand, for older children, when assessing internalizing problems or covert behaviors, there is some evidence for the incremental value of youth self-reports over parents reports (Langhinrichsen et al., 1990; Cantwell et al., 1997; Johnston and Murray, 2003).

One of the most consistent observations in the field of child assessment is the correspondence levels between informants' reports, which range from low to moderate in magnitude (Achenbach et al., 1987; Duhig et al., 2000; Achenbach, 2011; Markon et al., 2011; De Los Reyes et al., 2015). The evidence usually shows that pairs of informants who observed children in the same context (e.g., pairs of parents or pairs of teachers) tend to show greater levels of correspondence than pairs of informants who observed children in different contexts (e.g., parent and teacher). Accordingly, some studies have found that the cross-informant agreement was moderate to high between mother and father, and moderate to low between father–child and mother–child pairs (Grigorenko et al., 2010; Weitkamp et al., 2013). Correspondence between mothers and children tend to be higher than correspondence between fathers and children (Grigorenko et al., 2010) and mother–child reports tend to find a greater endorsement than father–child reports (Lapouse and Monk, 1958; Achenbach et al., 1987; Stanger and Lewis, 1993; De Los Reyes et al., 2015). Also, the confluence of informants' reports about children's externalizing problems (e.g., aggression and hyperactivity concerns) tends to be higher than that concerning internalizing problems (e.g., anxiety and depression). In this regard, maternal and paternal reports show moderate correspondence when rating internalizing behavior problems in children and a larger correspondence in ratings of externalizing behavior problems in children (Achenbach et al., 1987; Duhig et al., 2000; Grigorenko et al., 2010). This evidence may reflect the greater correspondence between reports of directly observable behaviors than internalized behaviors. There is also evidence supporting claims that the degree of acquaintance between parents and children is a factor that leads to different parental ratings (Hughes and Gullone, 2010). The variability of correspondence found between the different pairs of informants is probably reflective of both the potential informant effect and the differential contribution of each source of information to the assessment's target. Furthermore, we would like to remark that the variation of the responses will be due to real differences from individual subjects, and the variation of the subjects on the variable won't be a continuous uniform distribution, but its favorable or unfavorable position on the studied object will be according to their perception (Likert, 1932).

This study tries to explore from a multi-informant approach the relations between parental acceptance and children's

internalizing and externalizing problems. Perceived parental acceptance is one of the main factors involved in children's psychological adjustment, as is shown from the interpersonal acceptance-rejection theory (IPARTheory; Rohner, 1986; Rohner et al., 2012). Parental rejection (the opposite of parental acceptance) implies the absence or a significant withdrawal of parental warmth, affection, care, comfort, concern, nurturance, support, or love, and the presence of a variety of physically and psychologically hurtful behaviors and effects (Rohner and Khaleque, 2005; Rohner et al., 2012). Meta-analysis studies on this subject have found that rejection has consistently negative effects on the psychological adjustment and behavioral functioning of both children and adults worldwide (Khaleque and Rohner, 2002; Rohner and Khaleque, 2005; Rohner et al., 2012). The same body of research also shows that children who perceive their parents as being rejecting tend to experience distress, and in turn develop a specific cluster of internalizing (i.e., emotional instability, depression) and externalizing (i.e., aggression, delinquency) problems (McLeod et al., 2007; Hoeve et al., 2009; Rohner and Khaleque, 2010; Khaleque and Rohner, 2012; Khaleque, 2015; Ramírez-Lucas et al., 2015). However, no studies from this perspective have been conducted, to our knowledge, that explore either the informant effect or the incremental validity of parents' and children's perceived parental acceptance on externalizing and internalizing behavioral problems. Accordingly, no specific results are expected and no particular hypotheses are going to be tested. The first aim of this study is to test for evidence of informant effects related to the links between parental acceptance and children's behavioral problems as measured by children, fathers, and mothers through a round-robin design, in which all informants rate all targets. The second aim is to explore the incremental validity of the informants. Specifically, we deal with two questions: (1) Are there significant informant effects predicting children's behavioral problems based on perceived parental acceptance? (2) What is the incremental validity of the children's perceived parental acceptance over the parent's perceived parental acceptance in predicting the children's behavioral problems?

## MATERIALS AND METHODS

## Sample

The sample was composed of 681 participants (227 children, 227 fathers, and 227 mothers). Children's (40% boys; n = 90) ages ranged from 9 to 17 years (M = 12.52, SD = 1.81): 37% (n = 61) were between 9 and11 years, 47% (n = 107) were between 12 and 13 years, 20% (n = 46) were between 14 and 15, and 6% (n = 13) were between 16 and 17 years.

All of the children attended school, the majority lived in two-parent households (91%), and the mean number of siblings was three. Of the parents, 88% of fathers and 70% of mothers were employed. Occupational titles for mothers and fathers (respectively) were: major professionals (17 and 17%), lesser professionals (40 and 33%), semi-skilled workers (18 and 26%), and unskilled workers (25 and 24%). The mothers' and fathers' education levels were: university studies (40 and 35%), high school studies (40 and 57%), and primary studies (20 and 8%).

This sample is part of a larger sample of a general study about parental acceptance and children's psychological adjustment in the Spanish population. Children were selected according to mother–father–child matched participation. This sample represents 22% of the total sample (N = 1036). The total sample was randomly selected from public schools and publically funded private schools in different cities and communities of Spain. The participation rate of the total families was 91.5%.

No significant differences were found between participant and non-participant families in the demographic variables (i.e., child's sex, age, and socioeconomic level).

## Measures

All measures were filled in by children, mothers, and fathers using the appropriate versions of the instruments described below.

#### Parental Acceptance

Four versions of the Parental Acceptance-Rejection/Control Questionnaire were used to report on perceived parental acceptance, two for children (mother and father versions, one to report about each parent) and two for parents (one version for mothers and another version for fathers). Children filled in both mother and father versions (Parental Acceptance-Rejection/Control Questionnaire, Child PARQ/Control: mothershort version for children and Child PARQ/Control: father -short version for children). Mothers filled in mother versions and fathers filled in father versions (Parental Acceptance-Rejection/Control Questionnaire, PARQ/Control: Mother- short version for parents and, PARQ/Control: father- short version for parents; Rohner, 1990; Rohner and Khaleque, 2005; Spanish adaptation by Del Barrio et al., 2014). The short versions of the PARQ/Control for children and for parents consist of 29-item. The PARQ/Control for children is a self-reporting questionnaires with four scales measuring warmth/affection [e.g., "My mother (father) says nice things about me"], hostility/aggression [e.g., "My mother (father) gets angry at me easily"], indifference/neglect [e.g., "My mother (father) pays no attention to me"], and undifferentiated rejection [e.g., "My mother (father) does not really love me"], plus a parental control (permissive-strictness) scale built into it. The PARQ/Control for mothers and fathers are self-reports with the same scales as the version for children; the difference with the children version is that items ask about the mother or father her/himself (e.g., "I get angry at my son easily"). The mother and father versions of the PARQ/Control (short forms) are identical, with the exception of the title changing according to which parent is being assessed. In all versions items are scored on a 4-point Likert-type scale ranging from 4 (almost always true) through 1 (almost never true). The sum of the first four scales (24 items) constitutes a measure of overall perceived maternal and paternal acceptance/rejection (with the entire warmth scale reverse scored). A greater score indicates a perception of greater parental rejection. Evidence regarding the validity and reliability of the PARQ/Control has been very well supported (Khaleque and Rohner, 2002; Rohner and Khaleque, 2005). Coefficient alphas for the total score in this sample are 0.88 for fathers and

0.97 for mothers in the children versions; and 0.88 for fathers and 0.88 for mothers in the parent version.

#### Children's Behavioral Problems

Two versions from the Achenbach System Evidence Based Assessment (Achenbach and Rescorla, 2007) were used to report on the children's behavioral problems: one for children (YSR) and one for parents (CBCL). Fathers and mothers inform separately about the children's behavioral problems on the CBCL version. The Youth Self-Report (YSR) is composed of two parts, the first assessing various psychosocial skills and competences, and the second consisting of a check-list of 112 items assessing a large number of behavioral problems, which are aggregated into two broad dimensions: internalizing (anxiety/depression, withdrawal, somatic complaints) and externalizing (breaking rules, aggressive behavior) problems. The items are scored on a 3-point Likert-type scale with anchors of 0 (not true), 1 (somewhat or sometimes true), and 2 (very true or often true). The Children's Behavioral Check List (CBCL) is similar to YSR, with the exception of having one item more (113 "Other problems"). For the purpose of this study, we only use the check-lists and the two broad dimensions: externalizing and internalizing behavioral problems.

For this sample, the Cronbach's alphas were 0.75 for the internalizing scale, and 0.73 for the externalizing scale on the YSR version; 0.79 and 0.78 for the internalizing scale, and 0.80 and 0.77 for the externalizing scale on the father-CBCL and mother-CBCL, respectively.

## Procedure

Once the cluster sample of schools was selected, an authorization from the school board and an informed consent form from each child's responsible guardian were collected. Participation was voluntary. The instruments were administered collectively to each school class group in their own classrooms by research personnel trained for this task.

To explore the potential informant effect, we started with the correlated uniqueness model MTMM (Multitrait-multimethod Matrix; Byrne, 1998). According to the correlated uniqueness model, if the different sources are adding systematic variability to the model, we should find significant correlations between errors of the dependent variables reported by the same informant. At the same time, no matter what the global fit of the model is, a significant increase in the model fit should be noted. Second, we used a different hierarchical regression analysis to determine the magnitude of the incremental validity.

Data was analyzed using LISREL 8.9 and SPSS version 20.0 for Windows (SPSS WIN).

## Design and Variables

A round-robin design was employed, in which fathers, mothers, and children separately completed all the instruments used. The independent variables were parental acceptance levels as perceived by children, mothers, and fathers. The dependent variables were children's externalizing and internalizing problems, reported separately by fathers, mothers, and children.

## Results

In **Table 1** is included the correlation matrix among the variables used. According to the Multitrait-Multimethod matrix logit, if any informant effect exists the Monosource-Multitrait correlation should be higher than the Multisource-Multitrait one. If we focus on the dependent variables, we observe that the correlation between the internalizing and the externalizing problems informed by children (rint−ext) is 0.54 (monosourcemultitrait). This value is higher than other multisourcemultitrait correlations such as rint−pext = 0.15; rint−mext = 0.19; rpint−pext = 0.12; or rmint−ext = 0.02. These results should take us to think about a possible informant effect. The same pattern is found in other variables. Thus, the correlation intra-informant for the same two variables is higher than the correlation interinformants.

In order to obtain more evidences about the informant effect, we tested two models. In the first one (model 1), all the PARQ measures (PARQP, PARQM, MPARQ, and PPARQ) were predictors of all the criterion variables (INT, EXT, MINT, MEXT, PINT, and PEXT; **Figure 1**). The second model (model 2), was essentially the same, but included the correlations between the errors of each criterion variable reported by each informant (children, mothers, and fathers; **Figure 2**). We established that if we observed significant correlations between these errors in the second model, and the fit was improved, then it could be reasonable to think about an informant effect.

The fit indexes obtained for the first model were: χ <sup>2</sup> = 482.66, df = 21; p = 0.00; CFI = 0.57; RMSEA = 0.30; GFI = 0.35; AGFI = 0.35; GFI = 0.75; RMR = 0.15. For model 2, we obtained: χ <sup>2</sup> = 236.01, df = 18; p = 0.00; CFI = 0.79; RMSEA = 0.22; GFI = 0.86; AGFI = 0.56; RMR = 0.12.

Logically, in terms of fit indexes, both models are not necessarily accepted because we are not looking for a predictive model to explain the relationship between the variables. According to our premise, we should test whether the errors of the various criterion measures from the same informant are correlated. In this sense, model 2 improves the fit of the model 1 (1χ <sup>2</sup> = 246.65; 1df = 3), and the correlations between the errors of the criterion variables reported by the same source of information are significant [eint\_ext = 0.43, Critical Ratio (CR) = 9.74; emint\_mext = 0.28, CR = 5.31; epint\_pext = 0.43, CR = 10.69].

These results show a significant effect of the informant. As we can see in **Figure 2**, children and fathers are the informants that add more variability to the model; that is, the covariance of errors between children's internalizing and externalizing problems are higher when they are reported by fathers and by children than when they are reported by mothers. In order to quantify the magnitude of the contributions of the various informants, and their incremental validity, we conducted six hierarchical regression analyses.

The results from the hierarchical regression analyses are shown in **Table 2**. The contribution of perceived parental acceptance on behavioral problems is organized by the three informants (mothers, fathers, and children) and by the children's externalizing and internalizing problems.

#### TABLE 1 | Correlation matrix.

fpsyg-07-00664 May 6, 2016 Time: 16:20 # 5


Ext. Prob., Externalizing problems; Int. Prob., Internalizing problems; Pac, paternal acceptance; Mac, maternal acceptance. <sup>∗</sup>p < 0.05, ∗∗p < 0.01.

When the informant referencing the child's behavioral problems is the mother, the maternal acceptance reported by mothers shows the largest increment of R 2 , especially for externalizing problems. However, paternal acceptance reported by fathers made a significant contribution to externalizing problems (not internalizing), and maternal acceptance reported by mothers made a significant contribution to both internalizing and externalizing behavioral problems. Parental acceptance

(maternal or paternal) perceived by children does not make any significant contribution to behavioral problems. Parental acceptance reported by fathers and maternal acceptance reported by mothers considered together become to explain 19% of the variance on externalizing problems.

When the informant referencing the child's behavioral problems is the father, the same pattern was found, with the exception of the instance of externalizing problems seen in step 4, wherein children make a significant contribution. Parental acceptance reported by fathers, mothers, and children considered together become to explain the 40% of the variance on externalizing problems.

Finally, when the informant referencing the child's behavioral problems is the child, the largest increase occurs in step 4, when children report on parental acceptance. Nevertheless, both paternal and maternal acceptances were significant predictors of externalizing problems (not internalizing problems), while only paternal acceptance was significant for internalizing problems. The increase in the variance explained by the parental acceptance perceived by children is 13% for externalizing problems and 4% for internalizing. Parental acceptance reported by fathers and children (the significant sources of information) considered together become to explain the 11% of the variance on externalizing problems and 14% on internalizing problems.

## DISCUSSION

Method effects and incremental validity are two important issues for construct validity. The analysis of empirical similarities and differences between self and others as informants contribute to the knowledge of consistency of measures, its reliability


In relation to the first question, our findings confirm a significant informant effect, which shows that the predictive values are different from one informant to the others when predicting behavioral problems in children based on perceived parental acceptance. Consequently, the magnitude of relations in terms of behavior prediction between parental acceptance and children's externalizing and internalizing problems depends on the source of information used (i.e., children, mothers, or fathers). When the informant speaking on the child's behavioral problems is the mother, maternal acceptance perceived by mothers and paternal acceptance perceived by fathers are the best predictors of children's externalizing problems, while the best predictor for internalizing problems is only the maternal acceptance informed by mothers. The information provided by children about parental acceptance does not make any contribution to the behavioral problems reported upon by mothers. Likewise, the same pattern emerges when the informant about the child's behavioral problems is the father, except that children make a significant contribution to informing on externalizing problems (not internalizing). However, when children act as informants on their own behavioral problems, the pattern found is completely different; maternal acceptance as assessed by mothers does not make any contribution to the children's behavioral problems. Only paternal acceptance reported by fathers or children predicts the externalizing and internalizing problems; additionally, maternal acceptance reported by children predicts internalizing (not externalizing) problems.

The significant predictive value of perceived parental acceptance and children's psychological adjustment is very well supported in family research (Khaleque and Rohner, 2012; Rohner et al., 2012), but no studies have been conducted to explore the informant effect of parental acceptance on children's behavioral problems. Our results support this significant relation regardless of the source of information. Furthermore, our findings are consistent with previous studies that have found an informant effect reflected on the low or moderate confluence between children and parents on the information given by each of them (Achenbach et al., 1987; Rescorla et al., 2013; De Los Reyes et al., 2015). There are numerous prospective reasons for these results, such as the potential biased perception of informants (i.e., parents tending to perceive and inform about less or more problems than children), the information that informants use to rate the scales (i.e., family and school), conceptions of what constitutes abnormal behavior (Richters, 1992), the informants' own emotional state (Chilcoat and Breslau, 1997; Najman et al., 2000; Berg-Nielsen et al., 2003), the closeness of parent–child relationships (Hughes and Gullone, 2010), or the observability of behaviors (De Los Reyes and Kazdin, 2005).

fpsyg-07-00664 May 6, 2016 Time: 16:20 # 7

According to previous studies (Stanger et al., 1992; Duhig et al., 2000), our results support the different predictive utility that a multiaxial assessment approach may have in children's outcomes, specifically in predicting the children's externalizing and internalizing behavioral problems from the parental acceptance construct. In this regard, when parents report about the children's behavioral problems, both fathers (paternal acceptance) and mothers (maternal acceptance) tend to be the best informants to predict externalizing problems, while mothers (maternal acceptance) excel at predicting internalizing ones. However, when children report about their own behavioral problems, children (paternal acceptance to externalizing and internalizing problems, and maternal acceptance to internalizing ones) and fathers (paternal acceptance) tend to be the best informants to predict all kinds of children's behavioral problems.

Research does not yet allow us to make a conclusion about to what extent maternal or paternal acceptance will make a higher or lower contribution to children's psychological problems. Some studies suggest that maternal parenting is more strongly associated with children's emotional and behavioral problems than paternal parenting (Rosnati et al., 2007; Meunier et al., 2012), while other studies find that the opposite is true (Flouri and Buchanan, 2002; Khaleque and Rohner, 2011). Probably on the basis of this contribution differences could be the externalized–internalized nature of behavioral problems, as well as the informant effect. Accordingly, the greater contribution of maternal acceptance to the children's problems could be explained by the closeness of the mother–child relationship and by the fact that mothers tend to have more knowledge about the children's behavioral problems (mainly about the internalizing ones), possibly because mothers generally spend more time with their children than fathers (Renk et al., 2003; De Los Reyes and Kazdin, 2005), or because mothers could be perceived by their offspring to have higher interpersonal power and prestige than fathers (Carrasco et al., 2014). Paternal acceptance may become more relevant to externalizing problems than internalizing because of the nature of father–child relationships, which tend to be more focused on leisure activities (Torres et al., 2014) and goal-oriented behaviors (Leaper et al., 1998; Tenenbaum and Leaper, 2003). The informant effect that our study shows is consistent with the studies that found a higher contribution of paternal acceptance vs. maternal acceptance when the informants are children (Flouri and Buchanan, 2002; Bosco et al., 2003; Khaleque and Rohner, 2011) or teachers (Mattanah, 2001). Maternal parenting tends to be a stronger predictor of children's behavioral problems when parents are the source of information (Gryczkowski et al., 2010), but this is not always confirmed (Hoeve et al., 2009).

Regarding the second question concerning how incremental validity was also affected by the source of information on the children's behavioral problems, our results suggest that there are differential contributions of one source of information over the others and a subsequent incremental validity related to which combination of sources is considered. More specifically, when the informant about the child's behavioral problems is the mother, both father's and mother's information about parental acceptance increases the predictive validity for externalizing problems, but only the mother's information does this (maternal acceptance) for internalizing. However, when the informant about the child's behavioral problems is the father, then mothers, fathers, and children increase the predictive validity for externalizing problems. Nevertheless, only the mother's information about maternal acceptance has significant predictive value on internalizing problems. Finally, when the informant about the child's behavioral problems is the child, then mothers, children, and fathers increase the predictive validity for externalizing problems, but only fathers (not mothers) and children do this for internalizing problems. It is important to highlight that mothers have the higher incremental validity when parents (mothers or fathers) inform about children's problems, but that children make the larger contribution to incremental validity when they self-report about their own behavioral problems. These results support the children's ability to be introspective and to assess their own thoughts and feelings even better than adults (Bidaut-Russell et al., 1995; Johnston and Murray, 2003). These results are also consistent with the studies that support the incremental value of adult informants compared with the child's reports on externalizing problems (Loeber et al., 1990, 1991).

Furthermore, our results support that single informants (parents or children) produced significantly stronger effects than multiple informants (parents and children). That is, when the same informant provides information about parental acceptance (predictor) and the children's outcomes (dependent variable), this single informant tends to reach the higher incremental validity. It is probably due to shared method variance (Campbell and Fiske, 1959). This effect may be particularly prominent when children are the source of information. Although asking children to report on parenting and their own behavioral problems can lead to inflated effect size estimates, children could provide the best information about themselves and the perceived parent– child relationships. The higher incremental validity of mothers on children's internalizing problems is consistent with the higher predictive value of maternal acceptance on internalizing behaviors, as previously discussed.

When fathers are the source of information, the rest of the informants (children and mothers) add significant incremental validity. This could be because fathers sometimes have less knowledge of children's day-to-day lives, meaning that more information is needed from mothers and children to predict children's behavioral problems. However, when children are the source of information, the incremental validity is mainly added by fathers. This may be because of overlapped information from mothers and children, as these would share more information about the emotional lives of the children. It is consistent with the higher agreement between mothers and children than between fathers and children (Schneewind and Ruppert, 2013; Leung and Shek, 2014). The closer relationship of mother and child can account for a higher concurrence on the information provided by these informants, and therefore, the parent with a closer relationship will give much redundant information when added to the one given by the child. In cultures like that of Spain, where gender and parental roles are still quite differentiated, it is common for mothers to spend more time than fathers with

the children, which could be a reason why the mother does not add significant information when the child is used as the primary informant. Similarly, when the mother is the primary informant, the child does not add additional significant information.

Considering all these results as a whole, it can be concluded that the child is the best source of information about parental acceptance when we are trying to predict the children's behavioral problems (both externalizing and internalizing) reported by the own child. However, when the behavioral problems are informed by the parents, the parental acceptance information provided by them will be the data with better predictive value for children's externalizing problems. This changes when we deal with children's internalizing problems that are reported by the parents, in which case the mother's information will be the most predictive one.

A few limitations should be considered for future lines of research. First, this study focused on the general population instead of a clinical sample, meaning that generalization of the current findings to clinical populations should be made with caution, and future research should consider how these two samples may differ both quantitatively and qualitatively. Second, the lack of analysis by sex and age as moderators may be particularly relevant (Crick and Grotpeter, 1995; Johnston and Murray, 2003; Hughes and Gullone, 2010) in terms of informant effect and incremental validity. Studies about sex and age differences in the perception of parental acceptance and the expression of internalizing or externalizing problems symptoms may lead to variations in informant agreement and in relationships between parental acceptance and children's symptoms. Third, the parent's social desirability could minimize their reports about any adverse parenting experiences (i.e., rejection) affecting the level of parent–child agreement. Four, different methods of evaluation such as observations, rating scales, and self-reports should be explored in addition to the informant method. Future studies conducted from a developmental and gender perspective with a multi-measure perspective and using clinical samples are advised in order

## REFERENCES


to bring more light to the informant effect and incremental validity.

Despite the above limitations, the findings of the present study have important practical implications. Considering previous analysis, a multi-informant perspective rather than a single should be considered in order to increase the predictive value and the incremental validity when we try to predict children's internalizing and externalizing problems. Our results suggest that mother–father or child–father informant pairs seem to be the way to optimize the combinations of sources of information in order to predict children's behavioral problems from parental acceptance. Nevertheless, a child may give enough information to make future decisions, and if we have to add only one informant to the assessment, this should be the father. There is a clear need for more research from a multi-method perspective in the child assessment field, rather than having blind faith in a "more are better" approach to getting informants (Johnston and Murray, 2003), which will lead to an optimization of empirically based children's assessment (Carrasco et al., 2008).

## AUTHOR CONTRIBUTIONS

The tasks of each individual author are described in the folloing lines. EI-S: Bibliographic review, preparation of data matrices, drafting the theoretical contents, drafting the discusion, writing and preparing manuscript for sending. FH-T: Collection of data, statistical analysis, drafting the methodology and results. MC: Collection of data, statistical analysis, drafting the methodology and results, theoretical contents review, team coordination.

## ACKNOWLEDGMENTS

This research is included in the Project PSI2011-28925 and it is supported by grants from the Spanish Government, Ministerio de Ciencia e Innovación.




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Izquierdo-Sotorrío, Holgado-Tello and Carrasco. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.