Developing a Short Form of the Self-Assessment Practices Scale: Psychometric Evidence

Yan, Zi

doi:10.3389/feduc.2019.00153

ORIGINAL RESEARCH article

Front. Educ., 08 January 2020
Sec. Assessment, Testing and Applied Measurement
Volume 4 - 2019 | https://doi.org/10.3389/feduc.2019.00153

Developing a Short Form of the Self-Assessment Practices Scale: Psychometric Evidence

Zi Yan^*

Department of Curriculum and Instruction, The Education University of Hong Kong, Tai Po, Hong Kong

This research aimed to develop a short form of the Self-assessment Practices Scale (SaPS). Guided by a process model of self-assessment, the SaPS scale was designed to assess the actions students engage in during the self-assessment process. The data used for developing the original 20-item SaPS (SaPS-20), i.e., 1,416 Hong Kong students ranging from Primary 4 to Secondary 3, were reanalyzed, and a 12-item short form (SaPS-SF) was developed. Factor analysis and Rasch analysis were applied in complementary ways to examine the psychometric properties of the SaPS-SF. The results showed that factor structure of the original scale held in the SaPS-SF, and all items fitted the Rasch model requirements sufficiently and measured the constructs as theorized. The findings presented in this study facilitate the measurement of self-assessment practice in a parsimonious and effective way.

Introduction

Self-assessment is a fundamental skill required at each phase of self-regulated learning (Yan, 2019) and is crucial for life-long learning (Boud, 1995; Tan, 2012). Through self-assessing their own performances, students can identify their strengths and weaknesses, and adjust their learning strategies accordingly to learn more (Boud, 1995; Yan and Brown, 2017). Recent review studies (e.g., Brown and Harris, 2013; Panadero et al., 2017) revealed a general consensus in the literature with regard to the positive impact of self-assessment on academic achievement, self-regulation, and motivational aspects of learning (e.g., self-efficacy), although the effect sizes varied across studies.

Despite the important role of self-assessment in education, the understanding of the exact nature of “standard self-assessment” varies in literature (Panadero et al., 2016). In many educational studies, self-assessment is often simplified as a mere self-rating/grading with little cognitive reflection involved. However, self-assessment appears to be a far more complex activity in real learning contexts. Panadero et al. (2016) argued that “student self-assessment most generally involves a wide variety of mechanisms and techniques through which students describe (i.e., assess) and possibly assign merit or worth to (i.e., evaluate) the qualities of their own learning processes and products.” Yan (2016, 2018) summarized conceptualizations of self-assessment into three categories: (1) self-assessment is treated as a personal ability/trait that enables an accurate evaluation of one's own performance; (2) self-assessment is used as a supplementary assessment method; and (3) self-assessment is regarded as a learning strategy or process aiming for enhancing learning effectiveness.

From a pedagogical perspective, it makes more sense to conceptualize self-assessment as a learning strategy in enacting its merits in supporting student learning due to the long-lasting concerns about the accuracy of self-assessment for summative purposes (Brown et al., 2015; Yan and Brown, 2017). Yan and Brown (2017) conceptualized self-assessment as “a process during which students collect information about their own performance, evaluate and reflect on the quality of their learning process model and outcomes according to selected criteria, to identify their own strengths and weaknesses.” (p. 2). Accordingly, they proposed a “cyclical self-assessment process” that covers three sequenced actions including determining the performance criteria, self-directed feedback seeking, and self-reflection (see Figure 1). The first step of student self-assessment is to determine the assessment criteria that is to be applied in the following actions. The second step is to seek feedback with regard to the quality of their own performance from external and/or internal sources. External feedback comes from either explicit learning processes (e.g., reviewing past test papers or doing extra exercises), or inquiry with people (e.g., teachers, peers). Internal feedback comes from internally generated reactions (e.g., internal states, physical sensation, and emotions) to their own performance. However, neither external nor internal feedback itself necessarily leads to a meaningful self-assessment judgment without the third step, i.e., reflection. In the third step, the task is to reflect the quality of the process and product of learning with the support of feedback and arrive at an initial self-assessment judgement. This judgement could be continuously calibrated based on different assessment criteria or new sources of feedback.

FIGURE 1

Figure 1. Theoretical framework of the cyclical self-assessment process (Yan and Brown, 2017).

Building on the Yan and Brown (2017) model, Yan (2018) developed a Self-assessment Practices Scale (SaPS) that contains 20 items (hereafter SaPS-20) assessing four self-assessment actions, namely, seeking external feedback through monitoring (SEFM), seeking external feedback through inquiry (SEFI), seeking internal feedback (SIF), and self-reflection (SR).

This Study

The SaPS-20 is not a long questionnaire in itself. However, in many situations where SaPS is likely to be used in conjunction with other instruments, a shorter version would be preferred to reduce respondent load as far as possible. Survey administration will be more efficient and less disturbance will be caused if a questionnaire can obtain quality psychometric information using fewer items (Meriac et al., 2013). Moreover, the number of items differs across the four subscales of the SaPS-20. It might be beneficial to have a balanced weighting among different subscales as there is no convincing justification for uneven weightings among the different self-assessment actions in the Yan and Brown (2017) process model. This study reanalyzed responses of 1416 students to the SaPS-20 (Yan, 2018) with an aim to develop a short form of SaPS (hereafter SaPS-SF) and to investigate its psychometric properties. The developed short form SaPS-SF was expected to be a more parsimonious measure with a balanced number of items within each subscale.

Method

Participants

The SaPS-20 had been administered to a convenience sample of 1416 Hong Kong students from 18 primary schools and 11 secondary schools (49.6% female, n = 703). Participating students ranged from Primary 4 (P4) to Secondary 3 (S3) and approximately aged 9 to 14 years (P4 = 185, P5 = 211, P6 = 232, S1 = 254, S2 = 237, S3 = 297).

Measures

The SaPS-20 was developed based on the Yan and Brown (2017) cyclical model of self-assessment process. The scale contains four subscales that assess four actions students engage in during self-assessment process including, seeking external feedback through monitoring (SEFM; 5 items), seeking external feedback through inquiry (SEFI; 4 items), seeking internal feedback (SIF; 4 items), and self-reflection (SR; 7 items). A six-point Likert-type response scale, ranging from Strongly Disagree (1) to Strongly Agree (6) was implemented. Yan (2018) reported satisfactory reliability for the SaPS-20. The Cronbach's α for the four subscales were SEFM 0.85, SEFI 0.84, SIF 0.79, and SR 0.90 respectively. The Rasch reliabilities for the four subscales were 0.88, 0.88, 0.80, and 0.90 respectively.

Data Analysis

To provide complementary information about the psychometric properties of the SaPS-SF, both confirmatory factor analysis (CFA) and Rasch analysis (Rasch, 1960) were employed. This approach has been used in many empirical studies (e.g., Deneen et al., 2013; Hart et al., 2013; Primi et al., 2014; Yan, 2016; West et al., 2018; Testa et al., 2019) for the benefit of providing comprehensive scrutiny of the psychometric qualities of instruments.

Since the data have a hierarchical structure, i.e., students are nested within schools, a reasonable concern is whether multilevel modeling is necessary. Maas and Hox (2005) suggested that multilevel modeling is preferred if the design effect >2 and the number of groups is large. In this case, fourteen items had a design effect lower than 2 and six items had a design effect between 2 and 3. Since the majority of items had a low design effect and the number of schools was only 18, single-level analyses were adopted in this study.

For selection of items for inclusion in the SaPS-SF, four criteria were considered. The items retained should (1) represent important content in terms of self-assessment practice; (2) have the largest structure coefficients within each of the four subscales; (3) have good fit to the Rasch model; and (4) cover as wide as possible a difficulty range along the latent trait scale.

The psychometric properties of the resultant SaPS-SF was then subject to the scrutiny of CFA, then Rasch analysis. CFA was conducted using AMOS 24.0 (Arbuckle, 2015) to examine the globe model-data fit. Multiple fit indices were checked including the comparative fit index (CFI), the goodness of fit index (GFI), the standardized root mean square residual (SRMR), and the root mean square error of approximation (RMSEA). As a general rule, values of GFI and CFI over 0.90, and values of RMSEA lower than 0.08 (McDonald and Ho, 2002) and values of SRMR lower than 0.08 (Hu and Bentler, 1999) indicate an acceptable model-data fit.

Rasch analysis was applied following CFA to further check the psychometric properties of the SaPS-SF. In Rasch analysis, the ordinal rating scale is transformed into a continuous interval scale which enables subsequent parametric analysis. For the purpose of examining the psychometric quality of an instrument, Rasch analysis checks the degree to which items in a scale reflect an underlying unidimensional latent construct. Rasch analysis adopts a “data fit the model” approach that requires the empirical data to satisfy a priori requirements essential for achieving fundamental measurement (Bond and Fox, 2015). As self-assessment practice was classified into four different but inter-related actions, a multidimensional Rasch-based model (Adams et al., 1997) using ConQuest 2.0 (Wu et al., 2007) was employed with these data. The indicators used for checking the scale quality included response category functioning and item fit statistics (i.e., Infit MNSQ and Outfit MNSQ). As suggested by Wilson (2005), Infit/Outfit MNSQs in the range between 0.75 and 1.33 indicate sufficient fit to the Rasch model.

In addition, internal consistency (i.e., Cronbach's α estimates) and Rasch reliabilities for each subscale were computed.

Results

CFA with maximum likelihood (ML) estimation was applied. The skewness and kurtosis values were computed to check the normality of each item. The skewness index ranged from −1.02 to −0.39 and kurtosis index from −0.55 to 1.17, indicating approximately normal distribution (Kline, 2015). Yan's (2018) study compared alternative models (e.g., the higher-order factor model and first-order factor model) and concluded that the higher-order model was a better choice because (1) it had better fit statistics, and (2) it was in line with the theoretical model specified by Yan and Brown (2017). Hence, this study adopted the high-order factor model. The results showed that the composite reliabilities for the four factors are: 0.86 for SEFM, 0.84 for SEFI, 0.80 fir SIF, and 0.90 for SR. In Table 1, the items in the SaPS-20 are ranked according to their standardized CFA factor loadings within each of the four subscales. Rasch item difficulties with associated standard errors and item fit statistics are also provided for each item. To produce a more parsimonious scale and, at the same time, to maintain adequate coverage of content, the target was set at a 12-item scale (rather than 20 items) with 3 items in each of the four subscales.

TABLE 1

Table 1. Psychometric indicators for the SaPS-20 from CFA and Rasch analysis.

Item selection was guided by the four criteria, as described in the Method section. In subscale SEFM, items #2 and #1 was kept as they had the largest standardized factor loadings (0.78 and 0.77). However, item #3 was preferred over item #4, in spite of the marginal difference in the factor loading of item #4 (0.75) over that of item #3 (0.73). This was because item #3 had a higher item difficulty (0.16 logits) than item #4 (−0.05 logits) whose difficulty was similar to that of item #1 (−0.04 logits). Inclusion of item #3 (over #4) would help to cover a wider range of the underlying latent trait (0.28 over 0.08 logits).

In the SEFI subscale, items #9 and #8, with the largest standardized factor loadings (0.77 and 0.76), were included, but Item #6 (0.74) was retained in preference to item #7 (0.75) because teachers (item #6) are more likely to be an important source of feedback on students' performance compared to family members (item #7). Furthermore, Yan (2018) reported that item #7 demonstrated differential item functioning across year levels. Students from different year levels interpreted this item differently as it was unexpectedly more difficult to endorse for older students.

Items #12, #13, and #10 were retained for subscale SIF according to the four criteria. They had the largest standardized factor loadings, good fit to the Rasch model, and covered an appropriate range of difficulty.

For subscale SR, items #18 and #17 were kept. However, item #16 (0.79) was excluded in favor of item #19. Three considerations contributed to this decision: item #19 represented an essential aspect of self-reflection based on assessment results; had a difficulty of −0.43 logits that was helpful in covering a wider range of the latent trait; and the standardized factor loading of item #19 (0.70) was deemed adequate.

Confirmatory Factor Analysis

The 12 items included in the SaPS-SF were then subject to a CFA with maximum likelihood (ML) estimation. The specified model (Model 1) was identical to the high-order factor model tested in Yan's (2018) study, and in line with the Yan and Brown (2017) theoretical specification. In this model, the four actions of self-assessment form a hierarchical structure. SEFM and SEFI belonged to a second-order factor, i.e., seeking external feedback (SEF). SEF and SIF contributed to a higher-order factor, namely seeking feedback (SF). SF and SR were at the same level and constituted self-assessment (see Figure 2).

FIGURE 2

Figure 2. Model 1 for the SaPS-SF with standardized factor loadings.

It was found that the loading of SEF on SF was 0.94 with the 95% confidence interval ranging from 0.88 to 1.01, indicating that the loading of 0.94 does not significantly deviate from unity. It suggests that seeking external feedback (SEF) might be redundant. Hence, a revised model (Model 2) was tested. In this model, SEF was removed; SEFM, SEFI, and SIF contributed to SF (see Figure 3).

FIGURE 3

Figure 3. Model 2 for the SaPS-SF with standardized factor loadings.

The results in Table 2 showed that the SaPS-SF (both Model 1 and Model 2) had satisfactory and slightly better fit statistics than the SaPS-20. The standardized factor loadings of items in the SaPS-SF are presented in Figure 3 and Table 3. The factor loadings ranged from 0.70 to 0.82 for SEFM, 0.72 to 0.79 for SEFI, 0.64 to 0.81 for SIF, and 0.70 to 0.83 for SR.

TABLE 2

Table 2. CFA goodness-of-fit indices for the SaPS-20 and SaPS-SF.

TABLE 3

Table 3. Psychometric indicators for the SaPS-SF from CFA and Rasch analysis.

Rasch Analysis

Student responses to the selected 12 items in the SaPS-SF were also subject to a multidimensional Rasch analysis. The Rating Scale Model was applied as the same response scale was used across all items. The step calibrations (the transition points of from one category to the next) of the response scale increased monotonically from −1.47, −1.27, −0.76, 0.93, to 2.57 logits. This implied that the response scale functioned well in general although the step distances between the first three step calibrations could be larger, according to Linacre's (2002) guidelines. This result is similar to that of the SaPS-20. The correlations among the four latent traits (see Table 4) ranged from 0.57 to 0.85 for SaPS-20, from 0.56 to 0.82 for SaPS-SF.

TABLE 4

Table 4. Correlations between the four latent traits.

The item difficulty, standard error, item fit statistics (i.e., Infit and Outfit MNSQ) for the SaPS-SF are presented in Table 3. All the 12 items showed satisfactory fit to the Rasch model, indicating that all items within the same subscale were assessing the same construct as theorized.

The Wright map, as shown in Figure 4, presents person measures and item difficulties that are calibrated on the same metric. The four continua on the left side indicate the students' measures on each of the four subscales. The items with their thresholds, organized into the four subscales, are placed on the right side. The notation of x.y is used to indicate items and thresholds. For example, 3.5 refers to the 5th threshold of #3. Although the range of student ability was much larger than the range of item difficulty for each of the four subscales, the SaPS-SF still provided a targeted measurement of student self-assessment practice because items together with their item thresholds covered the major range of students' ability on the latent trait.

FIGURE 4

Figure 4. The Wright map of the SaPS-SF.

Both conventional reliability (i.e., Cronbach's α) and Rasch reliability (i.e., EAP/PV reliabilities generated by ConQuest) were calculated for the SaPS-SF. For an easy comparison, the reliabilities of each subscale of the SaPS-20 and SaPS-SF are presented in Table 5. It can be seen that all the four subscales in the SaPS-SF maintained a satisfactory reliability after the exclusion of 40% of the items (from 20 to 12 items). The Cronbach's α ranged from 0.76 to 0.82, and the Rasch reliability ranged from 0.79 to 0.86. The person separation indices of the two versions of SaPS were quite similar for SEFM, SEFI, and SIF. The separation for SR dropped marginally from 3 to 2.29, quite acceptable considering that the number of items decreased from 7 to 3.

TABLE 5

Table 5. Comparison of reliabilities of the SaPS-20 and SaPS-SF.

The correlations between students' Rasch person measures on the SaPS-SF and SaPS-20 were calculated. The coefficients were 0.94, 0.97, 0.95, and 0.92 for SEFM, SEFI, SIF, and SR, respectively. These high correlations indicated that the person measurement was stable across the short form and the original scale.

To further examine the invariance of estimates across the SaPS-20 and SaPS-SF, person measures and the associated standard errors obtained from these two versions of the scale were imported into an Excel spreadsheet provided by Bond and Fox (2015) and an invariance plot was generated for each subscale (see Figure 5). The person measures from SaPS-SF were plotted on the y-axis, and the measures from SaPS-20 on the x-axis. The 95% control lines were generated based on the standard errors for each of the person measures. It can be seen that the person measures for all the four subscales were within the 95% control lines with very few exceptions, indicating that person measures remained invariant (within error) across the short form and the original scale.

FIGURE 5

Figure 5. Person measure invariance (SaPS-SF vs. SaPS-20).

Discussion

The lack of a valid instrument for assessing self-assessment practice significantly hinders developing a detailed understanding of self-assessment. The SaPS-20 is the most recently developed tool (Yan, 2019; Yan et al., 2019) that is theory-driven and specifically designed for assessing different actions in the self-assessment process (Yan, 2018). The present study set out to extend the attempt to provide a valid and parsimonious measurement of self-assessment practice. The four-factor model found in the original SaPS-20 (Yan, 2018) fits very well in the 12-item SaPS-SF. The SaPS-SF reflected all actions of self-assessment practice—SEFM, SEFI, SIF, and SR—in a more balanced fashion (i.e., 3 items within each of the four subscales). All items in the SaPS-SF subscales fit the Rasch model sufficiently and measure unidimensional constructs as theorized. The SaPS-SF is much more parsimonious (40% decrease in item number) but almost equally as effective as using the original SaPS-20 in terms of differentiating person measures, as showed by person separation indices. The invariance of person measures demonstrated in Figure 5, as well as high correlations between the Rasch person measures obtained from the SaPS-SF and SaPS-20, provided strong evidence of concurrent validity.

As the SaPS (both the original scale and the short form) is a relatively new instrument, more studies are needed to provide further utility and validity evidence. First, as the sample used in this study was solely from the Confucian culture, examining the reliability and validity of the SaPS-SF on samples from other cultures would be an interesting topic. Second, the psychometric properties of the SaPS-SF with students of other age groups not covered in this study (e.g., lower primary students, upper secondary students, and university students) are warranted. Third, further studies could consider investigating the external (e.g., correlation with relevant constructs), and consequential aspects (e.g., prediction on outcome measures such as academic performance) of validity of the SaPS-SF (see Messick, 1995).

In conclusion, the SaPS-SF is a more economical measure of self-assessment practices which maintains the good psychometric properties of the original SaPS-20. The findings presented in this study facilitate the measurement of self-assessment practice in a parsimonious and effective way and, therefore, can contribute to future research in self-assessment.

Data Availability Statement

The datasets generated for this study are available on request to the corresponding author.

Ethics Statement

The studies involving human participants were reviewed and approved by The Education University of Hong Kong. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Funding

This work was supported by a General Research Fund (GRF) (Project No: EDUHK 18600019) from the Research Grants Council of Hong Kong.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

I would like to thank Prof. Trevor Bond for providing insightful comments on an early draft of this paper.

References

Adams, R. J., Wilson, M., and Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Appl. Psychol. Meas. 21, 1–23. doi: 10.1177/0146621697211001

ORIGINAL RESEARCH article

Developing a Short Form of the Self-Assessment Practices Scale: Psychometric Evidence

Introduction

This Study

Method

Participants

Measures

Data Analysis

Results

Confirmatory Factor Analysis

Rasch Analysis

Discussion

Data Availability Statement

Ethics Statement

Author Contributions

Funding

Conflict of Interest

Acknowledgments

References

People also looked at