The Future of Assessment as a Human and Social Endeavor: Addressing the Inconvenient Truth of Error

Brown, Gavin T. L.

doi:10.3389/feduc.2017.00003

SPECIALTY GRAND CHALLENGE article

Front. Educ., 13 February 2017
Sec. Assessment, Testing and Applied Measurement
Volume 2 - 2017 | https://doi.org/10.3389/feduc.2017.00003

The Future of Assessment as a Human and Social Endeavor: Addressing the Inconvenient Truth of Error

Gavin T. L. Brown*

Faculty of Education and Social Work, School of Learning, Development, and Professional Practice, The University of Auckland, Auckland, New Zealand

Assessment faces continuing challenges. These challenges arise predominantly due to the inherent errors we make when designing, administering, analyzing, and interpreting assessments. A widely held assumption is that our psychometric methods lead to reliable and valid scores; however, this premise depends on students exercising 100% effort throughout a test event, with no cheating, and having had sufficient personal environmental support to produce best possible results (Dorans, 2012).

Inconveniently, research makes clear that cheating and lack of effort contaminate scores (Murdock et al., 2016; Wise and Smith, 2016). This is especially the case in low-stakes testing situations, such as institutional evaluations (Wise and Cotten, 2009), leading to inappropriate conclusions about the state of an organization or jurisdiction. Hence, while it is convenient to presume that statistical advances will account for such systematic sources of error, the reality is that much assessment takes place both “in vivo” and “in situ” during classroom activities (Zumbo, 2015). Thus, while psychometric methods work reasonably well in high-stakes examination or standardized testing contexts (i.e., “in vitro”), there is little guarantee that these assumptions hold true for what happens in classroom contexts. Thus, the psychometric and testing industry has much to do to develop methods of describing and accounting for the myriad complexities of classroom- or school-based dynamics.

This matters because a widespread policy framework of using assessment to guide or inform improvement (i.e., “assessment for learning” or “formative assessment”) requires teachers to assess students so as to identify the quality of student learning and appropriate changes to classroom practices. UK experts tend to argue that this can only be done through teacher–student interaction in the classroom or by involving students in the process of considering the merits of their own or peers’ work (Black et al., 2003; Harlen, 2007; Swaffield, 2011). Others consider that tests can contribute information about changes to teaching that lead to better learning outcomes, provided the tests go beyond rank order or total score reporting (Brown and Hattie, 2012) or if teachers spend time analyzing strengths and weaknesses (Carless, 2011).

Regardless of the type of assessment method, it is very difficult for pre-service teachers to learn how to assess formatively (Hill and Eyers, 2016). Indeed, even practicing teachers need expertise in curriculum and pedagogy to exercise command of multiple methods of assessment in such a way that all learners are helped to overcome the, sometimes idiosyncratic, challenges they face (Cowie and Harrison, 2016; Moon, 2016). Teachers in New Zealand and Netherlands have learned to use achievement data to guide school-wide improvements, provided experts give them help (Lai and Schildkamp, 2016). However, such efforts often take 2–3 years before changes can be seen in student performance. Thus, despite multiple studies which show that teachers believe in using assessment formatively (Barnes et al., 2015; Bonner, 2016), putting in place policy and resources to support formative assessment is difficult, meaning formative assessment is not a quick fix for improving outcomes for all learners.

The formative assessment policy agenda challenges the dominance of formal testing and teacher-centric methods of assessment, with expectations that effective learning takes place as students engage with learning targets, outcomes, or objectives, take ownership of their work, cooperate with peers, understand more deeply what quality is, and receive and generate appropriate feedback (Leahy et al., 2005). Inconveniently, involving students in assessment presents considerable challenge due to psychological and social factors that interfere with the student’s ability to accurately self-evaluate (Andrade and Brown, 2016) or to constructively peer evaluate and collaborate (Panadero, 2016; Strijbos, 2016). Indeed, evidence that student involvement in assessment develops self-regulatory abilities is weak (Dinsmore and Wilson, 2016). Feedback processes are complex, belying the simple notion that student “horses” will automatically learn once they are led to the “water” of feedback (Lipnevich et al., 2016). While novelty in assessment methods is being developed, especially through introduction of ICT (Katz and Gorin, 2016), it is true that students are not necessarily fans of new ways of being assessed for fear their performance will be impacted (Struyven and Devesa, 2016).

A second widespread policy initiative is to use assessments, especially standardized tests, to evaluate teachers, schools, and systems (Lingard and Lewis, 2016; Teltemann and Klieme, 2016). It is clear that such policies tend to have largely negative impact on the quality of teaching (Hamilton, 2003; Nichols and Harris, 2016), and perhaps more so among minority and lower socio-economic communities. Nonetheless, public acceptance of the legitimacy of using assessment scores to ascertain quality in schooling is reasonably high (Buckendahl, 2016). Using tests to evaluate schools and teaching is a relatively quick and low-cost political process (Linn, 2000). However, summative accountability use of assessments creates tensions for teachers (Bonner, 2016), with many teachers in high-stakes accountability environments having very negative views of such uses (Deneen and Brown, 2016). Using assessments formatively requires discovery of what students have “failed” to be good at, so as to inform further instruction (Hattie and Brown, 2008). This implies that a formative assessment ought to reveal lack of success, a problematic event if external accountability consequences are attached to the same result. Thus, if consequences for low scores are seen as unfair, then it is not surprising if teachers use multiple methods to ensure that scores increase. If accountability assessment scores are inflated through construct-irrelevant processes, then the meaning of an accountability assessment is problematic.

The choice of policy priorities within different jurisdictions strongly shapes the nature and power of assessment practices. For example, both Arabic and Chinese language societies strongly prioritize memorization of content as the dominant model of schooling and attach substantial social and economic benefits for successful performance on formal examinations (Hargreaves, 1997; OECD, 2011; Gebril, 2016; Kennedy, 2016). Anglo-Commonwealth countries strongly prioritize a child-centered, student-involved approach (Stobart, 2006), in which interactive teacher assessment practices have been prioritized as means of improving learning outcomes (Black and Wiliam, 1998). The United States has strong legal protection for special needs students (IDEA, 1997) who are entitled to differentiated assessment and evaluation practices (Tomlinson, 1999). These differences in social uses and styles of assessment complicate the meaning of a grade or score and create challenges for psychometric models that attempt to create universal explanations of performance.

Within societies that are highly homogenous in terms of ethnic and linguistic make-up (e.g., Finland, Japan, China), it may be reasonable to expect that common psychological and social factors influence assessment. This simplifies predicting and modeling those factors. However, when comparisons are made among culturally distinct groups in multicultural societies, which is more the case in economically developed societies and nations (Van de Vijver, 2016), the psychological factors influencing student response, teacher judgments, or test performance can vary significantly. For example, tendencies to self-effacement or self-enhancement are not equal across cultural groups (Suzuki et al., 2008), so the meaning of self-assessment has to be carefully evaluated (i.e., among collectivist groups modest self-reporting enhances group belongingness). In multicultural contexts, assessments that depend on classroom interactions between and among students and teachers are likely to be impacted by these different cultural standards as to the best way to communicate an evaluation of work. The capacity of teachers to appropriately collect, analyze, and plan in response to both formal and informal assessment data is generally weak (Xu and Brown, 2016). Quite prolonged and intensive professional development is needed to generate “assessment capable” teachers (Smith et al., 2014). Thus, assessors and assessments are challenged by the varying and subtle differences created by cultural difference.

Even the introduction of technological solutions that increase the authenticity, diversity, and efficiency of formal testing (Csapó et al., 2012; Katz and Gorin, 2016) does not necessarily improve student performance or solve problems in scoring. Students’ enthusiasm for a computerized activity does not automatically lead to valid conclusions about their proficiency. Students are often concerned that novel assessment practices (including peer assessment, self-assessment, portfolio, performance, or computer-based assessments) will have negative impacts on their performance simply because they are unsure as to how well they will do on a new method of evaluation (Struyven and Devesa, 2016). Consequently, students tend to retreat into a strong preference for conventional assessment practices (e.g., essays or multiple-choice questions). Furthermore, technology now permits data sharing and long-term tracking of student performance, which ought to improve our understanding of how students are improving in which areas. However, the existence of these electronic data raises concerns about privacy and protection; imagine possible negative implications if early poor performance is kept on record and used in evaluative decision-making, despite substantial subsequent progress (Tierney and Koch, 2016).

Thus, inconveniently, the field of testing, applied psychometrics, measurement, and assessment is faced with complex problems, which are not restricted to any one form of assessment or any one society in which assessment is deployed. The inconveniences outlined here are especially the case if we accept that the goal of assessment is to inform improvement and make valid decisions about learners and teachers. The need for accurate diagnostic prescriptions that teachers, students, and/or parents could use to inform improvement is paramount. These prescriptions need to occur close to and responsive to the real-time processes of classroom learning and teaching, which is a substantial problem. The great contribution of psychometrics to the field of education has been an explicit attention to the problem of error in all testing, measurement, and assessment processes. However, few tools are currently available to robustly estimate and account for the kinds of error that occur in real-time classroom observations, interactions, and interpretations. The inconvenient challenge for educators who would minimize the role assessment plays in curriculum is that high-quality tests and measurements are necessary for justice, fairness, and the well-being of individuals and society. The inconvenient challenge for policy makers is that many assessment processes are not reliable or dependable (e.g., essay examinations; Brown, 2010), nor do they account well for the many factors outlined here. Thus, many policy decisions based on inadequate tools or processes are invalid.

The future of assessment requires that we no longer ignore these inconvenient problems facing assessment, testing, and applied measurement. Rather, assessment has to turn constructively to deeply insightful investigations into these perennial problems. Teachers and students need to know where learning is and what is next. Policy makers and parents have a right to know what is working, who is learning, who needs help, what needs to change, and so on. Assessment and testing are how we as humans discover the answers to these questions. Hence, good schooling and good education need good testing or assessment, both in the sense of high-quality and rightly done.

Leaning heavily on validity theory (Messick, 1989; Kane, 2006), good assessment leads to defensible interpretations and actions. These uses depend on robust arguments based on relevant theories of curriculum, teaching, learning, and measurement and on trustworthy empirical evidence that has been subjected to scrutiny (i.e., statistical and/or social moderation). The need to bring greater skill and insight into assessments that inform classroom practice is essential. The success of the whole superstructure of schooling relies on the quality of judgments and evaluations carried out in the millions of classrooms of the world on an everyday basis. If this work is not done well, and if we do not know that it is not done well, we fail.

Hence, engaging in the difficult challenges of how assessment can help education, while also making a credible case for the scores or judgments generated by assessments, needs to be reported. Leaving this only to educational statisticians would be a mistake. Testing and measurement need to integrate with classroom teaching, learning, and curriculum if it is to support schooling and prevent politicians from making simplistic but wrong interpretations and uses of assessment. This is the Grand Challenge for this Section of the journal Frontiers in Education. How can assessment be made flexible enough to support real learning in vivo, while fulfilling all the diverse expectations society has for it? As Section Editor, I look forward to your contributions.

Author Contributions

The author confirms being the sole contributor of this work and approved it for publication.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This paper draws heavily on Brown and Harris (2016). An earlier version of this paper, presented as an inaugural professorial lecture at the Faculty of Education and Social Work, The University of Auckland, can be seen at doi: 10.17608/k6.auckland.4238792.v1.

References

Andrade, H. L., and Brown, G. T. L. (2016). “Student self-assessment in the classroom,” in Handbook of Human and Social Conditions in Assessment, eds G. T. L. Brown and L. R. Harris (New York: Routledge), 319–334.

SPECIALTY GRAND CHALLENGE article

The Future of Assessment as a Human and Social Endeavor: Addressing the Inconvenient Truth of Error

Author Contributions

Conflict of Interest Statement

Acknowledgments

References

People also looked at