Edited by: Bronwen Cowie, University of Waikato, New Zealand
Reviewed by: Alison Margaret Gilmore, University of Otago, New Zealand; Robbert Smit, University of Teacher Education St. Gallen, Switzerland
Specialty section: This article was submitted to Assessment, Testing and Applied Measurement, a section of the journal Frontiers in Education
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The purpose of this study is to investigate the validity of using multiple-choice (MC) items as a complement to constructed-response (CR) items when making decisions about student performance on reasoning tasks. CR items from a national test in physics have been reformulated into MC items and students’ reasoning skills have been analyzed in two substudies. In the first study, 12 students answered the MC items and were asked to explain their answers orally. In the second study, 102 students from five randomly chosen schools answered the same items. Their answers were scored, and the frequency of correct answers was calculated for each of the items. The scores were then compared to a sample of student performance on the original CR items from the national test. Findings suggest that results from MC items might be misleading when making decisions about student performance on reasoning tasks, since students use other skills when answering the items than is intended. Results from MC items may also contribute to an overestimation of students’ knowledge in science.
This study investigates the validity of using multiple-choice (MC) items for making decisions about student performance on complex tasks. It has been performed as a reaction to the common practice in Sweden, where most of the national tests include a combination of MC and constructed-response (CR) items.
The national standards in the current Swedish curriculum represent complex skills such as “reasoning skills” (Christenson and Chang Rundgren,
In order to support teachers in making informed decisions about student proficiency, items included in a test must elicit evidence of the knowledge and skills sought. Such items can be of either MC or CR format. The reasons for including MC items are, for example, objective scoring and internal consistency on the test. Furthermore, since the Swedish teachers assess the tests themselves,
To decide whether MC items can contribute to the interpretation of student proficiency in relation to complex reasoning skills, if (and how) students utilize their reasoning skills when solving the MC items needs to be known. The purpose of this study is therefore to investigate the validity of using MC items for making decisions about student performance on complex tasks by investigating student reasoning when solving such items.
In the following section, the relationship between validity and reliability is outlined. Then previous research on students’ use of reasoning skills in MC items, as well as the potential interchangeability between MC and CR items, is presented and discussed.
According to the definition adopted here, whether a test is to be considered valid depends on whether the use and interpretations of the scores are reasonable (Messick,
The question of how to value different contributions to the assessment is sometimes described as a trade-off (Dunbar et al.,
Another way of handling the validity versus reliability trade-off is to attempt to maximize both. One way to accomplish this is to use both MC and CR items on a test. Some items are more closely aligned to the curriculum, but presumably have lower reliability, whereas the opposite is true for other items. Overall, this battery of items might give conditions for both validity and reasonably high reliability. Another way to prioritize both validity and reliability could be to strengthen the reliability for the items more closely aligned to the curriculum, for instance by detailed rubrics, training, and/or applying some kind of moderation procedure. What would
As shown, the relationship between validity and reliability is not a simple dichotomy. Rather, it concerns balancing demands for alignment with the curriculum and levels of certainty in the assessment. The particular question asked here is whether—when attempting to balance validity and reliability—it is meaningful to
It is not uncommon to assert that MC items can be used to assess students’ reasoning skills or their skills in drawing conclusions (Gustafsson et al.,
In a national evaluation of compulsory schools in Sweden (i.e., not the national tests referred to above), knowledge in science was tested for approximately 3,000 Swedish students in year 9 (the last year of compulsory school in Sweden). A number of the items were MC (see Figure
Multiple-choice items from a national evaluation in Sweden (translated by the authors).
Proportions of correct answers to the same items.
Written performance on multiple-choice item (%) | Oral performance (%) | |
---|---|---|
Item 1 | 26 | 80 |
Item 2 | 19 | 90 |
However, in the study by Schoultz (
The difference in frequency between the two studies is striking (Table
In other studies, students’ reasoning in relation to MC items has been investigated through “think aloud protocols” (TAPs). This means that students explicated their thoughts while performing the tasks (Hamilton et al.,
Reich (
In yet another study by Dufresne et al. (
The studies described above are only a small number of selected examples. However, the results from these studies indicate that students may use completely different skills when answering MC items than intended. Representative or not, this means that there is a risk that student performance on MC items might be misleading when attempting to assess complex skills. Since these studies are few, and their respective samples are small due to the qualitative nature of the research, there is a need for further research in this area.
Assuming that a student has a certain knowledge or ability which is relatively constant, he/she should answer all items addressing the same skill at approximately the same level. This is, however, hardly ever the case. Instead, a student with high ability may fail an easy item and
A number of studies have empirically investigated the differences between MC and CR items. Hogan (
Serious critique has been raised against the type of studies upon which the abovementioned conclusion is based (see Bennett,
Traub (
Generalizing from Traub’s review is difficult, partly because it includes so few studies and partly because those studies indicating that different item formats may test the same construct are very limited. For instance, in studies assessing programming, the CR items asked the students to either to make a list of advantages and disadvantages with a certain method or write down a specific procedure for programming. Similarly, in the studies assessing lexical knowledge, students were asked to “fill in the blanks” with single words. This means that the items in Traub’s review differ markedly from the examples discussed in the introduction of this article, where the students are supposed to respond to different views and reason about sources. In fact, the correlation between item formats in Traub’s review can be explained by the similarities between “recall” and “recognition,” which are both examples of memory knowledge (Wainer and Thissen,
Results from more recent research in this area also place the conclusion from Hogan and Traub in a somewhat different light. Becker and Johnston (
In sum, the results from previous research indicate that different item formats are
This study aims to further investigate the relationship between MC and CR items, since current research indicates that: (1) students may
Specifically, this study will analyze students’ reasoning skills in physics, where CR items from a national test in science have been reformulated into MC items. The study aims to answer the following questions:
When answering MC items constructed to measure students’ reasoning skills, which kinds of knowledge or skills are students’ reasoning based upon? Is there an agreement between students’ answers to MC and CR items designed to address the same reasoning skills?
The overall design of this study is to reformulate CR items from a national test, which has been designed to test complex reasoning skills, into MC items and then compare students’ answers to the different item formats. By reformulating CR items into MC equivalents, one can assume that the MC items will more likely address the same construct as the CR originals, compared to the other way around. This transformation is described in detail below.
Two samples of students, one for each of the research questions, were asked to answer the MC items. First, a small sample of students answered the MC items during interview conditions, so they were given an opportunity to explain the rationale for their answers. Second, a larger number of students answered only the MC questions (no interviews), so their performance as a group could be compared to a national sample of student performance on the CR versions of the same items.
The CR items used in this study were taken from the Swedish National Assessment in physics for 12-year-olds. This test typically consists of three parts, all of which are allocated 1 h of testing time and which focus on: communication skills (Part A), investigations (Part B), and content knowledge (Part C). Part A is of primary interest, since it includes the assessment of students’ reasoning skills. This particular subtest consists of three CR items, each focusing on a particular subskill (Appendix A in Supplementary Material).
First, one task addresses students’ skills in using scientific knowledge in discussions about socioscientific issues. For instance, in the 2014 physics test, the one used in this study, the context was the transportation of fruit. The task required the students to make a decision about which mode of transportation to choose for transporting fruit from Italy to Sweden, taking into account environmental concerns.
The second task involves choosing and reasoning about information and sources. In the 2014 physics test, the context was how one’s view of the solar system is influenced by technology, religion, and culture. In the task, the students were presented with a number of different sources about the solar system. They were then expected to identify which of these sources were about how (a) technology and (b) religion/culture influence one’s view of the solar system. They were also expected to justify their choices (i.e., reason about the information and sources).
The third and final task focuses on using scientific knowledge in order to produce texts, figures, and tables for different audiences and purposes. In the 2014 physics test, the context was the shape of clouds. In the task, the students were presented with a short text about how clouds change during a summer day. The students were then expected to draw a series of pictures, showing how the clouds changed during a summer day and including explanatory captions.
It is important to note that these subtests of communication skills differ from conventional tests in several respects. They are composed entirely of CR items, and they are designed to assess both divergent and convergent thinking. On the one hand, students are free to develop their own line of reasoning. On the other hand, they are also asked to—and rewarded for—using their scientific knowledge to support their reasoning. This means that students can formulate almost any argument they want, as long as they support it (i.e., divergent thinking). However, their support has to be sound. This means that their answers are considered to be higher quality if they use correct and relevant scientific knowledge (i.e., convergent thinking). The more rule-bound and quantitative facet of the physics subject therefore only constitutes one aspect of the assessment. This may also differ from conventional conceptions of tests in physics.
When transforming the CR items into MC equivalents, the first part of each item, which presented the context and any additional information (such as the sources in the second task and the text about clouds in the third), was left unchanged. Therefore, the prompts were identical for both formats. Next, each CR item was divided into two or three MC items in order to reflect the entire breadth of the original item. For instance, in the example above about the solar system, the students were asked to identify relevant sources and justify their choices. When transforming this task into MC items, one item targeted whether the students could identify relevant sources, while a second item addressed whether students could distinguish between appropriate and less appropriate justifications (Appendix B in Supplementary Material). When formulating the stem of the items, care was taken to create similar demands to those in the original item. For example, in the CR version of the cloud task, students were asked to draw a series of pictures, showing how the clouds changed during a summer day. In the equivalent MC items, the students were asked to identify all the kinds of clouds described in the text. In a second item, they were asked to indicate the sequence of changes during the day. Measures were also taken to avoid dependence among items, in the sense that one item cued another.
In order to formulate different alternatives, the assessment manual for the test was used. The assessment manual, which the teachers use when assessing student performance, contains both criteria and authentic examples of student performance for each of the four levels (all items are scored from 0 to 3). Examples of student performance on lower levels were used to create the distractors. Examples from high-scoring students were used to formulate the correct alternative. All alternatives were modified, so that they were similar in length and expression. In total, the three CR items resulted in seven MC equivalents, which were scored as either correct or incorrect.
Table
Item characteristics for the multiple-choice-equivalent items.
Item |
|||||||
---|---|---|---|---|---|---|---|
1a | 1b | 1c | 2a | 2b | 3a | 3b | |
Difficulty | 0.55 | 0.91 | 0.51 | 0.68 | 0.24 | 0.76 | 0.33 |
Discrimination | 0.34 | 0.30 | 0.54 | 0.57 | 0.36 | 0.41 | 0.43 |
Reliability | 0.27 | 0.44 | 0.19 |
In the first study, schools in the vicinity of the university were contacted to be part of the investigation. Three teachers at different schools answered this request. From each class, the respective teachers selected four students. Teachers were asked to select two boys and two girls, who also represented a combination of high- and low-performing students. Therefore, a total of six boys and six girls with different ability levels participated in the study.
Data consist of students’ answers to the MC equivalents. The students took the test individually, and they were asked to explain why they had chosen the particular alternative for each answer (see Hamilton et al.,
Students’ oral reasoning was first analyzed by categorizing their explanations as either correct or incorrect and then, independently from whether the explanation was correct or incorrect, on which kind of knowledge or skills they based their reasoning. In the latter case, the ambition was to distinguish between subject knowledge/skills, general knowledge/skills, and test-wiseness, according to Reich ( Correct answer; no reasoning. Correct answer; reasoning based on correct subject knowledge/skills. Correct answer; reasoning based on incorrect subject knowledge/skills. Correct answer; reasoning based on general knowledge/skills and/or test-wiseness. Incorrect answer; no reasoning. Incorrect answer; reasoning based on correct subject knowledge/skills. Incorrect answer; reasoning based on incorrect subject knowledge/skills. Incorrect answer; reasoning based on general knowledge/skills and/or test-wiseness.
The coding was completed by one researcher in relation to explicit and simple criteria. For instance, the students had to refer to scientific concepts in order to be categorized as “based on subject knowledge/skills.” The same researcher categorized all student answers on different occasions 2 weeks apart. Any deviations from the initial coding were checked against the criteria.
In the second study, five schools were randomly chosen from a national database containing all Swedish compulsory schools and contacted to be part of the investigation. In total, these schools had 102 12-year-old students who could take the test. The tests were sent by ordinary (i.e., not electronic) mail to the teachers. They distributed the tests to the students, collected them again, and sent them back to the researchers.
Data for this study consist of students’ answers to the MC equivalents, which were scored, and the frequency of correct answers was calculated for each item. The scores were then compared to a sample of student performance on the original CR items from the national test (
The national sample comes from teachers voluntarily reporting student performance on the national test through a website, so that they may compare their own students’ performance with the performance of all other students reported through the website. This is a service provided by the test developers. The sample in this study corresponds to approximately one-third of all students in the country who took the test.
This study was carried out in accordance with the ethical guidelines for the Humanities and Social Sciences set out by the Swedish Research Council. The study has not been subjected to review by an ethical committee since, according to Swedish legislation regarding research on human subjects (2003:460), research needs approval from an ethical committee only in cases where personal and sensitive information is handled, when physical interventions are made, or when the subjects may be harmed. In line with this, approval from an ethical committee is not required by the university where the research was conducted. All subjects, as well as their legal guardians, have been informed about the purpose of the research, that their participation is voluntary, and that they can interrupt their participation at any time. Written informed consents have been given by all subjects, as well as their legal guardians, in accordance with the Declaration of Helsinki.
Analyzing the answers from the 12 students to the MC equivalents reveals that the students as a group provided correct answers in 69 out of 84 instances (82.1%). Of these 69 correct answers, a total of 31 were based on reasoning using correct subject knowledge. The remaining 38 correct answers were mainly based on a combination of general knowledge/skills and test-wiseness strategies. However, the correct answers may also be based on incorrect subject knowledge/skills. Some students did not provide any reasoning for particular items.
The 15 incorrect answers were almost entirely based on a combination of general knowledge/skills and test-wiseness strategies. However, in a couple of instances, they were also based on subject knowledge. These results are summarized in Table
Overview of students’ strategies in answering the multiple-choice-equivalent items.
Correct answers | Incorrect answers | In total | |
---|---|---|---|
Subject knowledge | 31 | 2 | 33 |
General knowledge and test-wiseness | 29 | 13 | 42 |
Incorrect knowledge | 6 | – | 6 |
No reasoning | 3 | – | 3 |
In total | 69 | 15 | 84 |
Some interesting observations can be made from Table
Students’ strategies in answering each multiple-choice-equivalent item.
Item |
|||||||
---|---|---|---|---|---|---|---|
1a | 1b | 1c | 2a | 2b | 3a | 3b | |
Subject knowledge | 1 | 3 | 5 | 8 | – | 6 | 8 |
General knowledge and test-wiseness | 8 | 7 | 4 | 1 | 2 | 5 | 4 |
Incorrect knowledge | – | – | 2 | 2 | – | – | – |
No reasoning | 1 | 2 | – | – | – | – | – |
Subject knowledge | – | – | – | – | 1 | – | – |
General knowledge and test-wiseness | 2 | – | 1 | 1 | 9 | 1 | – |
A second observation from Table
What cannot be seen in the tables, but in the recordings of students’ reasoning, is that students basically apply one major strategy for general knowledge/skills and test-wiseness. They compare the different options and reason about the formulations in order to identify the correct alternative. For example, the context for item 1 was the transportation of fruit. The students were asked to make a decision about which mode of transportation to choose for transporting fruit from Italy to Sweden, taking into account environmental concerns. In the following excerpt, a student is explaining his/her answer to item 1c, in which the specific focus was on identifying concerns other than pollution and costs, which were topics covered in items 1a and 1b. As shown, the student reasons by comparing different alternatives, first numbers 1 and 5 and then numbers 3 and 6 (the numbers refer to questions posed by fictional children in the task). This individual finally arrives at number 9 by the process of elimination.
Yes, I chose number 9, because number 1, which mode of transportation produces the least amount of dangerous emissions, is sort of the same question as here in number 5, what amount of dangerous emissions does each mode of transportation produce per box of oranges. And question 3, which mode of transportation is the most expensive, is also somewhat like number 6, because it is about what it costs to transport a box of oranges with the different modes of transportation. Yes, and then question number 9 is about how much of the oranges it is possible to sell after having transported them with the different modes of transportation. (School 3, Student 1)
This example shows, which is similar across almost all student responses independent of whether they use subject knowledge or not, that students’ focus is on both the task and the wording. Similarly, in items 2a and 3b, most students base their reasoning on correct subject knowledge. However, this knowledge is most often used to distinguish between the options, not to reason about the phenomenon regarding the item. This means that although students reason, they do not necessarily provide the reasoning that was intended.
When comparing the options and deciding which one to choose, some notable differences emerged regarding the items. For instance, in relation to items 1a–c, students compare the formulations for the different options and then cross-check with similar wordings or synonyms in the information (remember that information is provided for all items in this test):
I chose that one because it had the best agreement with the text. (School 3, Student 2)
This means that the students use reading-comprehension skills and word knowledge as general knowledge/skills to find the correct option.
Other items are treated differently. In item 2b, for example, where students are supposed to choose the best justification for a choice of sources, two of the alternatives include the words that students may look for. Nonetheless, the majority of students choose the third (incorrect) option. According to the recordings, this alternative appeals to the students because it is easier to understand, whereas the correct option is more difficult to comprehend:
/…/Nicklas, I don’t even get his justification, but Love, I understand how he’s thinking. (School 2, Student 1) You understand it well anyway. You know why he chose it… [pause] from the requirements, or what you should call them, up here. (School 1, Student 1)
A similar strategy is used in item 3a, where the students choose the most elaborate answer (in this case student drawings), either because it is more elaborate or because it “explains a little bit better”:
Because here things are explained too. On the other hand, there they have only drawn the clouds and not as much is explained, [pause] but this one explains more with the arrows and things like that. (School 3, Student 2) Well, it looks like someone has put more effort into this. [pause] It looks like she’s spent more time doing this. [pause] Showing how she’s thinking and… you know. (School 2, Student 3)
Table
Mean score for multiple-choice (MC)-equivalent items and the national constructed-response (CR) items.
MC equivalents (%) ( |
CR items (%) ( |
|
---|---|---|
Item 1 | 65.7 | 54.2 |
Item 2 | 45.5 | 58.2 |
Item 3 | 55.0 | 61.0 |
Table
The aim of this study was to investigate the relationship between MC and CR items. MC items were therefore constructed from CR originals, which were originally designed to assess students’ reasoning skills in physics. The MC items were then answered by 12 students, who also explained the reasons for their answers, and their explanations were recorded. Furthermore, the MC items were sent to five randomly selected schools, so that 102 student answers could be collected. These answers were compared to students’ performances (on CR items) on the national test regarding difficulty.
In relation to the first research question, which kinds of knowledge or skills students’ reasoning are based on when answering MC items, the findings suggest that students use general knowledge/skills (such as reading comprehension and word knowledge) and test-wiseness strategies (such as comparing the wording of the different alternatives) in the majority of cases. Even when using correct subject knowledge, this is used to distinguish between the different alternatives, rather than being directed toward the scientific context of the task. These findings resonate with previous research in this area, which has shown that students may use different skills than intended when answering MC items. In particular, this study substantiates the research by Hamilton et al. (
The findings from the second study, addressing the question of whether there is an agreement between students’ answers to MC and CR items designed to address the same complex skills, also support the interpretation that students may use different skills when answering MC items rather than CR items. The findings indicate that the difficulty for the MC and the CR items differs for two of the three items, despite the effort to make them as similar as possible. Taken together, the results provide indications of MC and CR items not being easily exchangeable. Again, these findings corroborate previous research (Miller,
From the current data, it is not possible to draw any conclusion about the reasons for the observed differences between CR and MC items. However, as a recent study about item difficulty in the Swedish national science test suggests, the difference in difficulty for the CR items can be partially explained by how much of their own knowledge in science the students need to draw upon when answering the items (Jönsson,
For instance, the students find item 1 much easier when not having to formulate their own arguments, but instead choosing among different alternatives. A possible reason is that the students do not have to rely on their own knowledge to the same extent. On the other hand, item 2 becomes more difficult in the MC format. As the findings above reveal, the students tend to choose the wrong alternative because they find it easier to read and understand. The explanation for the increased difficulty of item 2 may therefore lie in the fact that students are less familiar with the words and concepts used in some of the alternatives. When answering the same item in their own words, they do not have to rely on word knowledge to the same extent, and item difficulty drops. Similarly, for item 3 the difference is not as significant as compared to item 2. Whereas the CR version of the item requires students to use the text to draw pictures, the MC version relies not only on reviewing others’ drawings but also on written text in the alternatives.
The inclusion of MC items when assessing reasoning skills in physics may improve reliability estimates, facilitate scoring, and reduce teachers’ workload. Nonetheless, the findings from this study suggest that the addition of MC items may
According to these findings, the best way to handle the validity versus reliability trade-off is
Several limitations of this study affect the possibility to generalize the findings. Most important are the items used, since they likely have a significant impact on the results. Specific strengths in this study are that the CR items are designed to address complex skills, and they are thoroughly evaluated with a large number of students, since they are part of a national test. Furthermore, all of the MC items were systematically constructed from the CR originals. All information concerning the task was identical for both MC and CR items, making comparisons more valid. It is not possible to make the MC items perfectly equal to the CR originals, for instance, due to the fact that the CR items were multidimensional (were to be assessed with more than one criterion), and the MC items have to be unidimensional. This means that the MC items could have been designed differently, and other items could possibly produce different results. The number of items (7) used in this study was also small, as an adaptation to the age of the students (12-year olds). Similar investigation, but with other and a greater number of MC items, is therefore a natural recommendation for subsequent research.
Another important limitation to this study is that—like much research in this area—it is small-scale. The first substudy included a convenience sample of 12 students from 3 different schools. Furthermore, the students had to explain the reasons for their answers. This procedure may, on the one hand, have provided more focused data material, compared to, for instance, TAPs. However, it may also have produced the task-oriented answers observed in the recordings as a methodological artifact. Future research investigating students’ explanations in relation to CR items or comparisons with TAPs is therefore imperative.
Finally, although a random selection of schools is included, the second substudy is also based on a limited sample of student performance. The sample from the national test is much larger, but is based on teachers’ voluntary reporting. Of great importance is the fact that it is assessed by the teachers themselves. Due to the uncertain nature of these data, quite simple statistical tools were used for analyzing the data. More sophisticated methods may have provided more nuance to the findings, for instance, regarding interactions between the students and item characteristics. The final recommendation for future research is therefore to further investigate the statistical relationship of MC and CR items. This should include data that (a) are based on a systematic transformation of CR to MC items, so that the items are designed to address complex skills, but (b) do not depend on teachers’ voluntary reporting and potentially unreliable assessment.
This study was carried out in accordance with the ethical guidelines for the Humanities and Social Sciences set out by the Swedish Research Council. According to the national guidelines, as well as Swedish law regarding research on human subjects (2003:460), research needs approval from an ethical committee only in cases where personal and sensitive information is handled, when physical interventions are made, or when the subjects may be harmed. All subjects, as well as their legal guardians, have given written informed consent in accordance with the Declaration of Helsinki.
AJ was the principal investigator, who led the design of the study, literature review, data collection, analyses, interpretation, and writing and revising of the manuscript. DR and FA both contributed to the conceptualization and planning of the study, to analyses, and to the writing and revising of the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at
1In Sweden, the teachers are responsible for assessing students’ performance on the national tests, as well as reporting the results. An “assessment manual” is therefore delivered with the test. This manual specifies the correct answers to MC and other selected-response items, whereas criteria and examples of student responses are provided for CR items. The assessment manual also includes an algorithm for calculating a “test grade” (A–F), either for the entire test or for different parts of the test.
2The stem in an MC item is the question or statement that precedes the options.
3In Sweden, schools are randomly assigned one of the science tests. This means that approximately one-third of the students take the test in physics, one-third the test in biology, and one-third the test in chemistry.
4Nicklas and Love are names of fictional characters who provide the justifications the students are supposed to choose from for the item.