Monday, May 26, 2008

Cargo cult psychometrics? Setting standards on standards-based standardized tests

This past week I had the opportunity to work for the Maine Department of Education (DOE) with a small, diverse group of educators on the task of setting achievement standards for the Science component of the Maine High School Assessment (MHSA). The intent of the MHSA is to measure student learning relative to the Maine Learning Results (MLR), a body of learning objectives / outcomes / standards that all students in Maine are expected to be proficient in as a result of their high school educational experience. A few years back, the Maine DOE decided to adopt the College Board's SAT as the MHSA instead of continuing with their own Maine Educational Assessment (MEA), which had been developed with the help of Measured Progress, a non-profit firm based in New Hampshire. Aside from offsetting some development costs, the switch to using the SAT as the primary component of the MHSA was undoubtedly influenced by the nice side-effect that all students in Maine would be one step closer to college application readiness. However, unlike the MEA, the SAT does not correlate with all standards in the MLR. Up until this year the federal DOE did not require states to report student learning in Science (at least in grades 9 - 12 ... I'm not sure about earlier grade levels), so Maine had not included any questions on Science since switching to the SAT. But, because Maine needed to report student learning in Science this year, the state DOE worked with Measured Progress to develop a multiple choice and free response augmentation for the MHSA to measure student learning relative to the Science standards from the MLR in order to be able to comply with federal reporting rules.

Because it had been a few years since Maine students' learning in Science had been assessed, the panel I worked on was tasked with setting achievement standards - expectations - for categorizing overall student performance on the Science augment as "Does Not Meet", "Partially Meets", "Meets", and "Exceeds". We began our work by actually taking the assessment; while I can't discuss any of the test items specifically, I can say that the questions seemed generally well-written and represented a broad and balanced sampling of the Science standards. We were then given a binder containing all the questions in order of difficulty, as determined by Item Response Theory (IRT) analysis, the first step in which is to generate for each question an Item Characteristic Curve (ICC) which, we were told, was of the logistic or "S curve" type, although we didn't see the actual graphs.

IRT is a complicated psychometric analytical framework that I heard of for the first time during this panel - I am still learning about it using the following resources: UIUC tutorial; USF summary; National Cancer Institute's Applied Research pages. We were not taught any of the following specifics on IRT during the panel session. From what I've learned subsequently by reading through the above-linked resources, it appears that the purpose of the ICC is to relate P, the probability of a particular response to a question (in this case, the correct answer) to Theta, the strength of the underlying trait of interest (in this case, knowledge of the MLR's science standards). In a logistic ICC, the S-shaped curve has three variables that influence it's theta value: "a", the discrimination parameter; "b", the difficulty parameter; "c", the guessing parameter. What I'm not yet sure of, and perhaps might never be, is whether the rank-order of the questions in our binders were based on some type of integration of the P v. Theta curve for each question, or if they were based on the "b" value for each item's ICC - from the way it was described to us, I suspect the latter to be the case.

Once we had the binders with ordered questions, we were asked to go through each question and to determine, individually, what the question measured and why it was more difficult than the previous question. A multiple choice or free response question can measure a lot of different factors - we were instructed to concentrate on determining which standard(s) from the MLR's Science section were being assessed. So, our analysis of what was measured by each question left out the important factors of wording, inductive reasoning, and deductive reasoning, just to name a few. After finishing our question-by-question analysis and discussing our individual findings as a group, we moved on to another task: describing the Knowledge, Skills, and Abilities (KSAs) that we associated with students at the four different achievement levels (Does Not Meet, Partially Meets, Meets, and Exceeds). Completing this task took quite a while, as there were many different opinions about what kind of KSAs different educators had observed in and/or expected from students at the different achievement levels.

With achievement level KSAs in mind, we then moved into the "bookmarking" task, the results of which would be sent on to the Maine DOE as the panel's recommendation for the cut scores to categorize students within one of the four achievement levels. In the bookmark standard setting procedure, each of us was given three bookmarks - one to place at the cut between Does Not Meet and Partially Meets, another to place at the cut between Partially Meets and Meets, and a final one to place at the cut between Meets and Exceeds. We were instructed to go through the binders with ordered questions and, starting from the easiest question at the very beginning, to place the bookmarks at the transition point where we felt that only the students with the KSAs characteristic of the next-higher achievement level would be able to get the question correct 2/3 of the time. Again, as with IRT, the underpinnings of the bookmark standard setting procedure weren't explained to us in detail, so I've been reading the following sources to learn more about it: Wisconsin DPI - 1; Wisconsin DPI - 2; Lin's work at U. Alberta's CRAME [PDF].

And so, we each went through our binders, set our bookmarks, and gave the results to the psychometrician from Measured Progress. Our bookmarks were averaged, and the average placement of the bookmarks were presented back to us for group discussion. We talked about where we placed our bookmarks individually and why we placed them there - some people were much above the average, some much below, and others very near or at the group's average placement. Conversation revealed that some panelists did not fully understand the instructions on where to place the bookmarks (if I recall, I think most confusion was due to the instructions about only the students at the next highest achievement level being able to get the question correct 2/3 of the time). Conversation also helped many panelists to re-evaluate the placement of their bookmarks based on question characteristics that had not been considered in the first round. We were then given the opportunity to place our bookmarks a second time, and were told that these results (which were not shared with us) would be passed on to the Maine DOE as the panel's recommendation for cut scores for categorizing student achievement.

During one of our breaks on the second day, when we were working on the bookmarking task, another panelist I was talking with asked if I had ever read any Richard Feynman, particularly his essay called "Cargo Cult Science". Although I'd heard of Feynman before, I replied that I hadn't read any of his work - the panelist described it to me as pertaining to the distinction between science and pseudoscience, and shared with me his feeling that our attempts to measure and set standards for student knowledge felt a lot like what Feynman was describing in that essay. At the time, I felt a bit of disagreement - although I know that measuring knowledge of standards via any assessment is bound to have flaws, I don't think it's psuedoscience. I've since read Feynman's essay, and understand more about his distinction between science and pseudoscience, which helps me to understand better my fellow panelist's remark -- I think it is captured by this quote: "In summary, the idea is to try to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgment in one particular direction or another."

Feynman's essay goes on to discuss the obligation of scientists to other scientists, as well as to non-scientists. I particularly agree with the responsibility of scientists to "bend over backwards" to demonstrate how they could be wrong, particularly when the complexity of the problem and/or the solution make it likely that the non-expert / non-scientist will believe the scientist's conclusion simply because they can't understand or evaluate the work on their own. My experience in working on this standard-setting panel provided me with invaluable insight into a complex process whose results have significant implications. Even our panel of experienced science educators struggled to understand the complexity of the standard setting process that we implemented, and the full underlying complexity of the entire process (ie: Item Response Theory and Bookmarking Method) was not explained. Given that there could be significant differences in the KSAs associated with each achievement level depending on the composition of the panel, and given that the underlying complexity of the task is significant, I think it is accurate to label this work as "cargo cult science" because only the results are shared with a broad audience. I don't think that the task of measuring knowledge with "pencil and paper" assessment is inherently pseudoscience - but we ultimately do a disservice to the potential for making education more scientific when the full scope of this type of work is not published.


  1. Is IRT a Cargo Cult? Maybe so. It seems to make a very simple -- and unjustified -- assumption about the nature of knowledge and learning. Explicitly, they seem to assume that there is a 1D variable describing "how well you know this." That fails to take into account the structuring of knowledge -- context dependence, framings (such as "playing the testing game"), and so on. I am very leery of large-scale multiple-choice testing. It's like doing a "gold-standard" medical study on the efficacy of willow bark in preventing headaches. If we haven't identified the relevant variables, we'll have GIGO, no matter how extensive the stats.

  2. Joe - my reply is many months tardy, but I'm hopeful you'll see it. I think your point is excellent that inferring knowledge from the results of large-scale multiple-choice standardized tests rests on a lot of assumptions, many of which are at the very least oversimplified, and at worst, just plain wrong. I suspect that there are a variety of "real world" reasons forcing these assumptions, not the least of which include the costs of administering and grading a different type of examination. I don't think that multiple choice exams don't give us information about what students have learned, but I do think that the inherent limitations should be much more of a factor in using the results to make choices about, say, funding schools.