PART III: MEASUREMENT VALIDITY AND RELIABILITY
As I wrote yesterday, Marzano and his research team had a “dependent variable problem.” That is, there was no single, comparable measure of “student achievement” (his stated outcome of interest) that they could use as a dependent variable across all participants. I should note that they were forced into this problem by choosing a lazy research design. A tighter, more focused design could have alleviated this problem (more on that in Part V).
Let’s visit their research methods briefly. Marzano’s research team asked 79 teachers to teach a unit with the IWB and one without it. In secondary schools, that meant two separate classes of students. In elementary schools, that meant teaching two “similar” units, one using the IWB and one without using it. Students were to be given a pretest before the unit and a posttest after the unit. For the elementary teachers, that actually meant giving four tests; a prettest and a posttest for each of the two “similar” units [NOTE: I’m putting quotation marks around “similar” because this raises a question of research ethics for me. But for this study, would the elementary teachers really teach two units on the same topic? One golden rule of educational research for me is that research should never drive pedagogical decisions.]
In the instructions to participating teachers, Marzano and his team wrote:
To be involved in a study you must be willing to do a few things. First you should select a specific unit of instruction, or set of related lessons on a single topic (hereinafter referred to as unit) and design a pretest and posttest for that unit. It is best if the unit is relatively short in nature. For example, if you teach mathematics, you might select a two week unit on linear equations. At the beginning of the unit, you would administer a pretest on linear equations. Then at the end of the unit you would administer a posttest. This test could be identical to the pretest, or it could be different. The important point is that you have a pretest and a posttest score for each student on the topic of linear equations. Ideally the pretest and posttest are comprehensive in nature.
Then, later, they write:
Finally both pretest and posttest scores should be translated to a percentage format. For example, if your pretest involves 20 points and a particular student receives a score of 15, then translate the 15 into a percentage of 75% (i.e., 15/20 = .75 x 100 = 75%) and record that as the pretest score for the student. If your posttest involves 80 points and that same student receives a score of 75, then translate the 75 into a percentage of 94% (75/80 = .94 x 100 = 94%) and record that as the student’s posttest score. The same procedure would be employed if you used a rubric. For example, if a student received a 2 on a 4 point rubric on the pretest, this score would be translated to a percentage of 50% (2/4 = .50 x 100) and this would be recorded as the student’s pretest score. The same translation would be done on the student’s rubric score for the posttest.
Those posttest scores became the dependent variable in each of the 85 separate studies (the pretest scores were used as covariates). In other words, the measure of student achievement in each of the 85 studies is “% correct or % proficient on a teacher created test of a single unit.” We can quibble about how to best define and operationalize student achievement, but that measure is unlikely to satisfy any legitimate educator’s conception of student achievement. Furthermore, what matters most here is the trustworthiness of the actual measure(s) used.
In the field of measurement, trustworthiness is operationalized in terms of validity and reliability. In the most general terms, validity is about the accuracy of a measurement (are you measuring what you think you’re measuring?) and reliability is about consistency (would you get the same score on multiple occasions?). I won’t write a whole treatise here about measurement validity and reliability; that would be a waste of our time (especially in an era where information is not scarce and you can find credible and accessible descriptions in places like this). Suffice it to say that none of the measures of student achievement used by any of the teachers who participated in the study are either valid or reliable.
To repeat what I wrote yesterday about meta-analysis, it is a powerful technique for combining the results of lots of studies, each of which is fully reported and each of which was selected for the meta-analysis because the full report of the study allowed the analyst to determine its trustworthiness. Because we do not have full descriptions of each of the 85 studies, we do not know what the dependent variable was in any of the studies; we just know that the teacher made up a test and converted the score to a percentage. In other words, as best we can tell, each of the 85 studies included in the meta-analysis suffers from a serious lack of measurement validity and reliability. That renders the whole meta-analysis invalid.
NOTE BENE: I know this is more of an issue of research design (the topic of Part II) than a measurement issue (the topic of Part III), but I needed to add this bit of information. At the end of the letter to the participant teachers, Marzano and his team wrote “Thank you again for considering involvement in an action research project.” Additionally, in a blog post about this study, Sonny Magana wrote, “Over the past academic year, Dr. Robert Marzano conducted a much-anticipated meta-analysis of numerous action research studies on the direct effect of Promethean’s transformational technologies on academic achievement.” Again, I’ll spare you the detailed description of action research, and I’m by no means an expert on action research. But, I know enough to state with total confidence that what the teachers did FOR Marzano and his team was NOT action research. Action research is NOT simply teachers collecting data in their own classrooms; it is a much more complicated and sophisticated process. For me, the misuse of the phrase “action research” calls into question the credibility of this study as a whole.]
THURSDAY – Part IV: Internal validity issues
FRIDAY – Part V: Summary and recommendations