[NOTE: this is the final post in a series of posts about a report recently issued based on a study done by Marzano Research Laboratory. Part I is here, Part II is here, Part III is here, and Part IV is here.]
PART V: SUMMARY AND RECOMMENDATIONS
[NOTE #2: I know, I know...I'm a couple of days late on this one. Sorry.]
Before I sum up and conclude, I should point out one other major flaw in this study. Marzano and his team use percentile ranks incorrectly. On page 18 of the report, they write: “Of particular interest is the column entitled ‘% Gain.’ Again, this column contains the percentile gain (or loss) in achievement associated with the treatment (i.e., use of Promethean technology).” Two problems here. First, percentiles are not the same as percentages (or % as it is written in the report). Second, they then go on to write: “This value [the percentile gain] was determined by consulting a normal curve table for the area for each reported effect size.” This would be fine if the scores on the dependent variables are normally distributed, which they most definitely are not. For Marzano to go around saying that incorporating Promethean IWBs into instruction will improve student achievement by 17 percentiles is wrong on lots of levels.
It should be clear by now that if I were reviewing the Marzano IWB study report as a manuscript submitted for publication in a peer-reviewed journal, I would reject it. I would not even mark it as “revise and resubmit.” The problems with the work are too critical and, in most cases, impossible to fix.
In summary, those problems are:
Those last two points are with respect to each of the 85 classroom-based studies that serve as the basis for the meta-analysis. The ultimate problem, then, is that the hallmark of good meta-analysis is the use of strong criteria as decision points for including individual studies.
As a point of comparison, I’m linking to two reviews of research. Each is described as having used “best-evidence synthesis” which very closely resembles meta-analysis. The methods used in the studies reported in the articles below are also consistent with those used by the What Works Clearinghouse.
Effective Programs in Elementary Mathematics: A Best-Evidence Synthesis
Effective Reading Programs for Middle and High Schools: A Best-Evidence Synthesis
In the first article, you’ll notice on the seventh page of the document (p. 432 of the article) a list of criteria for inclusion. The authors of those articles also provide a list of studies that were considered for inclusion but that were ultimately excluded along with the reasons for exclusion. This combined approach is critical; it gives the consumer of the research confidence that the data used in the meta-analysis come from many solid studies.
The impact of sample size for any given study included in a meta-analysis is another important point raised in the articles above. According to the authors of the second article, “[p]revious research (e.g., Rothstein et al., 2005; Slavin, 2008; Sterne, Gavaghan, & Egger, 2000; Taylor & Tweedie, 1998) has shown that studies with small sample sizes report larger effect sizes than studies with large samples.” As a result, in their meta analysis, the authors weight the individual findings by sample size. In each of the separate sites/studies used by Marzano and his team in their meta analysis, sample sizes were tiny. Consider for example site #34, teacher #57 where there were 9 students in the control group and 5 in the treatment group. There is no way that study gets included in any decent meta analysis.
There is a bit of irony in my choice of articles to post as exemplars. The lead author in each of those studies is Dr. Robert Slavin, the developer/founder of Success for All. Slavin has been frequently critiqued for being the lead researcher/analyst/author on many evaluation studies of Success for All, the program that he created. In other words, he has been accused of producing biased research. I don’t know enough to say if his research is biased or not; it’s certainly legitimate though to raise the question of bias where he is involved in the research. What I do know, though, is that each of the articles appears in one of the most well-respected, highly selective peer-reviewed journals. The math study appears in the Review of Educational Research which is dedicated to only publishing exquisite and top-notch reviews, syntheses, and meta-analyses in education. Thus, there is good reason to believe that those two articles present exemplars of how meta-analysis type research should be done.
I wrote earlier that doing good, comprehensive program evaluation in education is difficult and resource-intensive. That said, I believe it would actually be reasonably easy to evaluate the impact of IWBs on student achievement. In this era of standards and accountability, in any given state, we have year-to-year state test scores (at least in math and reading/language arts) from grades 3 to 8. So, Marzano’s team could have focused on one or two grade levels in one or two subject areas in one state.
Let’s say they focused on 8th grade student achievement. All they needed to do was to find about 20 middle schools that were willing to participate. In those 20 schools, there would be one subject-area teacher teaching in a classroom with the IWB and one teacher teaching a comparable class (NOTE: comparable here refers to students who are demographically similar and who are no different with respect to student achievement at baseline) without the IWB. Surely there are at least 20 middle schools in any state where there are two 8th grade teachers teaching comparable classes.
A common way to get schools and teachers to participate in such a study would be to offer an incentive. For Promethean, the promise of a free IWB to the teacher/classroom in the control condition the year after the study would be a wonderful incentive. Given this sampling framework, Marzano’s research team could work with the schools, districts or state departments to get student achievement data on the students in those 40 classrooms (20 treatment + 20 control). This could easily be done without violating any privacy laws. The students’ scores on the 7th grade state exams could serve as the pretest or the covariate. Their scores at the end of the 8th grade year would be the dependent variables. Over 40 classrooms, we’d be talking about a sample size of well over 800, with well over 400 students in each condition. Such a study would have lots of power. Analytic decisions would have to be made with respect to the unit of analysis. Marzano and his team could use the classroom as the level of analysis and conduct matched-pairs statistical test. Or, they could use the student as the unit of analysis and account for the nesting or lack of independence by using multilevel modeling techniques. Either way, this design would be much more appropriate and powerful for estimating the effects of IWBs on student achievement.
In the last couple of days, I spoke about this series of posts to two professors who I respect greatly. Interestingly, each one was very surprised to hear my opinion that Marzano was affiliated with sloppy work. One said, “he’s always been so careful.” That may very well be. I don’t intend for this series to be an indictment of Marzano (or even of IWBs). My hope is that I’ve provided a sensible critique of research that is being widely disseminated.
I often lament that decisions in education are too often made in the absence of empirical evidence. I wish policymakers in education would consult research more often. However, if educational decision makers decide to make an investment in interactive white boards, I would strongly urge them to do so for reasons other than the evidence offered by the Marzano Research Labs.
[NOTE: this is the fourth in a series of posts about a report recently issued based on a study done by Marzano Research Laboratory. Part I is here, Part II is here, and Part III is here.]
PART IV: INTERNAL VALIDITY
“Internal Validity is the approximate truth about inferences regarding cause-effect or causal relationships” (Trochim, 2006).
I consider research questions as one of three types: descriptive, relational and causal. From there, certain research designs lend themselves to best answering the research questions. For example, naturalistic inquiry methods such as those used in ethnographies or case studies are suited only to answer questions of description.
Marzano and his team set out mainly to answer a question of causality: does use of the IWBs cause improvement in student achievement? How they chose to do that is less than ideal, but we know that was their intention based on the use of quasi-experimental designs in each of the 85 classroom-based studies. Experiments and quasi-experiments are designs intended to address questions of causality.
When such designs are employed, the primary consideration of trustworthiness is internal validity. “All that internal validity means is that you have evidence that what you did in the study (i.e., the program) caused what you observed (i.e., the outcome) to happen” (Trochim, 2006). In any effort to prove causation, one key is to be able to rule out alternate explanations. In other words, alternate causes are threats to internal validity.
There are any number of ways to think about possible threats to internal validity, and there are many ways to describe them. There are at least a couple of general threats to internal validity for each of the 85 classroom-based studies that Marzano and his team used for his meta-analysis. The threats result mainly from the fact that, at least in the secondary schools, the same teacher taught the treatment and the control groups. At one level, that might seem like a benefit in that it eliminates teacher-level confounds. However, consider:
Social threats to internal validity – also known as or related to intervention or exposure bias. The students in the control group are presumably in the same classroom (at a different time) than the students in the treatment group. They see the IWBs. They also probably hear about the teachers using the IWBs from their friends in the treatment class. Might there be some compensatory rivalry and/or resentful demoralization? Trochim (2006) defines the latter:
Here, students in the comparison group know what the program group is getting. But here, instead of developing a rivalry, they get discouraged or angry and they give up (sometimes referred to as the “screw you” effect!). Unlike the previous two threats, this one is likely to exaggerate posttest differences between groups, making your program look even more effective than it actually is.
Experimenter’s bias – it is impossible to tell from the report all that the teachers knew and/or were told about the study. We know that in the instructions to the teachers, it says “Thank you for agreeing to participate in an action research [sic.] study regarding the effectiveness and utility of the Promethean technology in your classroom.” So, they knew what Marzano and his team were looking at/for. This knowledge could easily cause the teacher to pay more attention to her/his teaching in the treatment class. Additionally, just the fact that the teacher had to take an existing unit and figure out how to integrate the IWB technology means that the teacher was biased toward that group (i.e. she/he was more planful about that teaching).
Earlier, I wrote that I’d never seen meta-analysis included purposefully as an a priori part of a separate research design. Frankly, I’ve never seen or heard of a quasi-experiment in education where the (non-random) selection is at the classroom level and yet the treatment and control classes still have the same teacher. I could be wrong here, but in summary, this approach raises a number of general threats to internal validity for each of the 85 individual studies upon which the meta-analysis was based.
COMING NEXT:
FRIDAY – Part V: Summary and recommendations
[NOTE: this is the third in a series of posts about a report recently issued based on a study done by Marzano Research Laboratory. Part I is here and Part II is here.]
PART III: MEASUREMENT VALIDITY AND RELIABILITY
As I wrote yesterday, Marzano and his research team had a “dependent variable problem.” That is, there was no single, comparable measure of “student achievement” (his stated outcome of interest) that they could use as a dependent variable across all participants. I should note that they were forced into this problem by choosing a lazy research design. A tighter, more focused design could have alleviated this problem (more on that in Part V).
Let’s visit their research methods briefly. Marzano’s research team asked 79 teachers to teach a unit with the IWB and one without it. In secondary schools, that meant two separate classes of students. In elementary schools, that meant teaching two “similar” units, one using the IWB and one without using it. Students were to be given a pretest before the unit and a posttest after the unit. For the elementary teachers, that actually meant giving four tests; a prettest and a posttest for each of the two “similar” units [NOTE: I'm putting quotation marks around "similar" because this raises a question of research ethics for me. But for this study, would the elementary teachers really teach two units on the same topic? One golden rule of educational research for me is that research should never drive pedagogical decisions.]
In the instructions to participating teachers, Marzano and his team wrote:
To be involved in a study you must be willing to do a few things. First you should select a specific unit of instruction, or set of related lessons on a single topic (hereinafter referred to as unit) and design a pretest and posttest for that unit. It is best if the unit is relatively short in nature. For example, if you teach mathematics, you might select a two week unit on linear equations. At the beginning of the unit, you would administer a pretest on linear equations. Then at the end of the unit you would administer a posttest. This test could be identical to the pretest, or it could be different. The important point is that you have a pretest and a posttest score for each student on the topic of linear equations. Ideally the pretest and posttest are comprehensive in nature.
Then, later, they write:
Finally both pretest and posttest scores should be translated to a percentage format. For example, if your pretest involves 20 points and a particular student receives a score of 15, then translate the 15 into a percentage of 75% (i.e., 15/20 = .75 x 100 = 75%) and record that as the pretest score for the student. If your posttest involves 80 points and that same student receives a score of 75, then translate the 75 into a percentage of 94% (75/80 = .94 x 100 = 94%) and record that as the student’s posttest score. The same procedure would be employed if you used a rubric. For example, if a student received a 2 on a 4 point rubric on the pretest, this score would be translated to a percentage of 50% (2/4 = .50 x 100) and this would be recorded as the student’s pretest score. The same translation would be done on the student’s rubric score for the posttest.
Those posttest scores became the dependent variable in each of the 85 separate studies (the pretest scores were used as covariates). In other words, the measure of student achievement in each of the 85 studies is “% correct or % proficient on a teacher created test of a single unit.” We can quibble about how to best define and operationalize student achievement, but that measure is unlikely to satisfy any legitimate educator’s conception of student achievement. Furthermore, what matters most here is the trustworthiness of the actual measure(s) used.
In the field of measurement, trustworthiness is operationalized in terms of validity and reliability. In the most general terms, validity is about the accuracy of a measurement (are you measuring what you think you’re measuring?) and reliability is about consistency (would you get the same score on multiple occasions?). I won’t write a whole treatise here about measurement validity and reliability; that would be a waste of our time (especially in an era where information is not scarce and you can find credible and accessible descriptions in places like this). Suffice it to say that none of the measures of student achievement used by any of the teachers who participated in the study are either valid or reliable.
To repeat what I wrote yesterday about meta-analysis, it is a powerful technique for combining the results of lots of studies, each of which is fully reported and each of which was selected for the meta-analysis because the full report of the study allowed the analyst to determine its trustworthiness. Because we do not have full descriptions of each of the 85 studies, we do not know what the dependent variable was in any of the studies; we just know that the teacher made up a test and converted the score to a percentage. In other words, as best we can tell, each of the 85 studies included in the meta-analysis suffers from a serious lack of measurement validity and reliability. That renders the whole meta-analysis invalid.
NOTE BENE: I know this is more of an issue of research design (the topic of Part II) than a measurement issue (the topic of Part III), but I needed to add this bit of information. At the end of the letter to the participant teachers, Marzano and his team wrote “Thank you again for considering involvement in an action research project.” Additionally, in a blog post about this study, Sonny Magana wrote, “Over the past academic year, Dr. Robert Marzano conducted a much-anticipated meta-analysis of numerous action research studies on the direct effect of Promethean’s transformational technologies on academic achievement.” Again, I’ll spare you the detailed description of action research, and I’m by no means an expert on action research. But, I know enough to state with total confidence that what the teachers did FOR Marzano and his team was NOT action research. Action research is NOT simply teachers collecting data in their own classrooms; it is a much more complicated and sophisticated process. For me, the misuse of the phrase “action research” calls into question the credibility of this study as a whole.]
COMING NEXT:
THURSDAY – Part IV: Internal validity issues
FRIDAY – Part V: Summary and recommendations
[NOTE: this is the second in a series of posts about a report recently issued based on a study done by Marzano Research Laboratory. Part I is here.]
PART II: RESEARCH DESIGN ISSUES
From a research design perspective, this study (or collection of studies?) is best described as unusual. In fact, what Marzano’s research team tells us is that they conducted 85 separate small studies and then “synthesized” the results of those studies through a meta-analysis (a set of complex statistical analyses used to examine “effects” across multiple studies). Meta-analyses are not unusual, and are quite helpful as a way of combining the results across lots of studies. However, it IS very unusual to use meta-analysis as an a priori technique built in to an evaluation. Frankly, I’ve never seen it before. Meta-analysis is more typically used when there is a mature body of research within a topic area comprised of studies by multiple researchers across a number of years. Furthermore, they are typically done using lots of studies, each of which is fully reported and each of which was selected for the meta-analysis because the full report of the study allowed the analyst to determine its trustworthiness. That, to me, is one of the biggest problems with this study/analysis; we don’t know enough about each of the 85 individual studies.
Is Marzano’s meta-analytic approach “wrong?” Not necessarily, but to me, it’s indicative of a certain laziness. High-quality evaluation research in education is complicated and costly. It requires a ton of coordination and planning, especially to make sure that key data are high-quality and comparable.
Ultimately, Marzano got 79 teachers to agree to “participate” and got data on over 2,700 students. That’s commendable. But those teachers varied by grade level taught and subject taught (other than the elementary teachers; more on them later). They also taught in different states. So, he had a huge “dependent variable” problem. In other words, there was no single, comparable measure of “student achievement” (his stated outcome of interest). He needed a way to account for that and chose to deal with it by way of analytic techniques (i.e. meta-analysis), rather than by focusing the study (perhaps within a single state within one or two grade levels). He also chose a lazy way to get data on student achievement (more on that in Part III).
COMING NEXT:
WEDNESDAY – Part III: Construct validity and reliability issues
THURSDAY – Part IV: Internal validity issues
FRIDAY – Part V: Summary and recommendations
Two of the more well-known brand names in education recently combined forces. Robert J. Marzano (wildly popular consultant/author/speaker) produced a report of a study he conducted of Promethean ActivClassroom (wildly popular interactive white board (IWB) technology).
The report has received lots of publicity; I have seen multiple references to it on Twitter and elsewhere. You can get a copy from Promethean, but only by first providing them with lots of contact information here. I wasn’t willing to make that exchange, but the day after discussing the report on Twitter, Sonny Magana, the Director of Education Strategy at Promethean, Inc. was kind enough to e-mail me a copy of the report.
Marzano’s work has not yet been formally reviewed by any “peers” (at least as far as I can tell). While I am very critical of the way peer-review is typically conceived and carried out in academia, there is real value in the process. Therefore, I’m using this space to do just that. This first post is a bit of an introduction. In subsequent posts, I’ll address methodological and analytical issues.
In this first post, I’ll try to do two things simultaneously: address a key criticism and establish some semblance of credibility as a reviewer. The report states that it was prepared by the Marzano Research Laboratory for Promethean, Ltd. That undoubtedly means that Promethean funded a study of their own product(s). Such an arrangement, which certainly gives the appearance of a lack of objectivity, is not unpredecented and not even unusual. I should know; I’ve done plenty of evaluation research as a “third-party, independent” evaluator funded by vendors. For the better part of ten years, I was part of a research team that conducted evaluation research funded by private vendors such as Lightspan [since purchased by Plato Learning], Scholastic, eChalk, Jostens [sic.], etc to study their own products/programs. Based on my experiences, I can state confidently that those sorts of arrangements should be viewed with skepticism and examined critically. I stand by much of the work I did and would defend the work against any critique. However, there were certainly instances where the vendor/funding source “influenced” the contents of the final report. More often, the final report was written in a way that would be most palatable to the client.
[NOTE: of the privately funded evaluation research I was a part of, I've only been able to find one report that is publicly available. This report of a large-scale evaluation of Scholastic's READ 180 (funded by Scholastic) happens to be one by which I swear. There are a number of reasons why this study is credible, but the most important factor is that the main stakeholder was the Council of Great City Schools and not Scholastic].
Ultimately, without being present at the initial negotations between the parties and without being privy to conversations between the researcher(s) and the client, it is hard to know how “objective” or “honest” a research report is when the study is of a product/program and the study is funded by the vendor of said product/program. The best we can do is to (peer-)review these sorts of reports against the standards of educational research. Onward then…
PART II: Research Design

Categories
Tag Cloud
Blog RSS
Comments RSS


Void « Default
Life
Earth
Wind
Water
Fire
Light 