School Psychologists as Consumers of Research: What School Psychologists Need to Know About Analysis of Variance
By Stephen P. Kilgus
Volume 44 Issue 5
By Stephen P. Kilgus
One of the most common questions asked within school psychology research concerns the relationship between an intervention (independent variable) and an outcome (dependent variable). For example, say a researcher is interested in the effect of a cognitive–behavioral intervention (independent variable) on student depression (dependent variable). They evaluate this relationship by comparing depression levels across three groups, including those who received the intervention in a one-on-one format, those who received it in a small-group format, and those who received “treatment-as-usual” with no counseling. A second researcher wants to determine whether a Tier 2 small-group reading intervention (independent variable) demonstrates a positive effect on student reading comprehension (dependent variable). They evaluate this relationship with a single group of students by comparing preintervention comprehension levels to postintervention comprehension levels.
For several decades, the most common approach to evaluating these causal questions has been through analysis of variance (ANOVA). ANOVA represents a family of analyses that might be used to determine if mean scores on some dependent variable differ across multiple time points and/or multiple groups. Different ANOVA-related statistics can be calculated in evaluating the extent of this difference in terms of both statistical and practical significance.
When Anova Should Be Used
ANOVA is rather flexible in that it can accommodate a variety of research designs and corresponding research questions. Below is a brief description of six common ANOVA models, and Table 1 includes a summary of those models.
Table 1 Review of Analysis of Variance (ANOVA) Models and Corresponding Examples
Example: ADHD treatment, with three levels (Control, Medication, and Behavior Therapy)Example: Postintervention teacher ratings of ADHD symptomsExample: Mathematics instruction, with two levels (Direct instruction and student-centered); Dosage, with two levels (Once a week and twice a week)Example: Postintervention curriculum-based measurement scoresExample: Social skills instructionExample: Pre- and postintervention observations of positive and negative social interactionsExample: Depression treatment, with two levels (group counseling and individual cognitive–behavioral therapy)Example: Student self-report ratings of moodExample: Reading intervention, with two levels (fluency strategy and acquisition strategy); Instructor type, with two levels (peer and teacher)Example: Oral reading fluency (ORF) scores and nonsense word fluency (NWF), assessed at both pre- and postinterventionExample: ADHD treatment, with three levels (Control, Medication, and Behavior Therapy); Covariate = parent and teacher praise ratesExample: Pre- and postintervention observations of student inattentive and impulsive behaviors
|DESIGN||INDEPENDENT VARIABLE||DEPENDENT VARIABLE|
|Repeated measures ANOVA|
|Mixed factorial ANOVA|
One-way ANOVA. One-way ANOVA is used to compare the dependent variable of three or more levels, usually groups, of a single independent variable at a single time point. A one-way ANOVA can also be used to compare only two levels, with results being equivalent to an independent samples t-test.
Multi-way ANOVA. A multi-way ANOVA (e.g., two-way ANOVA or three-way ANOVA) is used to evaluate the effect of two or more independent variables, each of which has two or more levels, on a dependent variable measured at a single time point.
Repeated measures ANOVA. Repeated measures ANOVA is used to evaluate the effect of an independent variable with a single level on a dependent variable measured at two or more time points. In this scenario, time (also referred to as a within-group factor) is of particular interest, with results indicating the extent to which scores have changed across time points.
Mixed factorial ANOVA. Mixed factorial ANOVA is used to examine the effect of one or more independent variables, each of which has two or more levels, on a dependent variable measured at two or more time points.
Multivariate ANOVA. A multivariate ANOVA, which is commonly abbreviated as MANOVA, is used when evaluating two or more dependent variables at the same time. MANOVA can include any of four prior models, thus permitting simultaneous examination of multiple independent and dependent variables.
Analysis of covariance. Analysis of covariance, or ANCOVA, is an extension of any of the above ANOVA models that also includes additional covariates to remove bias in the dependent variable(s). Including covariates in the model increases the accuracy of conclusions regarding the independent variable. Covariates are those variables that predict the dependent variable despite not being related to the independent variable. By accounting for covariates, one is able to reduce within-group variance (also referred to as error variance), thus enhancing the ability to identify meaningful differences between groups on the dependent variable(s). When ANCOVA includes multiple dependent variables, it is normally referred to as MANCOVA.
There are several terms commonly used when describing ANOVA procedures. Some of the most important terms are described below, but readers are referred to Thompson (2006) for a more thorough review of ANOVA and the terms below.
Factors and levels. As noted above, the term factor is used to represent both group variables (i.e., between-group factors, independent variables), as well as time conditions (i.e., within-group factors). The various group and time conditions being compared within these factors are referred to as levels. To illustrate, attention deficit hyperactivity disorder (ADHD) treatment could be a between-group factor that might be of interest to a researcher. The three levels that could comprise the ADHD treatment factor include control, medication, and behavioral therapy. In addition, data collection phase may be a within-group factor that might be of interest, with levels of the factor including pretreatment and posttreatment.
Omnibus effects. A common ANOVA-based research question pertains to the extent to which the levels of a factor differ on some dependent variable. Two different sets of statistics are used to evaluate these questions. The first set of statistics corresponds to omnibus effects, which are used to inform general conclusions regarding the extent to which levels differ. Omnibus effects include both main effects and interaction effects. Main effects are calculated for both between-group and within-group factors. Within our example, a main effect could be calculated for the ADHD treatment between-group effect, with results indicating the extent to which the treatment conditions differ (averaged across the two time points). A main effect could also be calculated for time as the within-group factor, with results indicating the extent to which dependent variable scores change over time (averaged across all three groups). Interaction effects can be calculated to examine the interplay between multiple between-group factors or both between- and within-group factors. Within our example, an effect could be calculated for the interaction between ADHD treatment and time, with results representing the extent to which the difference between pre- and postintervention scores differs across treatment groups.
Multiple comparisons. One limitation associated with the evaluation of omnibus effects is the lack of specificity in their corresponding conclusions. That is, when results suggest a main or interaction effect is statistically significant, the findings do not specify which of the levels of the independent variable are different from each other relative to the dependent variable. This is not a concern when only two levels of a factor are being compared; however, it is somewhat vague and thus problematic when three or more levels are compared. In this latter scenario, a statistically significant main effect might indicate all levels of a factor differ from each other, or rather that only two levels differ while others remain similar. As such, a second set of ANOVA-related statistics are calculated via multiple comparison analyses to determine which within- and between-group levels differ from each other. A wide range of multiple comparison tests can be conducted, with the more common including the Bonferroni correction test, the Tukey honest significant difference (HSD) test, and the Scheffé test, among others.
Significance. Though a range of statistics are calculated as part of ANOVA, two types are of primary interest in evaluating the aforementioned omnibus effects and post hoc tests. Both statistics indicate the significance of findings, but each corresponds to a different form of significance. The first statistic is the p value, which indicates the statistical significance of findings. Specifically, p values indicate the probability that a sample of interest came from a population wherein the null hypothesis was true (Thompson, 2006). In the case of ANOVA, the null hypothesis specifies that the difference between relevant between- and/or within-group level means is equal to zero. The null hypothesis is typically rejected when a test’s observed p value is less than .05 (referred to as the critical alpha value), resulting in the conclusion that the difference between means is not equal to zero (in accordance with the alternative hypothesis). More practically stated, a significant finding means that the results are not due to sampling error.
Effect sizes are the second set of statistics that indicate the significance of findings within ANOVA, but an effect size represents the practical significance of results. Researchers have developed a wide range of effect size statistics, with the most commonly reported being Cohen’s d and partial eta squared (η2). Effect sizes have gained popularity over the years in recognition of (a) the limitations associated with p values and null hypothesis statistical significance testing and (b) the potential for effect sizes to inform inferences regarding the magnitude of effects (Wilkinson & APA Task Force on Statistical Inference, 1999). Rather than simply indicating that means are different from each other (as is the case with p values), effect sizes reflect the extent to which means actually differ. As such, when used within intervention effectiveness research, effect sizes speak to the actual change in an individual’s behavior that one might anticipate when applying an intervention. Furthermore, effect sizes are commonly expressed in the form of a standardized statistic that removes the influence of sample size (which p values do not), permitting their aggregation across studies in evaluating the consistency of effects as is commonly done in meta-analytic research. Though no single set of recommended effect size interpretive guidelines are likely to be appropriate across all investigations (Mc-Grath & Meyer, 2006), researchers commonly recommend that Cohen’s d statistics be considered small when greater than .20, medium when greater than .50, and large when greater than .80 (Cohen, 1988). Similarly, it has been suggested that partial η2 statistics be considered small when greater than .01, medium when greater than .06, and large when greater than .14 (Richardson, 2011).
The following example illustrates the interpretation of ANOVA findings. Suppose we are interested in determining with which particular mathematics curriculum students are most satisfied. Furthermore, we are also interested in whether student satisfaction is influenced by the type of instructor providing the curriculum, whether it is a certified classroom teacher or a student teacher. We therefore randomly assign students (n = 40) to receive mathematics curriculum A, B, C, or D. Half of these students are also randomly assigned to instruction provided by a certified classroom teacher, while the others are randomly assigned to instruction delivered by a student teacher. The resulting analysis then represents a two-way ANOVA that includes two between-group factors: (a) a mathematics curriculum factor with four levels (i.e., Curriculum A, B, C, and D); and (b) an instructor type factor, with two levels (i.e., certified classroom teacher and student teacher). Within this analysis, the dependent variable corresponds to student self-report ratings of satisfaction with the curriculum (on a 1–10 scale).
See Table 2 for a summary of example ANOVA omnibus effects findings. A review of the omnibus effects revealed one statistically significant main effect for mathematics curriculum (at the p < .05), which corresponded to a large effect size. In contrast, both the main effect for instructor type and the interaction between the two main effects were found to be nonstatistically significant. This was despite the interaction effect corresponding to a moderate effect size.
Table 2 Example ANOVA Result Including Omnibus Main and Interaction Effects
Taken together, results suggest that although student satisfaction was influenced by mathematics curriculum, it was not influenced by instructor type. Follow-up multiple comparison analyses were then conducted to identify which specific mathematics curriculum groups differed from each other relative to satisfaction ratings (while collapsing across instructor types; see Table 3). Statistically significant differences and large effects were found between all curricula, with the exception of a nonstatistically significant difference (and moderate effect) between Curricula B and C. Further review of these findings suggests that satisfaction with Curricula A and D was superior to Curricula B and C, while satisfaction with Curriculum A was superior to Curriculum D.
Table 3 Example ANOVA Results Including Multiple Comparison Tests
|CURRICULUM COMPARISON||MEAN DIFFERENCE||p||COHEN’S d|
|A vs. B||4.20||<.001||3.99|
|A vs. C||5.00||<.001||4.84|
|A vs. D||1.80||.006||1.94|
|B vs. C||0.80||.393||0.64|
|B vs. D||−2.40||<.001||−2.07|
|C vs. D||−3.20||<.001||−2.80|
Readers should attend to several key issues when critically evaluating ANOVA-based research that address the validity of reported findings; that is, the appropriateness of conclusions regarding the relation between independent and dependent variables given the analyses employed (Shadish, Cook, & Campbell, 2002). First, ANOVA is inherently influenced by sample size. One’s ability to identify meaningful effects as statistically significant is enhanced when examining larger samples (Cohen, 1992) and likely more difficult among smaller samples. ANOVA-based studies should include power analyses that indicate the sample required to detect significant results (American Psychological Association, 2009; Cohen, 1988). A lack of statistically significant findings could be due to a small sample size.
A second issue of concern is the number of statistical significance tests conducted by a researcher. Each time a researcher conducts an ANOVA with a critical alpha of p < .05, he or she is stating they are willing to risk a 5% chance of making a Type I error (i.e., rejecting the null hypothesis when it is in fact false). Although this is considered a relatively low probability, when conducting multiple tests at this same critical alpha level, the probability of making a Type I error somewhere across the entire study increases (Thompson, 2006). For instance, if one conducts three independent samples t-tests, each with a critical alpha level of p < .05, the overall study error rate would actually be .15 (3 * α = .15). Thus, it is important for researchers to account for this inflation in error rates, such as via the multiple comparison approaches described above (e.g., Bonferroni corrections, Tukey HSD test). Researchers could correct the alpha level by dividing .05 by the number of analyses. For example, a researcher who conducts five ANOVAs for one research question and finds one result significant to .05 is probably reporting an erroneous finding. Instead, the researchers should use a critical alpha level of .01 (.05 / 5).
A third issue pertains to the extent to which the data under consideration meet the parametric assumptions upon which ANOVA is founded. Research suggests ANOVA findings are prone to bias when the data being evaluated do not meet certain assumptions, including (a) homogeneity of variance, or similarity in within-group score dispersion across groups; (b) normal distribution of dependent variable scores; and (c) independence of observations, or the lack of relation between dependent variable scores beyond that explained by the independent variable(s). The final assumption is likely to be violated when data are nested, as is the case when collecting scores in relation to multiple students from the same classroom or school. Nested data are best analyzed with multilevel modeling to address ANOVA-relevant research questions while accounting for the dependence of observations (Hox, 2010). Given this potential for bias, it is strongly recommended that investigators report results indicative of whether parametric assumptions are met, thus increasing the confidence a reader might place in the reported findings.
One of the most common research questions within school psychology addresses the effect of some intervention on one or more outcome variables. ANOVA represents one of the most common approaches to evaluating causal research questions that pertain to the relation between independent and dependent variables. ANOVA models are flexible in that they allow researchers to compare multiple groups across one or more points in time, while also accounting for covariates to improve the precision of effect estimates. Consumers of ANOVA-based analyses should consider several key issues (e.g., sample size, parametric assumptions) in determining the appropriateness of ANOVA procedures, and thus the statistical conclusion validity of corresponding results.
Stephen P. Kilgus, PhD, is an assistant professor of school psychology at the University of Missouri