Multiple Comparisons ("A posteriori" tests)
Bonferroni
Least Significant Difference
Tukey's Method
Scheffe's Test
In ANOVAs, the F test(s) are used to test for real treatment differences. When the null hypothesis is not rejected, it may seem that no further questions need to be asked about the nature of the response. Often, however, this is an oversimplification. For example, if you had six different experimental treatments that did not significantly differ overall, then the hypothesis tested was:
H0: μ1 = μ2 = μ3 = μ4 = μ5 = μ6
Ha: at least one of the means differs from the others
Wouldn't you always wonder (I mean, really, wouldn't it keep you up nights?) if any of the groups might have differed significantly from the rest, but that the difference was lost by being averaged with the other possible comparisons?
One approach would be to make simple t-tests of the comparisons of interest. A t-test will be equivalent to an F-test if no significant blocking effect is present. However, it is vital to remember that the greater the number of tests conducted, the greater the likelihood of making a Type I error.Also, t-tests are limited to pairs of samples.
Fortunately, a number of tests are available for comparing three or more samples. Many "old-time" scientists do not understand these tests, and if they are applied, then the investigator may be accused of "data dredging" (also known as "turd polishing"). Properly applied and interpreted, however, multiple-comparison (also known as post-hoc or a posteriori tests) can provide legitimate and important information. As in all statistical analyses, you should report all P-values and allow the reader to draw conclusions as to their validity:
The Bonferroni Least Significant Difference (LSD) is based on the ratio of the maximum sample variance to the minimum sample variance. It is easy to compute, but not as powerful as Bartlett's test in detecting true differences among variances. It is often misused, and many statisticians do not recommend it. The most common misuse is to use it to make comparisons suggested by the data, rather than the comparisons originally planned (see "data dredging, above). For example, in extreme situations where the experimenter compares only the highest and lowest treatment means via a t-test or LSD, the difference is likely to be significant even when no effect is present. It can be shown that with 3 treatments, the observed value of t for the greatest difference will exceed the α = 0.05 level 13% of the time. With 6 treatments, it is 40% of the time, with 10 teatments, 60%, and with 20 treatments, 90%.
Tukey's method is generally recommended when:
To use Tukey's method, the following confidence interval is computed for each pairwise comparison between two population means (let's call them mi and mj):
Mean i - Mean j = +/- T * sqrt (Mean Square Error)
where T = 1/sqrt(n of all populations combined)*100 * (1- α%) point of the Studentized t range, with k and n - k degrees of freedom.
In applying Tukey's method to a set of data, the procedure is carried out stepwise as follows:
Scheffe's method is recommended when:
Scheffe's test involves dividing each sample into two or more subsamples. The variance is computed for each subsample, and an ANOVA is conducted on the log-transformed variance estimates. Scheffe's method is very general, in that all possible contrasts can be tested for significance and confidence intervals can be constructed for the corresponding linear functions of parameters. This means that an infinite number of simultaneous tests can be made, although a finite number must be made, resulting in an error probability no larger than planned (e.g., α = 0.05). The set of confidence intervals will have a confidence limit at least as large as stated.
This test has a large critical value for any contrast, and its power is consequently low. If only pairwise comparisons are to be made, then it is better to use Tukey's test, which produces narrower confidence limits.
In addition to the above, a large number of othe multiple-comparison techniques exist in the literature. They differ with respect to the target contrasts (i.e., pairwise vs. nonpairwise comparisons), the cell-specific sample sizes, and other properties. The goal of these tests is to minimize the Type II error rate, while controlling the Type I error rate.
The reasons for considering other multiple-comparison techniques are to improve both power (the ability to detect true differences, if they exist) and robustness (reduce dependence on meeting assumptions such as homogeneity of variance). Both Bonferroni and Sheffe's method are reasonably robust, while Tukey's method is less so. Sheffe's method is always the least powerful, since all possible pairwise combinations are considered. The Bonferroni method is commonly presumed to have the lowest power, but this may not be true if the sets of comparisons are well-planned (one way is to designate one treatment as the control).