09 General Linear Models II
Multiple Comparisons; Post-hoc Tests

Topics

Multiple Comparisons ("A posteriori" tests)
Bonferroni Least Significant Difference
Tukey's Method
Scheffe's Test

Multiple Comparisons

In ANOVAs, the F test(s) are used to test for real treatment differences. When the null hypothesis is not rejected, it may seem that no further questions need to be asked about the nature of the response. Often, however, this is an oversimplification. For example, if you had six different experimental treatments that did not significantly differ overall, then the hypothesis tested was:

H₀: μ₁ = μ₂ = μ₃ = μ₄ = μ₅ = μ₆

H_a: at least one of the means differs from the others

Wouldn't you always wonder (I mean, really, wouldn't it keep you up nights?) if any of the groups might have differed significantly from the rest, but that the difference was lost by being averaged with the other possible comparisons?

One approach would be to make simple t-tests of the comparisons of interest. A t-test will be equivalent to an F-test if no significant blocking effect is present. However, it is vital to remember that the greater the number of tests conducted, the greater the likelihood of making a Type I error.Also, t-tests are limited to pairs of samples.

Fortunately, a number of tests are available for comparing three or more samples. Many "old-time" scientists do not understand these tests, and if they are applied, then the investigator may be accused of "data dredging" (also known as "turd polishing"). Properly applied and interpreted, however, multiple-comparison (also known as post-hoc or a posteriori tests) can provide legitimate and important information. As in all statistical analyses, you should report all P-values and allow the reader to draw conclusions as to their validity:

Bonferroni Least Significant Difference

The Bonferroni Least Significant Difference (LSD) is based on the ratio of the maximum sample variance to the minimum sample variance. It is easy to compute, but not as powerful as Bartlett's test in detecting true differences among variances. It is often misused, and many statisticians do not recommend it. The most common misuse is to use it to make comparisons suggested by the data, rather than the comparisons originally planned (see "data dredging, above). For example, in extreme situations where the experimenter compares only the highest and lowest treatment means via a t-test or LSD, the difference is likely to be significant even when no effect is present. It can be shown that with 3 treatments, the observed value of t for the greatest difference will exceed the α = 0.05 level 13% of the time. With 6 treatments, it is 40% of the time, with 10 teatments, 60%, and with 20 treatments, 90%.

Tukey's Method

Tukey's method is generally recommended when:

The sizes of the samples selected from the different populations are equal.
Simple pairwise comparisons between the two means are of primary interest.

To use Tukey's method, the following confidence interval is computed for each pairwise comparison between two population means (let's call them m_i and m_j):

Mean i - Mean j = +/- T * sqrt (Mean Square Error)

where T = 1/sqrt(n of all populations combined)*100 * (1- α%) point of the Studentized t range, with k and n - k degrees of freedom.

In applying Tukey's method to a set of data, the procedure is carried out stepwise as follows:

Rank the sample means from the largest to the smallest.
Compare the largest sample mean with the smallest using the above formula, then the largest with the next smallest, and so on, until either the largest mean has been compared with the second largest, or a nonsignificant result is obtained (whichever comes first).
Continue by comparing the second largest mean with the smallest, with the next smallest, and so on, but make no further comparisons with the 2nd largest mean once a nonsignificant result is obtained.
Continue making such comparisons involving the third largest mean, the fourth largest, and so on. At each stage, once a nonsignificant result is obtained, conclude that no difference exists between any means enclosed by the first nonsignificant pair.

Scheffe's Test

Scheffe's method is recommended when:

The sizes of the samples selected from the different populations are not all equal.
Comparisons other than simple pairwise comparisons between the two means are of interest; these more general types of comparisons are known as contrasts.
Unplanned comparison sare made after gathering the data.

Scheffe's test involves dividing each sample into two or more subsamples. The variance is computed for each subsample, and an ANOVA is conducted on the log-transformed variance estimates. Scheffe's method is very general, in that all possible contrasts can be tested for significance and confidence intervals can be constructed for the corresponding linear functions of parameters. This means that an infinite number of simultaneous tests can be made, although a finite number must be made, resulting in an error probability no larger than planned (e.g., α = 0.05). The set of confidence intervals will have a confidence limit at least as large as stated.

This test has a large critical value for any contrast, and its power is consequently low. If only pairwise comparisons are to be made, then it is better to use Tukey's test, which produces narrower confidence limits.

In addition to the above, a large number of othe multiple-comparison techniques exist in the literature. They differ with respect to the target contrasts (i.e., pairwise vs. nonpairwise comparisons), the cell-specific sample sizes, and other properties. The goal of these tests is to minimize the Type II error rate, while controlling the Type I error rate.

The reasons for considering other multiple-comparison techniques are to improve both power (the ability to detect true differences, if they exist) and robustness (reduce dependence on meeting assumptions such as homogeneity of variance). Both Bonferroni and Sheffe's method are reasonably robust, while Tukey's method is less so. Sheffe's method is always the least powerful, since all possible pairwise combinations are considered. The Bonferroni method is commonly presumed to have the lowest power, but this may not be true if the sets of comparisons are well-planned (one way is to designate one treatment as the control).