Means
Variance
Standard Deviation
Standard Deviations
of Grouped Data
Coefficient of Variation
Normal Distribution
Standard Scores
Confidence Intervals
Skewness
Kurtosis
The arithmetic mean (or average) is simply the sum of all observed values divided by the number of observations. When looking at a sample we usually use n to represent the number of observations (sample size), and we use N for the size of the population. Using x for an observation we can write the formulae for the sample mean as:
This is the common notation where the numerator of the fraction can be read literally as "the sum of all x s from i to n . We can use any letter to denote the observations, but it is best to avoid letters that are commonly assigned specific meanings (like n and N). We use letters with bars over them to indicate mean of a sample, which distinguishes it from the mean of a population. We always use Roman letters to indicate statistical estimates of the true parameters, while Greek letters are used to indicate the parameters themselves, which are properties of a population. The mean of a population is symbolized by µ (the Greek letter mu), and the formula is nearly identical:
The mean of a population is often called its expectation, and as we will see later for every statistic there is a parameter of the population that has a theoretical foundation. The arithmetic mean has several properties that make it a good descriptor of a distribution:
- It always exists: it can be calculated for any numerical data.
- It is unique: a sample or population can have only one mean.
- It can be manipulated: further statistics can be obtained from it.
- It is exhaustive: it uses all the data.
The arithmetic mean is only one type of mean. One can also calculate geometric and harmonic means, and these are indeed good indicators of central tendency for many types of data. The geometric mean is usually used for ratio data and rates of change in biostatistics. The formula for the geometric mean is:
where the formula under the square root sign is the product of all the observations. A limitation of the geometric mean is the numbers must be positive. The harmonic mean is often used for highly variable or cyclic data in biostatistics, and its formula is:
You should use the appropriate mean depending on the type of data, which is sometimes a judgment call. In this class, we will deal primarily with the arithmetic mean because more statistical inferences have been developed for this statistic.
Sometimes observations may not have equal importance in a population. For example, we may be studying the effect of three drugs on a particular disease and get the following numbers of individuals recovering: 34, 49, and 17. The arithmetic mean would be 33.3 people recovered, but what if we find out that the number of individuals used in each study was different, or if drug doses varied? This number would not be a very good indicator of central tendency for the population of patients. Instead we must weight the observations based on any differences in treatment. Let's say that in the first study only 10 mg of the drug was administered, in the second 20 mg was used, and in the third 15 mg. The weighed mean is given by summing the product of the weights and observations and dividing by the sum of the weights:
which is slightly higher than the recovery calculated with the arithmetic mean. Here we use k to represent the number of groups. We more commonly use the weighted mean when sample sizes vary. Using the same drug data the weighted mean with sample sizes of 100, 45, and 15 would be 36.625. The final type of mean that you are likely to run into in biostatistics is the grand mean. This is a derivation of the weighted mean that we use when sample sizes vary, just as was done in the last example. The formula for the grand mean is:
Although the grand mean and weighted means are rarely used to describe distributions and samples, the grand mean is very important in many statistical tests that we will explore later in the class.
It is intuitive that the more similar values are to each other the less variable they are. If they are all exactly the same there is no variation and each observation equals the mean. One way to estimate dispersion is through deviations from the mean. If we average all the absolute values of the deviations of observations from the mean we obtain a mean deviation:
but this formulation of dispersion has many difficulties because of using absolute values (if one didn't, the average would zero). Fortunately, we can square the deviations and still obtain only positive values. The most common measure of dispersion is the mean squared deviation or variance. For a population this parameter is given by:
and for a sample it is:
In other words, the variance is equal to the sum of the squared deviations divided by the number of observations. Mean square and sum of squares are terms we will use often when we discuss regression and analysis of variance.
Of course it would be advantageous to examine dispersion in terms of numbers that are the same order as the original observations and the mean. We accomplish this with the standard deviation which is simply the square root of the variance:
These formulae for the sample variance and standard deviation are not those commonly used. These estimates are biased. When we say a statistic is biased we mean that on average it doesn't describe the quantity for the population. For example, if take 100 samples of a large population with known μ and calculate the sample mean, then the average of these 100 sample means will be identical (or nearly so) to the population mean. We call such a statistic an unbiased estimator. The above formulae for the sample variance and standard deviations are biased. Statisticians commonly use the unbiased measures of dispersion given by:
which on average do equal the population values.
Calculation of the standard deviation can be quite tedious using the above formula, but there is a computing formula that is easier:
As with the mean, we can estimate the dispersion of data from the distribution using the frequencies and class marks. This is a rather laborious task and there can be considerable error in the estimates. The formula for estimating the standard deviation from grouped data is:
The standard deviation is a measure of absolute variation that retains the units of measure (i.e., inches, grams, or mph). Such a measure of dispersion will depend on the metric used to measure the observations. Large scales will produce larger variations than small scales (the weights of hummingbird eggs are less variable than ostrich eggs). This is not always desirable. Instead, we often would like a measure of relative dispersion, one that is relatively independent of the scale used. The coefficient of variation is one such estimator, and it is given by:
which divides out the magnitude effect of the values. The coefficient of variation can be used to compare dispersion of different distributions without concern for the scale used to measure the observations.
What does this estimator of dispersion tell us about the distribution? Chebyshev's theorem says that the proportion of data that must lie within k standard deviations of the mean is given by:
where k is greater than 1. In other words, 75% of the data must lie within 2 standard deviations on either side of the mean, and 95% must like within 4.47s on either side of the mean. This applies to any set of data, regardless of the distribution, but it only sets limits (at least that amount of the data). With real samples we often see that much more of the data falls within the limits set by this theorem. If the distribution is normal then we can be more specific: 68% lies within 1s, 95% lies within 2s and 99.7% lies within 3s on either side of the mean. We will discuss the normal distribution later in the semester.
These observations about dispersion have lead to the notorious "bell curve" used in grading. Note, it assumes that the distribution is normal, which is rarely the case in real classrooms. To obtain a curved grade in this manner one must calculate a standard score which transforms the grades into units of standard deviations. The standard scores for a sample or population are given by:
Although standard scores are very useful in statistics, students wouldn't worship them if they were applied strictly; converting grades to standard scores can lower their value as well as raise them!
When we calculate the mean of a sample, we expect it to tell us something about the mean of the population from which the sample was taken. Unfortunately, the sample mean alone contains very little information. One cannot say with much certainty that a sample mean of 15 is close to the true mean of a population. We would like to know how confident we can be in our assertions about the population mean, and for that we need to know more than the mean itself. This is true for any statistic, but for now we will concentrate on the mean.
As we have seen, there is a variance in sample means, such that no single statistic will necessarily equal the population parameter. We call the standard deviation of statistics the standard error of estimates. For the mean, the size of the error depends on the size of the sample, the size of the population, and the standard deviation of the population. To adequately determine our confidence of asserting that the sample mean equals the population mean, we must use all this information.
When we speak of confidence, we are making a probability statement. For example, we may wish to be 95% sure that the sample mean represents the true population mean in some fashion. We cannot determine exactly what the population mean is, but we may be satisfied with knowing the range of possible values. By recognizing that the sample means are normally distributed, we can determine the area under the normal curve which accounts for 95% of the possible values. With this information, we can say the probability of the true mean falling within this range is 0.95.
When we make such probability statements, we are not interested in whether the real mean is less than or greater than the sample mean when we are wrong, so we want the interval representing the 95% probability to be centered around zero on the normal curve. The probability that we are wrong is 1-0.95 = 0.05, and half that amount is found on either tail of the normal distribution. We term this quantity α, and our probability statement is given by 1-α. The amount on each tail of the distribution is of course α/2 = 0.025. We are interested in determining the z-score that would mark the boundaries for population mean.
The z-score is the value such that the probability to the right (or left for negative numbers) is α/2 = 0.025. Now we can say that the maximum error associated with the sample mean, in terms of our probability statement, is given by:
One need only know the value of α/2 and one can find the appropriate z-score. For the case of α = 0.05 (95% sure), we obtain a value of zα/2 = 1.96. With this information, we can determine the maximum acceptable error associated with our sample mean. Let's say we have a sample mean of 5.1 and standard deviation of 2.3, that were obtained from a 50 observations from a very large population. We calculate the maximum error for a 95% probability statement as:
Which means that there is a 95% probability that the error will be no larger than 0.64. In general, we make probability statements prior to collecting data, and confidence statements after the data has been collected. The problem with probability statements is that they require one to know (or at least be able to guess) the standard deviation prior to conduction the experiment. This is rarely possible when dealing with real populations, so we must rely upon confidence limits. However, we may find maximum errors useful for another reason. We often go into a study hoping to have some degree of precision regarding our ability to estimate a population parameter. One has little control over the dispersion in the population, but one can control the sample size. We can algebraically rearrange the equation above to obtain the minimum sample size necessary to achieve some error. Lets say we had the same population as describe above and we wanted a maximum error of no greater than 1.1. The minimum sample size would be:
So if you can live with such an error, you can save your time and money collecting the smaller sample!
When we have no idea what the standard deviation is, we must rely on our sample to provide an estimate. In this case, we don't make an a priori (beforehand) statement of probability. Instead, we assert our confidence that the sample statistic (e.g., the mean) is near the population mean. Again, we must deal with a range of possible values, because the sample mean as an error associated with it. We can use maximum error to construct such a range, or confidence interval. Going back to our method for standardizing sample means:
we can see that when the sample mean equals the population, then z = 0. As the sample mean gets further away from population mean, the z-score gets larger or smaller. Rearranging this equation, we can get a measure of how far the sample mean is from the population mean.
Solving for the population mean, one can determine the limit that for the sample mean given a particular value of z:
We must realize that we are not interested in the sample mean being larger or smaller that the population mean: either case would make us less confident in our estimate. For that reason, we use zα/2 to set the boundaries. So, the confidence interval for a sample mean from a large population is:
Of course, we still may not know the population standard deviation, but as we have seen the sample standard deviation is usually a good estimator (the error is small). Using the same example above, we can calculate the confidence interval for our mean of 5.1:
and we can be 95% confident that the population mean is between 4.46 and 5.74. This method is vastly superior to simply talking about the sample mean, because it contains more information. Instead of having a point estimate of the population mean, we have a range of values, or an interval of estimates. One problem that arises from such a method of estimating the population mean is that its strength diminishes with increasing confidence. If we wanted to be 99% sure of our estimate we could get a confidence interval of:
which is wider than that for 95% confidence by about 0.4. That wider interval means we have more values to choose from as our estimate. One must balance confidence with precision. Typical confidence intervals use probabilities of 95% and 99%.
Sometimes we don't have a large sample. In all the above examples, we assumed that n was greater than or equal to 30. When we have a small sample, we find that the sample means follow a t -distribution better than a normal one. We can standardize our sample means to the t-distribution as follows:
We can also use a t-distribution to obtain the same confidence limits:
The t -distribution requires that one know the degrees of freedom (df or v ) for the statistic. In most cases, this is one less than the sample size (n -1). One can then look up the value of tα/2 in a table (or use a function). Let's assume that we have a sample size of 25 for our example. The df are 25-1 = 24, and the t -score associated with 95% confidence is 2.064. The confidence interval would be:
so, we are 95% sure that the population mean falls between 4.15 and 6.05.
With all these methods, one is assuming a large population. When the population is finite, we use the finite population correction factor:
Not all distributions are bell shaped (or normal). In the normal distribution, there are just as many observations to the right of the mean as there are to the left. The median and mean are also equal. When this is not the case, we say the distribution is skewed or asymmetrical. If the tail is drawn out to the left, then the curve is left skewed If the tail is drawn out to the right, then the curve is right skewed:
|
![]() |
|
The coefficient of skewness is referred to by the symbol γ1. The test for significant skewness is a simple t-test. The test statistic (ts) is computed as:
where γ1 is the expected
skewness (γ1= 0 if you are
testing for normality), and is the standard error of γ1,
which is computed as:
The significance of the test statistic (ts) is evaluated by comparing it with tα/2 with infinite degrees of freedom (i.e., 1.96 at α = 0.05). There is significant skewness if the absolute value of ts is greater than tα/2 . If ts is negative, then the distribution is skewed to the left. If ts is positive, then the distribution is skewed to the right.
Another type of departure from normality is the kurtosis, or "peakedness" of the distribution. A leptokurtic curve has more values near the mean and at the tails, with fewer observations at the intermediate regions relative to the normal distribution. A platykurtic curve has fewer values at the mean and at the tails than the normal curve, but more values in the intermediate regions. A bimodal ("double-peaked") distribution is an extreme example of a platykurtic distribution.
|
|
![]() |
The coefficient of kurtosis is referred to by the symbol γ2. The test for significant kurtosis is also a simple t-test. The test statistic (ts) is computed as:
where γ2 is the expected
kurtosis (γ2= 0 if you are
testing for normality), and is the standard error of γ2,
which is computed as:
The significance of the test statistic (ts) is evaluated by comparing it with tα/2 with infinite degrees of freedom (i.e., 1.96 at α = 0.05). There is significant kurtosis if the absolute value of ts is greater than tα/2 . If the value of ts is negative, then the distribution is platykurtic (i.e., has a broad or bimodal peak and short tails. If positive, then the distribution is leptokurtic (i.e., has a narrow peak and long tails).
Note: the standard error of kurtosis (as depicted here and published in most textbooks) assumes a symmetrical distribution of g2. For small sample sizes (<100), γ2 is strongly skewed to the right (it can never be less than -2.0). Therefore the above standard error yields a greatly exaggerated type II error when testing for platykurtosis of small samples.
The formulae for both skewness and kurtosis are moment statistics. A central moment in statistics, as in physics, is: 1/n Σ(xi-mean x)r. The first central moment in statistics: 1/n Σ(xi-mean x)1, is always equal to zero. The second central moment: 1/n Σ(xi-mean x)2, is the variance. The statistic γ1 is the third central moment divided by the cube of the standard deviation: 1/ns3 Σ((xi-mean x)3. The statistic g2 is 3 less than the fourth central moment divided by the fourth power of the standard deviation: 1/ns4 Σ(xi-mean x)4 - 3.
The use of skewness and kurtosis as indicators of nonnormality was common in the early 20th century, when much emphasis was placed on the distributions of biological variables. Three factors caused these parameters to fall out of favor:
Nowadays, however, it is easy to calculate skewness and kurtosis on a computer, and the contribution of this to a priori testing is an important part to ensuring the appropriateness of certain tests and the validity of their results.