11 Nonlinear and Multiple Regression

Topics

Limits of Prediction
Multiple Regression Analyses

Linear Functions
Nonlinear Functions
Limits of Prediction

Nonlinear Regression Analyses

As we said before, linear regression methods only apply to straight-line relationships between the variables. Often in biology, we encounter variables that are related in a nonlinear fashion. We can apply these techniques by forcing the relationship to be linear in some scale. This is accomplished in different ways, depending upon the type of hypothesized relationship. We will explore only a few here.

Exponential Functions

In many situations, the relationship between variables takes the shape of an exponential function. This means that the ys increase or decrease more rapidly than the xs. A classic example in biology is the exponential growth of populations. If we attempt to fit a straight line through the points, we come up with some useless results, as exemplified with the following figure:

Although all of the points fall on a smooth curve, the proportion of the variation in population sizes that can be accounted for by the straight line is only 74%. Worse still, the straight line predicts an intercept of -6204.5 which means that the population started with a negative number of individuals, which of course is impossible.

The equation for exponential growth is:

where N_t is the population size at some time (t), N₀ is the starting population size (intercept) and r is the intrinsic rate of increase. How can we make this linear? If we take the log of each side we get the following equation:

Now we can let y = ln(N_t), a = ln(N₀), b = r and t = x to get the typical linear regression equation, y = a + bx. If we plot the log of N against time, we iget a straight line.

With this type of transformation, b is the intrinsic rate of population increase (0.25), and a is the log of the starting population size (e^a = e^5.2983 = 200). For this particular population, we can predict future population sizes with the following equation:

To find the least squares equation for such data, we proceed just as with linear regression, except all the ys are ln(N). A table for calculating the regression coefficients follows:

	Time (x)	N	ln(N )= y	x²	ln(N)² = y²	*xln(N) = xy**
	1	256.81	5.55	1	30.78	5.55
	2	329.74	5.80	4	33.62	11.60
	3	423.40	6.05	9	36.58	18.14
	4	543.66	6.30	16	39.67	25.19
	5	698.07	6.55	25	42.88	32.74
	6	896.34	6.80	36	46.22	40.79
	7	1150.92	7.05	49	49.68	49.34
	8	1477.81	7.30	64	53.27	58.39
	9	1897.55	7.55	81	56.98	67.93
	10	2436.50	7.80	100	60.81	77.98
	11	3128.53	8.05	121	64.78	88.53
	12	4017.11	8.30	144	68.86	99.58
	13	5158.07	8.55	169	73.07	111.13
	14	6623.09	8.80	196	77.41	123.18
	15	8504.22	9.05	225	81.87	135.72
	16	10919.63	9.30	256	86.46	148.77
	17	14021.08	9.55	289	91.17	162.32
	18	18003.43	9.80	324	96.01	176.37
	19	23116.86	10.05	361	100.97	190.92
	20	29682.63	10.30	400	106.06	205.97
Count	20		20	20	20	20
Sum	210		158.47	2870.00	1297.14	1830.15
Average	10.50		7.92	143.50	64.86	91.51

Giving us:

This is obviously a better fit to the data (r = 1.00), and the intercept yields a positive number for starting population size instead of a negative. Of course, we test the significance of these regression coefficients using a and b, not the starting population size and the intrinsic rate of increase.

Power Functions

In some cases, the dependent variable is a power function of the independent variable. An excellent example from biology is the allometric equation. This equation relates the rate of a physiological function to body size, and it looks like:

where s is the allometric scale and c is the allometric constant. An example is the relationship between body size and metabolic rate in homeotherms.

Obviously, the straight line doesn't fit the points very well. Although the coefficient of determination is very high, all of the points actually fit on a smooth curve which better explains the variation in ys. Also note the intercept is 1.13, which suggests that even animals with no size have a metabolic rate, which we know is impossible. There are actually two ways to approach this problem.

Remembering back to the example we used for fruit set in plants, we found an intercept that was greater than zero. This suggested that plants of age zero still produced fruits: again this is impossible. However, over the range of ages we obtained the relationship was fairly linear. As long as we confined our predictions to this range of ages, we could be fairly confident in them, but extrapolating outside that range increased our probability of making an erroneous prediction. This is the problem with regression when it is used to predict values for which we have no supporting data. Note that in the example for metabolic rate, the relationship is fairly linear for large body sizes. We may be satisfied with limiting our analysis to that range of the data and ignoring the lower body sizes and metabolic rates. In such a case, we would proceed with a normal linear regression model. This simplification, however, ignores an important consideration. We may have a theoretical reason to expect a certain nonlinear relationship to occur between to variables (perhaps because such a functional response has appeared in similar studies). In such cases, it would be a mistake to ignore the underlying biological mechanism merely to simplify our model: we may miss important insights about our data.

We would like to use all the data (usually a much better approach), so we must linearize the relationship in some fashion. Again, we can take the log of our ys to produce a linear relationship. In this case the equation is:

Now we proceed just as we did with exponential regression except that we must also use ln(size) to represent x..

Note that we now have a slope of 0.05, which is the allometric constant. The intercept is very close to zero (just a small rounding error, it will never actually reach zero on a log scale!), which is as it should be. To find the allometric scale, we simply find the exponent e^a = 1.00. The following table shows the calculation of the regression coefficients for this problem.

	Size	Rate	ln(Size) = x	ln(Rate) = y	ln(Size)² = x²	ln(Rate)² = y²	*ln(Size)ln(Rate) = xy**
	10	1.12	2.30	0.12	5.30	0.01	0.27
	20	1.16	3.00	0.15	8.97	0.02	0.45
	30	1.19	3.40	0.17	11.57	0.03	0.58
	40	1.20	3.69	0.18	13.61	0.03	0.68
	50	1.22	3.91	0.20	15.30	0.04	0.77
	60	1.23	4.09	0.20	16.76	0.04	0.84
	70	1.24	4.25	0.21	18.05	0.05	0.90
	80	1.24	4.38	0.22	19.20	0.05	0.96
	90	1.25	4.50	0.22	20.25	0.05	1.01
	100	1.26	4.61	0.23	21.21	0.05	1.06
Count			10	10	10	10	10
Sum			38.13	1.91	150.23	0.38	7.51
Average			3.81	0.19	15.02	0.04	0.75

Multiple Regression Analyses

Often, we find that variation in our independent variable is poorly explained by a single independent variable. For example, we may regress body size against age and get only a moderate fit (e.g., r = 0.32). Intuitively, we realize there are many things that can influence body size, like sex (if the species is sexually size dimorphic like humans), genetics (e.g., if the parents were large), and environment (e.g., if a variable amount of food was consumed during development). We could do four separate regressions, but as we saw in the discussion of correlations, these independent variables may be correlated with each other. Simply summing the coefficients of determination for each separate regression will not tell us how much variation are explained by all the independent variables. Further, we will have no way of determining a single intercept and slope. The better approach is to simultaneously fit the dependent variable to all independent variables, again minimizing the squared difference between the observed and predicted values. We call this approach multiple regression.

Linear Functions

If the dependent variable is linearly related to each of the independent variables, then we can extend our linear equation to account for them:

In this case, each independent variable has a different slope associated with it, but there is still only a single intercept. As with our previous linear models we proceed from a set of normal equations, which look like the following for two independent variables:

From these we can solve, with Gaussian elimination, for each of the regression coefficients. As with simple linear regression, there are formulae for the regression equations that involve sums of squares and sums of cross products. Before we examine these formulae, we must introduce some new statistics. They are:

With these, the regression coefficients can be found with:

By simply producing a table as we have will all the regression problems, we can easily find the coefficients. Now let's use our body size problem again. For several ages, we have the size of a lizard and their food intake.

	Age (x₁)	Food (x₂)	Size (y)	x₁²	x₂²	y²	x₁x₂	x₁y	x₂y
	1	1	1	1	1	1	1	1	1
	2	1.1	4.1	4	1.21	16.81	2.2	8.2	4.51
	3	1.3	7.1	9	1.69	50.41	3.9	21.3	9.23
	4	1.4	12.5	16	1.96	156.25	5.6	50	17.5
	8	1.4	12.5	64	1.96	156.25	11.2	100	17.5
	10	1.3	12.5	100	1.69	156.25	13	125	16.25
	12	1.3	19.5	144	1.69	380.25	15.6	234	25.35
	15	1.2	27.5	225	1.44	756.25	18	412.5	33
Count	8	8	8	8	8	8	8	8	8
Sum	55	10	96.7	563	12.64	1673.47	70.5	952	124.34
Average	6.875	1.25	12.0875	70.375	1.58	209.184	8.8125	119	15.5425

The values necessary for calculation of the regression coefficients are:

In order to determine the amount of variation in size that can be accounted for with these two independent variables, we can proceed as we did for multiple correlation. This can be a tedious undertaking with several independent variables. A better way is to use the ratio of two sums of squares, which requires us to define a few new terms. In the process of performing a regression, we have accounted for a portion of the total variation in ys with our predicted values. When the fit is not perfect,we have some residual variation that is not accounted for. We can define three sums of squares for these three aspects of variation. The sums of squares for total variation (SS_tot) in ys we have already seen as SS_y. The sums of squares for the regression (SS_reg)compares the predicted values to the mean value of the ys. The sums of squares of the residuals (SS_res) is the comparison of the predicted to observed values.

The multiple correlation coefficient is given by:

For this particular regression, we have an r of 0.89. We will expand on these methods when we explore analysis of variance next week.

Nonlinear Functions

Nonlinear relationships can also exist when there are curvilinear relationships that cannot be explained by exponential or power functions. If there are changes in the direction of y given a range of xs, then we are probably dealing with a polynomial function. It is a simple matter to fit polynomial relationships between an independent variable and a dependent variable with multiple regression. The polynomial function is given by:

This representation is a linear equation, we simply raise values of x to different powers. The most simple form of this relationship is the parabola. One might see a parabolic relationship when dealing with may types of data. As an example we will use recovery time in response to doses of a new drug.

	Dose = x₁	Dose² = x₂	Time = y	x₁²	x₂²	y²	x₁x₂	x₁y	x₂y
	1	1	7.2	1	1	51.84	1	7.20	7.20
	2	4	6.7	4	16	44.89	8	13.40	26.80
	3	9	4.7	9	81	22.09	27	14.10	42.30
	4	16	3.7	16	256	13.69	64	14.80	59.20
	5	25	4.7	25	625	22.09	125	23.50	117.50
	6	36	4.2	36	1296	17.64	216	25.20	151.20
	7	49	5.2	49	2401	27.04	343	36.40	254.80
	8	64	5.7	64	4096	32.49	512	45.60	364.80
Count	8	8	8	8	8	8	8	8	8
Sum	36.00	204.00	42.10	204.00	8772.00	231.77	1296.00	180.20	1023.80
Average	4.50	25.50	5.26	25.50	1096.50	28.97	162.00	22.53	127.98

which yields:

These quantities give us a regression equation of y = 9.24 - 2.01x + 0.20x². This equation produces a correlation coefficient of 0.85, and as you can see, it is a pretty good fit to the original data.

A word of caution: with a long enough expression (i.e., k = n - 1), we can fit a curve through every observation of y. Generally, third order (k = 3) or above fits should be looked upon with skepticism in biological data, as these indicate higher-order intereactions between variables that only rarely have a good theoretical explanation.