Regression

Click above to start an Interactive Visual Presentation (Plugin Required)
Click here to go to our plugin download and plugin tutorial page


Regression statistics expand on correlation to allow us to use relationships between variables to make predictions. They provide us with tools to write linear equations which can be used to predict the value of a dependent or criterion variable from the value of one or a set of predictor variables.

Linear Functions

Before we start talking about regression we are first going to talk about the general topic of linear functions which some of you may remember from earlier math classes.

Linear Functions Form

The general form of a simple linear function is this equation (Y = a + bX). This equation describes any straight line. The slope of the line is represented by the letter b. The intercept is represented in the equation by the letter a. The value of the intercept, a, is where the line crosses the Y axis. By arbitrarily picking a value for X and using this formula we can determine the the values of Y and therefore draw the line.

Linear Function Example 1

In the blue box at the bottom of the illustration is our equation for a particular line, Y = 2X + 1. From this equation we can determine how the line will appear on a graph. I'll call this graph a "scatterplot" for reasons that we discussed in the lecture on correlation.

Note: In this example, I have reversed the order of the parameters on the right side of the equation, but it is essentially the same. That is, (Y = a + bX) is the same mathematically as (Y = bX + a).

In this example, the slope b = 2 and the intercept a = 1. The next step is to arbitrarily pick at least three X values. For this example I picked the X values of 0, 1, and 2. Using the equation (Y = 2x +1) I can determine that when X = 0 then Y = 1. When X = 1 then Y = 3 and when X = 2 then Y = 5. Try these out for yourself and perhaps try some other values for X as well. Notice that when you draw a line through these points the line crosses the Y axis right at the value of 1. The value "1" is the intercept.

Notice that high values of X give you high values of Y. Later, we'll see that a positive slope corresponds to the idea of a positive relationship in correlation.

Linear Function Example 2

Here's another example. For this line the equation is Y = .5x - 1. This equation looks a little different because the value of the Y-intercept is -1. That means the line will cross the Y axis at -1. The form of our equation remains the same (Y = bX + a) but if you remember your high school algebra the equation Y = .5x + (-1) is the same as Y = .5x -1. This second form is just a simpler way of writing the equation if the value of a is less than 0.

Here I arbitrarily picked the X values of 0, 1, 2, and 3. Then using the equation Y = .5x -1, I determined that the associated Y values would be -1. -.5, 0, and .5. Why don't you try this and check my calculations.

Compare Example 1 and Example 2. What is the effect of changing the slope from 2 (in example 1 to .5 in example 2)? The line with the lower value of slope (.5) is less steep than the line with the higher slope.

What is the effect of changing the intercept, a, from 1 to -1? Notice that the line cuts the Y axis in each case exactly at the value of "a."

Linear Function Example 3

For this line the equation is Y = -1.5x + 2. Here we have a negative value (-1.5) for the slope and the intercept has a value of 2. I picked the X values of 0, 1, and 2 again and then used the equation to determine the Y values.

Compare: What is the effect of making the slope negative? The line slopes the other way. Notice that High values of X now give you Low values of Y. This is true when the slope is negative. We'll see later that a negative slope corresponds to the idea of a negative relationship in correlation.


When the Slope (b) = 0 -- No Relationship

What does the line look like if the slope is 0? It will always look like a flat horizontal line. In this example, regardless of what value you put in for X, Y will always equal 3.

Later, we'll see that a slope of 0 corresponds to the idea of no relationship in correlation.

Y = X

Y = 0 + 1x or simply Y = X is the line that starts at the origin (0, 0) and goes up at a 45 degree angle. By picking the X values of 0, 1, and 2 and using the formula you can determine that the Y values are also 0, 1, and 2, respectively.


Regression Line

In statistics, when we want to predict or estimate one variable, Y, from a second variable, X, we use a procedure called "regression." The "regression line" is the linear function we use to make this prediction. If that doesn't make much sense to you at this point, it's OK. We'll spend a lot of time learning this concept.

NOTATION: When we talk about predicted or estimated values of variable Y, we generally use some symbol like a Y with a little caret or a little hat on the top of it (^), or we use Y prime (Y'). In this class, we will use Y' because it's a lot easier to type an apostrophe for prime than it is to draw one of those little hats in HTML at this time. But in statistics books you'll see several different notations.

In words we'll say "Y prime is equal to a plus bX." In symbols, we'll write Y' = a + bX.

Obviously (except for the prime) Y' = a + bX is very similar to the linear function that we just reviewed.

When you are predicting or estimating values of Y from X, Y is called the criterion variable, and X is called the predictor variable. The criterion variable is often called the dependent variable.


Cigarettes and Health Example
For the purposes of our lecture today we are going to use a made-up example which examines the relationship between cigarettes and health. So, Y might be the number of health problems experienced by an individual between the ages of 65 and 70; and X might be the number of cigarettes he or she smoked per day from the age of 20 until the age of 50. We want to predict Y from X, that is we want to estimate the number of health problems later in life from the number of cigarettes smoked earlier in life. In statistical jargon, we will find the regression line, Y' = a + bX

Health Problems and Smoking

Measurement Operations: Translating a life into the number of cigarettes smoked per day. There are two general methodologies used in such studies. In a RETROSPECTIVE STUDY, we would ask research participants to review their life and report how many cigarettes they smoked per day between the ages of 20 and 50. In a PROSPECTIVE STUDY, we would track people across their lifetime, asking them to record the number of cigarettes they smoke per day. The retrospective study could be done in a few months. The prospective study would take many years. The data from a prospective study is much higher quality because it doesn't rely on the subjects' memory. Either way, the number of cigarettes smoked per day is X; it will be our predictor variable.

Then we're going to predict the number of health problems a participant has between the ages of 65 and 70 from how much they smoked.

Let's say we do a retrospective study. We examine the medical records of participants when they were between 65 and 70 years old, counting the number of health problems they had. Then we give them a questionnaire on how much they've smoked at different times in their life. We want to predict health problems from smoking rate.


Regression Line

Let's say that we're going to have a tiny little sample, usually there are thousands of people in such studies, but we're just going to have a few so that our calculations will be simple. The data is made up.

On the illustration the data is ordered in the table on the left from the lowest X value to the highest; that is, it goes from the least number of cigarettes smoked to the highest. So the person who smoked one per day had three health problems; the person who smoked two packs had ten health problems, and so on.

The table contains the data of individual participants and we've measured two things about each of them. In the general scheme of methodology, a regression study is still a correlational study.

Next we're going to draw a scatterplot. Each dot on the scatterplot represents the data of one person. Perhaps by now you've got enough experience with scatterplots from studying correlations to know that this scatterplot shows a positive relationship. The more smoking the more health problems. The scatterplot shows a pretty high correlation.

We have a scatterplot; but the question is how do we find the linear regression line? How can we draw a line that goes as close as possible to all the points on the graph?

Regression Line

Describing the relationship between X and Y. Little r is one descriptive statistic which can summarize the relationship between cigarettes and health problems in this data. You've studied correlation and know how to calculate r.

There's another descriptive statistic called the regression line. The regression line is the best straight line that we can draw through or between the points on the scatterplot. Obviously a straight line can't connect the all dots because then you'd have to bounce up and down and up and down from one dot to the next, and it wouldn't be a straight line. So we want to be able to draw a single line that comes as close to all the dots as possible.

Least Squares Principle. We'll have to have a criteria for what we mean by "close." The criteria is called the least squares principle. Recall back to our discusssion of Variance. We showed how the variance squares the deviations around the mean. In Regression we will square the deviations around the regression line instead of around the mean. The best fit regression line is the line that has the smallest value for the squared deviations around it, the least squared deviations. That's essentially the whole idea of least squares. But we'll talk about it more later after you are more familiar with these ideas.

So how do we find the line that best fits through all these points?


Regression Equation Formula

When we reviewed linear functions, we described equations of the type, Y = .75X + 3. Then we put in values of X, calculated values of Y, and drew the line on a graph. But what if we don't know the values of the parameters a and b?

In Regression analysis we don't know a and b. We have to calculate a value of a and a value of b from the sample data. We're going to use the data to calculate the slope and the intercept of the regression line.

The current graphic shows the formulas for calculating the slope and the intercept from the data. Our estimated value of Y will be found through the equation Y' = a + bX The yellow box on the left shows the formulas for a and b.

As you can see, the intercept, a, is equal to the Mean of X minus b times the Mean of Y

The slope, b, is equal to the correlation coefficient, r, times the standard deviation of Y divided by the standard deviation of X.

Write these formulas down, and in the next graphic we will begin calculating a and b.

NOTE: You can, if you want, reverse what you call the predictor variable and the criterion variable. That is, you can reverse X and Y so that you predict X from Y instead of Y from X. In this example, that means you can predict the number of cigarettes smoked (X) from the number of health problems (Y). The pink box in the lower right of the graphic shows the formulas for predicting X from Y. We won't do that sort of reversal in this class, so you don't need those formulas. They are there just for your information.

Find the Slope from the Data

These are pretty easy formulas if you've already got the means, the standard deviations and the correlation coefficient. There is a lot of rounding error with these formulas, so it is best to carry out your calculations to several significant digits. In our example, you'll notice that I carried the numbers out to 5 decimals.

The current graphic shows all the relevant statistics that we need for our example

The slope, b, is equal to the correlation coefficient times the standard deviation of Y over the standard deviation of X [b = rxy (sy/sx)]. As you can see, substituting into that equation gives you a slope of +1.578.

Find the Intercept from the Data

Let's go ahead and calculate the intercept, a. The intercept is equal to 11 minus 1.578 times 5 which is equal to +3.109.


Substitute the Values into the Regression Equation

Through the formulas we have just calculated, the data has told us that the best fitting regression line between health (Y) and smoking (X) has a = 3.109 and b = 1.578.

The general form for the regression line is Y' = a + bX. Let's substitute the values of a and b we just found into that general equation. The substitution gives us Y' = 3.109 +1.578X. That's our regression line.

Y' = 3.109 + 1.578X describes the relationship between health problems (Y) and smoking (X). At this point in the course we will consider it another descriptive statistic. The regression line describes very precisely the relationship between two variables.

Let's go on to graph this equation and see how it fits into our data.

Back to the data and scatterplot: In the graphic you can see that the data is listed in the upper table as it was on previous graphics. The data is also drawn on the scatterplot as it was before.

How do we draw the regression line on the scatterplot? What we're going to do is create a second table in the lower right of the graphic. In that table we will put in arbitrary values of X and then calculate predicted values of Y, that is, we will calculate a Y' for every value of X that we arbitrarily choose.

Calculate some Estimated Values (Y')
Now let's put in a few values of X into the equation. It doesn't matter what values so let's use 0, 5, and 10. Now what we've got to do is use the regression equation to calculate a predicted value for each of these values of X. As practice you may want to use the equation to calculate these estimated values of Y before you go on.

  Calculation Example 1

Here's a specific example. If X is 0 and we put 0 into the regression equation the predicted value (Y') is 3.10.

One reason for choosing 0 as a value for X is that it makes the calculations easy.


Calculation Example 2

 

Next, we will use X = 5. This means that we're going to have to at least do a little multiplication this time. Y' = 10.99.

Calculation Example 3

Finally we'll put in X = 10, and calculate that Y' = 18.89.

Distinguish data from predicted values. Remember that there is a difference between the actual observed value of Y and the Y' that we calculated. The upper table on the graphic a few paragraphs up shows the actual data, both for X and Y. The lower table on the graphic to the left contains predicted values of Y for hypothetical values of X.

Put the Predicted Values on the Scatterplot

 

Using the lower table in the graphic above, we see that if a hypothetical person smoked 0 cigarettes per day we would predict that such a person would have 3.109 health problems between the ages of 65 and 70. This person was not among our research participants, this is not a real person, we are just hypothesizing that based on our prediction equation, this hypothetical person should have 3.109 health problems. In fact there is no such thing as 3.109 health problems; a real person could have 3 health problems or 4 health problems, but not 3.109. It would be like having 3.109 children.

The upper table shows the actual data for the real research participants. In the lower table are predicted values of Y for hypothetical values of X. It is very important to distinguish between Y (actual data) and Y' (hypothetically predicted values)

OK, let' go on. If someone smokes 5 cigarettes per day, hypothetically, we would predict s/he would have 10.99, or close to 11 health, problems. And then if a person smokes 10 cigarettes a day, we would predict 18.89 health problems

Let's put a point (round yellow dots) on the graph for the hypothetical data . We put a point at X = 0 and Y' = 3.109 for our first hypothetical case. We put a point at X = 5 and Y' = 10.99 for our second hypothetical case. And we put a point at X = 10 and Y' = 18.80. You can see these three round points on the current scatterplot illustration.

Study the scatterplot. Make sure you understand the difference between the six square points representing the actual data and the three round points representing the predictions based on the regression line we just calculated.

Now let's draw the regression line.

The Best Fit Line

We are all set now to draw a line through the three prediction points.

Draw the Best Fit Line

The current graphic shows the best fit regression line drawn through the three predictions made from our regression line. You can see that the line passes through none of the data points (squares). But the line that we've drawn is as close as possible to all the data points taken together.

Now we've added a level of sophistication to our correlational analysis. Up to this point we would draw a scatterplot and calculate a correlation coefficient. That would tells us we have have a positive relationship between smoking and health problems. By adding a regression analysis, we can describe this positive relationship much more precisely. We can talk about a specific line (Y' = 3.109 + 1.58X) that relates X to Y. Now we can use the equation of that line to put in any hypothetical value of X (smoking) and predict a hypothetical Y' (health problems). This is a great gain in precision of knowledge.

CAVEAT: if there is a nonlinear relationship, this method will not produce accurate results. We are working with linear regression which assumes a linear relationship between X and Y. It gives us a good description of a straight line drawn through all the points on the scatterplot. It's not a good description of curvilinear relationship. So the caveats that applied to the correlation coefficient apply here also.

SUMMARY: Up to this point we have defined and reviewed linear functions. We have defined what we mean by a regression line and proposed formulas for calculating a regression line from data. Then we went through the details of an example using formulas and actually calculated a regression line. It is important that you practice these ideas with homeworks at this point so that your understanding starts to develop.

Least Squared Error. We are going to return to the idea of least squared error which we began to develop in the variance lecture. Recall that in that lecture we focused on the deviation (or difference) between an individual score and the mean. This time we will focus on a very similar deviation--the difference between the actual score and the score predicted by the regression line.

Symbols for Predicted versus Actual Values of Y Remember that Y is the dependent (or criterion) variable. Y is being predicted from X, the predictor variable.

As a start, we have established symbols for the actual score and the predicted score. Following common conventions in statistics we will symbolize the predicted value of Y by Y' (pronounced "Y prime") or by Y with a caret over it (pronounced "Y hat"). Both "Y hat" and "Y prime" are common symbols for the predicted value of Y. In this web lecture I will use Y' (Y prime) for the predicted score because it is easier to type.

The actual score (the one measured on the research participants) will be symbolized by Y.

These two symbols, Y and Y', look similar but conceptually they're quite different. The predicted value of Y (Y') is an estimation of what Y would be if the regression line perfectly described the relationship between X and Y. The actual value of Y is the data that we obtained by our measurement operations when we went out and did the research.

 

The next graph show that the actual values of Y are represented by the red squares with the blue outline and the predicted scores, which all fall on the regression line, are the yellow circles with black outlines.

 

Three Columns: X, Y and Y'

We are next going to be very explicit about three columns of numbers, X, Y, and Y'.

X column. The numbers in the X column are the number of cigarettes smoked per day as reported by the research participants as they look back, retrospectively, over their lives. For convenience, the X scores are ordered from lowest to highest. X is the actual data collected on the predictor variable.

Y column. Under the Y column is the number of serious health problems experienced by each participant between the ages of 65 and 70. Y is the actual data collected on the dependent or criterion variable.

Y' column. The numbers in the "Y prime" column are calculated using the regression equation. We have calculated a predicted score (Y') for every participant. For example for the participant at the top of all three columns, we plugged X = 1 into the the regression equations and got a Y' = 4.7. The value 4.7 is an overestimate since that person actually had only 3 health problems between the ages of 65 and 70. The Y' column is what I would guess the number of health problems would be for a particular person using the regression line.

Wolf in a sheep's clothing. You've heard of a wolf dressing up in sheep's clothing. Well, Y' is really X dressed up in Y clothing. Y' is a transformation of X using the equation Y' = a + bX You take an X and turn it into a Y'.

Error: A Difference that makes a Difference. Look at the Y column versus the Y' column. For each participant our guess (Y') is different than what actually happened (Y). We are now going to focus on these differences, which are called prediction errors. They're very interesting to statisticians because great insight is gained about any model through the errors it makes. Error is defined as Y minus "Y prime," and often it's represented by a little e. Error is defined as the deviation of an actual score from a predicted score.

We're back to discussing deviations again, as we did with deviations around the mean. Deviations around the mean are X - M, prediction error deviations are Y - Y'. Error deviation describes how far the actual data is from the prediction. Since the point of regression is to try to predict Y values using X values, any deviation of Y' from Y is, obviously, considered an error.

Let's focus on the research participant at the bottom of the X column. This person had an X value of 10, that is he or she smoked 10 cigarettes a day across an entire lifetime. The actual number of health problems reported for this person was 23. We predicted 18.89 health problems from our regression equation. So in that specific case the difference between the actual score and the predicted score is 23 minus 18.89, or 4.41. Our error is 4.41. The graph shows the amount of that error visually. The deviation between the actual data and the prediction is what we mean by prediction error.

Thinking strategy. Process the idea of error in two ways. Calculate the exact value of a prediction error (e.g., 23 - 18.89 = 4.41). Then focus on the difference between predicted score and the actual score on the graph. The former gives you a symbolic, computational understanding of error. The latter gives you a visual understanding of error. As we continue through this material, understand the ideas both ways, until the two ways of understanding are fully integrated and interchangeable in your mind. If you do this, you will learn these ideas well.

The e column

You will notice that we have added a fourth column. We have calculated e = Y - Y' for every participant. The e column lists the errors for each person.

The errors have + and - signs because some of the Y' values are overestimates and some are underestimates of the actual data.

Sum of e = 0. You'll notice that when you sum up all of the errors, they come out to be zero. In that sense the Prediction Line is like the Mean. That is, one of the characteristics of the prediction line is that the amount of error above the line is equal to the amount of error below the line. The positive errors cancel out the negative errors (within rounding error).

This characteristic is similar to the Mean where the amount of deviation above the mean is equal to the amount of deviation below the mean and consequently the sum of the deviations around the mean equals 0. Similarly, the sum of the deviations (errors) around the prediction line is zero.

Summary: Error is just a deviation of the actual score from the predicted score. The sum of these deviations, or the sum of the errors around the regression line, is conceptually zero. I say conceptually zero because with rounding error you might find in your homework data that the sum of the errors is only close to zero--but that's due to rounding errors.

Least Squared Error

Preview. In the next section of this lecture we are going examine the idea of Squared Error. Based on that we will develop the idea of Error Variance.

Least Squared Error. Without using calculus, we cannot show how the formulas you are using to find the intercept, a, and the slope, b, were derived. But it can be proven that those formulas (compared to any other possible formulas) produce the least amount of squared error possible.

The theory behind the mathematical derivation of the formulas you learned for the predicted line (Y' = a + bX) assures us that the line we calculate with those formulas generates the least possible amount of squared error.

Why Least Squared Error is Desirable

Review: In the Variance Lecture we learned that variance is average squared deviation around the mean. Then we took some time to develop why it is useful for the variance formula to be based on squared deviations. Toward the end of the variance topic, we talked about the idea that all scientific measurement operations produce some degree of measurement error. While we willing to accept errors in measurement, we prefer that they be small and we are concerned if they are large. So squaring errors (deviations) is useful because the squaring function is sensitive to large errors. Squaring multiplies larger errors more than it does small errors.

If I were building a table and measured the length of one of its legs, I would not think that I could measure it exactly. Some engineer or physicist could measure it with more precision. All measurements produce some kind of error. However, I am not very concerned if my error is small. If I'm one ten thousandth of an inch off, it's not going to make much difference for building a table. But if I'm a long way off, if I'm two inches off, that will make a difference in the final product. A two inch difference in the length of the legs will make the table wobbly.

The same is true when we're measuring people's personality, intelligence, aggressiveness, or memory in psychology. We assume small errors don't really matter but big errors do. Metaphorically, big measurement errors will create wobbly theoretical constructs.

Prediction Error. Prediction works the same way. We accept that our predictions won't be exact, there will be some degree of error. But how much error is the question. We are much less concerned about small errors in prediction than large ones.

Least Squared Error. The basic idea of least squared error is that, based on the wobbly table argument, we would like procedures which, while they may produce small errors, minimize large errors. And large errors are amplified by the squaring function more than are small errors. Therefore any procedure that gives us the the least amount of squared error gives us what we want.

The regression formulas for the intercept, a, and the slope, b, gives us the line that has the least squared error around it. That is, they produce the least amount of squared error in our predictions.

 

Squared Error Column

The graphic on the right adds a fifth column of numbers, squared errors. To get this column, calculate errors then square them.

Prediction Errors Sum to 0. Notice that error column sums very close to zero. There is a bit of inaccuracy in the sum of the errors (0.01) due to rounding.

Mean error. The Mean of any set of numbers is just the sum of those numbers divided by the number of numbers. So if the sum of the errors is conceptually 0, then the average error must be conceptually 0. Therefore our regression line has a very nice characteristic--its average error is 0. That means it's not biased toward overestimating nor toward underestimating; the overestimates exactly cancel out the underestimates.

Sum of Squared Errors. Look at the fifth column of numbers, the column of squared errors. In your notes complete the column where it says "etc., etc." Confirm that the sum of the squared errors is 108.5707.

It is useful to keep in mind that that errors are deviations of actual scores, Y, from predicted scores, Y'.

Review the Formula for Variance

If we want to calculate error variance, then we will have to be clear about what formula to use to calculate it.

Think about the formulas for variance. Can you remember them just from using them?

Variance Formula when M = 0

On the left is the definitional formula for the variance. On the right is the computational formula.
It doesn't matter which of the two formulas we use; it comes out the same.

Sum(X squared)/n. Look at both formulas above. Notice that when M = 0, they both reduce to the same thing. They both reduce to Sum of (X squared) divided by n. Write down in your notes the formula for the variance when M = 0.

Constructing a formula for Error Variance

We want to calculate the variance of the error scores. The Mean error = 0. So let's just change the X's to e's in the last formula we made up. Then we will have our formula for calculating error variance.

Prediction Error Variance

The idea. I want to make sure that the concepts behind calculations on the graphic are really clear. What we did was make predictions about Y from X. Those predictions are inevitably wrong--that's where error comes in. We have already discovered that the Mean error around the regression line is 0. Now we are going to find the variance of the error scores.

The formula. Across the top of graphic is a formula for prediction error variance. When you put Me = 0 into it, it boils down the formula we just made up. So to get prediction error variance, we just have to divide the sum of the squared errors by n.

Go ahead and substitute the numbers in the formula and calculate error variance.

Error Variance

The error variance is 108.5707 divided by 6 which equals 18.1.

Summary: Prediction Error and Prediction Error Variance

Prediction and error. In general, when humans make predictions about the world, they are not surprised if their predictions are not exactly spot on. Put another way, we generally assume our predictions will have some amount of error in them. But how much? A little error in our predictions may not make much difference but a large error in our predictions could be disastrous.

e = Y - Y'. Regression analysis is a formal prediction procedure. It is a way of predicting from the values of one variable (X) what the values of another variable (Y) will be. We expect the regression equation will generate errors. But we want to have some measure of whether those errors are large or small. Prediction error is a deviation: e = Y - Y'. So an error is a deviation of the actual score from the predicted score. The sum of these deviations is 0. So on the average, the errors around the regression line are 0. But how spread out are they? A little? A lot? Variance measures how spread out things are by giving us the average squared deviation.

Prediction error variance. Error Variance is the average squared deviation of the actual data around the predictions. Error variance measures how spread out the errors are around the regression line. When you think about it, a good predictor would have a low error variance which would indicate that the average squared deviation of the actual data away from the predictions is small. If you have small error variance, then that means that all of the data is clustered pretty tightly around the regression line because all of the e's would be small. In a similar way, a poor predictor would have a relatively large error variance.

In the next section we will examine these ideas in more depth.

Explained and Unexplained Variance

Next, we will discuss the concepts of explained variance and unexplained variance in regression. (Unexplained variance is exactly the same concept we just defined as error variance.)
Example

Remember from our example that the predictor variable was number of cigarettes smoked per day between ages 20 and 50 and the criterion variable (or dependent variable) was the number of health problems between the ages of 65 and 70.

 

Background Statistics

The graphic at the right reviews the statistical results we've calculated for our example so far.

Variability. We found the standard deviation of Y to be 6.68331 and the variance of Y to be 44.66667.

X, Y Correlation. We also found the correlation coefficient between cigarettes and health problems to be r = .77119.

Regression line. Finally, we found the regression line to be Y' = 3.109 + 1.578X.

Total Variance in Y. In the context of what we are now learning, and for emphasis, we will call the variance of Y "the total variance of Y." This is because we are about to break the total variance of Y into two parts--explained variance and unexplained variance. Before we break the total variance of Y up into parts, let's make sure we are clear about what the variance of Y is.

Total Variance of Y

Y is the number of health problems a person has between the ages of 65 and 70. Clearly, the number of health problems in the age range from 65 to 70 varies from person to person. One individual will have a different number of health problems than another person. So in a large group of people there will be a variability, spread-out-ness, in the number of health problems. We measure variability with variance.

To find the variance of Y, you subtract each Y score from the mean of Y (Y - My), then you square each of these deviations, sum them, and average them. The total variance of Y can be broken into two parts, explained variance and unexplained variance.

Explained Variance Conceptually. Explained Variance is that part of the variance in health problems that is predictable from, or explained by, how much or how little a person smokes. In other words, some health problems are related to smoking. So we can explain (to a certain degree) why the number of health problems varies from person to person simply by knowing how many cigarettes each person smoked per day.

Unexplained Variance Conceptually. Clearly not all health problems are related to cigarettes. People who never smoke still have health problems. Obviously there are other factors that affect health besides cigarette smoking. So not all health problems (Y) are related to cigarettes smoked (X). Therefore, some of the variance in Y is not predictable from, or is unexplained by, cigarette smoking.

Conceptually, we look at our group of research volunteers and see that they differ a great deal in the number of health problems they have between 65 and 70. Some of these problems we can explain by their cigarette smoking lifestyle, and others of them, we can't explain by their cigarette smoking lifestyle. Other factors, besides smoking, also affect their health.

A visualization. In order to visualize this kind of logic, it's very typical in statistics to combine symbolic logic--formulas-- with graphic representations to help keep track of what's going on. A very typical visualization in this case is to have total variance be represented by a circle. In the graphic above, the total variance is represented by the entire circle; it has a value of 44.67.

Calculating Explained & Unexplained Variance

The explained variance can be calculated very simply by r squared times the total variance of Y.

In this particular example, r squared is .59741. If we multiply that times the total variance which is 44.66667, we get an explained variance of 26.57

The formula for unexplained variance is (one minus r squared) times the total variance of Y.

In the example, one minus r squared is .40527. We multiply this times the total variance which is 44.666, giving a result of 18.10 for the unexplained variance.

 

Dividing the Circle

Visually, we represent the total Y variance (44.67) by the full circle. We divide the circle into two parts--explained variance (26.57) and unexplained variance (18.10).

The two parts of the circle add up to the entire area of the circle. Adding the explained and unexplained variance together will give you the total variance.

The circle is drawn poorly. The large part of the circle (26.57) is given a smaller portion of the pie than the smaller part (18.10). In your notes you may want to do a better job of it. (This conceptual inaccuracy will continue in the picture until the end of this lecture.)

 

Proportion of Variance Accounted For

The proportion of the explained variance is called proportion of variance accounted for. It is an important concept in regression. The proportion of variance accounted for is found by dividing the explained variance by the total variance.

It's no more difficult than if you have a bushel basket with 20 pieces of fruit in it, both apples and oranges. By counting you discover that there are 15 oranges and 5 apples. If you take 20 and divide it into 15, you would get the proportion of oranges. 15 divided by 20 gives .75, so proportion of pieces of fruit that are oranges is .75. We are following exactly the same logic here in calculating the proportion of variance accounted for.

%. People often use percentages to describe this concept. If you multiply the proportion of variance accounted for by 100, then you get 59.4%. People go back and forth between calling this percent of variance accounted for versus proportion of variance accounted for. Both are common ways of phrasing things in regression analysis.

Coefficient of Determination

One interesting fact is that you don't actually have to use the variance to find the proportion of variance accounted for. You simply need to square r. r squared gives you the proportion of variance accounted for in Y by X.

This is simple to calculate. If someone asks you what the proportion of variance accounted for is, all you have to do is square the correlation coefficient.

r squared is sometimes called the coefficient of determination.

Obviously, the two methods of finding the proportion of variance accounted for are mathematically the same. If you care to, you can probably prove this for yourself just by laying out the various formulas we just learned and looking for what cancels.

Unexplained Variance

You can probably guess that the proportion of unexplained variance will be calculated in a similar way. We simply take the unexplained variance and divide it by the total variance which gives the proportion of the circle that's in the unexplained part.

In this case, the unexplained variance (18.10) divided by total variance (44.67) gives us a proportion of variance not accounted for equal to .405.

There is a simple way to calculate this proportion too--(one minus r squared). In this case (one minus r squared) is (1 - .594) which gives us .405. The proportion of variance NOT accounted for is .405.

On exams and homeworks you will need to know how to calculate the proportion of variance accounted for (or not accounted for) using both methods--that is, 1) using r squared and 2) using the variance formulas

Note once again that unexplained variance is the same thing as error variance.

Integration

We will use the idea of deviation to integrate all the ideas we have talked about in the Regression Lecture. A great deal of statistical theory is based on deviations. So looking at deviations is a relatively easy way to understand what is going on. Just read the this integration section to get the main ideas because they will help you bring the whole topic together into a well-formed gestalt.

Three Deviations

Y. For this discussion, let's make up an example data point for a single person. That person's data is shown on the graph as a red square with blue trim. Y represents the number of health problems reported by this one person. The Y value for the red square shown on the graph looks to be about 21 health problems. So for this discussion let's say we have a person who has Y = 21. (The X score looks to be about 6.)

Y'. The black line on the graph is the regression line for predicting Y from X. The predicted score (Y') for our example person is shown as a yellow circle on the regression line. Let's say that the predicted number of health problems is 17. That is, Y' = 17. (So for a person who has X = 6, our regression equation predicts 17 health problems.)

M. The red line on the graph represents the Mean of all the Y scores. It is the average number of health problems in our sample of volunteers. In our sample M = 11.

A First Deviation: (Y - M). As you can see from the graph, our person's Y score deviates from the Mean. For the single case we have made up, Y - M = 21 - 11 = 10.

A Second Deviation (Y - Y'). Also notice that the actual score (Y) deviates from the predicted score (Y'). This deviation is what we have called error. e = Y - Y' = 21 - 17 = 4.

A Third Deviation (Y' - M). Also notice that the predicted score (Y') deviates from the Mean of Y. Y' - M = 17 - 11 = 6.

Partitioning deviations. This is a very simple proof, and looking it over will give you some real insight. Notice that the second and third deviations add up to the first deviation. That is, 4 + 6 = 10. Algebraically this is true in general.
Set (Y - M) = (Y - Y') + (Y' - M).
Remove the parentheses ==>Y - M = Y - Y' + Y' - M
Cancel the + and - Y' ==> Y - M = Y - M.

It is quite generally true that Y - M = (Y - Y') + (Y' - M). In our example: 21 - 11 = (21 - 17) + (17 - 11). We have broken (partitioned) the total deviation of Y from its Mean into to parts. We will now go on to examine how these two parts generate unexplained (error) variance and explained variance.

Three Variances

The Concept of Variance. Remember that variance is the average squared deviation. Conceptually then, variance is based on deviation. So we are going to conceptualize the three deviations we just talked about as three variances.

Total Variance. The total variance in Y is generated by the total deviations of the Y scores from their Mean. It is based on the sum of all the squared Y - M deviations. You don't need to know anything about regression or correlation to understand the total variance which is based on the squared deviation of an actual score from the mean of Y. On the final graph, the deviation of Y from the Mean is shown by the thick blue bracket.

Error (Unexplained) Variance. We have broken the total distance between the Y score and the Mean into two parts. The first part is Y - Y' which we've argued is prediction error. That is, Y - Y' = e. When we find the variance of these error scores we get error variance.

Explained Variance (Variance accounted for). The third deviation is Y' - M. Look at the graphic; notice that Y' (circle) is closer to Y (square) than is the Mean (red line). In terms of the example, when a person smokes 6 cigarettes per day we predict 17 health problems. The actual number of health problems for this person is 21, and the Mean number of health problems is 11. So by predicting 17 we are a lot closer to the data (21) than the Mean is. If we simply used the mean to predict Y, we would be farther off than if we use the regression equation.

On the average, our predicted score (Y') is closer to the actual data (Y) than is the Mean. We have gained something by predicting Y from X. What we have gained is that, overall, the predicted scores are closer to the what the data is than the Mean. So we have explained a part of the total variance in Y.

Note: Explained Variance is conceptually based on the sum of the squared deviations of our predictions around the Mean. A note of caution: We have not actually calculated Explained Variance as the sum of the squared Y' - M deviations divided by n. This is the first I've mentioned it. We have used r squared times the total variance of Y as a way to calculate Explained Variance. That's the way you should do it. What we are doing here is integrating all the ideas.


Summary: Partitioning the Variance. In summary, we have broken up (partitioned) the total variance into two parts--Unexplained (error) Variance and Explained Variance. Any variance of any kind is based on deviation. The sources of these three variances can be seen clearly on the graph as the three deviation shown by brackets. (Y-M) = (Y-Y') + (Y'-M). In the specific data point example we have made up:
(21-10) = (21-17) + (17-11).

Thinking strategy. As you study, keep these last two graphics in mind while you do your calculations and use the formulas. These graphics will act as a road map. You will easily learn the conceptual terrain if you keep track of the where you are on the map as you solve problems and calculate answers.

Don't worry about details, nor about the long train of thought involved in developing these concepts. Go and practice and learn by doing homework. As you do the homework, keep the visual map (last two graphics) in mind. The goal is to form a simple whole (gestalt) for yourself that integrates all this new material into a coherent understanding.