|

Regression statistics expand on correlation to allow
us to use relationships between variables to make predictions. They
provide us with tools to write linear equations which can be used
to predict the value of a dependent or criterion variable from the
value of one or a set of predictor variables.
Linear
Functions
Before
we start talking about regression we are first going to talk about
the general topic of linear functions which some of you may remember
from earlier math classes.
Linear
Functions Form
The general form of a simple linear function
is this equation (Y = a + bX). This equation describes any straight
line. The slope of the line is represented by the letter b.
The intercept is represented in the equation by the letter
a. The value of the
intercept, a, is where the line crosses the Y axis. By arbitrarily
picking a value for X and using this formula we can determine the
the values of Y and therefore draw the line.
Linear Function Example 1
In the blue box at the bottom of the illustration
is our equation for a particular line, Y = 2X + 1. From this equation
we can determine how the line will appear on a graph. I'll call
this graph a "scatterplot" for reasons that we discussed
in the lecture on correlation.
Note: In this example, I have reversed the order
of the parameters on the right side of the equation, but it is essentially
the same. That is, (Y = a + bX) is the same mathematically as (Y
= bX + a).
In this example, the slope b = 2 and the intercept
a = 1. The next step is to arbitrarily pick at least three X values.
For this example I picked the X values of 0, 1, and 2. Using the
equation (Y = 2x +1) I can determine that when X = 0 then Y = 1.
When X = 1 then Y = 3 and when X = 2 then Y = 5. Try these out for
yourself and perhaps try some other values for X as well. Notice
that when you draw a line through these points the line crosses
the Y axis right at the value of 1. The value "1" is the
intercept.
Notice that high values
of X give you high values of Y. Later, we'll see that a positive
slope corresponds to the idea of a positive relationship in correlation.
Linear
Function Example 2
Here's another example. For this line the
equation is Y = .5x - 1. This equation looks a little different
because the value of the Y-intercept is -1. That means the line
will cross the Y axis at -1. The form of our equation remains the
same (Y = bX + a) but if you remember your high school algebra the
equation Y = .5x + (-1) is the same as Y = .5x -1. This second form
is just a simpler way of writing the equation if the value of a
is less than 0.
Here I arbitrarily picked the X values of
0, 1, 2, and 3. Then using the equation Y = .5x -1, I determined
that the associated Y values would be -1. -.5, 0, and .5. Why don't
you try this and check my calculations.
Compare Example 1
and Example 2. What is the effect of changing the slope from
2 (in example 1 to .5 in example 2)? The line with the lower value
of slope (.5) is less steep than the line with the higher slope.
What is the effect of changing the intercept,
a, from 1 to -1? Notice that the line cuts the Y axis in each case
exactly at the value of "a."
Linear Function Example 3

For this line the equation is Y = -1.5x + 2. Here we have a negative
value (-1.5) for the slope and the intercept has a value of 2. I
picked the X values of 0, 1, and 2 again and then used the equation
to determine the Y values.
Compare: What is the effect of making
the slope negative? The line slopes the other way. Notice
that High values of X now give you Low values of Y. This
is true when the slope is negative. We'll see later that a negative
slope corresponds to the idea of a negative relationship in correlation.
When the Slope (b) = 0 -- No Relationship

What does the line look like if the slope is 0? It will always
look like a flat horizontal line. In this example, regardless of
what value you put in for X, Y will always equal 3.
Later, we'll see that a slope of 0 corresponds to the idea of no
relationship in correlation.
Y
= X
Y = 0 + 1x or simply Y = X is the line
that starts at the origin (0, 0) and goes up at a 45 degree angle.
By picking the X values of 0, 1, and 2 and using the formula you
can determine that the Y values are also 0, 1, and 2, respectively.

Regression
Line
In statistics, when we want to predict or estimate
one variable, Y, from a second variable, X, we use a procedure called
"regression." The "regression line" is the linear
function we use to make this prediction. If that doesn't make much
sense to you at this point, it's OK. We'll spend a lot of time learning
this concept.
NOTATION: When we
talk about predicted or estimated values of variable Y, we generally
use some symbol like a Y with a little caret or a little hat on
the top of it (^), or we use Y prime (Y').
In this class, we will use Y' because it's a lot easier to type
an apostrophe for prime than it is to draw one of those little hats
in HTML at this time. But in statistics books you'll see several
different notations.
In words we'll say "Y
prime is equal to a plus bX."
In symbols, we'll write Y' = a + bX.
Obviously (except for the prime) Y' = a + bX is
very similar to the linear function that we just reviewed.
When you are predicting or estimating values
of Y from X, Y is called the criterion variable,
and X is called the predictor variable.
The criterion variable is often called the dependent variable.

Cigarettes and Health Example

For the purposes of our lecture today we are going
to use a made-up example which examines the relationship between
cigarettes and health. So, Y might be the number of health problems
experienced by an individual between the ages of 65 and 70; and
X might be the number of cigarettes he or she smoked per day from
the age of 20 until the age of 50. We want to predict Y from X,
that is we want to estimate the number of health problems later
in life from the number of cigarettes smoked earlier in life. In
statistical jargon, we will find the regression line, Y'
= a + bX
Health
Problems and Smoking
Measurement Operations:
Translating a life into the number of cigarettes smoked per day.
There are two general methodologies used in such studies. In a RETROSPECTIVE
STUDY, we would ask research participants to review their life and
report how many cigarettes they smoked per day between the ages
of 20 and 50. In a PROSPECTIVE STUDY, we would track people across
their lifetime, asking them to record the number of cigarettes they
smoke per day. The retrospective study could be done in a few months.
The prospective study would take many years. The data from a prospective
study is much higher quality because it doesn't rely on the subjects'
memory. Either way, the number of cigarettes smoked per day is X;
it will be our predictor
variable.
Then we're going to predict the number of
health problems a participant has between the ages of 65 and 70
from how much they smoked.
Let's say we do a retrospective study. We
examine the medical records of participants when they were between
65 and 70 years old, counting the number of health problems they
had. Then we give them a questionnaire on how much they've smoked
at different times in their life. We want to predict health problems
from smoking rate.
Regression
Line
Let's say that we're going to have a tiny little
sample, usually there are thousands of people in such studies, but
we're just going to have a few so that our calculations will be
simple. The data is made up.
On the illustration the data is ordered in the
table on the left from the lowest X value to the highest; that is,
it goes from the least number of cigarettes smoked to the highest.
So the person who smoked one per day had three health problems;
the person who smoked two packs had ten health problems, and so
on.
The table contains the data of individual participants
and we've measured two things about each of them. In the general
scheme of methodology, a regression study is still a correlational
study.
Next we're going to draw a scatterplot. Each dot
on the scatterplot represents the data of one person. Perhaps by
now you've got enough experience with scatterplots from studying
correlations to know that this scatterplot shows a positive relationship.
The more smoking the more health problems. The scatterplot shows
a pretty high correlation.
We have a scatterplot; but the question is how
do we find the linear regression line? How can we draw a line that
goes as close as possible to all the points on the graph?
Regression Line
Describing the relationship
between X and Y. Little r is one descriptive statistic which
can summarize the relationship between cigarettes and health problems
in this data. You've studied correlation
and know how to calculate r.
There's another descriptive statistic called the
regression line. The regression line
is the best straight line that we can draw through or between the
points on the scatterplot. Obviously a straight line can't connect
the all dots because then you'd have to bounce up and down and up
and down from one dot to the next, and it wouldn't be a straight
line. So we want to be able to draw a single line that comes as
close to all the dots as possible.
Least Squares Principle.
We'll have to have a criteria for what we mean by "close."
The criteria is called the least squares principle. Recall back
to our discusssion of Variance. We showed how the variance squares
the deviations around the mean. In Regression we will square the
deviations around the regression line instead of around the mean.
The best fit regression line is the line that has the smallest value
for the squared deviations around it, the least squared deviations.
That's essentially the whole idea of least squares. But we'll talk
about it more later after you are more familiar with these ideas.
So how do we find the line that best fits through
all these points?
Regression
Equation Formula
When we reviewed linear functions, we described
equations of the type, Y = .75X + 3. Then we put in values of X,
calculated values of Y, and drew the line on a graph. But what if
we don't know the values of the parameters a and b?
In Regression analysis we don't know a and
b. We have to calculate a value of a and a value of b from the sample
data. We're going to use the data
to calculate the slope and the intercept of the regression line.
The current graphic shows the formulas for
calculating the slope and the intercept from the data. Our estimated
value of Y will be found through the equation Y' = a + bX The yellow
box on the left shows the formulas for a and b.
As you can see, the intercept, a, is equal
to the Mean of X minus b times the Mean of Y
The slope, b, is equal to the correlation
coefficient, r, times the standard deviation of Y divided by the
standard deviation of X.
Write these formulas down, and in the next
graphic we will begin calculating a and b.
NOTE: You can, if you want, reverse what
you call the predictor variable and the criterion variable. That
is, you can reverse X and Y so that you predict X from Y instead
of Y from X. In this example, that means you can predict the number
of cigarettes smoked (X) from the number of health problems (Y).
The pink box in the lower right of the graphic shows the formulas
for predicting X from Y. We won't do that sort of reversal in this
class, so you don't need those formulas. They are there just for
your information.
Find
the Slope from the Data
These are pretty easy formulas if you've
already got the means, the standard deviations and the correlation
coefficient. There is a lot of rounding error with these formulas,
so it is best to carry out your calculations to several significant
digits. In our example, you'll notice that I carried the numbers
out to 5 decimals.
The current graphic shows all the relevant
statistics that we need for our example
The slope, b,
is equal to the correlation coefficient times the standard deviation
of Y over the standard deviation of X [b
= rxy (sy/sx)].
As you can see, substituting into that equation gives you a slope
of +1.578.
Find the Intercept from the Data

Let's go ahead and calculate the intercept, a.
The intercept is equal to 11 minus 1.578 times 5 which is equal
to +3.109.
Substitute
the Values into the Regression Equation
Through the formulas we have just calculated,
the data has told us that the best fitting regression line between
health (Y) and smoking (X) has a = 3.109 and
b = 1.578.
The general form for the regression line is Y'
= a + bX. Let's substitute the values of a and b we just
found into that general equation. The substitution gives us Y'
= 3.109 +1.578X. That's our regression line.
Y' = 3.109 + 1.578X describes the relationship
between health problems (Y) and smoking (X). At this point in the
course we will consider it another descriptive statistic. The regression
line describes very precisely the relationship between two variables.
Let's go on to graph this equation and see how
it fits into our data.
Back to the data and scatterplot:
In the graphic you can see that the data is listed in the upper
table as it was on previous graphics. The data is also drawn on
the scatterplot as it was before.
How do we draw the regression
line on the scatterplot? What we're going to do is create
a second table in the lower right of the graphic. In that table
we will put in arbitrary values of X and then calculate predicted
values of Y, that is, we will calculate a Y' for every value of
X that we arbitrarily choose.
Calculate some
Estimated Values (Y')
Now let's put in a few values of
X into the equation. It doesn't matter what values so let's use
0, 5, and 10. Now what we've got to do is use the regression equation
to calculate a predicted value for each of these values of X. As
practice you may want to use the equation to calculate these estimated
values of Y before you go on.
Calculation Example
1

Here's a specific example. If X is 0 and we put 0 into the regression
equation the predicted value (Y') is 3.10.
One reason for choosing 0 as a value for X is that it makes the
calculations easy.
Calculation Example 2

Next, we will use X = 5. This means that we're going to have to
at least do a little multiplication this time. Y' = 10.99.
Calculation Example 3

Finally we'll put in X = 10, and calculate that Y' = 18.89.
Distinguish data from predicted values.
Remember that there is a difference between the actual observed
value of Y and the Y' that we calculated. The upper table on the
graphic a few paragraphs up shows the actual data, both for X and
Y. The lower table on the graphic to the
left contains predicted values of Y for hypothetical values of X.
Put the Predicted Values on the Scatterplot
Using the lower table in the graphic above,
we see that if a hypothetical person
smoked 0 cigarettes per day we would
predict
that such a person would have 3.109
health problems between the ages of 65 and 70. This person
was not among our research participants, this is not a real person,
we are just hypothesizing that based on our prediction equation,
this hypothetical person should have 3.109 health problems. In fact
there is no such thing as 3.109 health problems; a real person could
have 3 health problems or 4 health problems, but not 3.109. It would
be like having 3.109 children.
The upper table shows the actual data for
the real research participants. In the lower table are predicted
values of Y for hypothetical values of X. It is very important to
distinguish between Y (actual data) and Y' (hypothetically predicted
values)
OK, let' go on. If
someone smokes 5 cigarettes per day, hypothetically, we would predict
s/he would have 10.99, or close to 11 health, problems. And then
if a person smokes 10 cigarettes a day, we would predict 18.89 health
problems
Let's put a point (round yellow dots) on
the graph for the hypothetical data . We put a point at X = 0 and
Y' = 3.109 for our first hypothetical case. We put a point at X
= 5 and Y' = 10.99 for our second hypothetical case. And we put
a point at X = 10 and Y' = 18.80. You can see these three round
points on the current scatterplot illustration.
Study the scatterplot. Make sure you understand
the difference between the six square points representing the actual
data and the three round points representing the predictions based
on the regression line we just calculated.
Now let's draw the regression line.
The
Best Fit Line
We are all set now to draw a line through the
three prediction points
Draw
the Best Fit Line
The current graphic shows the best fit regression
line drawn through the three predictions made from our regression
line. You can see that the line passes through none of the data
points (squares). But the line that we've drawn is as close as possible
to all the data points taken together.
Now we've added a level of sophistication to our
correlational analysis. Up to this point we would draw a scatterplot
and calculate a correlation coefficient. That would tells us we
have have a positive relationship between smoking and health problems.
By adding a regression analysis, we can describe this positive relationship
much more precisely. We can talk about a specific line (Y' = 3.109
+ 1.58X) that relates X to Y. Now we can use the equation of that
line to put in any hypothetical value of X (smoking) and predict
a hypothetical Y' (health problems). This is a great gain in precision
of knowledge.
CAVEAT: if there is a nonlinear relationship,
this method will not produce accurate results. We are working with
linear regression which assumes a linear relationship between X
and Y. It gives us a good description of a straight line drawn through
all the points on the scatterplot. It's not a good description of
curvilinear relationship. So the caveats that applied to the correlation
coefficient apply here also.
SUMMARY: Up to this point we have
defined and reviewed linear functions. We have defined what we mean
by a regression line and proposed formulas for calculating a regression
line from data. Then we went through the details of an example using
formulas and actually calculated a regression line. It is important
that you practice these ideas with homeworks at this point so that
your understanding starts to develop.

Least Squared Error.
We are going to return to the idea of least squared error which
we began to develop in the variance lecture. Recall that in that
lecture we focused on the deviation (or difference) between an individual
score and the mean. This time we will focus on a very similar deviation--the
difference between the actual score and the score predicted by the
regression line.
Symbols
for Predicted versus Actual
Values of Y Remember that Y is the dependent (or
criterion) variable. Y is being predicted from X, the predictor
variable.
As a start, we have established symbols for the
actual score and the predicted score. Following common conventions
in statistics we will symbolize the predicted value of Y by Y' (pronounced
"Y prime") or by Y with a caret over it (pronounced "Y
hat"). Both "Y hat" and "Y prime" are common
symbols for the predicted value of Y. In this web lecture I will
use Y' (Y prime) for the predicted score because it is easier
to type.
The actual score (the one measured on the research
participants) will be symbolized by Y.
These two symbols, Y and
Y', look similar but conceptually they're quite different. The predicted
value of Y (Y') is an estimation of what Y would be if
the regression line perfectly described the relationship between
X and Y. The actual value of Y is
the data that we obtained by our measurement operations when we
went out and did the research.
The
next graph show that the actual values of
Y are represented by the red squares with the blue outline
and the predicted scores, which all fall on the regression line,
are the yellow circles with black outlines.
Three Columns: X, Y and
Y'
We are next going to be very explicit about three columns of numbers,
X, Y, and Y'.
X column. The
numbers in the X column are the number of cigarettes smoked per
day as reported by the research participants as they look back,
retrospectively, over their lives. For convenience, the X scores
are ordered from lowest to highest. X is the actual data
collected on the predictor variable.
Y column. Under
the Y column is the number of serious health problems experienced
by each participant between the ages of 65 and 70. Y is the actual
data collected on the dependent or criterion variable.
Y' column.
The numbers in the "Y prime" column are calculated using
the regression equation. We have calculated a predicted score (Y')
for every participant. For example for the participant at the top
of all three columns, we plugged X = 1 into the the regression equations
and got a Y' = 4.7. The value 4.7 is an overestimate since that
person actually had only 3 health problems between the ages of 65
and 70. The Y' column is what I would guess
the number of health problems would be for a particular person using
the regression line.
Wolf in a sheep's clothing. You've heard of a wolf dressing
up in sheep's clothing. Well, Y' is really X dressed up in Y clothing.
Y' is a transformation of X using the equation Y' = a + bX
You take an X and turn it into a Y'.
Error: A Difference that makes a Difference.
Look at the Y column versus the Y' column. For each participant
our guess (Y') is different than what actually happened (Y). We
are now going to focus on these differences, which are called prediction
errors. They're very interesting to statisticians because great
insight is gained about any model through the errors it makes. Error
is defined as Y minus "Y prime," and often it's represented
by a little e. Error is defined as the deviation of
an actual score from a predicted score.
We're back to discussing deviations again, as we did with deviations
around the mean. Deviations around the mean are X - M, prediction
error deviations are Y - Y'. Error deviation describes how
far the actual data is from the prediction. Since the point of regression
is to try to predict Y values using X values, any deviation of Y'
from Y is, obviously, considered an error.
Let's
focus on the research participant at the bottom of the X column.
This person had an X value of 10, that is he or she smoked 10 cigarettes
a day across an entire lifetime. The actual number of health problems
reported for this person was 23. We predicted 18.89 health problems
from our regression equation. So in that specific case the difference
between the actual score and the predicted score is 23 minus 18.89,
or 4.41. Our error is 4.41. The graph shows the amount of that error
visually. The deviation between the actual data and the prediction
is what we mean by prediction error.
Thinking strategy. Process the idea
of error in two ways. Calculate the exact value of a prediction
error (e.g., 23 - 18.89 = 4.41). Then focus on the difference between
predicted score and the actual score on the graph. The former gives
you a symbolic, computational understanding of error. The latter
gives you a visual understanding of error. As we continue through
this material, understand the ideas both ways, until the two ways
of understanding are fully integrated and interchangeable in your
mind. If you do this, you will learn these ideas well.
The e column
You will notice that we have added a fourth column.
We have calculated e = Y - Y' for every participant. The
e column lists the errors for each person.
The errors have + and - signs because some of
the Y' values are overestimates and some are underestimates of the
actual data.
Sum of e = 0. You'll notice that when you
sum up all of the errors, they come out to be zero. In that sense
the Prediction Line is like the Mean. That is, one of the characteristics
of the prediction line is that the amount of error above the line
is equal to the amount of error below the line. The positive errors
cancel out the negative errors (within rounding error).
This characteristic is similar to the Mean where
the amount of deviation above the mean is equal to the amount of
deviation below the mean and consequently the sum of the deviations
around the mean equals 0. Similarly, the sum of the deviations (errors)
around the prediction line is zero.
Summary: Error is just a deviation of the
actual score from the predicted score. The sum of these deviations,
or the sum of the errors around the regression line, is conceptually
zero. I say conceptually zero because with rounding error you might
find in your homework data that the sum of the errors is only close
to zero--but that's due to rounding errors.
Least Squared Error

Preview. In the next section of this lecture we are going
examine the idea of Squared Error. Based on that we will
develop the idea of Error Variance.
Least Squared Error. Without using calculus, we cannot show
how the formulas you are using to find the intercept, a,
and the slope, b, were derived. But it can be proven that
those formulas (compared to any other possible formulas) produce
the least amount of squared error possible.
The theory behind the mathematical derivation of the formulas you
learned for the predicted line (Y' = a + bX) assures us that
the line we calculate with those formulas generates the least possible
amount of squared error.
Why Least Squared Error is Desirable
Review: In the Variance Lecture we
learned that variance is average squared deviation around the mean.
Then we took some time to develop why it is useful for the variance
formula to be based on squared deviations. Toward the end of the
variance topic, we talked about the idea that all scientific measurement
operations produce some degree of measurement error. While we willing
to accept errors in measurement, we prefer that they be small and
we are concerned if they are large. So squaring errors (deviations)
is useful because the squaring function is sensitive to large errors.
Squaring multiplies larger errors more than it does small errors.
If I were building a table and measured the length of one of its
legs, I would not think that I could measure it exactly. Some engineer
or physicist could measure it with more precision. All measurements
produce some kind of error. However, I am not very concerned if
my error is small. If I'm one ten thousandth of an inch off, it's
not going to make much difference for building a table. But if I'm
a long way off, if I'm two inches off, that will make a difference
in the final product. A two inch difference in the length of the
legs will make the table wobbly.
The same is true when we're measuring people's personality, intelligence,
aggressiveness, or memory in psychology. We assume small errors
don't really matter but big errors do. Metaphorically, big measurement
errors will create wobbly theoretical constructs.
Prediction Error. Prediction works
the same way. We accept that our predictions won't be exact, there
will be some degree of error. But how much error is the question.
We are much less concerned about small errors in prediction than
large ones.
Least Squared Error. The basic idea
of least squared error is that, based on the wobbly table argument,
we would like procedures which, while they may produce small errors,
minimize large errors. And large errors are amplified by the squaring
function more than are small errors. Therefore any procedure that
gives us the the least amount of squared error gives us what we
want.
The regression formulas for the intercept, a, and the slope,
b, gives us the line that has the least squared error around
it. That is, they produce the least amount of squared error in our
predictions.

Squared
Error Column
The graphic on the right adds a fifth column of
numbers, squared errors. To get this column, calculate errors then
square them.
Prediction Errors Sum to 0. Notice that
error column sums very close to zero. There is a bit of inaccuracy
in the sum of the errors (0.01) due to rounding.
Mean error. The Mean of any set of numbers
is just the sum of those numbers divided by the number of numbers.
So if the sum of the errors is conceptually 0, then the average
error must be conceptually 0. Therefore our regression line has
a very nice characteristic--its average error is 0. That means it's
not biased toward overestimating nor toward underestimating; the
overestimates exactly cancel out the underestimates.
Sum of Squared Errors.
Look at the fifth column of numbers, the column of squared errors.
In your notes complete the column where it says "etc., etc."
Confirm that the sum of the squared errors is 108.5707.
It is useful to keep in mind that that errors
are deviations of actual scores, Y, from predicted
scores, Y'.
Review
the Formula for Variance
If we want to calculate error variance, then we will have to be
clear about what formula to use to calculate it.
Think about the formulas for variance. Can
you remember them just from using them?
Variance Formula when M = 0
On the left is the definitional formula for the variance.
On the right is the computational formula.
It doesn't matter which of the two formulas we use; it comes out
the same.
 
Sum(X squared)/n. Look at both formulas above. Notice that
when M = 0, they both reduce to the same thing. They both reduce
to Sum of (X squared) divided by n. Write down in your notes the
formula for the variance when M = 0.

Constructing a formula for Error Variance
We want to calculate the variance of the error scores. The Mean
error = 0. So let's just change the X's to e's in the last formula
we made up. Then we will have our formula for calculating error
variance.

Prediction Error
Variance
The idea. I want to make sure that
the concepts behind calculations on the graphic are really clear.
What we did was make predictions about Y from X. Those predictions
are inevitably wrong--that's where error comes in. We have already
discovered that the Mean error around the regression line is 0.
Now we are going to find the variance of the error scores.
The formula. Across the top of graphic
is a formula for prediction error variance. When you put Me = 0
into it, it boils down the formula we just made up. So to get prediction
error variance, we just have to divide the sum of the squared errors
by n.
Go ahead and substitute the numbers in the formula and calculate
error variance.
Error
Variance
The error variance is 108.5707 divided
by 6 which equals 18.1.
Summary:
Prediction Error and Prediction Error Variance
Prediction and error. In general,
when humans make predictions about the world, they are not surprised
if their predictions are not exactly spot on. Put another way, we
generally assume our predictions will have some amount of error
in them. But how much? A little error in our predictions may not
make much difference but a large error in our predictions could
be disastrous.
e = Y - Y'. Regression analysis is
a formal prediction procedure. It is a way of predicting from the
values of one variable (X) what the values of another variable (Y)
will be. We expect the regression equation will generate errors.
But we want to have some measure of whether those errors are large
or small. Prediction error is a deviation: e = Y - Y'.
So an error is a deviation of the actual score from the predicted
score. The sum of these deviations is 0. So on the average, the
errors around the regression line are 0. But how spread out are
they? A little? A lot? Variance measures how spread out things are
by giving us the average squared deviation.
Prediction error variance. Error
Variance is the average squared deviation of the actual data
around the predictions. Error variance measures how spread out the
errors are around the regression line. When you think about it,
a good predictor would have a low error variance which would indicate
that the average squared deviation of the actual data away from
the predictions is small. If you have small error variance, then
that means that all of the data is clustered pretty tightly around
the regression line because all of the e's
would be small. In a similar way, a poor predictor would have a
relatively large error variance.
In the next section we will examine these ideas in more depth.

Explained and Unexplained Variance

Next, we will discuss the concepts of explained
variance and unexplained variance in regression. (Unexplained
variance is exactly the same concept we just defined as error variance.)
Example

Remember from our example that the predictor
variable was number of cigarettes smoked per day between ages 20
and 50 and the criterion variable (or dependent variable) was the
number of health problems between the ages of 65 and 70.
Background Statistics
The graphic at the right reviews the statistical
results we've calculated for our example so far.
Variability. We found the standard
deviation of Y to be 6.68331 and the variance of Y to be 44.66667.
X, Y Correlation. We also found the
correlation coefficient between cigarettes and health problems to
be r = .77119.
Regression line. Finally, we found
the regression line to be Y' = 3.109 + 1.578X.
Total Variance in Y. In the context
of what we are now learning, and for emphasis, we will call the
variance of Y "the total variance of Y." This is
because we are about to break the total variance of Y into two parts--explained
variance and unexplained variance. Before we break the total variance
of Y up into parts, let's make sure we are clear about what the
variance of Y is.
Total
Variance of Y
Y is the number of health problems a person
has between the ages of 65 and 70. Clearly, the number of health
problems in the age range from 65 to 70 varies from person to person.
One individual will have a different number of health problems than
another person. So in a large group of people there will be a variability,
spread-out-ness, in the number of health problems. We measure variability
with variance.
To find the variance of Y, you subtract
each Y score from the mean of Y (Y - My), then you square each of
these deviations, sum them, and average them. The
total variance of Y can be broken into two parts, explained
variance and unexplained variance.
Explained Variance Conceptually.
Explained Variance is that part of the variance in health problems
that is predictable from, or explained by, how much or how little
a person smokes. In other words, some health problems are related
to smoking. So we can explain (to a certain degree) why the number
of health problems varies from person to person simply by knowing
how many cigarettes each person smoked per day.
Unexplained Variance Conceptually.
Clearly not all health problems are related to cigarettes. People
who never smoke still have health problems. Obviously there are
other factors that affect health besides cigarette smoking. So not
all health problems (Y) are related to cigarettes smoked (X). Therefore,
some of the variance in Y is not predictable from, or is unexplained
by, cigarette smoking.
Conceptually, we look at our group of research
volunteers and see that they differ a great deal in the number of
health problems they have between 65 and 70. Some of these problems
we can explain by their cigarette smoking lifestyle, and others
of them, we can't explain by their cigarette smoking lifestyle.
Other factors, besides smoking, also affect their health.
A visualization. In order to visualize
this kind of logic, it's very typical in statistics to combine symbolic
logic--formulas-- with graphic representations to help keep track
of what's going on. A very typical visualization in this case is
to have total variance be represented by a circle. In the graphic
above, the total variance is represented by the entire circle; it
has a value of 44.67.
Calculating
Explained & Unexplained Variance
The explained variance can be calculated
very simply by r squared times the total variance of Y.
In this particular example, r squared is
.59741. If we multiply that times the total variance which is 44.66667,
we get an explained variance of 26.57
The formula for unexplained variance is
(one minus r squared) times the total variance of Y.
In the example, one minus r squared is .40527.
We multiply this times the total variance which is 44.666, giving
a result of 18.10 for the unexplained variance.
Dividing
the Circle
Visually, we represent the total Y variance
(44.67) by the full circle. We divide the circle into two parts--explained
variance (26.57) and unexplained variance (18.10).
The two parts of the circle add up to the
entire area of the circle. Adding the explained and unexplained
variance together will give you the total variance.
The circle is drawn poorly. The large part of the circle (26.57)
is given a smaller portion of the pie than the smaller part (18.10).
In your notes you may want to do a better job of it. (This conceptual
inaccuracy will continue in the picture until the end of this lecture.)
Proportion
of Variance Accounted For
The proportion of the explained variance
is called proportion of variance accounted for. It is an important
concept in regression. The proportion of variance accounted for
is found by dividing the explained variance by the total variance.
It's no more difficult than if you have
a bushel basket with 20 pieces of fruit in it, both apples and oranges.
By counting you discover that there are 15 oranges and 5 apples.
If you take 20 and divide it into 15, you would get the proportion
of oranges. 15 divided by 20 gives .75, so proportion of pieces
of fruit that are oranges is .75. We are following exactly the same
logic here in calculating the proportion of variance accounted for.
%. People often use percentages to
describe this concept. If you multiply the proportion of variance
accounted for by 100, then you get 59.4%. People go back and forth
between calling this percent of variance accounted for versus proportion
of variance accounted for. Both are common ways of phrasing things
in regression analysis.
Coefficient
of Determination
One interesting fact is that you don't actually
have to use the variance to find the proportion of variance accounted
for. You simply need to square r. r squared gives you the
proportion of variance accounted for in Y by X.
This is simple to calculate. If someone
asks you what the proportion of variance accounted for is, all you
have to do is square the correlation coefficient.
r squared is sometimes called the
coefficient of determination.
Obviously, the two methods of finding the
proportion of variance accounted for are mathematically the same.
If you care to, you can probably prove this for yourself just by
laying out the various formulas we just learned and looking for
what cancels.
Unexplained
Variance
You can probably guess that the proportion
of unexplained variance will be calculated in a similar way. We
simply take the unexplained variance and divide it by the total
variance which gives the proportion of the circle that's in the
unexplained part.
In this case, the unexplained variance (18.10)
divided by total variance (44.67) gives us a proportion of variance
not accounted for equal to .405.
There is a simple way to calculate this
proportion too--(one minus r squared). In this case (one
minus r squared) is (1 - .594) which gives us .405. The proportion
of variance NOT accounted for is .405.
On exams and homeworks you will need to
know how to calculate the proportion of variance accounted for (or
not accounted for) using both methods--that is, 1) using r squared
and 2) using the variance formulas
Note once again that unexplained variance
is the same thing as error variance.
Integration
We will use the idea of deviation to integrate all
the ideas we have talked about in the Regression Lecture. A great
deal of statistical theory is based on deviations. So looking
at deviations is a relatively easy way to understand what is going
on. Just read the this integration section to get the main ideas
because they will help you bring the whole topic together into
a well-formed gestalt.
Three
Deviations
Y. For this discussion,
let's make up an example data point for a single person. That
person's data is shown on the graph as a red square with blue
trim. Y represents the number of health problems reported by this
one person. The Y value for the red square shown on the graph
looks to be about 21 health problems. So for this discussion let's
say we have a person who has Y = 21. (The X score looks
to be about 6.)
Y'. The black line
on the graph is the regression line for predicting Y from X. The
predicted score (Y') for our example person is shown as a yellow
circle on the regression line. Let's say that the predicted number
of health problems is 17. That is, Y' = 17. (So for a person
who has X = 6, our regression equation predicts 17 health problems.)
M. The red line on the graph represents
the Mean of all the Y scores. It is the average number of health
problems in our sample of volunteers. In our sample M = 11.
A First Deviation: (Y - M). As you
can see from the graph, our person's Y score deviates from the Mean.
For the single case we have made up, Y - M = 21 - 11 = 10.
A Second Deviation (Y - Y'). Also
notice that the actual score (Y) deviates from the predicted score
(Y'). This deviation is what we have called error. e =
Y - Y' = 21 - 17 = 4.
A Third Deviation (Y' - M). Also notice
that the predicted score (Y') deviates from the Mean of Y. Y'
- M = 17 - 11 = 6.
Partitioning deviations. This is a very simple proof,
and looking it over will give you some real insight. Notice that
the second and third deviations add up to the first deviation.
That is, 4 + 6 = 10. Algebraically this is true in general.
Set (Y - M) = (Y - Y') + (Y' - M).
Remove the parentheses ==>Y - M = Y - Y' + Y' - M
Cancel the + and - Y' ==> Y - M = Y - M.
It is quite generally true that Y - M = (Y - Y') + (Y' - M).
In our example: 21 - 11 = (21 - 17) + (17 - 11). We have broken
(partitioned) the total deviation of Y from its Mean into to parts.
We will now go on to examine how these two parts generate unexplained
(error) variance and explained variance.
Three Variances
The Concept of Variance. Remember
that variance is the average squared deviation. Conceptually then,
variance is based on deviation. So we are going to conceptualize
the three deviations we just talked about as three variances.
Total Variance. The total variance
in Y is generated by the total deviations of the Y scores from their
Mean. It is based on the sum of all the squared Y - M deviations.
You don't need to know anything about regression or correlation
to understand the total variance which is based on the squared deviation
of an actual score from the mean of Y. On the final graph, the deviation
of Y from the Mean is shown by the thick blue bracket.
Error
(Unexplained) Variance. We have broken the total distance between
the Y score and the Mean into two parts. The first part is Y - Y'
which we've argued is prediction error. That is, Y - Y' = e.
When we find the variance of these error scores we get error variance.
Explained Variance (Variance accounted
for). The third deviation is Y' - M. Look at the graphic; notice
that Y' (circle) is closer to Y (square) than is the Mean (red line).
In terms of the example, when a person smokes 6 cigarettes per day
we predict 17 health problems. The actual number of health problems
for this person is 21, and the Mean number of health problems is
11. So by predicting 17 we are a lot closer to the data (21) than
the Mean is. If we simply used the mean to predict Y, we would be
farther off than if we use the regression equation.
On the average, our predicted score (Y')
is closer to the actual data (Y) than is the Mean. We have gained
something by predicting Y from X. What we have gained is that, overall,
the predicted scores are closer to the what the data is than the
Mean. So we have explained a part of the total variance in
Y.
Note: Explained Variance is conceptually based on the sum
of the squared deviations of our predictions around the Mean. A
note of caution: We have not actually calculated Explained
Variance as the sum of the squared Y' - M deviations divided by
n. This is the first I've mentioned it. We have used r squared
times the total variance of Y as a way to calculate Explained
Variance. That's the way you should do it. What we are doing here
is integrating all the ideas.
Summary: Partitioning the Variance. In summary, we have
broken up (partitioned) the total variance into two parts--Unexplained
(error) Variance and Explained Variance. Any variance of any kind
is based on deviation. The sources of these three variances can
be seen clearly on the graph as the three deviation shown by brackets.
(Y-M) = (Y-Y') + (Y'-M). In the specific data point example
we have made up:
(21-10) = (21-17) + (17-11).
Thinking strategy. As you study, keep these last two graphics
in mind while you do your calculations and use the formulas. These
graphics will act as a road map. You will easily learn the conceptual
terrain if you keep track of the where you are on the map as you
solve problems and calculate answers.
Don't worry about details, nor about the long train of thought
involved in developing these concepts. Go and practice and learn
by doing homework. As you do the homework, keep the visual map (last
two graphics) in mind. The goal is to form a simple whole (gestalt)
for yourself that integrates all this new material into a coherent
understanding.
|