Hypothesis
Testing Web Page
This is the text of the
inclass lecture which accompanied the Authorware visual graphics
on this topic. You may print this text out and use it as a textbook.
Or you may read it online. In either case it is coordinated with
the online Authorware teaching program.
Topic
Locator Map
This
map allows you to

Jump directly to a topic which interests you.

Coordinate the dynamic visual Authorware presentations with the
corresponding text available on this web page.
1.
To find a topic which interests you: Look at the map of menus
above. Choose a menu that interests you. Notice that the menu buttons
have topics printed on them. Click on any button (topic) on the
menu; you will jump directly to the text that corresponds to the
topic printed on the button.
2.
To coordinate this web page with Authorware presentations: The
corresponding Authorware program should already be open. Go to the
menu of your choice in the Authorware program and click any
button which interests you. Then on the topic locator map above
click on the same button on the same menu; you will
jump to the text that corresponds to the Authorware presentation.
End
of Topic Locator Map
Begin Text Explaining Hypothesis Testing
Back to
Topic Locator Map
Science
and Statistics. We've gone to some length in previous lectures
to clarify the relationships among natural phenomena, science, and
mathematical models. We have defined what we mean by DV measurement
operations in the realm of science and have shown how those are
modeled by probability concepts such the normal population and the
binomial population. In this lecture we will further develop explicit
correspondences between scientific ideas and statistical models.
Statistical
Conclusion Validity & Hypothesis Testing. Statistical Conclusion
Validity is an idea developed in classic discussions (e.g., Cook
and Campbell, 1979) of scientific methodology. Hypothesis testing
is a formal model developed in probability and statistics. The two
ideas have direct and formal correspondences with each other. What
I'm making clear is that when I use the term "Statistical Conclusion
Validity" I'm speaking in the realm of science; and when I
use the term "Hypothesis Testing" I am speaking in the
realm of mathematicalstatistical models. This lecture will focus
on developing these two ideas and the relationships between them.
Back to
Topic Locator Map
We are going to develop
the idea of the PCH of Chance. But before we can explain
what we mean by that we need to discuss several preliminaries.
Scientific Hypotheses.
As you know from general knowledge, scientists have hypotheses and
they do research to evaluate the validity of their hypotheses. A
very general form of scientific hypothesis is that some IV will
cause some change in some DV.
Scientific Skepticism.
A healthy part of science is its value of skepticism. Scientists
don't accept beliefs and hypotheses just because someone says they
are true; rather, scientists have developed extensive common sense,
logical, and cultural traditions for testing beliefs and hypotheses.
The fundamental
attitude of science is to be skeptical, to assume a new hypothesis
is not true until until there are compelling reasons to believe
it.
So if one scientist says
that the IV causes a change in the DV, the skeptic says, "No
it doesn't. The IV has no effect on the DV." The skeptic shows
up most clearly as your scientific competitors at a different university
or in a different lab. In parody we could say that the knee jerk
reaction of any scientist is not to believe anything, especially
if it's proposed by another scientist. Parodies aside, scientific
skepticism is very important. And it is perhaps most important in
stepping back and taking a skeptical attitude toward your own most
cherished hypotheses. It would difficult to do good research without,
at times, challenging, in your own mind, the hypothesis you are
evaluating.
Skeptical Hypothesis.
I will establish a convention called the skeptical hypothesis. If
the scientific hypothesis is that the IV causes a change in the
DV, then the automatic skeptical hypothesis is that the IV has no
effect on the DV.
Blood Pressure Example.
The discussion to this
point has been rather abstract. To be a little more specific, suppose
that you develop a pill which you think will reduce blood pressure.
So your scientific hypothesis is something like, "The blood
pressure pill will reduce blood pressure."
Skeptical Hypothesis.
"The blood pressure pill has no effect on blood pressure,"
will be the skeptic's reply. That is implicit in the attitude of
science is a skeptical hypothesis that negates the scientific hypothesis.
Your own internal attitude should include this skepticism. In science
it's healthy to have some amount of skepticism toward your own ideas.
One of the things you
can do to counter the skepticism toward your hypothesis is design
a study and collect data.
Research Design.
The current Authorware graphic outlines the design of a simple study
to evaluate the effectiveness of the blood pressure pill. The Pill
Group takes the new blood pressure pill for 3 months. The Control
Group continues as they have been prior to the study. Specifically,
the volunteers in this control group do not take the new blood pressure
pill nor any other pill administered by the experimenter. At the
end of the threemonth period, the blood pressure of both groups
is measured.
Negative Results.
One thing that can happen when you do research is that the results
are not consistent with the scientific hypothesis. The current graphic
shows such results. After three months of taking the pill the average
blood pressure in the Pill Group is no different than the average
blood pressure in the Control Group. The scientific hypothesis proposed
that the Pill Group blood pressure would be lower.
Is the data pattern
consistent with the scientific hypothesis? In the coming lectures
we will focus on some sophisticated inferential statistics like
t, chisquare and F. Often beginning students get caught up in using
these statistics and forget to check the most fundamental question:
Is the pattern of results consistent with the scientific hypothesis.
If not, it means that you must rethink what is going on. Often this
leads to revising the research design or the scientific hypothesis.
You may or may not go on to apply inferential statistics. In any
case, for the overall scientific project, these sophisticated inferential
statistics are much less important than the pattern of results.
If the data pattern is
not consistent with the scientific hypothesis the skeptic basically
says "I told you so."
Positive Results.
If your IV is effective and if you have some amount of skill and
luck, the data pattern will be consistent with the scientific hypothesis.
The current graphic shows such a case. You can see that the mean
blood pressure in the Pill Group (red) is lower than the mean blood
pressure in the Control Group (blue). This is consistent with the
prediction of the scientific hypothesis.
Once you have positive
results, the conversation between the scientist and the skeptic
gets interesting.
The skeptic will grudgingly
admit that it does appear that the data pattern fits with the scientific
hypothesis. But the skeptic will start inventing ways (other than
the scientific hypothesis) to interpret the results. In other words,
the skeptic is not going to give up easily, if at all.
Let's say that the data
have turned out just like the scientific hypothesis predicted. If
this is so, the skeptic thinks of ways the data would have come
out as they did even if the scientific hypothesis was not true.
For example, in our little Pill Study, the skeptic would surely
say that the results are due to the Placebo Effect.
Placebo Effect.
It is wellknown that if people believe they are receiving treatment
they get better even if the treatment is bogus. If you give a group
of people a pill with no active ingredient, telling them that there
is a great new ingredient contained in the pill, they tend to get
better. This is a very general and well researched phenomena. It
cannot be cavalierly discounted in any research design.
The main idea here is
that the skeptic can plausibly claim that even if the active ingredient
in your pill is ineffective, the Pill group would have come out
with a lower blood pressure because of the Placebo Effect. The participants
in the Pill group were given a reason to believe that should get
better (the pill) while the participants in the Control group were
not given such a reason.
So the Pill Study has
no control for the Placebo effect. For this reason it is a weak
study. We do not know if if the blood pressure is lower in the Pill
group because of the active ingredient or because of the placebo
effect. The scientist argues that it is the ingredient. The skeptic
argues that it is placebo. Neither side can convince the other.
And so the scientist has not gained the desired logical advantage
in the conversation by doing the research. The skeptic remains skeptical.
She or he has a plausible hypothesis which competes with the scientific
hypothesis in explaining the data pattern.
Plausible Competing
Hypotheses. The
placebo effect is one example of what can be called plausible competing
hypotheses. The placebo effect is plausible within the research
conversation. It also competes with the scientific hypothesis in
explaining the results. Some books and researchers use the terms
"rival hypotheses" or "rival conjectures" for
what I'm calling plausible competing hypotheses. Also, to save you
and I both time in writing down ideas about this topic, I will use
the abbreviation PCH for a plausible competing hypothesis.
Redesign the Pill
Study. The Pill study would have to be redesigned to control
for the placebo effect. This could be easily done. The Pill group
would remain the same. The Control group would be given a pill that
is exactly the same in appearance and taste as the Pill group but
without the active ingredient. The
Control group would be told, of course, that they are getting the
active ingredient in their pill. In this new study, the placebo
effect should therefore apply to both the Pill group and the Control
group. So the two groups would be equivalent except for the active
ingredient. When you redo the research, you would hope that the
Pill group would have an even lower blood pressure than the placebo
control group. Then you could argue that both groups benefit from
the placebo effect and the Pill group still produces lower blood
pressures. Therefore the active ingredient must account for the
lower blood pressures in the Pill group.
More PCH's. Scientific
skepticism runs deep. An imaginative critic will continue to invent
competing hypotheses. Many of these are standard and wellknown
and you will study them in research methods. For example, the critic
might ask if the study was a double blind study. If you don't know
the issues involved in double blind studies, that's ok, you'll learn
about them in research methods. The point I'm making is that the
skeptic will invent a long string of PCH's which you need to take
into account in designing your research.
These PCH's generally
have much more to do with research methods than they do statistics.
There is one PCH, however, which is the focus of much of statistics.
In statistics, we are
concerned about a special plausible competing hypothesis, and it's
called "chance."
PCH of Chance.
Chance is what the mathematician George Polya (1968) called "the
everpresent rival conjecture." Chance is a universal plausible
competing hypothesis which applies to any data set. The PCH of chance
basically claims that the data pattern happened by chance
alone.
Redesigned Study.
Suppose we have redesigned our Pill study with a placebo control
group. Both groups received a pill, one with the active ingredient,
one without the active ingredient. Consequently, both groups experienced
the placebo effect. Participants in both groups had reason to believe
they're taking a good pill and that their blood pressure should
go down. But only one group actually had the active ingredient.
Positive Results.
Suppose we got positive results when we did this new study. The
mean blood pressure in the Pill group was lower than the mean blood
pressure in the Placebocontrol group. We're feeling good about
that. We show the skeptic the results of the new study.
Immediately the skeptic
will say that the data pattern occurred by chance alone. The skeptic
claims we were lucky. One of the two groups had to come out with
a lower blood pressure and we were just lucky that it was the Pill
group. It's not that our hypothesis about the active ingredient
is right. The active ingredient is not effective. We were just lucky
that the results make it appear as if it is effective.
Chance is plausible.
I will elaborate the skeptic's argument, in the following way. Suppose
we divide any classroom right down the center into two sides, right
side versus left side. Then we measure anything we want to measure,
people's height, their weight, their GPA. Choose any dependent variable
you want. Let's say we measure the height of the students, comparing
the heights of those on the right side with the heights of the people
the left side. We calculate the average height of those two groups.
By Chance Alone.
Suppose also that we have no reason to believe students on the right
are different in height than students on the left. Why should they
be? But by chance alone one of these two groups will have a higher
mean height. The probability is nil that the average height of those
two groups would be exactly the same. The
two means would not be identical. It depends on how many
decimals you're willing to go out to, of course, but probably you
won't have to carry decimals at all. By
chance alone the two means of any two randomly created groups will
be different on just about anything you measure. It's not only plausible,
it's almost inevitable.
What if I find that the
average height on the left is higher than on the right. It is just
as plausible, even more plausible, to believe that the result happened
by chance alone than to believe some hypothesis about people on
the left being taller.
Back to blood pressure.
The skeptic says, "Of course one of the two average blood pressures
was going to be higher than the other. No big deal." The skeptic
claims you were just lucky that data pattern came out with the group
you predicted to be lower actually lower. For the skeptic it is
a 5050 chance that either group is lower. It's like flipping a
coin. If you flip a coin and predict heads there is a 5050 chance
you will be right. Getting a Head as a result of the flip doesn't
mean you are able to predict the outcome of coins. There's a .5
probability that you'd be right even if you have no ability to predict
the outcome.
Same with blood pressure.
You predicted the Pill group would be lower. It was lower. But even
if the active ingredient were ineffective, there would be a 5050
chance that the Pill group would be be lower by chance alone. The
skeptic is not impressed by this. The skeptic looks at your nice
graph showing the blood pressure in the Pill group (red bar) is
lower than the blood pressure in the Placebo Control group (blue
bar). The skeptic plausibly interprets these results having occurred
by chance alone.
Chance universally applies
to any data set. Any pattern of results might plausibly be argued
to have occurred by chance alone.
What we are studying
in this lecture is under what conditions is this universal rival
conjecture no longer plausible. How do we argue against the plausibility
of chance explaining our results? The editors of research journals
will generally want to have this basic issue addressed before they
agree to publish research.
Arguments against
chance. The inferential statistics we will soon be studying
(t, chisquare, F) are designed to address the PCH of Chance.
Arguing against the plausibility that chance alone generated the
research results is all that these statistics are going to accomplish.
This lecture on Hypothesis Testing will develop the logic
by which we can argue against the plausibility of chance generating
the data pattern.
Caveat. Inferential
statistics can help evaluate and possibly even strongly argue against
the PCH of chance. But they won't tell you whether you did good
science, whether you have welldesigned placebo group, or whether
you had all the appropriate control groups. Those are different
issues that we seem to put in research methods courses rather than
statistics.
Therefore, just because
we can validly argue that chance isn't plausible doesn't mean that
we're done. And so there's a caveatjust because you get a "significant
result" (that is, you can strongly argue against chance) doesn't
mean that you've done a good study. The caveat is that you still
have to eliminate many other PCH's, such as the placebo effect.
Other PCH's will be examined in more depth in research methods.
Statistics, even good
statistics, don't guarantee good science.
Jargon. We won't
explain the jargon here, but merely introduce it. When we can validly
argue that it isn't plausible that chance alone accounts for that
data pattern, we say that we have a statistically significant
result. When people say "statistically significant," they
mean that they have a valid argument against the PCH of Chance.
Sometimes people say
a "reliable result" to mean the same idea.
Summary. Up to
this point we've talked very generally about hypothesis testing
and statistical conclusion validity; and we've introduced the PCH
of Chance.
Let's go back to the
Hypothesis Testing menu and move on to the next topic.
Back to Topic Locator
Map
We are going to talk
about Ho and H1 (or the null and alternative hypotheses). These
are important ideas which we have to deal with despite their unpopularity
with beginning students.
A journey. To
set up these ideas, we're going to take a long journey. Let's
say that you and a friend go down to New Orleans, down to the
French quarter, down, in fact, to Bourbon Street. And let's say
you go down some stairs into the basement underneath a music club.
Of course there wouldn't be such a place and if there were you
surely wouldn't go there, but let's say that you and your friend
find yourselves in an illegal gambling den. Now this is hypothetical.
I know no one here would ever do that. It's just to provide a
context in which it might be important to understand when things
are happening by chance or not.
You go into a smokefilled
and tacky room, packed with people. There are big bouncers and
they seem to have bulges in the right place under their suit coats
to indicate they're packing heat. There's a certain electricity
in this environment.
A simple coin game.
Let's say that you decide to observe a certain kind of game. We'll
keep the game as simple as possible because we don't have very
much probability theory to work with. But a great deal of the
logic of the argument applies to all gambling games, and even,
by analogy, to research results. This simple little game involves
a single coin lying on a table. The dealer, or maybe we should
say the flipper, picks up the coin and flips it. Spinning end
over end, it flies through the air and lands back on the table.
Everyone is betting on whether it's a head or a tail.
And, just to help make
my teaching point, we'll put a further restriction on the game.
The house must always bet on heads and the client must always
bet on tails.
Something is going
on. Of course you're not gambling, just observing what's going
on. Maybe your friend is gambling. You watch the game for a while
and notice a string of heads. You say to yourself, "Something
is going on. I wonder what it is?"
Scientific hypothesis.
Since you think something is going on, you come up with a scientific
hypothesis. The scientific hypothesis is that the coin is twoheaded.
If so, the players are doomed to lose.
Skeptical hypothesis.
You mention your scientific hypothesis to your friend. Your friend
is skeptical of your hypothesis. Perhaps your friend thinks you're
being a bit paranoid. Your friend thinks that the coin is a
fair coin and that the games not fixed. In other words, your
friend thinks nothing is going on.
Perhaps your friend
pointedly asks you if you've picked up the coin and examined it.
Of course you haven't. No one is letting players get close to
the coin. Being unable to examine the coin, you can simply say
that the coin is behaving like a two headed coin.
We have a scientific
hypothesis that the coin is twoheaded. The skeptical hypothesis
is that it's a fair coin. That is, the coin has a head on one
side and tail on the other.
Research design.
So you suggest a simple research project. You and your friend
observe the behavior of one flip of the coin. The scientific hypothesis
predicts that it will land as a head. The skeptical hypothesis
predicts that it will be either a head or a tail.
Results. The
result of the flip is a head. That's the data.
Conclusions.
Since the data pattern is consistent with the scientific hypothesis,
you say, "See. It's a twoheaded coin." I predicted
the result would be heads and it was. Your
friend replies, "Oh come on, how many times have we flipped
coins? You know it's got to come out a head or a tail, and it
just happened to come out a head once. The chances are 50% that
it will be a head by chance alone."
This is a case where
the PCH of Chance is really clear. It's easy to believe
that it's a fair coin, and just by chance alone it came out heads.
The data fit the prediction of the scientific hypothesis by chance
alone.
Now remember our little
story about the blood pressure pill. In a certain sense, what's
the difference? You flipped a coin, you ran two groups. In either
case one of two outcomes had to happen.
Or suppose a scientist
divided a classroom into right and left sides and measured everybody's
blood pressure. Suppose the scientist predicted that the group
on the right would have higher blood pressures because conservatives
have more Type A personalities. That's pretty loose thinking and
silly. It confuses many things. One thing it confuses is the metaphor
of political views (right vs left) with spatial location (right
Vs left) in the classroom. But suppose the data came out that
the group on the right had higher blood pressures. This is consistent
with the scientific hypothesis. I'm suggesting you should feel
a bit skeptical about the scientist concluding his or her hypothesis
is right. The
skeptic replies that the two means will be at least a little different.
One will be higher. And its just a matter of 5050 chance whether
the one that's higher is the one on the right. This
is the point of view of the skeptic. One
of the two groups had to be higher, and the scientist was lucky
it came out the way s/he predicted.
The same with the coin
in New Orleans. One of the two sides had to come up, you were
just lucky it came up heads which is in line with the twoheaded
coin scientific hypothesis. It is plausible to argue that the
data (a head) happened by chance alone.
The point I'm spending
some time making is that the PCH of Chance is not actually as
absurd as it might seem when you first hear of it.
Model the Realm
of Science with the Realm of Statistics. Now we're going to
translate the hypotheses from the realm of science (scientific
hypothesis and the skeptical PCH of Chance) into statistical hypotheses.
As usual, this will lead to some new jargon.
Null Hypothesis.
The PCH of Chance is modeled in the realm of statistics
by a statistical hypothesis called the null hypothesis,
frequently symbolized by H with a subscripted 0 as in the current
graphic. Right now web text does not easily allow me to use subscripts,
so I'll just write it as Ho. When speaking, people say
"H oh," or "H zero," or "null hypothesis."
If you believe that
the research data happened by chance alone, Ho is for you.
Alternative Hypothesis.
The scientific hypothesis is modeled in the realm of statistics
by a statistical hypothesis called the alternative hypothesis,
frequently symbolized by H with a subscripted "1" or
a subscripted "a." Due to the difficulty of writing
subscripts in web text, I will write it H1. When speaking,
people say "H one" or "H a" or "alternative
hypothesis."
Ho. In the coin
story, the PCH of Chance claims that the probability of a head
for a fair coins is .5. So we can write
Ho: P(Head) = .5
H1.
In the coin story, the scientific hypothesis is that the coin
is twoheaded. So the probability of a head is 1. So we can write
H1: P(Head) = 1
Connection from
Science to Statistics. The exact form that Ho and H1 take
will always depend on the scientific context. Obviously Ho is
not always Ho: P(Head) = .5. It takes on that particular form
because of the coin story. The same is true for H1. In this section
we are taking great pains to make this connection between science
and statistics explicit. This will help later when we develop
statistics like t, chisquare and F.
Let's refocus on New
Orleans and the coin game.
Let's say you respect
your friend's skepticism. It is, after all, convincing that a
fair coin has a high probability (.5) of generating the data in
your previous experiment (which was one flip of a coin).
New study: Two flips.
You do a new research project. You observe two flips of the coin.
Suppose the results come out to be two heads. (I'll abbreviate
"two heads" as "HH.") Now you might say to
your friend, "See. Two heads. That's exactly what my scientific
hypothesis predicted."
Independent events.
Here's where you have to remember back to some ideas we learned
in the Basic Probability lecture. We argued that flipping coins
are independent events and the joint probability of independent
events is simply the product of their probabilities. The probability
of a head is onehalf on each of the two flips. So, assuming the
two flips are independent, then the probability of a head and
then another head is onehalf times onehalf which is onequarter.
That is, P(HH) = (.5)(.5) = .25.
Statistical Hypotheses.
The Null Hypothesis is "Ho: P(H) = .5" If the null hypothesis
is true, then P(HH) = .25. The Alternative Hypothesis is "H1:
P(H) = 1." If the alternative hypothesis is true, then P(HH)
= 1.
Skeptical reply.
Your friend, though, is still going to argue that the results
(HH) could be due to chance, saying "You know, a lot of people
have flipped coins twice and gotten two heads in a row. That's
a pretty likely thing to happen. In fact, if you flip coins a
lot, it's hard to avoid getting two heads in a row. It happens
about onequarter of the time."
HHH. Let's say,
in response this criticism, you redesign your study. Now you observe
the behavior of three flips of the coin. Suppose the results are
HHH. At this point you might say, "Hey that's it. It's a
twoheaded coin. Let's go confront one of those bouncers and tell
him that we're going right to the police unless he tells the manager
to get a fair coin on the table." Well, your friend, looking
at the size of the bouncers might reply, "Shhh. If they hear
you we might be nursing broken fingers in the morning. You want
to risk all that pain just because a coin came up with three heads?
That could still happen by chance."
Under the PCH of Chance
and the null hypothesis the probability of a head is onehalf.
This means that P(HHH) is one half time times one half times one
half which is one eighth. That is, P(HHH) = (.5)(.5)(.5) = .125.
If H1 is true, then P(HHH) = 1.
If you're bored some
time, perhaps standing in line, you can flip a fair coin for a
while. Three heads in a row won't happen very often, but it will
happen if you flip enough times. It happens in about one in every
eight three coin sequence. The results HHH still can come up by
chance occasionally. But, perhaps, chance is starting to feel
just a little less probable, a little less plausible. Still three
heads by chance is not so improbable as to be completely implausible.
HHHH. The story's
the same. The new study is to observe the behavior of four flips.
And the results are four straight heads: HHHH. You look at your
friend with that "I told you so look."
Your friend's might
say "Hmm well, four straight. But you know, I've been bored
on airplanes and flipped fair coins a lot, and I can remember
four coming up in a row. That can happen by chance alone. It has
a one16th probability."
What you should sense
in this example is that chance is slowly becoming a less and less
plausible way to explain the results. Four heads in a row can
happen by chance alone, but it happens with a probability of only
.0625. That's just a little higher than .06 which is just 6 times
in a hundred. That's not very probable. Still, it could
happen.
HHHHH. Let's
carry this logic on a few more times. Let's say the next study
is to observe the behavior of 5 flips of the coin, and let's say
the results are HHHHH. We got five straight heads in five flips.
You would like to conclude that this is, indeed, a twoheaded
coin.
Your friend says, "Hhmm,
1/2 times 1/2 five times is 1/32 or pretty close to .03. I must
admit that HHHHH happening by chance alone is starting to be a
little improbable, a little implausible." You remind him
that you want to go to the manager to complain. He looks at his
fingers and says, "Well, really, it's not likely, but 5 heads
in a row could happen with a fair coin."
When is chance implausible?
Everyone has slightly
different criteria for when they think it is implausible to argue
that chance alone is producing the data. Some people start feel
it is implausible at HHHH, others at HHHHH. Others require even
more data to decide.
HHHHHH. You
observe 6 flips of the coin and it gives you six heads: HHHHHH.
As you can see by the graphic, the chances of 6 heads in 6 flips
of a fair coin is 1/64 or .0156. It's just a little more than
one in a hundred. This is so improbable that it is becoming, for
most people, implausible. They will start looking for some other
way to explain six straight heads. You, the scientist, of course,
have such an explanationit's a twoheaded coin. A twoheaded
coin must yield HHHHHH as data.
HHHHHHH. We'll
do this one last time. Suppose seven flips give 7 heads. If Ho
is true then P(HHHHHHH)=.0078. That's about 7 or 8 times in a
thousand. To most people this is so improbable as to make the
Plausible Competing Hypothesis of Chance implausible.
Reject Ho. Ho
corresponds to the PCH of Chance. At some point (4 heads? 5 heads?
6 heads? 7 heads? 8 heads? ...) the data are such that if Ho is
true the data are extremely improbable. Rather than think the
data are improbable we reject Ho. We say Ho: P(Head) = .5 can't
be true.
Community standard.
It is notable that in this example, most people feel that Ho gets
implausible somewhere around 4 straight heads or 5 straight heads.
P(HHHH) = 1/16 = .06+. P(HHHHH) = 1/32 = .03+. As we will see
later, many in the scientific community have decided that the
convention for rejecting Ho should be at 1/20 = .05. If assuming
Ho is true makes the data less probable than .05, then we generally
reject Ho. What is interesting is that .05 is right between .06
and .03, the points in the coin example where many people feel
that the plausibility of chance starts to be shaky.
Well, that is getting
ahead of ourselves. If the last paragraph is a bit obscure for
you, be sure we will gain considerable experience with these ideas
in the future.
Rejecting Ho leads
to discarding the PCH of Chance. In the realm of statistics,
based upon a probability argument, we reject Ho. Translating
back into the realm of science, that means we think that the plausible
competing hypothesis of chance is no longer plausible.
Psychologically and intellectually, there is some point for each
person when she or he believes that chance alone cannot account
for the results. We have eliminated one PCH.
George Polya (1968),
in his discussion of the inductive method, has shown how each
time we eliminate a plausible competing hypothesis, we make our
own scientific hypothesis more plausible. For example, we have
argued that if 7 heads come up in 7 flips, it is no longer plausible
that the coin is generating heads by chance alone. That makes
the scientific hypothesis of a twoheaded coin more
convincing.
This is the whole essence
of the argument in a very simple case. All the major ideas of
Hypothesis Testing have been introduced. The probability theory
is going to get more complicated unfortunately, but we have laid
out the main ideas here to serve as an overview and road map for
you.
There are a couple
more ideas worth mentioning at this point. The next one we will
discuss is the "alpha level" or the "significance
level."
Can rejecting Ho
be an error? One limitation on the logic we've just summarized
is that we might be wrong when we reject Ho. The data always could
happen by chance alone. You could flip a coin 100 times, and it's
conceptually possible to get 100 heads, incredibly unlikely, but
conceptually possible.
Let's say we reject
Ho after 7 heads in 7 flips. The P(HHHHHHH) = .00078... Let's
round .00078 to .007 (or about seven in a thousand) just to make
discussion more streamlined. If Ho is true, there is only seven
in a thousand chances of getting the data (7 heads). So we reject
Ho. But conceptually it is possible that a fair coin generated
7 heads.
So when we reject Ho,
the probability we are making an error in doing do is the probability
that a fair coin would generate the data that convinced us to
reject Ho. For example, if we reject Ho because we got 7 heads
in 7 flips, we have a .007 probability of being wrong.
Alpha Level. "Alpha"
or the "alpha level" is the probability you are wrong
when you reject Ho. As you can see from the graphic, if Ho is
true, the probability of 4 heads is about .06, the probability
of 5 heads is about .03, probability of 6 heads is about .015,
probability of 7 heads is about .007.
The probability that
you are wrong when you reject Ho depends on when you reject it.
If you reject Ho after
5 heads in 5 flips, the probability you are wrong is about .03
(which is the probability that that string of heads could happen
by chance alone.) If you reject Ho after 6 heads in 6 flips, your
alpha level (probability you incorrectly reject Ho) is about .015.
If you reject Ho after 7 heads in 7 flips, the alpha level is
.007.
Significance level
and p value. While they have slightly different connotations,
the terms "significance level" and "p value"
are essentially synonymous with alpha level. They deal with the
idea that, while it is compelling to reject Ho under certain circumstances,
there is always a small probability that we are wrong in doing
so.
Statistical Conclusion
Validity. Statistical Conclusion Validity is a formal term
from the realm of science. It refers to the validity with which
we can make the argument that the results of our research were
not due to chance alone.
Statistical Conclusion
Validity encompasses the whole argument that we've just summarized
in this lecture to this point.
Back to Science.
The final graphic in this section summarizes visually what
we have discussed up to this point. It also connects us back from
statistics to science. Rejecting Ho on the basis of probability
in the Realm of Statistics is used in the Realm of Science as
an argument that the PCH of Chance is no longer plausible.
What about H1?
The way the data turned out, H1 is still probable. It predicted
all heads and that's what the data turned out to be.
What about the scientific
hypothesis? Well the scientific hypothesis is in a little
stronger position. Chance, the everpresent rival hypothesis,
has been discarded. The scientific hypothesis is a bit more plausible.
But we should make it clear that the scientific hypothesis has
not been proven. It will never be proven by these methods.
For one thing, as we
mentioned, however improbable it might be, the data could have
been due to chance, there is always the a small probability we
were wrong in discarding chance. That alone means the scientific
hypothesis is not proven.
Let's look at some
other reasons why the scientific hypothesis is not proven by this
method.
Further Limitations.
Suppose we reject Ho and discard the PCH of Chance. Suppose your
friend in New Orleans is finally convinced. You've just done a
study where you observed 7 heads in 7 flips. Your friend agrees
that the string of heads is so unlikely that s/he's willing to
give up the PCH of Chance. S/he's even willing to say that it's
not a fair coin. But that does not prove that it's a twoheaded
coin. There are other hypotheses which could explain the 7 head
in 7 flips.
Magnetism. The
coin could very well have a head on one side and a tail on the
other, but it might be a fake coin made of steel so that a complex
magnetic machine might always make it land with heads facing up.
That theory also accounts for 7 straight heads.
Holograms. Maybe
the coins is a highquality hologram which has a head and a tail
but is programmed so that it always lands with the head side facing
up. That also would account for 7 straight heads.
So Rejecting Ho and
eliminating the PCH of Chance certainly gets rid of one annoying
criticism of the study. But it does not prove the scientific hypothesis.
There may be other, equally good, even better, hypotheses to be
considered.
Since you cannot start
a convincing argument with someone until chance has been eliminated
as a PCH, what discarding the PCH of Chance does is make the case
of the scientific hypothesis arguably stronger.
Not Nothing.
What does Rejecting Ho do? In our coin story, first you noticed
that "something is going on" with this coin. Then you
turned this sense of "something going on" into a scientific
hypothesisThe coin has two heads. Your friend said that "nothing
is going" on with the coin. S/he created a skeptical hypothesisIt's
a fair coin.
Rejecting Ho basically
says It's not nothing. The results are too improbable to
allow us to believe that nothing is going on and the data occurred
by chance alone.
But rejecting Ho does
not help to know which of all the possible somethings
is the one that's actually going on. Perhaps what's going
on is that it's a twoheaded coin. Perhaps it could be a magnetic
coin. Perhaps it could be a hologram. What's actually going on
could be many things. Rejecting Ho and eliminating chance as a
PCH doesn't help you to know which of these many possibilities
is happening. Therefore rejecting Ho will not "prove"
any particular theory.
Rejecting Ho always
has this double negative feel. We are negating the idea that
nothing is happening. If you keep in mind this double negative
logic, it will help understand how we use Ho in inferential statistics
later on.
Back
to Topic Locator Map
We've
studied this before. I want to remind you that we've already
developed the building block ideas for this current section when
we studied the "Catching Cold" example in the Binomial
Distribution lecture. We will now build on what we previously
learned. You also had a chance to practice the foundation for
the current discussion in the Binomial Homework.
Catching Cold.
Let's say that baseline data show that the probability of catching
a cold in Salt Lake City in January and February is known to be
.5. That is, P(Cold) = .5. (In this example we're keeping the
probability at .5 so that it's like the familiar flip of a coin.)
Suppose also that some researchers develop a cold vaccine and
therefore want to evaluate for themselves and other people whether
their vaccine works.
Research design.
They run a study with 10 volunteers to test whether the vaccine
is effective or not. They administer the vaccine to the 10 volunteers
and then determine if each volunteer gets a cold or not during
January and February. If the vaccine works, the chances of catching
cold among the volunteers should be less than .5.
Scientific Hypothesis.
The scientific hypothesis is that the vaccine will improve health
(reduce the chances of a cold).
Skeptical Hypothesis.
The skeptic will propose that the vaccine is going to have no
effect on the chances of getting a cold. Therefore
the 10 volunteers have the same probability of catching a cold
as unvaccinated people (i.e., P(Cold = .5). According to the skeptic,
the vaccine is completely worthless.
If you listen carefully
to the debate at the beginning of any flu season, you'll hear
both of these attitudes about flu vaccines. Some scientists believe
in them and recommend them very strongly. Other scientists are
skeptical of flu vaccines and think they're not worth very much.
PCH of Chance.
Suppose the research results show very few colds among the 10
volunteers. Such results would favor the scientific hypothesis.
The skeptic would immediately start inventing plausible competing
hypotheses to explain these favorable results. Since the study
is poorly designed (it doesn't even have a control group) the
skeptic could invent many PCH's. But in statistics we are primarily
concerned with one PCHthe PCH of Chance. The skeptic will say
that you just got lucky. There were relatively few colds among
your 10 vaccinated volunteers by chance alone. After all, among
many different groups of 10 Unvaccinated people, the number
of colds would vary greatly. You were just lucky in sampling a
group who caught few colds by chance alone.
IV
and DV. The independent variable (IV) is the vaccine, and
the dependent variable (DV) is health. The researchers will measure
the DV by categorizing whether or not each volunteer gets a cold.
More formally, they will invent X (called an indicator
variable). For each volunteer, X = 0, if that person does not
catch cold. X = 1 if a volunteer does catch one or more colds.
These are very simple and straightforward measurement operations.
The scientists
expect the IV (vaccine) will affect the DV (the occurrence of
a cold). The skeptic thinks the IV will have no effect on the
DV.
BLUE
DETOUR: We will now take a brief
detour into the issue of how to tell directional from nondirectional
scientific hypotheses. I will make this text blue so you know
it is an aside. Below you will find more blue text where the ideas
we learn here will be used. The two sets of blue text make a full
idea.
Is
the scientific hypothesis directional or nondirectional?
This is a new distinction we've not introduced before. Whether
the scientific hypothesis is directional or not impacts how we
will eventually write the alternative hypothesis, H1. Let's use
the vaccine example to illustrate the difference between a directional
and nondirectional scientific hypothesis.
Directional
hypothesis. Scientific hypothesis: The vaccine will reduce
the chances of a cold. Whenever the scientific hypothesis
predicts that the IV will cause the DV to change in a certain
direction (either increase or decrease), then we say it is directional.
As we've stated it, the scientific hypothesis is directional because
it predicts that the number of colds will be reduced.
For
contrast, let's change our example so that the scientific hypothesis
is nondirectional.
Exploratory
Research. Sometimes in the early stages of vaccine research
scientists don't know a great deal about the organism that causes
a disease; they may not even be sure which of many organisms actually
cause the disease. They also don't know whether the vaccine should
be based on a dead virus or on a weakened, live virus. And if
it is based on a live virus, how much should it be weakened to
trigger an immune response without transmitting the disease? Under
these circumstances the scientists may have to do exploratory
research. They may administer a weakened live virus. They hope,
of course, that it reduces the disease, but, if it is not weakened
enough, it might increase the incidence of the disease. Either
way, they learn something. If the disease increases after the
vaccine, then they at least know that they have the right organism
and that they need to weaken it more or administer it dead. If
it reduces the diseases, then it is a matter of tweaking the dose.
The
important point is that either an increase or a decrease gives
important information. So they are simply looking for a vaccine
that affects (either way) the disease. This is not unlike the
psychotherapeutic strategy of prescribing the symptom. If you
can teach people to make their symptoms worse, then you and they
know that you've found the variables that control their symptoms.
Then you and they can work with these variables to lessen the
symptoms.
Sometimes
it's important simply to have an effect, without worrying about
direction of effect. Once you know how to get an effect, you can
worry about making it go the direction you want.
Nondirectional
hypothesis. Scientific
Hypothesis: The vaccine will affect the chance of a cold.
If the researchers are in early, exploratory stages of investigation,
they might have a nondirectional scientific hypothesis. They
think they've isolated the right virus, so they think the vaccine
will have an effect. But they don't know which direction this
effect will be. They don't know if it will reduce colds or increase
colds.
That's
the issue of directional versus nondirectional scientific hypotheses.
Let's go back to our story where the research is more advanced,
and the researchers are predicting that the vaccine will reduce
colds.
Modeling
Science with Statistics. The scientific hypothesis is that
chances of a cold are reduced in people who are vaccinated. The
skeptic replies that the vaccine has no effect on colds. If the
data were to favor the scientific hypothesis, then the skeptic
would reply with the PCH of Chance. Let's cast these ideas into
statistical terms. Remember that the baseline data is that P(Cold)
= .5. That is, (hypothetical) past research shows that the probability
of catching a cold during January and February in Salt Lake City
is .5.
H1.
H1 corresponds to the scientific hypothesis. We can rewrite the
prediction that the chances of a cold should be reduced as "the
probability of a cold is less than .5." To be more succinct,
we could write "P(Cold) < .5." The "<"
means "less than." Conversely the ">" sign
means "greater than." So the alternative hypothesis
is:
H1: P(Cold) < .5
Ho.
Ho corresponds to the PCH of Chance which is based on the skeptical
hypothesis that the vaccine has no effect on colds. We can rewrite
this as "the probability of a cold equals .5." Or more
succinctly as "P(Cold) = .5. So the null hypothesis is:
Ho: P(Cold) = .5
So now
we have translated the two verbal, qualitative hypotheses from
the scientific conversation into two symbolic, mathematical statistical
hypotheses. Let's go on to the next step.
Student
Question. Shouldn't Ho be written in this case as "the
probability of a cold is greater than or equal to .5?"
Reply.
Good question. Technically yes. While we're just going to ignore
that issue at this level of statistics I'll go ahead and address
it. [Ignoring both the question and my reply will not disadvantage
you in any way in future learning, so you may read the reply or
skip it.] The issue is that the two hypotheses should together
cover all possibilities (i.e., all probabilities of a cold from
0 to 1). Then if you eliminate one of them, you can by converse
logic choose the other. Clearly,
H1 must cover the range of probabilities of colds less than the
baseline .5. So H1: P(Cold) < .5 is correctly written. That
leaves Ho to cover all the other possibilities (i.e., the probabilities
from .5 to 1). That is, Ho covers "no effect" up through
unpredicted increases in colds. But the actual test of Ho which
we are about to develop tests specifically the the point where
there is no effect, that is, where P(Cold) equals the baseline
probability of .5. That is, where P(Cold) = .5. So reducing null
hypothesis to the simple form: Ho: P(Cold) = .5 looses us nothing
in developing our test. And I have found over the years that including
the more technical logic (such as this paragraph) has caused confusion
and without spreading much light. So I've simplified the null
hypothesis a bit.
Test
Statistic. We now have two statistical hypotheses, Ho and
H1. How can we decide between them? We're going to invent this
idea of a Test Statistic. We will use the test statistic to decide
between Ho and H1.
A couple
obvious questions are what is this test statistic and how will
it work?
DV.
In our example, what were the measurement operations? The researchers
simply measured the occurrence or nonoccurrence of a cold in
each person. The dependent variable was X, an indicator
variable. X was 0 if a person caught no cold. X was 1 if a person
caught at least one cold. Such a DV will generated a series of
1's and 0's as our data. Since we have 10 volunteers, we will
have 10 data points. Each data point will be a 0 or a 1.
Test
Statistic. Let's invent a very simple test statistic (call
it TS). TS will add up all the scores (either 0's or 1's). As
you can see on the graphic, TS = Sum of X. TS is simple;
it is not even as complicated as the mean. If you think about
it, TS = Sum of X will tell us the number of people who caught
cold in our sample of 10 volunteers. All those who caught no cold
get a 0. All those who caught at least one cold get a 1. Adding
up all the 1's gives us the number of people who caught colds.
Distinction:
Test Statistic versus DV. The DV is the measurement operations
which generate the data. In this case each person will generate
either a 0 or a 1, depending on if s/he gets a cold or not. X
will give us 10 1's and 0's, one for each person. In contrast,
TS is a statistic
calculated on the data generated by X. In this respect TS is no
different than the mean or variance, just simpler.
Later
we will get to complicated test statistics such as t, chisquare,
and F. For now we want a simple case to develop the idea of what
test statistics do.
Range
of TS. Everyone
who didn't get a cold was a 0, everyone who did get a cold was
a 1, and so when we add all those up, we get the number of colds.
The researchers took a sample of 10 from the Salt Lake City population
and we vaccinated them. If there are 10 people in the sample then
our test statistic could vary from 0 to 10. That is, the group
could catch between 0 and 10 colds. It
could be that none of them caught a cold, one of them caught a
cold, two of them, and so on all the way up to all ten of them
caught a cold. So the range of TS, which is shown visually on
the current graphic, is 0 to 10.
Horizontal
Axis. In the graphic, the horizontal axis gives the number
of colds (values of TS) across its entire range, starting from
0 and going to 10.
Divide
the Range of TS. We're going to divide the range of the test
statistic up into two regions:
Reject
H0 region  Do not reject H0 region.
Critical
Value. The line separating the Reject Ho region from the Do
not reject Ho region will be called the critical value. That line
is red on the current graphic.
Rejecting
Ho. Recall the New Orleans coin example. Remember that for
each person there came a point at which it no longer seemed plausible
to believe that the data were happening by chance alone. For some
people that point came after 4 straight heads, for others after
5 straight heads, for others after 6 heads, and so on. The
critical value is that point at which the data lead a person to
reject Ho.
Where
to put the critical value? Eventually (in the next section)
we will make up specific criteria based on probability for deciding
exactly where to put the critical value. But for now, let's just
put the critical value in several places and think about the common
sense implications of its placement. We'll examine putting the
critical value between 2 and 3, between, 0 and 1, between 4 and
5, and between 1 and 2. We'll see if any or all of those placements
of the critical value make common sense.
If
Ho is true. If Ho is true and P(Cold) = .5, then it makes
sense that we should get around 5 colds out of 10 volunteers.
So if TS = 5 (that is, you get 5 colds in your sample of 10 people),
you wouldn't want to reject Ho because 5 colds is very likely
if Ho is true. Common sense would dictate that you would want
to reject Ho for some number of colds below 5.
Between
2 and 3. The critical
value could be put between 2 and 3 colds as it is in the current
graphic. That means we would decide to Reject Ho if our
sample yielded either 0 or 1 or 2 colds. If the sample yielded
3 or more colds we would decide to Not reject Ho. Think
about it. Ho would naturally lead to a number of colds around
5. The scientists predict fewer colds. Is 2 colds sufficiently
few that it would lead you reject Ho?
Between
2 and 3 is just one place to put the critical value. Arguments
could be made for putting the critical value in many different
places.
Back
to science. Recall that when we reject Ho in
the realm of statistics we are saying that the PCH of Chance
is implausible in the realm of science. This means that
the scientists have eliminated one of the skeptics' PCH's. Moreover,
the data (2 or less colds) certainly fits both the alternative
hypothesis (H1: P(Cold) < .5) and the scientific hypothesis
that the vaccine reduces colds.
Between
4 and 5. People could make an argument that we should put
the critical value between 4 and 5 because you'd expect around
5 colds out of 10 people if Ho is true. That means you would reject
Ho for 4 or 3 or 2 or 1 or 0 colds. They could argue that 4 colds
or lower indicate a reduction in colds. The trouble with this
particular argument is that not everyone would believe you when
you rejected Ho because by chance alone it would be very easy
to get 4 colds out 10 people even if P(Cold) = .5 were true. Still,
there is a certain logic to saying anything below 5 is a reduction.
Think about it and sense how you feel about it.
Between
0 and 1. Set the critical value between 0 and 1. Now you reject
Ho only if there are no colds in the sample of 10 people. That's
very stringent, but it's a logical place to put the critical value.
If we vaccinate 10 people and none of them catch a cold, we can
argue with conviction that that's not likely to happen
by chance alone. So we can reject Ho and therefore argue that
the PCH of Chance is no longer plausible. The problem with this
placement of the critical value is that even if the vaccine was
effective at reducing the number of colds it might not completely
eliminate them. So this stringent critical value might miss an
effective vaccine.
Question.
What he said is he feels a little lost. He wants to know how to
determine where to put the critical value because I'm just putting
it along the range of TS at all these different places.
Reply.
That's exactly right, that's the impression I wanted. The critical
value could be about anywhere. What are the criteria people use
to decide where to put it? That's a good sense of curiosity. To
answer that we will need to use our knowledge of probability,
specifically our knowledge of the Binomial Distribution. I like
you wondering about this because then when we go through how to
use the Binomial to establish criteria for setting the critical
value, you'll have some motivation for learning about it. The
simple answer to the question for the moment is that we need probability
theory and we're building toward that.
But...
for now, the point I'm making is there's a lot of places (as he's
pointed out) where we can put the critical value. Let me put it
in one more place.
Between
1 and 2. The critical value could be put it between 1 and
2. That's a perfectly sensible place. If we only get 1 or 0 in
our sample we would reject Ho. And, by converse logic, the alternative
hypothesis that P(Cold) < .5 remains probable. Then mapping
back into the realm of science we could say that PCH of chance
is implausible and the scientific hypothesis (that the vaccine
reduces the number of colds) remains plausible.
Summary.
We've established the idea of a TS and how it differs from
the data generated by the DV. (The TS is a statistic calculated
on the data.) Then we've introduced how can divide the range of
the TS into regions where we would or would not reject Ho. The
Reject Ho and Do not reject Ho regions are separated
by a critical value. Then we found that we can argue for
setting the critical values in many different places.
What we
did not do is learn how to use the Binomial Distribution
to find the sampling distribution of the Test Statistic. If we
know the sampling distribution of the TS, we can establish specific
Probabilitybased criteria for setting the critical values. I
have found that it is worthwhile to introduce the basic logic
and vocabulary of rejection regions before bringing in probability
theory, which can be a difficult step for many people.
Back
to Topic Locator Map
Abduction:
Nature to Science to Probability. Suppose one of the volunteers
in the vaccine study is walking in a crowd, at some risk of contacting
a cold virus. Everyone is breathing the same air. Some people
are coughing and sneezing. Our measurement operations reduce that
volunteer to a DV symbolized by X. X is either a 0 or a
1, depending on whether or not the person caught a cold. Next
we model X as a Bernoulli trial.
Bernoulli
Trial. Assuming Ho is true (and we always assume Ho is
true in this part of the process) the volunteer's probability
of a cold is .5. In other words, we assume that the vaccine
is ineffective and so all data is generated by chance alone.
So there's a 5050 chance that the volunteer will receive a score
of 0 and the same chance s/he will receive a 1. Only two things
can happen, either s/he gets a cold or s/he doesn't. In the past
we've learned that these kind of twooutcome situations can be
modeled by the Bernoulli Distribution.
What
is the Probability Distribution of X? The probability distribution
of the DV (in this case, X) is modeled by a Bernoulli Trial. The
probability distribution of X is shown on the right in the graphic
(above). You can see that on the graph both 0 and 1 have the same
probability (height of the black bars).
Population.
Recall that the probability distribution of the DV is often called
a population. In the current example this is a Bernoulli Trial.
In the realm of science, when the scientists measure one volunteer
it is equivalent in the statistical model to drawing a single
score (X) from the population.
Sampling.
When the scientists running the study find and measure a sample
of 10 volunteers, the statistical model summaries all that work
as simply sampling 10 people from a population. In this case we
can see on the above graphic that we will randomly draw 10 people
from the population. This gives us a sample of 10 scores. For
illustration, I have made the score of the first person (X1) equal
to 1, and the score the second score (X2) equal to 0, and the
score of the last (or nth) score equal to 1. So our sample data
will consist of a bunch of 0' and 1's.
Calculate
TS. Once the scientists have their sample data, they calculate
the TS = Sum of X, which gives them the number of colds in their
sample. So the TS might have any value between 0 and 10.
Sampling
Distribution of TS. Each DV score (X) is a Bernoulli Trial.
There are 10 scores. TS gives us the number of colds out of 10.
If we define a cold as a success, then the number of colds in
the sample is the number of successes in 10 Bernoulli Trials.
Your past experience with the Binomial Distribution should
let you now remember that this is a Binomial Sampling Distribution.
Two distinctions
are important to keep in mind.
DV
versus TS. The DV (here, X) generates the individual data
points. The TS is a statistic calculated on those data points.
Population
versus Sampling Distribution. The population gives the probability
that a single score takes on one of its values. The Sampling Distribution
gives the probability that the TS will take on one of the values
in its range. If we want to know the probability TS = 3 (i.e.,
there are 3 colds in the sample) we find that out from the Sampling
Distribution.
That's
all by way of review from the Binomial lecture. You have already
seen the current graphic (above) showing the 4 steps to a sampling
distribution when we studied sampling distributions. It's good
to be familiar with it and get a good concept of the overall 4step
pattern because it's a very useful visual representation of a
complex set of variables and mathematical relationships.
BACK
TO THE BLUE DETOUR:
Three scientific cases. Let's use the vaccine example to
develop three distinct cases.
Case
1: This is the case we've focused on in the above examples.
The scientists think that the vaccine is welldeveloped and effective.
The scientific hypothesis is "The vaccine (IV) will reduce
the chances of a cold (DV)." The scientific hypothesis is
directionalit predicts less colds after vaccination.
Case
2: We mentioned this case briefly, above. The scientists don't
think the vaccine is welldeveloped. They are pretty sure that
they have the correct virus isolated, but they're not sure that
it has been weakened just the right amount. It might reduce colds
or it might increase colds. The scientific hypothesis is "The
vaccine (IV) will affect the chances of a cold (DV)." The
scientific hypothesis is nondirectionalit predicts either an
increase or decrease in colds after vaccination.
Case
3: We have not talked about this case yet. . It is an early
stage of vaccine research. The whole research team is pretty sure
they have isolated the correct virus. But there is a strong difference
of opinion on how much to weaken the virus. A group of dissenters
believes that the virus has not been weaken nearly enough. The
dissenters believe the vaccine will surely increase the number
of colds. The dissenters' scientific hypothesis is "The vaccine
(IV) will increase the chances of a cold (DV)." This scientific
hypothesis is also directional, but in the other directionit
predicts more colds after vaccination.
Keep
these three cases in mind for a minute.
Tails
of Distributions. When you picture in your mind the
graph of a probability distribution (e.g., normal or binomial)
you'll notice that it generally has a large bump of high probability
in the middle and then tapers off in both directions until the
probabilities are very low. In the current graphic (above) you
can see this general shape in the case of the binomial sampling
distributionhigh probability in the center stepping down on
both sides until the probability is negligible. The tails of a
distribution are where the probabilities taper off toward zero
(on both sides). Each distribution typically has two tails, an
upper tail and a lower tail (see graphic above).
Let's
now put the three scientific cases together with the idea of tails
and integrate all that into our idea of rejecting Ho.
Onetailed
rejection region: LOWER. In Case 1, the scientists are expecting
the reduction in the number of colds. So in our statistical model,
it makes sense to reject Ho when there are very few colds in our
sample. In the graphic above this case is shown on the far left.
You can see that there is only one rejection region and it is
placed in the lower tail of the distribution. This is called a
onetailed rejection region or a onetailed test of Ho. In this
case we would write a onetailed (lower) alternative hypothesis:
H1: P(Cold) < .5
Two
tailed rejection regions. In Case 2, the scientists are expecting
that the vaccine will affect colds, but they are unable to predict
in which direction. It is a nondirectional scientific hypothesis.
Therefore in our statistical model, it makes sense to reject Ho
if there are either very many colds or very few colds. You can
see this case in the center of the current graphic above. As the
graphic shows, there is a reject Ho region in both tails of the
distribution. This is called a twotailed rejection region or
a twotailed test of Ho. In this case we would write a twotailed
alternative hypothesis:
H1: P(Cold) not equal
to .5
Onetailed
rejection region: UPPER. In Case 3, the dissenting scientists
are expecting that the vaccine will increase the number of colds.
So in our statistical model, it makes sense for the dissenters
to reject Ho when there are very many colds in our sample. In
the graphic above this case is shown on the far right. You can
see that there is only one rejection region and it is placed in
the upper tail of the distribution. This is called a onetailed
rejection region or a onetailed test of Ho. In this case we would
write a onetailed (upper) alternative hypothesis:
H1: P(Cold) > .5
The
overview shown in the last graphic above, along with the discussion
about it, requires the integration of the many new ideas which
we've been discussing. It is good to come back to this graphic
and discussion at various future points in the class when we are
applying these ideas.
For
now, what is important is to realize we are creating statistical
models of important scientific ideas (scientific hypothesis, skepticism,
PCH of Chance). To use the statistical model well, you need to
understand how the various aspects of the model (Ho, H1, critical
values, rejection regions) relate to the scientific context. We
have focused on how the nature of the scientific hypothesis (directional
versus nondirectional) affects how we write H1 and where we put
the rejection region(s).
Question.
How does this relate to the New Orleans coin example?
Reply.
I made the coin example as simple as possible. I made the scientific
hypothesis that the coin is twoheaded. So we are predicting not
just an increase in the number of heads that would normally occur
with a fair coin, but that there would be nothing but heads. That
makes it strongly one directionalway more heads than a fair
coin would generate. In the coin example Ho: P(Head) = .5, and
H1: P(Head) = 1.
We've
studied this before. Once again I want to remind you that we've
already developed the probability tools for this next section when
we studied the "Catching Cold" example in the Binomial
Distribution lecture. In fact, we've previously worked through all
the technical details of this section, both in the Binomial lecture
and in the handson homework assigned afterwards. We also worked
on foundations for this current material in the Sampling Distribution
lecture, particularly the Binomial Sampling Distribution. Consequently,
the current material should build naturally on your previous learning
without your needing to go back to it. But if the details seem fuzzy
or confusing, you might want to review the Binomial lecture and
the Sampling Distribution lecture as well as the homework you did
with those two lectures.
Back to
Topic Locator Map
Review Example.
We have n = 10 volunteers who try a new vaccine. If the vaccine
is effective the probability of a cold should be less than .5; if
it is ineffective, the probability of a cold should be .5. So H1
is that P(Cold) < .5 and Ho is that P(Cold) = .5
Whenever we construct
a test of Ho, we always assume Ho is true. (You can't test
it if you don't use it as the model.)
The test statistic (TS)
will be the number of colds in the 10 volunteers. The Sampling Distribution
of the TS will be the Binomial, with p = .5, and N = 10. This assumes
that a "success" is catching a cold and that Ho is true
(P(Cold) = .5). In probability jargon, the Sampling Distribution
of the TS is Binomial with N = 10 and p = .5
Review the binomial
sampling distribution. As we've argued, the sampling distribution
of TS is the Binomial with N = 10 and p =.5. Along the horizontal
axis is the number of colds in the sample starting from 0 and going
to 10. If we call a cold a "success" then the horizontal
axis is "r," the number of successes (colds) in 10 trials
(volunteers). This assumes Ho is true, P(Cold) = .5.
The current lecture graphic
(above) includes a small table on the left side so that we can get
explicit probabilities for each number of colds from 0 to 10. Looking
at that table you can see that the probability of 0 colds is .001,
that is, 1 in a thousand. In contrast, the probability of 5 colds
is .2456, that is, almost 1 in 4. We will use the probabilities
in the table to develop the ideas in this lecture. But when you
do homework you will have the Binomial Tool to work with, so you
won't need the table. You can directly use its output.
What's the question?
Remember the question which we're leading up to. What's the criteria
for setting critical values? How do we know where to put the
critical values which define our rejection region(s)? A while
ago, we kept putting the critical value in different places, we
put it between 2 and 3, between 4 and 5, between 1 and 2. We kept
moving it around. At one point a student asked how to figure out
where to put the critical value. We're now going to use a bit of
probability theory and a bit of common sense to answer that question.
A Case 1 Example.
We will assume that the scientists believe the vaccine will reduce
the chances of a cold. That is, we will construct a onetailed (lower)
rejection region (Case 1) because the scientific hypothesis predicts
a small number of colds. Ho: P(Cold) = .5. H1: P(Cold) < .5.
Critical value between
2 and 3. If the critical value is between 2 and 3 then we reject
Ho when the number of colds is 0 or 1 or 2 colds. If the data show
3 or more colds, then we do not reject Ho. What we are rejecting
when we reject Ho is the idea that there is a 5050 chance of a
cold.
Alpha. Recall
that alpha = Probability of (incorrectly) rejecting Ho when it
is true. Rejecting Ho when it is true is, of course, a mistake,
an error. Alpha is the probability of such an error.
If you think of the logic
of it, the only way to calculate alpha is by assuming Ho is true.
What is alpha if critical
value is between 2 & 3? Assume Ho is true. The probabilities
of getting 0 or 1 or 2 colds by chance alone are .0010, .0098, and
.0439, respectively. These are circled in red in the table in the
current graphic. Adding these probabilities up, .0010 + .0098
+ .0439 = .0547.
This is very much like
flipping a fair coin. If you flipped a fair coin (where P(Head)
= .5) ten times, what would be probability of getting either 0 or
1 or 2 heads? It would be the same, .0547.
What have we accomplished?
What we've done is discover the probability of making a mistake
when we reject Ho when Ho is true. That is, if the vaccine is
ineffective and Ho is true, then the probability is .0547 of getting
0 or 1 or 2 colds by chance alone (and therefore rejecting Ho and
consequently thinking incorrectly that the vaccine is effective).
.0547 is a little larger than 1 chance in 20.
Why is this important
to science? Consider what would happen if the vaccine were worthless,
but by chance alone so few colds (say, only 1 cold occurred) that
we rejected Ho. The scientists would then be inclined to think that
this did not happen by chance alone (even though it did). Consequently,
they would be inclined to think the vaccine had some value. Therefore,
they and other research labs might spend a great deal of effort,
time and money pursuing research on this vaccine, only to find out
later that it is worthless. It is very costly to science to think
an IV is effective when it is not.
Consequently, scientists
want alpha (the probability of rejecting Ho when it is true) to
be very low. By social convention and common sense, scientists
generally insist that alpha be smaller than 1 in 20. [Note:
If you divide 20 into 1 you will get .05.]
We just found that if
we put the critical value between 2 and 3, the probability of mistakenly
rejecting Ho when it's true is .0547. Let's compare .0547 with the
probability of making this mistake when we put the critical value
in some other places.
Critical Value between
1 & 2. Putting the critical value between 1 and 2 means
we will reject Ho if we get 0 or 1 cold in our sample. If we get
2 or more colds we do not reject Ho.
Alpha. What is
alpha? You can get that directly and dynamically if you are working
online with the Binomial Tool. But for this lecture look at the
table on the left of the current graphic. The probabilities for
0 and 1 cold are outlined in red. When we add these probabilities
up, we get alpha. Alpha = .0010 + .0098 = .0108. Looking
at the Reject Ho region on the graphic we can see that the probability
of falling in the Ho rejection region is .0108, or a little more
than 1 in 100. The scientific convention is that alpha should be
smaller than .05. Certainly .0108 is smaller than .05.
Remember the scientists
don't know if Ho is true or not. (If they knew, they wouldn't
have to do research.) All the scientists know is the data they got.
So if they get 0 colds or if they get 1 cold, they will reject Ho.
They hope they are making a correct decision. What they do know
is alpha. If Ho is true, the probability of getting 0 or 1 cold
by chance alone is about 1 chance in a 100. So if Ho is true, their
chances of rejecting it incorrectly are only .0108. This is small.
They are willing to take the chance.
Between 0 & 1.
Putting the critical value between 0 and 1 means we will reject
Ho only if we get 0 colds in our sample. If we get 1 or more colds
we do not reject Ho.
Alpha. What is
alpha? You can get that directly and dynamically if you are working
online with the Binomial Tool. But for this lecture look at the
table on the left of the current graphic. The probability of 0 colds
is circled in red. Alpha = .0010. Looking at the Reject Ho
region on the graphic we can see that the probability of falling
in the Ho rejection region by chance alone is .0010, or 1 in 1000.
The scientific convention
is that alpha should be smaller than .05. Certainly .001 is smaller
than .05.
[Before going on with
the flow of the lecture, I'm going to answer some student questions
which may clarify the ideas we are leaning. If you find an overview
useful read on, otherwise skip down to the next graphic.]
Question. Can
you go over the whole thing again all at once?
Reply. Let me
summarize this whole processeven if it means repeating a lot.
These are tricky ideas the first time you run into them and repetition
helps. We have a vaccine. In Salt Lake City baseline data show that
there is normally a .5 chance of catching a cold. We think that
our vaccine is effective and the probability of catching a cold
will be less than .5. The skeptic thinks that the vaccine doesn't
work and the probability of a cold is .5. Okay, granting the skeptic
that s/he is right, we set up a sampling distribution of the TS
(number of colds) assuming the P(Cold) = .5. That is, we assume
Ho is true. That means that if the skeptic is right and this vaccine
is just salt water, and it has no effect whatsoever, the most likely
outcome of course, is you'd get about 5 colds. This has a probability
of .2461 (see table on graphic) or about 1 in 4. But there's a really
good chance you'd get either 4 or 6 colds (.2051 or about 1 in 5)
and a good chance you'd get 3 or 7 (.1172 or about 1 in 9) colds.
And so forth, as you can see by table and the shape the Binomial
Sampling Distribution. But, if we put the critical value between
2 and 3 and still assume Ho is true, then the probability of getting
either 0, 1, or 2 by chance alone will add up to roughly .05a
little more than .05. So if the skeptic is right, the probability
of getting this few colds (0, 1, or 2) is rather small, its about
.05 (1 in 20).
Question. How
does this relate to the New Orleans coin example?
Reply. Let's go
back to the New Orleans coin. That should be useful and integrative.
Is it a fair coin or not? Let's grant your skeptical friend his
or her due. Let's grant that it's a fair coin, i.e., assume Ho is
true. What is the probability getting heads with a fair coin? Well,
it's .5 for each flip. Flip a coin once, and get a head. It's real
plausible to argue that 1 head in 1 flip happened by chance.
The probability of 1
head in 1 flip by chance alone with a fair coins is .5. If you decide
that 1 head in 1 flip is enough evidence to reject Ho (i.e., you
decide that it is a twoheaded coin), the alpha would be .5. That
is, the probability of incorrectly rejecting Ho when it is true
is .5 That is very high.
Well, flip it twice and
get two heads. "Two straight heads!" you say, "This
ain't happen' by chance." You reject Ho. For the critic, it's
still pretty plausible to argue 2 heads in 2 flips has happened
by chance (1 chance in 4). Your alpha would be .25, if you rejected
Ho for 2 heads in 2 flips.
You flip the coin three
times and get three heads. "Three straight heads!" you
say. Now you're certain it's not happening by chance alone. You
reject Ho. For the critic it's still plausible (1 in 8) that 3 heads
in 3 flips happened by chance . Alpha would be .125.
Flip it four times. You
get 4 heads in 4 flips. You reject Ho. Again, for the critic, it's
still kind of plausible that you could get 4 heads in a row with
a fair cointhe chances are 1 in 16. Alpha would be .0625
Flip the coin 5 times
and get 5 heads. Reject Ho. Now it's starting to be less plausible
to believe that the data are happening by chance alone (1 in 32).
Alpha would be .03125. Notice that at this point alpha is less than
.05. The convention is that when alpha is less than .05, most scientists
will accept the rejection of Ho.
Somewhere between 4 heads
in 4 flips and 5 heads in 5 flips (between 1/16 and 1/32) is the
psychological point where many people find it compelling to reject
Ho. And .05 (the scientific convention) is 1 in 20 which is right
between 1/16 and 1/32.
Let's go back to the
flow of the lecture.
Summary of calculating
alpha. To calculate alpha, first assume Ho is true when
you construct the sampling distribution of the TS. This is because
you want to know the probabilities of getting data in the rejection
region by chance alone (i.e., when Ho is true). To put it yet another
way, if you assume Ho is true when you construct the sampling distribution
of the TS, then that sampling distribution will allow you to find
the probability that your TS falls in the rejection region.
Next set the critical
value. Alpha will always depend on where along the range of
the TS the critical value is placed. You set the critical value
at a place where you and your colleagues feel that alpha is acceptably
small.
There will be one other
important variable which affects alpha but we will talk about it
in a later lecture called "Degrees of Freedom."
Statistical Conclusion
Validity. Let's make sure we tie up all the jargon. Statistical
conclusion validity is the validity with which we can refute the
hypothesis that the data pattern is due to chance alone. When we
reject Ho in the statistical model we translate that rejection back
into science by discarding the PCH of Chance. We are saying the
data did not happen by chance alone. It's not nothing. Something
is going on.
Alpha as validity.
Alpha is the probability we are wrong in rejecting Ho. To use a
different example than the one which is on the current screens (above),
suppose we put the critical value between 0 and 1, that is, we reject
Ho only if we get 0 colds. There is only 1 chance in a 1000 that
we would get such a data pattern (i.e., 0 colds) by chance alone.
Chance alone has a very
low probability (.001) of generating the data. You can argue that
chance no longer is a plausible explanation of the data.
If a critic questions
our rejecting of Ho, we can point out that, sure we could
be mistaken, but the chances of that are only .001. We will claim
that .001 is so low that we have a valid argument for rejecting
Ho (and consequently discarding the PCH of Chance). A low alpha
is the centerpiece of our logical argument that we are validly
discarding chance as a PCH.
Back to Topic Locator
Map
This ends the formal
lecture on Hypothesis testing. Below is an answer to one question
asked by a student.
Question. Can
you go over how to calculate alpha is again? [If it's useful
to you, read the reply. If not you have finished the lecture.]
Reply. When I'm
rejecting Ho, I'm rejecting the idea that the probability of getting
a cold is .5 and therefore the data are occurring by chance alone.
But what if Ho is actually true, what if the data are only
happening by chance? What's the probability of rejecting Ho when
it's true? Alpha gives you that probability.
Between 2 & 3.
But what is alpha in this particular case? Let's just do the mechanics.
I've outlined in red (on the above graphic) the various probabilities
that apply to alpha in this case. Those probabilities are .0010,
.0098, and .0439. If we add those probabilities up, we see that
they add up to .0547. This is how we calculate alpha.
Back to Topic Locator
Map
