Hypothesis Testing Web Page

This is the text of the in-class lecture which accompanied the Authorware visual graphics on this topic. You may print this text out and use it as a textbook. Or you may read it online. In either case it is coordinated with the online Authorware teaching program.

Go To Setting Critical Values Go To 1 & 2 tailed regions Go To Vaccine Example Go To PCH of Chance Go To Science and Statistics Go To  H0 & H1
Topic Locator Map


This map allows you to--

  1. Jump directly to a topic which interests you.
  2. Co-ordinate the dynamic visual Authorware presentations with the corresponding text available on this web page.

1. To find a topic which interests you: Look at the map of menus above. Choose a menu that interests you. Notice that the menu buttons have topics printed on them. Click on any button (topic) on the menu; you will jump directly to the text that corresponds to the topic printed on the button.

2. To coordinate this web page with Authorware presentations: The corresponding Authorware program should already be open. Go to the menu of your choice in the Authorware program and click any button which interests you. Then on the topic locator map above click on the same button on the same menu; you will jump to the text that corresponds to the Authorware presentation.

End of Topic Locator Map

Begin Text Explaining Hypothesis Testing

Back to Topic Locator Map

Science and Statistics. We've gone to some length in previous lectures to clarify the relationships among natural phenomena, science, and mathematical models. We have defined what we mean by DV measurement operations in the realm of science and have shown how those are modeled by probability concepts such the normal population and the binomial population. In this lecture we will further develop explicit correspondences between scientific ideas and statistical models.

Statistical Conclusion Validity & Hypothesis Testing. Statistical Conclusion Validity is an idea developed in classic discussions (e.g., Cook and Campbell, 1979) of scientific methodology. Hypothesis testing is a formal model developed in probability and statistics. The two ideas have direct and formal correspondences with each other. What I'm making clear is that when I use the term "Statistical Conclusion Validity" I'm speaking in the realm of science; and when I use the term "Hypothesis Testing" I am speaking in the realm of mathematical-statistical models. This lecture will focus on developing these two ideas and the relationships between them.

Back to Topic Locator Map

We are going to develop the idea of the PCH of Chance. But before we can explain what we mean by that we need to discuss several preliminaries.

Scientific Hypotheses. As you know from general knowledge, scientists have hypotheses and they do research to evaluate the validity of their hypotheses. A very general form of scientific hypothesis is that some IV will cause some change in some DV.

Scientific Skepticism. A healthy part of science is its value of skepticism. Scientists don't accept beliefs and hypotheses just because someone says they are true; rather, scientists have developed extensive common sense, logical, and cultural traditions for testing beliefs and hypotheses. The fundamental attitude of science is to be skeptical, to assume a new hypothesis is not true until until there are compelling reasons to believe it.

So if one scientist says that the IV causes a change in the DV, the skeptic says, "No it doesn't. The IV has no effect on the DV." The skeptic shows up most clearly as your scientific competitors at a different university or in a different lab. In parody we could say that the knee jerk reaction of any scientist is not to believe anything, especially if it's proposed by another scientist. Parodies aside, scientific skepticism is very important. And it is perhaps most important in stepping back and taking a skeptical attitude toward your own most cherished hypotheses. It would difficult to do good research without, at times, challenging, in your own mind, the hypothesis you are evaluating.

Skeptical Hypothesis. I will establish a convention called the skeptical hypothesis. If the scientific hypothesis is that the IV causes a change in the DV, then the automatic skeptical hypothesis is that the IV has no effect on the DV.

Blood Pressure Example. The discussion to this point has been rather abstract. To be a little more specific, suppose that you develop a pill which you think will reduce blood pressure. So your scientific hypothesis is something like, "The blood pressure pill will reduce blood pressure."

Skeptical Hypothesis. "The blood pressure pill has no effect on blood pressure," will be the skeptic's reply. That is implicit in the attitude of science is a skeptical hypothesis that negates the scientific hypothesis. Your own internal attitude should include this skepticism. In science it's healthy to have some amount of skepticism toward your own ideas.

One of the things you can do to counter the skepticism toward your hypothesis is design a study and collect data.

Research Design. The current Authorware graphic outlines the design of a simple study to evaluate the effectiveness of the blood pressure pill. The Pill Group takes the new blood pressure pill for 3 months. The Control Group continues as they have been prior to the study. Specifically, the volunteers in this control group do not take the new blood pressure pill nor any other pill administered by the experimenter. At the end of the three-month period, the blood pressure of both groups is measured.

Negative Results. One thing that can happen when you do research is that the results are not consistent with the scientific hypothesis. The current graphic shows such results. After three months of taking the pill the average blood pressure in the Pill Group is no different than the average blood pressure in the Control Group. The scientific hypothesis proposed that the Pill Group blood pressure would be lower.

Is the data pattern consistent with the scientific hypothesis? In the coming lectures we will focus on some sophisticated inferential statistics like t, chi-square and F. Often beginning students get caught up in using these statistics and forget to check the most fundamental question: Is the pattern of results consistent with the scientific hypothesis. If not, it means that you must rethink what is going on. Often this leads to revising the research design or the scientific hypothesis. You may or may not go on to apply inferential statistics. In any case, for the overall scientific project, these sophisticated inferential statistics are much less important than the pattern of results.

If the data pattern is not consistent with the scientific hypothesis the skeptic basically says "I told you so."

Positive Results. If your IV is effective and if you have some amount of skill and luck, the data pattern will be consistent with the scientific hypothesis. The current graphic shows such a case. You can see that the mean blood pressure in the Pill Group (red) is lower than the mean blood pressure in the Control Group (blue). This is consistent with the prediction of the scientific hypothesis.

Once you have positive results, the conversation between the scientist and the skeptic gets interesting.

The skeptic will grudgingly admit that it does appear that the data pattern fits with the scientific hypothesis. But the skeptic will start inventing ways (other than the scientific hypothesis) to interpret the results. In other words, the skeptic is not going to give up easily, if at all.

Let's say that the data have turned out just like the scientific hypothesis predicted. If this is so, the skeptic thinks of ways the data would have come out as they did even if the scientific hypothesis was not true. For example, in our little Pill Study, the skeptic would surely say that the results are due to the Placebo Effect.

Placebo Effect. It is well-known that if people believe they are receiving treatment they get better even if the treatment is bogus. If you give a group of people a pill with no active ingredient, telling them that there is a great new ingredient contained in the pill, they tend to get better. This is a very general and well researched phenomena. It cannot be cavalierly discounted in any research design.

The main idea here is that the skeptic can plausibly claim that even if the active ingredient in your pill is ineffective, the Pill group would have come out with a lower blood pressure because of the Placebo Effect. The participants in the Pill group were given a reason to believe that should get better (the pill) while the participants in the Control group were not given such a reason.

So the Pill Study has no control for the Placebo effect. For this reason it is a weak study. We do not know if if the blood pressure is lower in the Pill group because of the active ingredient or because of the placebo effect. The scientist argues that it is the ingredient. The skeptic argues that it is placebo. Neither side can convince the other. And so the scientist has not gained the desired logical advantage in the conversation by doing the research. The skeptic remains skeptical. She or he has a plausible hypothesis which competes with the scientific hypothesis in explaining the data pattern.

Plausible Competing Hypotheses. The placebo effect is one example of what can be called plausible competing hypotheses. The placebo effect is plausible within the research conversation. It also competes with the scientific hypothesis in explaining the results. Some books and researchers use the terms "rival hypotheses" or "rival conjectures" for what I'm calling plausible competing hypotheses. Also, to save you and I both time in writing down ideas about this topic, I will use the abbreviation PCH for a plausible competing hypothesis.

Redesign the Pill Study. The Pill study would have to be re-designed to control for the placebo effect. This could be easily done. The Pill group would remain the same. The Control group would be given a pill that is exactly the same in appearance and taste as the Pill group but without the active ingredient. The Control group would be told, of course, that they are getting the active ingredient in their pill. In this new study, the placebo effect should therefore apply to both the Pill group and the Control group. So the two groups would be equivalent except for the active ingredient. When you redo the research, you would hope that the Pill group would have an even lower blood pressure than the placebo control group. Then you could argue that both groups benefit from the placebo effect and the Pill group still produces lower blood pressures. Therefore the active ingredient must account for the lower blood pressures in the Pill group.

More PCH's. Scientific skepticism runs deep. An imaginative critic will continue to invent competing hypotheses. Many of these are standard and well-known and you will study them in research methods. For example, the critic might ask if the study was a double blind study. If you don't know the issues involved in double blind studies, that's ok, you'll learn about them in research methods. The point I'm making is that the skeptic will invent a long string of PCH's which you need to take into account in designing your research.

These PCH's generally have much more to do with research methods than they do statistics. There is one PCH, however, which is the focus of much of statistics.

In statistics, we are concerned about a special plausible competing hypothesis, and it's called "chance."

PCH of Chance. Chance is what the mathematician George Polya (1968) called "the ever-present rival conjecture." Chance is a universal plausible competing hypothesis which applies to any data set. The PCH of chance basically claims that the data pattern happened by chance alone.

Redesigned Study. Suppose we have redesigned our Pill study with a placebo control group. Both groups received a pill, one with the active ingredient, one without the active ingredient. Consequently, both groups experienced the placebo effect. Participants in both groups had reason to believe they're taking a good pill and that their blood pressure should go down. But only one group actually had the active ingredient.

Positive Results. Suppose we got positive results when we did this new study. The mean blood pressure in the Pill group was lower than the mean blood pressure in the Placebo-control group. We're feeling good about that. We show the skeptic the results of the new study.

Immediately the skeptic will say that the data pattern occurred by chance alone. The skeptic claims we were lucky. One of the two groups had to come out with a lower blood pressure and we were just lucky that it was the Pill group. It's not that our hypothesis about the active ingredient is right. The active ingredient is not effective. We were just lucky that the results make it appear as if it is effective.

Chance is plausible. I will elaborate the skeptic's argument, in the following way. Suppose we divide any classroom right down the center into two sides, right side versus left side. Then we measure anything we want to measure, people's height, their weight, their GPA. Choose any dependent variable you want. Let's say we measure the height of the students, comparing the heights of those on the right side with the heights of the people the left side. We calculate the average height of those two groups.

By Chance Alone. Suppose also that we have no reason to believe students on the right are different in height than students on the left. Why should they be? But by chance alone one of these two groups will have a higher mean height. The probability is nil that the average height of those two groups would be exactly the same. The two means would not be identical. It depends on how many decimals you're willing to go out to, of course, but probably you won't have to carry decimals at all. By chance alone the two means of any two randomly created groups will be different on just about anything you measure. It's not only plausible, it's almost inevitable.

What if I find that the average height on the left is higher than on the right. It is just as plausible, even more plausible, to believe that the result happened by chance alone than to believe some hypothesis about people on the left being taller.

Back to blood pressure. The skeptic says, "Of course one of the two average blood pressures was going to be higher than the other. No big deal." The skeptic claims you were just lucky that data pattern came out with the group you predicted to be lower actually lower. For the skeptic it is a 50-50 chance that either group is lower. It's like flipping a coin. If you flip a coin and predict heads there is a 50-50 chance you will be right. Getting a Head as a result of the flip doesn't mean you are able to predict the outcome of coins. There's a .5 probability that you'd be right even if you have no ability to predict the outcome.

Same with blood pressure. You predicted the Pill group would be lower. It was lower. But even if the active ingredient were ineffective, there would be a 50-50 chance that the Pill group would be be lower by chance alone. The skeptic is not impressed by this. The skeptic looks at your nice graph showing the blood pressure in the Pill group (red bar) is lower than the blood pressure in the Placebo Control group (blue bar). The skeptic plausibly interprets these results having occurred by chance alone.

Chance universally applies to any data set. Any pattern of results might plausibly be argued to have occurred by chance alone.

What we are studying in this lecture is under what conditions is this universal rival conjecture no longer plausible. How do we argue against the plausibility of chance explaining our results? The editors of research journals will generally want to have this basic issue addressed before they agree to publish research.

Arguments against chance. The inferential statistics we will soon be studying (t, chi-square, F) are designed to address the PCH of Chance. Arguing against the plausibility that chance alone generated the research results is all that these statistics are going to accomplish. This lecture on Hypothesis Testing will develop the logic by which we can argue against the plausibility of chance generating the data pattern.

Caveat. Inferential statistics can help evaluate and possibly even strongly argue against the PCH of chance. But they won't tell you whether you did good science, whether you have well-designed placebo group, or whether you had all the appropriate control groups. Those are different issues that we seem to put in research methods courses rather than statistics.

Therefore, just because we can validly argue that chance isn't plausible doesn't mean that we're done. And so there's a caveat--just because you get a "significant result" (that is, you can strongly argue against chance) doesn't mean that you've done a good study. The caveat is that you still have to eliminate many other PCH's, such as the placebo effect. Other PCH's will be examined in more depth in research methods.

Statistics, even good statistics, don't guarantee good science.

Jargon. We won't explain the jargon here, but merely introduce it. When we can validly argue that it isn't plausible that chance alone accounts for that data pattern, we say that we have a statistically significant result. When people say "statistically significant," they mean that they have a valid argument against the PCH of Chance.

Sometimes people say a "reliable result" to mean the same idea.

Summary. Up to this point we've talked very generally about hypothesis testing and statistical conclusion validity; and we've introduced the PCH of Chance.

Let's go back to the Hypothesis Testing menu and move on to the next topic.

Back to Topic Locator Map

We are going to talk about Ho and H1 (or the null and alternative hypotheses). These are important ideas which we have to deal with despite their unpopularity with beginning students.

A journey. To set up these ideas, we're going to take a long journey. Let's say that you and a friend go down to New Orleans, down to the French quarter, down, in fact, to Bourbon Street. And let's say you go down some stairs into the basement underneath a music club. Of course there wouldn't be such a place and if there were you surely wouldn't go there, but let's say that you and your friend find yourselves in an illegal gambling den. Now this is hypothetical. I know no one here would ever do that. It's just to provide a context in which it might be important to understand when things are happening by chance or not.

You go into a smoke-filled and tacky room, packed with people. There are big bouncers and they seem to have bulges in the right place under their suit coats to indicate they're packing heat. There's a certain electricity in this environment.

A simple coin game. Let's say that you decide to observe a certain kind of game. We'll keep the game as simple as possible because we don't have very much probability theory to work with. But a great deal of the logic of the argument applies to all gambling games, and even, by analogy, to research results. This simple little game involves a single coin lying on a table. The dealer, or maybe we should say the flipper, picks up the coin and flips it. Spinning end over end, it flies through the air and lands back on the table. Everyone is betting on whether it's a head or a tail.

And, just to help make my teaching point, we'll put a further restriction on the game. The house must always bet on heads and the client must always bet on tails.

Something is going on. Of course you're not gambling, just observing what's going on. Maybe your friend is gambling. You watch the game for a while and notice a string of heads. You say to yourself, "Something is going on. I wonder what it is?"

Scientific hypothesis. Since you think something is going on, you come up with a scientific hypothesis. The scientific hypothesis is that the coin is two-headed. If so, the players are doomed to lose.

Skeptical hypothesis. You mention your scientific hypothesis to your friend. Your friend is skeptical of your hypothesis. Perhaps your friend thinks you're being a bit paranoid. Your friend thinks that the coin is a fair coin and that the games not fixed. In other words, your friend thinks nothing is going on.

Perhaps your friend pointedly asks you if you've picked up the coin and examined it. Of course you haven't. No one is letting players get close to the coin. Being unable to examine the coin, you can simply say that the coin is behaving like a two headed coin.

We have a scientific hypothesis that the coin is two-headed. The skeptical hypothesis is that it's a fair coin. That is, the coin has a head on one side and tail on the other.

Research design. So you suggest a simple research project. You and your friend observe the behavior of one flip of the coin. The scientific hypothesis predicts that it will land as a head. The skeptical hypothesis predicts that it will be either a head or a tail.

Results. The result of the flip is a head. That's the data.

Conclusions. Since the data pattern is consistent with the scientific hypothesis, you say, "See. It's a two-headed coin." I predicted the result would be heads and it was. Your friend replies, "Oh come on, how many times have we flipped coins? You know it's got to come out a head or a tail, and it just happened to come out a head once. The chances are 50% that it will be a head by chance alone."

This is a case where the PCH of Chance is really clear. It's easy to believe that it's a fair coin, and just by chance alone it came out heads. The data fit the prediction of the scientific hypothesis by chance alone.

Now remember our little story about the blood pressure pill. In a certain sense, what's the difference? You flipped a coin, you ran two groups. In either case one of two outcomes had to happen.

Or suppose a scientist divided a classroom into right and left sides and measured everybody's blood pressure. Suppose the scientist predicted that the group on the right would have higher blood pressures because conservatives have more Type A personalities. That's pretty loose thinking and silly. It confuses many things. One thing it confuses is the metaphor of political views (right vs left) with spatial location (right Vs left) in the classroom. But suppose the data came out that the group on the right had higher blood pressures. This is consistent with the scientific hypothesis. I'm suggesting you should feel a bit skeptical about the scientist concluding his or her hypothesis is right. The skeptic replies that the two means will be at least a little different. One will be higher. And its just a matter of 50-50 chance whether the one that's higher is the one on the right. This is the point of view of the skeptic. One of the two groups had to be higher, and the scientist was lucky it came out the way s/he predicted.

The same with the coin in New Orleans. One of the two sides had to come up, you were just lucky it came up heads which is in line with the two-headed coin scientific hypothesis. It is plausible to argue that the data (a head) happened by chance alone.

The point I'm spending some time making is that the PCH of Chance is not actually as absurd as it might seem when you first hear of it.

Model the Realm of Science with the Realm of Statistics. Now we're going to translate the hypotheses from the realm of science (scientific hypothesis and the skeptical PCH of Chance) into statistical hypotheses. As usual, this will lead to some new jargon.

Null Hypothesis. The PCH of Chance is modeled in the realm of statistics by a statistical hypothesis called the null hypothesis, frequently symbolized by H with a subscripted 0 as in the current graphic. Right now web text does not easily allow me to use subscripts, so I'll just write it as Ho. When speaking, people say "H oh," or "H zero," or "null hypothesis."

If you believe that the research data happened by chance alone, Ho is for you.

Alternative Hypothesis. The scientific hypothesis is modeled in the realm of statistics by a statistical hypothesis called the alternative hypothesis, frequently symbolized by H with a subscripted "1" or a subscripted "a." Due to the difficulty of writing subscripts in web text, I will write it H1. When speaking, people say "H one" or "H a" or "alternative hypothesis."

Ho. In the coin story, the PCH of Chance claims that the probability of a head for a fair coins is .5. So we can write

Ho: P(Head) = .5

H1. In the coin story, the scientific hypothesis is that the coin is two-headed. So the probability of a head is 1. So we can write

H1: P(Head) = 1

Connection from Science to Statistics. The exact form that Ho and H1 take will always depend on the scientific context. Obviously Ho is not always Ho: P(Head) = .5. It takes on that particular form because of the coin story. The same is true for H1. In this section we are taking great pains to make this connection between science and statistics explicit. This will help later when we develop statistics like t, chi-square and F.

Let's refocus on New Orleans and the coin game.

Let's say you respect your friend's skepticism. It is, after all, convincing that a fair coin has a high probability (.5) of generating the data in your previous experiment (which was one flip of a coin).

New study: Two flips. You do a new research project. You observe two flips of the coin. Suppose the results come out to be two heads. (I'll abbreviate "two heads" as "HH.") Now you might say to your friend, "See. Two heads. That's exactly what my scientific hypothesis predicted."

Independent events. Here's where you have to remember back to some ideas we learned in the Basic Probability lecture. We argued that flipping coins are independent events and the joint probability of independent events is simply the product of their probabilities. The probability of a head is one-half on each of the two flips. So, assuming the two flips are independent, then the probability of a head and then another head is one-half times one-half which is one-quarter. That is, P(HH) = (.5)(.5) = .25.

Statistical Hypotheses. The Null Hypothesis is "Ho: P(H) = .5" If the null hypothesis is true, then P(HH) = .25. The Alternative Hypothesis is "H1: P(H) = 1." If the alternative hypothesis is true, then P(HH) = 1.

Skeptical reply. Your friend, though, is still going to argue that the results (HH) could be due to chance, saying "You know, a lot of people have flipped coins twice and gotten two heads in a row. That's a pretty likely thing to happen. In fact, if you flip coins a lot, it's hard to avoid getting two heads in a row. It happens about one-quarter of the time."

HHH. Let's say, in response this criticism, you redesign your study. Now you observe the behavior of three flips of the coin. Suppose the results are HHH. At this point you might say, "Hey that's it. It's a two-headed coin. Let's go confront one of those bouncers and tell him that we're going right to the police unless he tells the manager to get a fair coin on the table." Well, your friend, looking at the size of the bouncers might reply, "Shhh. If they hear you we might be nursing broken fingers in the morning. You want to risk all that pain just because a coin came up with three heads? That could still happen by chance."

Under the PCH of Chance and the null hypothesis the probability of a head is one-half. This means that P(HHH) is one half time times one half times one half which is one eighth. That is, P(HHH) = (.5)(.5)(.5) = .125. If H1 is true, then P(HHH) = 1.

If you're bored some time, perhaps standing in line, you can flip a fair coin for a while. Three heads in a row won't happen very often, but it will happen if you flip enough times. It happens in about one in every eight three coin sequence. The results HHH still can come up by chance occasionally. But, perhaps, chance is starting to feel just a little less probable, a little less plausible. Still three heads by chance is not so improbable as to be completely implausible.

HHHH. The story's the same. The new study is to observe the behavior of four flips. And the results are four straight heads: HHHH. You look at your friend with that "I told you so look."

Your friend's might say "Hmm well, four straight. But you know, I've been bored on airplanes and flipped fair coins a lot, and I can remember four coming up in a row. That can happen by chance alone. It has a one-16th probability."

What you should sense in this example is that chance is slowly becoming a less and less plausible way to explain the results. Four heads in a row can happen by chance alone, but it happens with a probability of only .0625. That's just a little higher than .06 which is just 6 times in a hundred. That's not very probable. Still, it could happen.

HHHHH. Let's carry this logic on a few more times. Let's say the next study is to observe the behavior of 5 flips of the coin, and let's say the results are HHHHH. We got five straight heads in five flips. You would like to conclude that this is, indeed, a two-headed coin.

Your friend says, "Hhmm, 1/2 times 1/2 five times is 1/32 or pretty close to .03. I must admit that HHHHH happening by chance alone is starting to be a little improbable, a little implausible." You remind him that you want to go to the manager to complain. He looks at his fingers and says, "Well, really, it's not likely, but 5 heads in a row could happen with a fair coin."

When is chance implausible? Everyone has slightly different criteria for when they think it is implausible to argue that chance alone is producing the data. Some people start feel it is implausible at HHHH, others at HHHHH. Others require even more data to decide.

HHHHHH. You observe 6 flips of the coin and it gives you six heads: HHHHHH. As you can see by the graphic, the chances of 6 heads in 6 flips of a fair coin is 1/64 or .0156. It's just a little more than one in a hundred. This is so improbable that it is becoming, for most people, implausible. They will start looking for some other way to explain six straight heads. You, the scientist, of course, have such an explanation--it's a two-headed coin. A two-headed coin must yield HHHHHH as data.

HHHHHHH. We'll do this one last time. Suppose seven flips give 7 heads. If Ho is true then P(HHHHHHH)=.0078. That's about 7 or 8 times in a thousand. To most people this is so improbable as to make the Plausible Competing Hypothesis of Chance implausible.

Reject Ho. Ho corresponds to the PCH of Chance. At some point (4 heads? 5 heads? 6 heads? 7 heads? 8 heads? ...) the data are such that if Ho is true the data are extremely improbable. Rather than think the data are improbable we reject Ho. We say Ho: P(Head) = .5 can't be true.

Community standard. It is notable that in this example, most people feel that Ho gets implausible somewhere around 4 straight heads or 5 straight heads. P(HHHH) = 1/16 = .06+. P(HHHHH) = 1/32 = .03+. As we will see later, many in the scientific community have decided that the convention for rejecting Ho should be at 1/20 = .05. If assuming Ho is true makes the data less probable than .05, then we generally reject Ho. What is interesting is that .05 is right between .06 and .03, the points in the coin example where many people feel that the plausibility of chance starts to be shaky.

Well, that is getting ahead of ourselves. If the last paragraph is a bit obscure for you, be sure we will gain considerable experience with these ideas in the future.

Rejecting Ho leads to discarding the PCH of Chance. In the realm of statistics, based upon a probability argument, we reject Ho. Translating back into the realm of science, that means we think that the plausible competing hypothesis of chance is no longer plausible. Psychologically and intellectually, there is some point for each person when she or he believes that chance alone cannot account for the results. We have eliminated one PCH.

George Polya (1968), in his discussion of the inductive method, has shown how each time we eliminate a plausible competing hypothesis, we make our own scientific hypothesis more plausible. For example, we have argued that if 7 heads come up in 7 flips, it is no longer plausible that the coin is generating heads by chance alone. That makes the scientific hypothesis of a two-headed coin more convincing.

This is the whole essence of the argument in a very simple case. All the major ideas of Hypothesis Testing have been introduced. The probability theory is going to get more complicated unfortunately, but we have laid out the main ideas here to serve as an overview and road map for you.

There are a couple more ideas worth mentioning at this point. The next one we will discuss is the "alpha level" or the "significance level."

Can rejecting Ho be an error? One limitation on the logic we've just summarized is that we might be wrong when we reject Ho. The data always could happen by chance alone. You could flip a coin 100 times, and it's conceptually possible to get 100 heads, incredibly unlikely, but conceptually possible.

Let's say we reject Ho after 7 heads in 7 flips. The P(HHHHHHH) = .00078... Let's round .00078 to .007 (or about seven in a thousand) just to make discussion more streamlined. If Ho is true, there is only seven in a thousand chances of getting the data (7 heads). So we reject Ho. But conceptually it is possible that a fair coin generated 7 heads.

So when we reject Ho, the probability we are making an error in doing do is the probability that a fair coin would generate the data that convinced us to reject Ho. For example, if we reject Ho because we got 7 heads in 7 flips, we have a .007 probability of being wrong.

Alpha Level. "Alpha" or the "alpha level" is the probability you are wrong when you reject Ho. As you can see from the graphic, if Ho is true, the probability of 4 heads is about .06, the probability of 5 heads is about .03, probability of 6 heads is about .015, probability of 7 heads is about .007.

The probability that you are wrong when you reject Ho depends on when you reject it.

If you reject Ho after 5 heads in 5 flips, the probability you are wrong is about .03 (which is the probability that that string of heads could happen by chance alone.) If you reject Ho after 6 heads in 6 flips, your alpha level (probability you incorrectly reject Ho) is about .015. If you reject Ho after 7 heads in 7 flips, the alpha level is .007.

Significance level and p value. While they have slightly different connotations, the terms "significance level" and "p value" are essentially synonymous with alpha level. They deal with the idea that, while it is compelling to reject Ho under certain circumstances, there is always a small probability that we are wrong in doing so.

Statistical Conclusion Validity. Statistical Conclusion Validity is a formal term from the realm of science. It refers to the validity with which we can make the argument that the results of our research were not due to chance alone.

Statistical Conclusion Validity encompasses the whole argument that we've just summarized in this lecture to this point.

Back to Science. The final graphic in this section summarizes visually what we have discussed up to this point. It also connects us back from statistics to science. Rejecting Ho on the basis of probability in the Realm of Statistics is used in the Realm of Science as an argument that the PCH of Chance is no longer plausible.

What about H1? The way the data turned out, H1 is still probable. It predicted all heads and that's what the data turned out to be.

What about the scientific hypothesis? Well the scientific hypothesis is in a little stronger position. Chance, the ever-present rival hypothesis, has been discarded. The scientific hypothesis is a bit more plausible. But we should make it clear that the scientific hypothesis has not been proven. It will never be proven by these methods.

For one thing, as we mentioned, however improbable it might be, the data could have been due to chance, there is always the a small probability we were wrong in discarding chance. That alone means the scientific hypothesis is not proven.

Let's look at some other reasons why the scientific hypothesis is not proven by this method.

Further Limitations. Suppose we reject Ho and discard the PCH of Chance. Suppose your friend in New Orleans is finally convinced. You've just done a study where you observed 7 heads in 7 flips. Your friend agrees that the string of heads is so unlikely that s/he's willing to give up the PCH of Chance. S/he's even willing to say that it's not a fair coin. But that does not prove that it's a two-headed coin. There are other hypotheses which could explain the 7 head in 7 flips.

Magnetism. The coin could very well have a head on one side and a tail on the other, but it might be a fake coin made of steel so that a complex magnetic machine might always make it land with heads facing up. That theory also accounts for 7 straight heads.

Holograms. Maybe the coins is a high-quality hologram which has a head and a tail but is programmed so that it always lands with the head side facing up. That also would account for 7 straight heads.

So Rejecting Ho and eliminating the PCH of Chance certainly gets rid of one annoying criticism of the study. But it does not prove the scientific hypothesis. There may be other, equally good, even better, hypotheses to be considered.

Since you cannot start a convincing argument with someone until chance has been eliminated as a PCH, what discarding the PCH of Chance does is make the case of the scientific hypothesis arguably stronger.

Not Nothing. What does Rejecting Ho do? In our coin story, first you noticed that "something is going on" with this coin. Then you turned this sense of "something going on" into a scientific hypothesis--The coin has two heads. Your friend said that "nothing is going" on with the coin. S/he created a skeptical hypothesis--It's a fair coin.

Rejecting Ho basically says It's not nothing. The results are too improbable to allow us to believe that nothing is going on and the data occurred by chance alone.

But rejecting Ho does not help to know which of all the possible somethings is the one that's actually going on. Perhaps what's going on is that it's a two-headed coin. Perhaps it could be a magnetic coin. Perhaps it could be a hologram. What's actually going on could be many things. Rejecting Ho and eliminating chance as a PCH doesn't help you to know which of these many possibilities is happening. Therefore rejecting Ho will not "prove" any particular theory.

Rejecting Ho always has this double negative feel. We are negating the idea that nothing is happening. If you keep in mind this double negative logic, it will help understand how we use Ho in inferential statistics later on.

Back to Topic Locator Map

We've studied this before. I want to remind you that we've already developed the building block ideas for this current section when we studied the "Catching Cold" example in the Binomial Distribution lecture. We will now build on what we previously learned. You also had a chance to practice the foundation for the current discussion in the Binomial Homework.

Catching Cold. Let's say that baseline data show that the probability of catching a cold in Salt Lake City in January and February is known to be .5. That is, P(Cold) = .5. (In this example we're keeping the probability at .5 so that it's like the familiar flip of a coin.) Suppose also that some researchers develop a cold vaccine and therefore want to evaluate for themselves and other people whether their vaccine works.

Research design. They run a study with 10 volunteers to test whether the vaccine is effective or not. They administer the vaccine to the 10 volunteers and then determine if each volunteer gets a cold or not during January and February. If the vaccine works, the chances of catching cold among the volunteers should be less than .5.

Scientific Hypothesis. The scientific hypothesis is that the vaccine will improve health (reduce the chances of a cold).

Skeptical Hypothesis. The skeptic will propose that the vaccine is going to have no effect on the chances of getting a cold. Therefore the 10 volunteers have the same probability of catching a cold as unvaccinated people (i.e., P(Cold = .5). According to the skeptic, the vaccine is completely worthless.

If you listen carefully to the debate at the beginning of any flu season, you'll hear both of these attitudes about flu vaccines. Some scientists believe in them and recommend them very strongly. Other scientists are skeptical of flu vaccines and think they're not worth very much.

PCH of Chance. Suppose the research results show very few colds among the 10 volunteers. Such results would favor the scientific hypothesis. The skeptic would immediately start inventing plausible competing hypotheses to explain these favorable results. Since the study is poorly designed (it doesn't even have a control group) the skeptic could invent many PCH's. But in statistics we are primarily concerned with one PCH--the PCH of Chance. The skeptic will say that you just got lucky. There were relatively few colds among your 10 vaccinated volunteers by chance alone. After all, among many different groups of 10 Unvaccinated people, the number of colds would vary greatly. You were just lucky in sampling a group who caught few colds by chance alone.

IV and DV. The independent variable (IV) is the vaccine, and the dependent variable (DV) is health. The researchers will measure the DV by categorizing whether or not each volunteer gets a cold. More formally, they will invent X (called an indicator variable). For each volunteer, X = 0, if that person does not catch cold. X = 1 if a volunteer does catch one or more colds. These are very simple and straightforward measurement operations.

The scientists expect the IV (vaccine) will affect the DV (the occurrence of a cold). The skeptic thinks the IV will have no effect on the DV.

BLUE DETOUR: We will now take a brief detour into the issue of how to tell directional from non-directional scientific hypotheses. I will make this text blue so you know it is an aside. Below you will find more blue text where the ideas we learn here will be used. The two sets of blue text make a full idea.

Is the scientific hypothesis directional or non-directional? This is a new distinction we've not introduced before. Whether the scientific hypothesis is directional or not impacts how we will eventually write the alternative hypothesis, H1. Let's use the vaccine example to illustrate the difference between a directional and non-directional scientific hypothesis.

Directional hypothesis. Scientific hypothesis: The vaccine will reduce the chances of a cold. Whenever the scientific hypothesis predicts that the IV will cause the DV to change in a certain direction (either increase or decrease), then we say it is directional. As we've stated it, the scientific hypothesis is directional because it predicts that the number of colds will be reduced.

For contrast, let's change our example so that the scientific hypothesis is non-directional.

Exploratory Research. Sometimes in the early stages of vaccine research scientists don't know a great deal about the organism that causes a disease; they may not even be sure which of many organisms actually cause the disease. They also don't know whether the vaccine should be based on a dead virus or on a weakened, live virus. And if it is based on a live virus, how much should it be weakened to trigger an immune response without transmitting the disease? Under these circumstances the scientists may have to do exploratory research. They may administer a weakened live virus. They hope, of course, that it reduces the disease, but, if it is not weakened enough, it might increase the incidence of the disease. Either way, they learn something. If the disease increases after the vaccine, then they at least know that they have the right organism and that they need to weaken it more or administer it dead. If it reduces the diseases, then it is a matter of tweaking the dose.

The important point is that either an increase or a decrease gives important information. So they are simply looking for a vaccine that affects (either way) the disease. This is not unlike the psychotherapeutic strategy of prescribing the symptom. If you can teach people to make their symptoms worse, then you and they know that you've found the variables that control their symptoms. Then you and they can work with these variables to lessen the symptoms.

Sometimes it's important simply to have an effect, without worrying about direction of effect. Once you know how to get an effect, you can worry about making it go the direction you want.

Non-directional hypothesis. Scientific Hypothesis: The vaccine will affect the chance of a cold. If the researchers are in early, exploratory stages of investigation, they might have a non-directional scientific hypothesis. They think they've isolated the right virus, so they think the vaccine will have an effect. But they don't know which direction this effect will be. They don't know if it will reduce colds or increase colds.

That's the issue of directional versus non-directional scientific hypotheses. Let's go back to our story where the research is more advanced, and the researchers are predicting that the vaccine will reduce colds.

Modeling Science with Statistics. The scientific hypothesis is that chances of a cold are reduced in people who are vaccinated. The skeptic replies that the vaccine has no effect on colds. If the data were to favor the scientific hypothesis, then the skeptic would reply with the PCH of Chance. Let's cast these ideas into statistical terms. Remember that the baseline data is that P(Cold) = .5. That is, (hypothetical) past research shows that the probability of catching a cold during January and February in Salt Lake City is .5.

H1. H1 corresponds to the scientific hypothesis. We can rewrite the prediction that the chances of a cold should be reduced as "the probability of a cold is less than .5." To be more succinct, we could write "P(Cold) < .5." The "<" means "less than." Conversely the ">" sign means "greater than." So the alternative hypothesis is:

H1: P(Cold) < .5

Ho. Ho corresponds to the PCH of Chance which is based on the skeptical hypothesis that the vaccine has no effect on colds. We can rewrite this as "the probability of a cold equals .5." Or more succinctly as "P(Cold) = .5. So the null hypothesis is:

Ho: P(Cold) = .5

So now we have translated the two verbal, qualitative hypotheses from the scientific conversation into two symbolic, mathematical statistical hypotheses. Let's go on to the next step.

Student Question. Shouldn't Ho be written in this case as "the probability of a cold is greater than or equal to .5?"

Reply. Good question. Technically yes. While we're just going to ignore that issue at this level of statistics I'll go ahead and address it. [Ignoring both the question and my reply will not disadvantage you in any way in future learning, so you may read the reply or skip it.] The issue is that the two hypotheses should together cover all possibilities (i.e., all probabilities of a cold from 0 to 1). Then if you eliminate one of them, you can by converse logic choose the other. Clearly, H1 must cover the range of probabilities of colds less than the baseline .5. So H1: P(Cold) < .5 is correctly written. That leaves Ho to cover all the other possibilities (i.e., the probabilities from .5 to 1). That is, Ho covers "no effect" up through unpredicted increases in colds. But the actual test of Ho which we are about to develop tests specifically the the point where there is no effect, that is, where P(Cold) equals the baseline probability of .5. That is, where P(Cold) = .5. So reducing null hypothesis to the simple form: Ho: P(Cold) = .5 looses us nothing in developing our test. And I have found over the years that including the more technical logic (such as this paragraph) has caused confusion and without spreading much light. So I've simplified the null hypothesis a bit.

Test Statistic. We now have two statistical hypotheses, Ho and H1. How can we decide between them? We're going to invent this idea of a Test Statistic. We will use the test statistic to decide between Ho and H1.

A couple obvious questions are what is this test statistic and how will it work?

DV. In our example, what were the measurement operations? The researchers simply measured the occurrence or non-occurrence of a cold in each person. The dependent variable was X, an indicator variable. X was 0 if a person caught no cold. X was 1 if a person caught at least one cold. Such a DV will generated a series of 1's and 0's as our data. Since we have 10 volunteers, we will have 10 data points. Each data point will be a 0 or a 1.

Test Statistic. Let's invent a very simple test statistic (call it TS). TS will add up all the scores (either 0's or 1's). As you can see on the graphic, TS = Sum of X. TS is simple; it is not even as complicated as the mean. If you think about it, TS = Sum of X will tell us the number of people who caught cold in our sample of 10 volunteers. All those who caught no cold get a 0. All those who caught at least one cold get a 1. Adding up all the 1's gives us the number of people who caught colds.

Distinction: Test Statistic versus DV. The DV is the measurement operations which generate the data. In this case each person will generate either a 0 or a 1, depending on if s/he gets a cold or not. X will give us 10 1's and 0's, one for each person. In contrast, TS is a statistic calculated on the data generated by X. In this respect TS is no different than the mean or variance, just simpler.

Later we will get to complicated test statistics such as t, chi-square, and F. For now we want a simple case to develop the idea of what test statistics do.

Range of TS. Everyone who didn't get a cold was a 0, everyone who did get a cold was a 1, and so when we add all those up, we get the number of colds. The researchers took a sample of 10 from the Salt Lake City population and we vaccinated them. If there are 10 people in the sample then our test statistic could vary from 0 to 10. That is, the group could catch between 0 and 10 colds. It could be that none of them caught a cold, one of them caught a cold, two of them, and so on all the way up to all ten of them caught a cold. So the range of TS, which is shown visually on the current graphic, is 0 to 10.

Horizontal Axis. In the graphic, the horizontal axis gives the number of colds (values of TS) across its entire range, starting from 0 and going to 10.

Divide the Range of TS. We're going to divide the range of the test statistic up into two regions:

Reject H0 region --------- Do not reject H0 region.

Critical Value. The line separating the Reject Ho region from the Do not reject Ho region will be called the critical value. That line is red on the current graphic.

Rejecting Ho. Recall the New Orleans coin example. Remember that for each person there came a point at which it no longer seemed plausible to believe that the data were happening by chance alone. For some people that point came after 4 straight heads, for others after 5 straight heads, for others after 6 heads, and so on. The critical value is that point at which the data lead a person to reject Ho.

Where to put the critical value? Eventually (in the next section) we will make up specific criteria based on probability for deciding exactly where to put the critical value. But for now, let's just put the critical value in several places and think about the common sense implications of its placement. We'll examine putting the critical value between 2 and 3, between, 0 and 1, between 4 and 5, and between 1 and 2. We'll see if any or all of those placements of the critical value make common sense.

If Ho is true. If Ho is true and P(Cold) = .5, then it makes sense that we should get around 5 colds out of 10 volunteers. So if TS = 5 (that is, you get 5 colds in your sample of 10 people), you wouldn't want to reject Ho because 5 colds is very likely if Ho is true. Common sense would dictate that you would want to reject Ho for some number of colds below 5.

Between 2 and 3. The critical value could be put between 2 and 3 colds as it is in the current graphic. That means we would decide to Reject Ho if our sample yielded either 0 or 1 or 2 colds. If the sample yielded 3 or more colds we would decide to Not reject Ho. Think about it. Ho would naturally lead to a number of colds around 5. The scientists predict fewer colds. Is 2 colds sufficiently few that it would lead you reject Ho?

Between 2 and 3 is just one place to put the critical value. Arguments could be made for putting the critical value in many different places.

Back to science. Recall that when we reject Ho in the realm of statistics we are saying that the PCH of Chance is implausible in the realm of science. This means that the scientists have eliminated one of the skeptics' PCH's. Moreover, the data (2 or less colds) certainly fits both the alternative hypothesis (H1: P(Cold) < .5) and the scientific hypothesis that the vaccine reduces colds.

Between 4 and 5. People could make an argument that we should put the critical value between 4 and 5 because you'd expect around 5 colds out of 10 people if Ho is true. That means you would reject Ho for 4 or 3 or 2 or 1 or 0 colds. They could argue that 4 colds or lower indicate a reduction in colds. The trouble with this particular argument is that not everyone would believe you when you rejected Ho because by chance alone it would be very easy to get 4 colds out 10 people even if P(Cold) = .5 were true. Still, there is a certain logic to saying anything below 5 is a reduction. Think about it and sense how you feel about it.

Between 0 and 1. Set the critical value between 0 and 1. Now you reject Ho only if there are no colds in the sample of 10 people. That's very stringent, but it's a logical place to put the critical value. If we vaccinate 10 people and none of them catch a cold, we can argue with conviction that that's not likely to happen by chance alone. So we can reject Ho and therefore argue that the PCH of Chance is no longer plausible. The problem with this placement of the critical value is that even if the vaccine was effective at reducing the number of colds it might not completely eliminate them. So this stringent critical value might miss an effective vaccine.

Question. What he said is he feels a little lost. He wants to know how to determine where to put the critical value because I'm just putting it along the range of TS at all these different places.

Reply. That's exactly right, that's the impression I wanted. The critical value could be about anywhere. What are the criteria people use to decide where to put it? That's a good sense of curiosity. To answer that we will need to use our knowledge of probability, specifically our knowledge of the Binomial Distribution. I like you wondering about this because then when we go through how to use the Binomial to establish criteria for setting the critical value, you'll have some motivation for learning about it. The simple answer to the question for the moment is that we need probability theory and we're building toward that.

But... for now, the point I'm making is there's a lot of places (as he's pointed out) where we can put the critical value. Let me put it in one more place.

Between 1 and 2. The critical value could be put it between 1 and 2. That's a perfectly sensible place. If we only get 1 or 0 in our sample we would reject Ho. And, by converse logic, the alternative hypothesis that P(Cold) < .5 remains probable. Then mapping back into the realm of science we could say that PCH of chance is implausible and the scientific hypothesis (that the vaccine reduces the number of colds) remains plausible.

Summary. We've established the idea of a TS and how it differs from the data generated by the DV. (The TS is a statistic calculated on the data.) Then we've introduced how can divide the range of the TS into regions where we would or would not reject Ho. The Reject Ho and Do not reject Ho regions are separated by a critical value. Then we found that we can argue for setting the critical values in many different places.

What we did not do is learn how to use the Binomial Distribution to find the sampling distribution of the Test Statistic. If we know the sampling distribution of the TS, we can establish specific Probability-based criteria for setting the critical values. I have found that it is worthwhile to introduce the basic logic and vocabulary of rejection regions before bringing in probability theory, which can be a difficult step for many people.

Back to Topic Locator Map

Abduction: Nature to Science to Probability. Suppose one of the volunteers in the vaccine study is walking in a crowd, at some risk of contacting a cold virus. Everyone is breathing the same air. Some people are coughing and sneezing. Our measurement operations reduce that volunteer to a DV symbolized by X. X is either a 0 or a 1, depending on whether or not the person caught a cold. Next we model X as a Bernoulli trial.

Bernoulli Trial. Assuming Ho is true (and we always assume Ho is true in this part of the process) the volunteer's probability of a cold is .5. In other words, we assume that the vaccine is ineffective and so all data is generated by chance alone. So there's a 50-50 chance that the volunteer will receive a score of 0 and the same chance s/he will receive a 1. Only two things can happen, either s/he gets a cold or s/he doesn't. In the past we've learned that these kind of two-outcome situations can be modeled by the Bernoulli Distribution.

What is the Probability Distribution of X? The probability distribution of the DV (in this case, X) is modeled by a Bernoulli Trial. The probability distribution of X is shown on the right in the graphic (above). You can see that on the graph both 0 and 1 have the same probability (height of the black bars).

Population. Recall that the probability distribution of the DV is often called a population. In the current example this is a Bernoulli Trial. In the realm of science, when the scientists measure one volunteer it is equivalent in the statistical model to drawing a single score (X) from the population.

Sampling. When the scientists running the study find and measure a sample of 10 volunteers, the statistical model summaries all that work as simply sampling 10 people from a population. In this case we can see on the above graphic that we will randomly draw 10 people from the population. This gives us a sample of 10 scores. For illustration, I have made the score of the first person (X1) equal to 1, and the score the second score (X2) equal to 0, and the score of the last (or nth) score equal to 1. So our sample data will consist of a bunch of 0' and 1's.

Calculate TS. Once the scientists have their sample data, they calculate the TS = Sum of X, which gives them the number of colds in their sample. So the TS might have any value between 0 and 10.

Sampling Distribution of TS. Each DV score (X) is a Bernoulli Trial. There are 10 scores. TS gives us the number of colds out of 10. If we define a cold as a success, then the number of colds in the sample is the number of successes in 10 Bernoulli Trials. Your past experience with the Binomial Distribution should let you now remember that this is a Binomial Sampling Distribution.

Two distinctions are important to keep in mind.

DV versus TS. The DV (here, X) generates the individual data points. The TS is a statistic calculated on those data points.

Population versus Sampling Distribution. The population gives the probability that a single score takes on one of its values. The Sampling Distribution gives the probability that the TS will take on one of the values in its range. If we want to know the probability TS = 3 (i.e., there are 3 colds in the sample) we find that out from the Sampling Distribution.

That's all by way of review from the Binomial lecture. You have already seen the current graphic (above) showing the 4 steps to a sampling distribution when we studied sampling distributions. It's good to be familiar with it and get a good concept of the overall 4-step pattern because it's a very useful visual representation of a complex set of variables and mathematical relationships.


Three scientific cases. Let's use the vaccine example to develop three distinct cases.

Case 1: This is the case we've focused on in the above examples. The scientists think that the vaccine is well-developed and effective. The scientific hypothesis is "The vaccine (IV) will reduce the chances of a cold (DV)." The scientific hypothesis is directional--it predicts less colds after vaccination.

Case 2: We mentioned this case briefly, above. The scientists don't think the vaccine is well-developed. They are pretty sure that they have the correct virus isolated, but they're not sure that it has been weakened just the right amount. It might reduce colds or it might increase colds. The scientific hypothesis is "The vaccine (IV) will affect the chances of a cold (DV)." The scientific hypothesis is non-directional--it predicts either an increase or decrease in colds after vaccination.

Case 3: We have not talked about this case yet. . It is an early stage of vaccine research. The whole research team is pretty sure they have isolated the correct virus. But there is a strong difference of opinion on how much to weaken the virus. A group of dissenters believes that the virus has not been weaken nearly enough. The dissenters believe the vaccine will surely increase the number of colds. The dissenters' scientific hypothesis is "The vaccine (IV) will increase the chances of a cold (DV)." This scientific hypothesis is also directional, but in the other direction--it predicts more colds after vaccination.

Keep these three cases in mind for a minute.

Tails of Distributions. When you picture in your mind the graph of a probability distribution (e.g., normal or binomial) you'll notice that it generally has a large bump of high probability in the middle and then tapers off in both directions until the probabilities are very low. In the current graphic (above) you can see this general shape in the case of the binomial sampling distribution--high probability in the center stepping down on both sides until the probability is negligible. The tails of a distribution are where the probabilities taper off toward zero (on both sides). Each distribution typically has two tails, an upper tail and a lower tail (see graphic above).

Let's now put the three scientific cases together with the idea of tails and integrate all that into our idea of rejecting Ho.

One-tailed rejection region: LOWER. In Case 1, the scientists are expecting the reduction in the number of colds. So in our statistical model, it makes sense to reject Ho when there are very few colds in our sample. In the graphic above this case is shown on the far left. You can see that there is only one rejection region and it is placed in the lower tail of the distribution. This is called a one-tailed rejection region or a one-tailed test of Ho. In this case we would write a one-tailed (lower) alternative hypothesis:

H1: P(Cold) < .5

Two tailed rejection regions. In Case 2, the scientists are expecting that the vaccine will affect colds, but they are unable to predict in which direction. It is a non-directional scientific hypothesis. Therefore in our statistical model, it makes sense to reject Ho if there are either very many colds or very few colds. You can see this case in the center of the current graphic above. As the graphic shows, there is a reject Ho region in both tails of the distribution. This is called a two-tailed rejection region or a two-tailed test of Ho. In this case we would write a two-tailed alternative hypothesis:

H1: P(Cold) not equal to .5

One-tailed rejection region: UPPER. In Case 3, the dissenting scientists are expecting that the vaccine will increase the number of colds. So in our statistical model, it makes sense for the dissenters to reject Ho when there are very many colds in our sample. In the graphic above this case is shown on the far right. You can see that there is only one rejection region and it is placed in the upper tail of the distribution. This is called a one-tailed rejection region or a one-tailed test of Ho. In this case we would write a one-tailed (upper) alternative hypothesis:

H1: P(Cold) > .5

The overview shown in the last graphic above, along with the discussion about it, requires the integration of the many new ideas which we've been discussing. It is good to come back to this graphic and discussion at various future points in the class when we are applying these ideas.

For now, what is important is to realize we are creating statistical models of important scientific ideas (scientific hypothesis, skepticism, PCH of Chance). To use the statistical model well, you need to understand how the various aspects of the model (Ho, H1, critical values, rejection regions) relate to the scientific context. We have focused on how the nature of the scientific hypothesis (directional versus non-directional) affects how we write H1 and where we put the rejection region(s).

Question. How does this relate to the New Orleans coin example?

Reply. I made the coin example as simple as possible. I made the scientific hypothesis that the coin is two-headed. So we are predicting not just an increase in the number of heads that would normally occur with a fair coin, but that there would be nothing but heads. That makes it strongly one directional--way more heads than a fair coin would generate. In the coin example Ho: P(Head) = .5, and H1: P(Head) = 1.

We've studied this before. Once again I want to remind you that we've already developed the probability tools for this next section when we studied the "Catching Cold" example in the Binomial Distribution lecture. In fact, we've previously worked through all the technical details of this section, both in the Binomial lecture and in the hands-on homework assigned afterwards. We also worked on foundations for this current material in the Sampling Distribution lecture, particularly the Binomial Sampling Distribution. Consequently, the current material should build naturally on your previous learning without your needing to go back to it. But if the details seem fuzzy or confusing, you might want to review the Binomial lecture and the Sampling Distribution lecture as well as the homework you did with those two lectures.

Back to Topic Locator Map

Review Example. We have n = 10 volunteers who try a new vaccine. If the vaccine is effective the probability of a cold should be less than .5; if it is ineffective, the probability of a cold should be .5. So H1 is that P(Cold) < .5 and Ho is that P(Cold) = .5

Whenever we construct a test of Ho, we always assume Ho is true. (You can't test it if you don't use it as the model.)

The test statistic (TS) will be the number of colds in the 10 volunteers. The Sampling Distribution of the TS will be the Binomial, with p = .5, and N = 10. This assumes that a "success" is catching a cold and that Ho is true (P(Cold) = .5). In probability jargon, the Sampling Distribution of the TS is Binomial with N = 10 and p = .5

Review the binomial sampling distribution. As we've argued, the sampling distribution of TS is the Binomial with N = 10 and p =.5. Along the horizontal axis is the number of colds in the sample starting from 0 and going to 10. If we call a cold a "success" then the horizontal axis is "r," the number of successes (colds) in 10 trials (volunteers). This assumes Ho is true, P(Cold) = .5.

The current lecture graphic (above) includes a small table on the left side so that we can get explicit probabilities for each number of colds from 0 to 10. Looking at that table you can see that the probability of 0 colds is .001, that is, 1 in a thousand. In contrast, the probability of 5 colds is .2456, that is, almost 1 in 4. We will use the probabilities in the table to develop the ideas in this lecture. But when you do homework you will have the Binomial Tool to work with, so you won't need the table. You can directly use its output.

What's the question? Remember the question which we're leading up to. What's the criteria for setting critical values? How do we know where to put the critical values which define our rejection region(s)? A while ago, we kept putting the critical value in different places, we put it between 2 and 3, between 4 and 5, between 1 and 2. We kept moving it around. At one point a student asked how to figure out where to put the critical value. We're now going to use a bit of probability theory and a bit of common sense to answer that question.

A Case 1 Example. We will assume that the scientists believe the vaccine will reduce the chances of a cold. That is, we will construct a one-tailed (lower) rejection region (Case 1) because the scientific hypothesis predicts a small number of colds. Ho: P(Cold) = .5. H1: P(Cold) < .5.

Critical value between 2 and 3. If the critical value is between 2 and 3 then we reject Ho when the number of colds is 0 or 1 or 2 colds. If the data show 3 or more colds, then we do not reject Ho. What we are rejecting when we reject Ho is the idea that there is a 50-50 chance of a cold.

Alpha. Recall that alpha = Probability of (incorrectly) rejecting Ho when it is true. Rejecting Ho when it is true is, of course, a mistake, an error. Alpha is the probability of such an error.

If you think of the logic of it, the only way to calculate alpha is by assuming Ho is true.

What is alpha if critical value is between 2 & 3? Assume Ho is true. The probabilities of getting 0 or 1 or 2 colds by chance alone are .0010, .0098, and .0439, respectively. These are circled in red in the table in the current graphic. Adding these probabilities up, .0010 + .0098 + .0439 = .0547.

This is very much like flipping a fair coin. If you flipped a fair coin (where P(Head) = .5) ten times, what would be probability of getting either 0 or 1 or 2 heads? It would be the same, .0547.

What have we accomplished? What we've done is discover the probability of making a mistake when we reject Ho when Ho is true. That is, if the vaccine is ineffective and Ho is true, then the probability is .0547 of getting 0 or 1 or 2 colds by chance alone (and therefore rejecting Ho and consequently thinking incorrectly that the vaccine is effective). .0547 is a little larger than 1 chance in 20.

Why is this important to science? Consider what would happen if the vaccine were worthless, but by chance alone so few colds (say, only 1 cold occurred) that we rejected Ho. The scientists would then be inclined to think that this did not happen by chance alone (even though it did). Consequently, they would be inclined to think the vaccine had some value. Therefore, they and other research labs might spend a great deal of effort, time and money pursuing research on this vaccine, only to find out later that it is worthless. It is very costly to science to think an IV is effective when it is not.

Consequently, scientists want alpha (the probability of rejecting Ho when it is true) to be very low. By social convention and common sense, scientists generally insist that alpha be smaller than 1 in 20. [Note: If you divide 20 into 1 you will get .05.]

We just found that if we put the critical value between 2 and 3, the probability of mistakenly rejecting Ho when it's true is .0547. Let's compare .0547 with the probability of making this mistake when we put the critical value in some other places.

Critical Value between 1 & 2. Putting the critical value between 1 and 2 means we will reject Ho if we get 0 or 1 cold in our sample. If we get 2 or more colds we do not reject Ho.

Alpha. What is alpha? You can get that directly and dynamically if you are working online with the Binomial Tool. But for this lecture look at the table on the left of the current graphic. The probabilities for 0 and 1 cold are outlined in red. When we add these probabilities up, we get alpha. Alpha = .0010 + .0098 = .0108. Looking at the Reject Ho region on the graphic we can see that the probability of falling in the Ho rejection region is .0108, or a little more than 1 in 100. The scientific convention is that alpha should be smaller than .05. Certainly .0108 is smaller than .05.

Remember the scientists don't know if Ho is true or not. (If they knew, they wouldn't have to do research.) All the scientists know is the data they got. So if they get 0 colds or if they get 1 cold, they will reject Ho. They hope they are making a correct decision. What they do know is alpha. If Ho is true, the probability of getting 0 or 1 cold by chance alone is about 1 chance in a 100. So if Ho is true, their chances of rejecting it incorrectly are only .0108. This is small. They are willing to take the chance.

Between 0 & 1. Putting the critical value between 0 and 1 means we will reject Ho only if we get 0 colds in our sample. If we get 1 or more colds we do not reject Ho.

Alpha. What is alpha? You can get that directly and dynamically if you are working online with the Binomial Tool. But for this lecture look at the table on the left of the current graphic. The probability of 0 colds is circled in red. Alpha = .0010. Looking at the Reject Ho region on the graphic we can see that the probability of falling in the Ho rejection region by chance alone is .0010, or 1 in 1000.

The scientific convention is that alpha should be smaller than .05. Certainly .001 is smaller than .05.

[Before going on with the flow of the lecture, I'm going to answer some student questions which may clarify the ideas we are leaning. If you find an overview useful read on, otherwise skip down to the next graphic.]

Question. Can you go over the whole thing again all at once?

Reply. Let me summarize this whole process--even if it means repeating a lot. These are tricky ideas the first time you run into them and repetition helps. We have a vaccine. In Salt Lake City baseline data show that there is normally a .5 chance of catching a cold. We think that our vaccine is effective and the probability of catching a cold will be less than .5. The skeptic thinks that the vaccine doesn't work and the probability of a cold is .5. Okay, granting the skeptic that s/he is right, we set up a sampling distribution of the TS (number of colds) assuming the P(Cold) = .5. That is, we assume Ho is true. That means that if the skeptic is right and this vaccine is just salt water, and it has no effect whatsoever, the most likely outcome of course, is you'd get about 5 colds. This has a probability of .2461 (see table on graphic) or about 1 in 4. But there's a really good chance you'd get either 4 or 6 colds (.2051 or about 1 in 5) and a good chance you'd get 3 or 7 (.1172 or about 1 in 9) colds. And so forth, as you can see by table and the shape the Binomial Sampling Distribution. But, if we put the critical value between 2 and 3 and still assume Ho is true, then the probability of getting either 0, 1, or 2 by chance alone will add up to roughly .05--a little more than .05. So if the skeptic is right, the probability of getting this few colds (0, 1, or 2) is rather small, its about .05 (1 in 20).

Question. How does this relate to the New Orleans coin example?

Reply. Let's go back to the New Orleans coin. That should be useful and integrative. Is it a fair coin or not? Let's grant your skeptical friend his or her due. Let's grant that it's a fair coin, i.e., assume Ho is true. What is the probability getting heads with a fair coin? Well, it's .5 for each flip. Flip a coin once, and get a head. It's real plausible to argue that 1 head in 1 flip happened by chance.

The probability of 1 head in 1 flip by chance alone with a fair coins is .5. If you decide that 1 head in 1 flip is enough evidence to reject Ho (i.e., you decide that it is a two-headed coin), the alpha would be .5. That is, the probability of incorrectly rejecting Ho when it is true is .5 That is very high.

Well, flip it twice and get two heads. "Two straight heads!" you say, "This ain't happen' by chance." You reject Ho. For the critic, it's still pretty plausible to argue 2 heads in 2 flips has happened by chance (1 chance in 4). Your alpha would be .25, if you rejected Ho for 2 heads in 2 flips.

You flip the coin three times and get three heads. "Three straight heads!" you say. Now you're certain it's not happening by chance alone. You reject Ho. For the critic it's still plausible (1 in 8) that 3 heads in 3 flips happened by chance . Alpha would be .125.

Flip it four times. You get 4 heads in 4 flips. You reject Ho. Again, for the critic, it's still kind of plausible that you could get 4 heads in a row with a fair coin--the chances are 1 in 16. Alpha would be .0625

Flip the coin 5 times and get 5 heads. Reject Ho. Now it's starting to be less plausible to believe that the data are happening by chance alone (1 in 32). Alpha would be .03125. Notice that at this point alpha is less than .05. The convention is that when alpha is less than .05, most scientists will accept the rejection of Ho.

Somewhere between 4 heads in 4 flips and 5 heads in 5 flips (between 1/16 and 1/32) is the psychological point where many people find it compelling to reject Ho. And .05 (the scientific convention) is 1 in 20 which is right between 1/16 and 1/32.

Let's go back to the flow of the lecture.

Summary of calculating alpha. To calculate alpha, first assume Ho is true when you construct the sampling distribution of the TS. This is because you want to know the probabilities of getting data in the rejection region by chance alone (i.e., when Ho is true). To put it yet another way, if you assume Ho is true when you construct the sampling distribution of the TS, then that sampling distribution will allow you to find the probability that your TS falls in the rejection region.

Next set the critical value. Alpha will always depend on where along the range of the TS the critical value is placed. You set the critical value at a place where you and your colleagues feel that alpha is acceptably small.

There will be one other important variable which affects alpha but we will talk about it in a later lecture called "Degrees of Freedom."

Statistical Conclusion Validity. Let's make sure we tie up all the jargon. Statistical conclusion validity is the validity with which we can refute the hypothesis that the data pattern is due to chance alone. When we reject Ho in the statistical model we translate that rejection back into science by discarding the PCH of Chance. We are saying the data did not happen by chance alone. It's not nothing. Something is going on.

Alpha as validity. Alpha is the probability we are wrong in rejecting Ho. To use a different example than the one which is on the current screens (above), suppose we put the critical value between 0 and 1, that is, we reject Ho only if we get 0 colds. There is only 1 chance in a 1000 that we would get such a data pattern (i.e., 0 colds) by chance alone.

Chance alone has a very low probability (.001) of generating the data. You can argue that chance no longer is a plausible explanation of the data.

If a critic questions our rejecting of Ho, we can point out that, sure we could be mistaken, but the chances of that are only .001. We will claim that .001 is so low that we have a valid argument for rejecting Ho (and consequently discarding the PCH of Chance). A low alpha is the centerpiece of our logical argument that we are validly discarding chance as a PCH.

Back to Topic Locator Map

This ends the formal lecture on Hypothesis testing. Below is an answer to one question asked by a student.

Question. Can you go over how to calculate alpha is again? [If it's useful to you, read the reply. If not you have finished the lecture.]

Reply. When I'm rejecting Ho, I'm rejecting the idea that the probability of getting a cold is .5 and therefore the data are occurring by chance alone. But what if Ho is actually true, what if the data are only happening by chance? What's the probability of rejecting Ho when it's true? Alpha gives you that probability.

Between 2 & 3. But what is alpha in this particular case? Let's just do the mechanics. I've outlined in red (on the above graphic) the various probabilities that apply to alpha in this case. Those probabilities are .0010, .0098, and .0439. If we add those probabilities up, we see that they add up to .0547. This is how we calculate alpha.

Back to Topic Locator Map


Home Desk Ducks Menu