|
Sampling Distributions Web Page
©Copyright 1997, 2000 Tom
Malloy
This is the text of the
in-class lecture which accompanied the Authorware visual graphics
on this topic. You may print this text out and use it as a textbook.
Or you may read it online. In either case it is coordinated with
the online Authorware graphics.
Topic
Locator Map


This map allows you
to--
- Jump directly to a topic which
interests you.
- Coordinate the dynamic visual
Authorware presentations with the corresponding text available
on this web page.
1. To find a topic which interests you:
Look at the map of menus above. Choose a menu that interests you.
Notice that the menu buttons have topics printed on them. Click
on any button (topic) on the menu; you will jump directly to the
text that corresponds to the topic printed on the button.
2. To coordinate this web page with Authorware
presentations: The corresponding Authorware program should already
be open. Go to the menu of your choice in the Authorware
program and click any button which interests you. Then on the topic
locator map above click on the same button on the same
menu; you will jump to the text that corresponds to the Authorware
presentation.
End of Topic Locator Map
Beginning of Text explaining Sampling Distributions

Sampling Distributions. Now we're going to move onto what's
generally considered the iceberg for the Titanic of statistics classes.
The topic of Sampling Distributions very often is the most difficult
and confusing concept for beginning students to learn. We've worked
hard to create graphical and interactive presentations that will
make this idea clear. We've also created interactive tools which
will give you extensive experience with sampling distributions.
All this experience is carefully designed to make your learning
easier. Still, this is a difficult topic the first time you come
across it.

Back to Topic Locator
Map

Abduction. The first thing I want to do is review the abduction
process. I want to emphasize how these concepts connect to science.
There are complex processes in nature which a scientist wishes to
study. The scientist invents measurement operations which we call
in our jargon "dependent variables" or DV's. DV's turn
these natural processes into numbers. That is, scientists like to
measure the world. Once we have our numbers we then model the numbers
in terms of mathematical ideas such as probability distributions.
When we model our DV as a probability distribution we sometimes
call it a Random Variable.

Example of Abduction. Remember the baseball player we studied
in basic probability. She's about to take a swing at the ball. She's
the infinite process we want to study and to do so we're going to
take measurements on her. We going to measure the number of times
she's been at bat, the number of hits she's made, and we're going
to get the relative frequency of her hits. These are generally called
her batting average. Then we can model that batting average in terms
of some probability distribution as is shown on the graphic. So
we have one human woman, and for cultural reasons we decide summarize
her has a single number (her batting average) which we print in
the newspaper next to her name. She's been reduced to this single
number. But this number is of considerable importance to us in this
culture. Some people become millionaires based on this number. As
of 1999 no women that I know of have become millionaires that way,
but people can become millionaires because of the performance that
leads to that number.

Abduction: Rolling a single die. As a review, the next screen
shows the roll of a single die. We've considered this in detail
before as an abductive process. We pointed out that we could measure
the roll of a die many ways because it is a complex, multi-dimensional
process. We could measure the number of times it hits the ground,
the amount of time it takes to come to rest, the rhythmic pattern
of the sounds it makes as it rolls, and so on. We choose to measure
it by counting the number of dots facing up when it comes to rest.
Then we model those numbers as an equiprobable distribution. So,
again, we take a complex process in nature, reduce it to numbers
via measurement operations (DV's) and finally model the numbers
in terms of probability distributions.


Back to Menu Locator
Map
Sampling distribution overview. The visual schema on this
screen will be used over and over to conceptualize an important
process in the inferential statistics (such as t, F and chi square)
which we will be developing. This schema gives the major steps in
our statistical model.
1) Population. We start the process by assuming a population.
By population we mean a process in nature that has been reduced
to numbers by measurement operations and then modeled as a probability
distribution. So this is like the batting averages or the roll of
a die which we just reviewed. The population in statistics is just
a probability distribution. But it is also something which is connected
back to the context of science.
2) Sample. The next thing we do, from the scientific point
of view, is go to the lab or some other place and do a piece of
research. We do an experiment. From a statistical point of view
we think of doing an experiment as taking a random sample from the
population (which is a probability distribution). This statistical
model corresponds to the homework you did with the Sample from Normal
Tool (which is found in Difference to Inference on the StatCenter
main menu). Recall your experience with sampling from the normal
distribution. The homework asked you to define normal distributions
and then take samples. In the statistical model, doing research
is like sampling numbers from a well-defined probability distribution.
An important relationship: The population determines the
probability that a single score in the sample will take on a certain
value or fall in a certain region. Recall your work with the normal
distribution. You were able to find, for example, the probability
that one northern European male had a height between 165 and 175
cm. The important point is that the population relates to single
scores in the sample.
3) Define a Statistic. Statistics are just formulas which
apply to the data in the sample. In steps 1 and 2, we have defined
a probability distribution and sampled a bunch of numbers from it.
In step 3 we choose a statistical formula which we can apply to
those numbers. Some statistics which we are already familiar with
in this class include the mean, the standard deviation, and little
r. Up through step three, we're on familiar ground. We've worked
with probability distributions; we've taken samples; and we've applied
statistics to the numbers in a sample.
4) Find the Sampling Distribution of the Statistic. Step
4 is new territory; it will take some time to develop and understand
these ideas. In short, what we've got to do next is find the probability
distribution (sampling distribution) of the statistical formula.
I don't necessarily expect those words to mean anything much right
now; they are more a guide to what we will be learning.
A sampling distribution lets us find the probability that the sample
statistic takes on a certain value or falls in a certain interval.
So a sampling distribution does for the sample statistic what the
population does for a single score in the sample. We'll work with
that idea a lot so that it takes on some depth.


Binomial Sampling Distribution. Let's start by developing
a sampling distribution based on the binomial probability distribution.

Back to Menu Locator
Map
Abduction from nature to science to statistics. Once again,
we're going to start with this abductive process. We will take an
infinite process in nature, a baby, and reduce it to gender, which
is a massive reduction. For data collection and statistical reasons,
we prefer to have numbers. So we will use a zero for a boy and a
one for a girl. Turning a child into a 0 or a 1 is a tremendous
loss of information since there is so much more to know about a
child. But measurement has profound advantages for science which
can make this loss worthwhile. Still it is important to remember
that measurement is a always massive reduction of some infinite
processes.
Next we are going to model our data (0's and 1's) as the Bernoulli
process which is a simple probability model we studied earlier.
A Bernoulli process can have only two outcomes (in our case a girl
or a boy). Each outcome has a known probability of occurring. In
our case the P(Boy) = P(Girl) = 0.5.

Four steps in getting the Sampling Distribution. Now we
are going to go through each step carefully.
1. Assume Population. The first step is to assume a probability
distribution. We have just argued in the previous section that gender
at birth can be reasonably modeled as a Bernoulli probability process.
So we will assume that the population of human births are a Bernoulli
process. When a child is born two things can happen, girl or boy,
and each has a 0.5 probability of occurring.
Recall that in a Bernoulli process one outcome is called a success
and the other a failure. As we mentioned, these terms are not evaluative
in this context. Suppose for some reason we want to know how many
girls are being born. In our research project we want to count the
number of girls. Using the "success" and "failure"
jargon, it makes sense then to call the birth of a girl a "success"
and the birth of a boy a "failure."
2. Construct a Sample. The next step is to take a random
sample from the population of human births. We've had practice using
StatCenter Probability Tools to take samples from populations so
this is something we are familiar with.
Science. In this example, the scientists are engaged in
a research project to count the number of girls being born, say
in Salt Lake County. They examine county birth records, taking a
random sample of 10 births during some period of time. They give
male births a 0 and female births a 1. Their data then will be n
= 10 scores. Each score will be a 0 or a 1. In step 2 of the graphic
you can see that the first birth is a girl (X1
= 1), the second birth is a boy (X2 = 0),
and so on down to the last birth which is a girl (Xn = 1).
Statistics. From the point of view of the statistical model,
the whole research project is conceived of as simply randomly sampling
n = 10 times from a Bernoulli population. This random sample yields
the same n = 10 scores that the scientists got through all their
hard work (see step 2 of the graphic).
The point is simple. Scientists have to do a lot of work in a research
project to collect data. All this work is summarized in the statistical
model simply as "sampling from a population."
3. Define Statistical Formula. The third step of the process
is to define some statistic. By statistic we mean any of the formulas
you are familiar with such as the Mean or little r.
Or we might have in mind some statistic you have yet to study, such
as t or F or Chi Square. Don't worry about
these for now; we'll get to them later.
In our current example, we are going to make up a very simple statistic.
Since we're counting the number of girls in our sample, then our
statistic will be called G for number of girls. The formula
for our statistic will be very simple. It will just be the sum of
the X's. Each X is either a 1 or a 0, so if you sum all the the
0's and 1's, the 0's won't count for anything and the 1's will.
If our ten scores were 1,0, 0, 1,1, 0,1 0, 1,1, then the sum of
X would be 6, because there are six 1's. Each 1 indicates a girls
so G = 6 means there were 6 girls in the sample. X is called an
indicator variable because it indicates a particular event (in this
case a girl).
In our research we want to count the number of girls. The statistic
we have defined,G = sum of X, counts the number of girls in the
sample. So the statistic is a good one in the sense that it accomplishes
our goal.
4. Find the Sampling Distribution of the Statistic. The
last step is to figure out what the probability distribution of
the statistic is. Since our research question is leading us to count
the number of girls in our sample of 10 births, we are going to
want to know answers to questions like what is the probability of
exactly 7 girls in 10 births. Or we might want to know what is the
probability of between 3 and 7 girls in 10 births. Notice that the
statistic we defined in step three counted the number of girl births.
If we find the probability distribution of our statistic, we can
answer those kinds of questions.
The probability distribution of a sample statistic is called its
sampling distribution.
Let's start figuring out how to find the sampling distribution
of our statistic.

Argument that the sampling distribution should be the Binomial.
If you think about it, then the sampling distribution of the statistic
should be a binomial distribution. This is because our population
is a Bernoulli Trial. So our random sample of ten births is 10 independent
Bernoulli Trials. Our DV is X, where X = 0 for
boy and X = 1 for girl. So X is a Bernoulli Trial. If
a girl is a success, then p = .5. We have N = 10 births,
therefore N = 10.
This all a perfect set up for using the Binomial Distribution
which gives us r, the number of successes, in N Bernoulli
Trials for any value of p. Our sample statistic (G = Sum
of X) will gives r, the number of girls (or successes). So G
is distributed as the Binomial Distribution.
In short, the sampling distribution of G is the Binomial Distribution.

So now we see the full four steps of the sampling distribution
process. 1) We take some process in nature, measure it, and model
it as a population. In this case the population is a Bernoulli Trial.
2) We take a random sample of size n from the population. This generates
a number of sample data points or scores. We generally give these
scores some symbol, like X. 3) We define a statistic on the sample
data. In this case the sample statistic is G = Sum of X. 4) We discover
via some logical-mathematical argument what the sampling distribution
of the sample statistic is. In this case, the sampling distribution
of G was the Binomial.
Finding the Sampling Distribution is often difficult. Finding
the sampling distribution of various statistics is, in general,
non-trivial. It is one of the most mathematically sophisticated
parts of statistics. In our simple example, we were able to make
a rigorous argument that the sampling distribution of the statistic
G is the Binomial, even assuming very minimal math as a background.
But finding the sampling distribution for most applied statistics
is very difficult. So now that we have a simple example to form
a basis for our understanding, we will skip the logic of going from
step 3 to step 4, and simply tell you what the sampling distribution
of various statistics are. What you need to understand is this overall,
4-step schema.
The homework will give you lots of practice using and assimilating
this schema.
Population versus Sampling Distribution. One crucial thing
to notice is that the population and the sampling distribution are
different probability distributions. The Population gives you probabilities
that an individual score will take on certain values. The Sampling
Distribution gives you the probability that a statistic (which is
a function of many individual scores) will take on certain values.

We've looked at how the Binomial Distribution can be used as a
Sampling Distribution when the Population is a Bernoulli process.
Let's move on now to examining a case involving the Normal Probability
Distribution.
When the sample statistic is the mean, and the Population is a
Normal Distribution, what is the Sampling Distribution of the Mean?

Spatial Ability Example. Suppose a research group develops
a test of Spatial Ability. They will call the score a person gets
on their test a Spatial Ability Quotient (SAQ). The next screen
shows examples of the kinds of procedures (operations) that might
be included on a test of Spatial Ability.
Back to Menu Locator
Map

A typical question on a spatial ability test is shown on the screen.
Which figure below (a or b) is the upper figure rotated by 180 degrees?
The test would have a long string of these kinds of questions. At
the end each person gets a number, depending on how many they got
right. The answer, by the way, to the sample question is "a."

Let's say SAQ scores can be modeled as a normal distribution with
mu equal to 150 and sigma equal to 30. This is another example of
how people might be measured in psychology and how the measurement
operations might be modeled as a normal distribution.

Abduction Again. Once again we have a summary of the process
by which a person (or other process in nature) can come to be modeled
as a probability distribution. In this case the operational definition
of how the person is measured consists of the SAQ test. Next the
SAQ test scores are modeled as a normal probability distribution.
This normal distribution is what we will call our population.
Statistically, when a person takes the SAQ test, it is as if we've
randomly sampled from from a normal probability distribution, with
mu equal to 150 and standard deviation equal to 30.
Student Question: How does the researcher know that SAQ
scores can be modeled as N(150,30)? A first answer is that
people who make up standardized tests collect data on very large
samples of people so that they can get good information about this
question. A second answer is that in general in statistics we make
the assumption that our DV's are modeled as a normal distributions.
We may not know mu and sigma, but we generally assume the population
is normal. A third answer is that for this example, I just made
up the parameters (mu and sigma) so we would have a clear example
to work with.
Summary. We have a person; from the picture he appears to
be man. He used to be happy, but then he took this test. He had
to sit there and stare at the picture and decide if it's a or b
which is the top figure rotated 180 degrees. This poor guy has to
answer a whole bunch of similar questions, and if he misses any
he's going to lose points. When he completes the test he gets a
number, a test score. So the scientific measurement operations reduce
the human being to a number. Finally we model these numbers as a
normal probability distribution in which the average score is 150
and the standard deviation is 30.
This is how we come up with the population which we will use in
the next section dealing with the sampling distribution of the mean.

Back to Menu Locator
Map

Find the Sampling Distribution of the Mean (SDM). Okay,
in the previous section we have developed the argument for Step
1. That is, we have assumed the probability distribution of spatial
ability quotient (SAQ) is normal with a mean of 150, standard deviation
of 30. In short, SAQ is modeled as N(150,30).
For Step 2, we've sampled 25 people, measured each of them with
our test and got 25 spatial ability quotient scores. In other words,
we've constructed our sample.
For Step 3, we find the mean of the sample which is simply the
sum of all the scores over n.
Now comes the tricky part. The new concept that we're working on
is finding the sampling distribution. In this case we have to find
the sampling distribution of the mean (SDM). The SDM is a probability
distribution which gives the probability that a sample mean will
take on a certain score. In contrast, the original population gives
the probability that an individual score will take on a certain
value.
Step 4. Finding the SDM. If we want to find the SDM, there's
a series of things we need to know.
The first thing we need to know is that it's been proven mathematically
that if the population is normal then the sampling distribution
of the mean must be normal. This is simple and elegant and important.
If we assume that the population is normal then the SDM is normal.
That solves most of our problems in finding the SDM. We know it
is a normal probability distribution. Now all we have to do is find
its mu and sigma. (Recall that a normal distribution is completely
specified by its mean and standard deviation.)
So the sampling distribution of the mean is normal if the original
population is normal. Next, let's find its sigma (its standard deviation).

Let's introduce some new jargon and notation. We're going to call
the standard deviation of the SDM the Standard Error of the Mean
or, for short, SEM.
In math notation we will symbolize the SEM as "sigma sub M."
Look on the graphic and you'll see a sigma with an M as a subscript.
That's our symbol for the standard deviation of the sampling distribution
of the mean.
In contrast the standard deviation of the population will be notated
simply as sigma (like it is for any normal distribution).
Okay so how do we calculate the SEM?

There's an extremely simple formula for finding the standard error
of the mean. SEM, or sigma sub m, is equal to the population standard
deviation, (plain old sigma) divided by the square root of n, where
n is sample size. So in other words, the standard error of the mean
is the population standard deviation divided by the square root
of the sample size.
In terms of the spatial ability example we've been developing,
the population of individual SAQ scores was a normal distribution
with a mean of 150 and standard deviation of 30. That is, the population
is N(150, 30). We gave the test to 25 people, so our sample
size is 25. Therefore, we have 30 over the square root of 25, which
is 6. So the standard error of the mean is 6 in the example we're
working with.

It turns out that it's even easier to find the mean of the sampling
distribution of the mean because the mean of the SDM happens to
be equal the mean of the original population. So we don't even have
to do a calculation.
I know this sounds like word salad the first time you hear it,
but the mean of the sampling distribution of the mean is the same
as the mean of the original population. Consequently we don't even
have a separate symbol for the mean of the SDM.
In terms of our example, all you have to do is recall that the
mean of population of spatial ability quotients was 150. Therefore
the mean of the sampling distribution of the mean will be 150.
So the mu of the SDM is just equal to the mu of the population.
In this case it is equal to 150.
What have done is go through several steps that give us three pieces
of information. First, if the population is normal, the sampling
distribution of the mean is normal. Second, the standard error of
the mean is given by a simple formula (population sigma over square
root of n). Third, the mu of the population and the SDM are the
same.
So we have fully specified the SDM. It's N(150, 6). In
contrast the population is N(150, 30).
Let's look back to our overview again. Now we have all four steps
fully developed..

This screen illustrates the four major ideas we have been talking
about: 1) a population; 2) a sample drawn from the population; 3)
a sample statistic (the mean); and 4) the sampling distribution
of the mean.

Summary To find the sampling distribution of the mean we
need to know a few things. First, if the population is normal then
the SDM is normal. Second, the mu of the SDM is the same as the
mu of the population. Third, the standard deviation of the SDM is
simply the population sigma divided by the square root of n.

Here's another little aspect that's interesting. The population
is N(150, 30) and the SDM is N(150, 6). They are identical normal
distributions except that the SDM has a smaller standard deviation.
Therefore the SDM is taller and less spread out than the population.
The screen we are looking at puts them side by side. As you know
from your previous experience with the normal distribution, as sigma
gets smaller, the distribution gets thinner and taller. As sigma
gets larger the distribution gets shorter and wider.
The sampling distribution of the mean is going to be more compact,
more of the probability will be closer to the center, 150 in this
case, than is true of the population. The sampling distribution
of the mean is less variable than the population. That will turn
out to be a good and interesting characteristic. Sample means are
less variable than individual sample scores.

So lets go back and summarize. You start with some process in nature
and you measure it via some dependent variable measurement operations;
then you model your dependent variable as a population. Next you
go to the lab and do the research. That is, you collect a sample
from the population. Third, when you're done with the research,
you analyze the data, and one of the simplest things you can do
to analyze the data is find the average or the mean. Finally, you
find the probability distribution of sample means.
Questions? SEM is short for the standard error of the mean
(which itself is just a short way of saying the standard deviation
of the sampling distribution of the mean).
PRACTICE PROBLEM. Let's use what we've learned to solve
some probability problems.
Question 1: What is the probability that a single individual's
SAQ score falls between 140 and 160? You've already practiced problems
like this. If you want to know the probability that a single
SAQ score falls between 140 and 160, you use the population.
Question 2: What is the probability that the mean of a sample
of individuals falls between 140 and 160. This is a different question.
It is about the mean not about an individual score. If you want
to know the probability that a sample mean falls between 140 and
160 you use the SDM to answer the question.
Let's solve those two questions. Open up StatCenter's "Normal
Tool." Suppose you want to answer the first question which
requires that you find the probability that one score falls between
140 and 160. To answer this, set Normal Tool's mu to 150 and sigma
to 30 because the population from which that individual is sampled
is N(150, 30). Then enter 140 as the lower score and 160 as the
upper score. The probability output window will show you that the
probability is .2586 of getting an individual score between 140
and 160.
The second question asks you to find the probability that the mean
of a sample of 25 individuals will fall between 140 and 160. On
the Normal Tool, keep mu the same (150) but change sigma to 6 because
the SDM is N(150, 6). Enter 140 and 160 as lower and upper scores.
Now the Normal Tool will output a .9031 probability. The probability
that a mean will fall between 140 and 160 is .9031.
Look back at one of the illustrations showing both the population
and the SDM. The SDM is the less variable of the two; it is taller
and thinner. Since the SDM is less variable than the population
can you figure out why the probability of the mean falling between
140 and 160 is less than the probability of a single score falling
between the same two points?


Back to Menu Locator
Map
Two ways to think of sampling a mean. There are two ways
to think about sampling means. The first way is that we assume there's
a population, we sample it, and we get a mean. If we did that we
might get a specific mean, lets say 148.76, in a particular sample.
In other words, I've constructed a specific sample of 25 people.
The graphic shows that the first person's score was 156, the second
person's score was 137 and so on down to the last person's score,
which was 148. I haven't shown all of them but lets say we've have
25 scores. I calculate the mean and I get 148.76. That's the whole
process. That's one way to think about how a mean comes into existence.

A second way to think about how a mean comes into existence is
to suppose that we already have defined the sampling distribution
of the mean. This option can be seen under the number 2 on the screen.
If the SDM is defined, we can think of sampling one mean from it.
In our particular case, we sample once from the SDM and get a mean
= 148.76.
Those are two rather distinct ways to think about how a mean comes
into existence. It can be sampled directly from the SDM or you can
think of sampling a whole large research sample from the original
population, and then calculating the mean.
Just to practice, here's another example. Suppose I ask what is
the probability of sampling a mean between 144 and 156? Notice that
these two numbers are one SEM below and one standard deviation above
150. By now we might even have memorized that the probability of
falling between -1 and +1 standard deviation from mu is .6827. So
we know that the probability of getting a mean between 144 and 156
is .6827.
Review: Parameters versus Statistics. You need to make a
strong distinction between population parameters, like mu and sigma,
and sample statistics like the mean and the sample standard deviation.
In our SAQ example, the mu of the population is the same as the
mu of the sampling distribution of the mean; they are both 150.
And they are both population parameters.
In contrast, the mean of the sample is the average of a bunch of
scores. You have to calculate the sample mean from the data. In
our example, we didn't list all 25 scores in the sample. But if
we had them all we could calculate the sample mean. The answer we
gave for the sample mean was 148.76.
In general, when you draw a sample, you usually don't get a sample
mean that is exactly the same as the population mean.
It is important to make this distinction between population parameters
and sample statistics.
StatCenter Tools
StatCenter's SDM Tool is designed to let
you experience first hand all the processes involved in the Sampling
Distribution of the Mean. From these experiences you will naturally
learn about the relationship between a Population, a Sample,
a Mean of a Sample, and the Sampling Distribution of the Mean. As
a way of focusing on these learning experiences, you will be given
homework using the SDM Tool.

SDM Tool. Lets look at the StatCenter sampling distribution
of the mean tool. From Ducks in a Row simply click on the "Use
Sampling Distribution of the Mean Tool" link. From you Virtual
Desk, just click on your "Interactive Tool" icon and select
"SD of Mean Tool" from the list of interactive tools.
Back to Menu Locator
Map
A
"StatCenter" web page will pop up. If you scroll down
on it, it has a practice problem or two.
A menu for the Sampling Distribution of the Mean tool will also
pop up. It has two buttons. Click the lower button which says, "SDM
Tool."

Upper Right Hand Panel. You'll find two distributions in
the upper right hand panel of the SDM Tool. The black distribution
is the population. The red distribution is the sampling distribution
of the mean. They have the same mu so the two distributions are
centered at the same value (because, of course, they have the same
center). The population and SDM differ only in that they have different
standard deviations. And so the population (which has a larger sigma)
is lower and wider and the SDM (which has a smaller sigma) is thinner
and taller.
Upper Left Hand Panel. Looking at the upper left
hand panel, you see an interface where you can enter information.
There is a place where you can set population mu depending
on what's given in the homework or test problem. You can also set
the population standard deviation. Finally, you can set the
sample size, n.
Sample size, n, along with population mu and sigma, are the three
really important pieces of information you need to get from a word
problem. You need to set all three of them to use the SDM Tool.
As an example to work with, set population mu = 100, sigma = 5,
and n = 10. Set mu, sigma, and n. PRESS UPDATE.
The SDM tool will immediately and automatically give you the mean
and standard deviation of the SDM. (Remember that the standard deviation
of the SDM is also called the standard error of the mean or SEM.)
Press the "Sample" button in the lower right hand
corner panel.

Lower Left Hand Panel. Sample scores will appear in the
lower left hand panel when you press the "Sample" button.
(The "Sample" button is in the lower right panel.)
The sample scores are called empirical data since they correspond
to the data collected in a research project. The SDM Tool calculates
the mean of the scores automatically for you. This mean is the empirical
mean that a scientist would calculate in a research project. It
is always important to distinguish between the theoretical population
mean, mu, and the empirical research mean.
New sets of empirical data will appear every time you press the
"Sample" button. Notice that (empirical) individual scores
are written in black. That is because they are sampled from the
(theoretical) population which is black. In contrast, the (empirical)
sample mean is written in red. That is because it is sampled from
the (theoretical) SDM which is red.
Click the "Sample" button many times. Notice how the
sample data and the sample mean change with each new sample you
take. The theoretical populations are constant and unchanging. The
empirical data change with every sample. Just stare at the data
as you click or just stare at the sample mean. Notice how their
values change.
Lower Right Panel. As you click "Sample" many
times, also notice the lower right panel. Every time you click "sample"
a small red hatch mark appears representing that mean. Each empirical
mean is a different value so the hatch marks are placed in different
places along the number line. If two means have values very close
to each other the hatch marks are piled on top of one another. Click
"Sample" quickly many times to get a sense of this.
If you click quickly, over and over, these hatch marks will eventually
stack up and begin to take the shape of the Sampling Distribution
of the Mean. This shows you that across huge numbers of samples
there is an empirical frequency distribution of sample means which
looks roughly like the theoretical SDM (shown above it in the upper
right hand panel).

1 to 10,000 Samples per click. There's a pop-up menu in
the lower right panel just below where the red hatch marks appear.
There's a small white window. Next to the window is an arrow. Click
on the arrow. A menu will pop down. This menu will give you the
choice of how many samples you can draw with a single click. You
can draw 1, 5, 10, 50, 100, 500, 1,000, or 10,000 samples with a
single click. This allows you to easily see the evolution of the
shape of the frequency distribution of sample means.
Play with taking large numbers of samples with a single click.
You'll notice that the shape of the empirical distribution of sample
means quickly conforms to the normal shape of the theoretical SDM.
How many samples are necessary before you think the frequency distribution
of empirical means closely takes on the shape of the the theoretical
SDM? Do you need hundreds of samples? thousands? tens of thousands?
A central idea in using this tool is to compare the theoretical
SDM (upper right panel) with the distribution of sample means which
pile up empirically (lower right panel).
[Note: The empirical data and mean in the lower left panel only
change when the number of samples per click is set to 1. This is
because when more than one sample is collected with a single click,
the computer samples means directly from the SDM rather than sampling
n different scores from the population and then computing the sample
mean. This allows it to get 1,000 or even 10,000 mean very quickly.]
Summary of Sampling Distributions. We will wind up this
topic with a review. It's important to realize that a scientist
is filled with curiosity about the mystery of nature. What interests
us most is that somewhere out there is stuff that's vastly beyond
all of our theories. As Shakespeare put it, "There are more things
in heaven and earth, Horatio, than are dreamt of your philosophy."
It's these undreamt of, undiscovered, things that fascinate us so.
In psychology we're fascinated with human beings who are always
elusively beyond the reach of any theory no matter how good that
theory is. Other scientists might be fascinated with plant ecology
in a rain forest. This fascination leads scientists to study the
world.
Abduction. In studying the universe, one thing scientists
do is to measure things, to reduce them to dependent variable measurement
operations. This reduction of what we are interested in to numbers
is what leads to statistics. In statistics we model the numbers
we get as random variables or, in different words, as probability
distributions. These probability distributions we call populations.
This process of moving sideways across ideas and models from nature
to measurement to probability distribution is called abduction.
By way of contrast with abduction, induction is to infer upward
from information to a higher order principle. Deduction is to infer
downward from a principle to lower order consequences. Abduction
is neither upward nor downward. It is the sideways movement from
one way of conceptualizing to another way of conceptualizing on
the same level. For statistics we take the scientific idea of measurement
and model it as a probability distribution.
Four Step Schema. Step one is (through abduction) to model
our dependent variable as a population. The population gives us
the probability that a single score will take on certain values.
Second, draw a sample of many scores. This is not a simple process.
It requires that we do research projects that often take years to
complete. In statistics we summarize years of thought and effort
with the simple phrase "draw a sample." Third we analyze
the data in the sample with some statistic, such as the mean. Finally
we find the sampling distribution of the statistic.

Word Salad. This four step process often leads to us speaking
in confusing phrases which sound like word salad. For example, we
need to find the mean of the sampling distribution of the mean,
which might be shortened to the mean of the mean. Of course we just
found out that the mean of the SDM is the mean (mu) of the population.
But, there we go; until you are used to these ideas the previous
sentence might sound like word salad.
We also might talk about the standard deviation of the mean (which
is sometimes called the standard error of the mean). In this lecture
we found out that the standard deviation of the mean is equal to
the population standard deviation divided by the square root of
sample size (which is more word salad until you work with these
ideas for a while).
The point is that it's ok to feel confusion at this point. These
are abstract and tricky concepts and you need to work with them
extensively using StatCenter tools and doing homework. You should
be aiming for a time when these word salad phrases take on clear
meaning.
Back to Menu Locator
Map
|