Hello everyone.
This is our 3rd and last lecture for Chapter 2
In this lecture, we will focus on Measures of Variability: the Range, the Variance, and
Standard Deviation Remember that there are three things we want
to know about a set of data: its shape, its "typical" value or measures of central
tendency, and its "spread" of scores or measures of variability.
We've already looked at how to describe the shape of distributions and how to calculate
and interpret measures of central tendency.
The purpose of measures of central tendency is to describe the "typical" value of
a variable.
Measures of the central tendency, however, can be misleading by themselves because you
could have two distributions with the same mean but one of the distributions' scores
might be more clustered together than the other distributions' scores.
So we need measures of variability as well.
It's not enough to report the mean of our data values.
We also need to describe how spread out our data values is.
How much variation is there within our data set?
Is there lots of variation between data values or is there very little variation?
As you can see here, the distribution for height has less variability; scores are closer
together.
The distribution for weight has more variability, the scores are more spread out.
Variation always exists in a data set, regardless of which characteristic you're measuring,
because not every individual is going to have the same exact value for every variable.
Variability is what makes the field of statistics what it is.
For example, the price of homes varies from house to house, from year to year, and from
state to state.
Household income varies from household to household, from country to country, and from
year to year.
Variability can be defined several ways: variability is a quantitative measure of the difference
between scores; variability describes the degree to which the scores are spread out
or clustered together.
The purpose of measures of variability is to: 1.
Describe the distribution in terms of average distance between scores and means, 2.
Measure how well an individual score represents the distribution, and 3.
How much error to expect when using a sample.
So, let's start our discussion on measures of variability with the range.
The simplest way to measure variability or spread is to look at the largest and smallest
values.
This gives us information about the tails of the distribution.
The range is the difference between the largest and smallest data values.
The largest data value is also referred to as the maximum and the smallest data value
is referred to as the minimum.
So let's look at an example of 10 Psychology students' exam scores
82 77 90 71 62 68 74 84 94 88 The highest test score is 94 and the lowest
test score is 62.
The range, R, is equal to 94 – 62 = 32 All the students in the class scored between
62 and 94 on this exam.
The difference between the best and the worst score is 32 points.
The range is affected by outliers or extreme values in the data set, so the range is not
resistant to outliers.
If the student with the worst score (62) actually didn't study and scored even lower (28),
the range becomes 94 – 28 = 66.
The difference between the best and worst score here is 66 points.
Also, the range is calculated using only 2 values in the data set (the largest and smallest).
Researchers rarely use the range because it is a very crude way of describing variability
since it only considers 2 scores from the group of scores and it does not take into
account how clustered together the scores are within the distribution.
The variance and the standard deviation, on the other hand, use all the data values in
the calculations.
Measures of variability are meant to describe how spread out data are.
Another way to think about this is to describe how far, on average, each observation is from
the mean.
The standard deviation of a group of scores tells us how spread out the scores are around
the mean.
To be precise, the standard deviation is a measure of the standard, or average distance,
from the mean.
Just as there is a population mean and sample mean, we also have a population standard deviation
& variance and a sample standard deviation & variance.
The calculations are different for population and sample data.
Here are the steps to calculate the population standard deviation:
If we don't know what the mean is, our first order of business is going to be to calculate
the mean for the variable of interest.
Next, we're going to subtract the mean from each score.
This gives each score's deviation score.
The deviation score is how far away the actual score is from the mean.
The distance separating a score from the mean is called the score's deviation, indicating
the amount the score "deviates" from the mean.
Take each data value and subtract the mean from it.
(X – μ) for population mean.
If the deviation is positive then the raw score is larger than the mean.
If the deviation is negative then the raw score is less than the mean.
The size of the deviation (regardless of its sign) indicates the distance the raw score
lies from the mean: this means that the larger the deviation, the farther the score is away
from the mean.
A deviation of 0 indicates that the raw score equals the mean.
The sum of the deviations in this step always equal zero.
Since the sum of the deviations is always zero, the mean deviation (or average distance)
will always be zero.
So, we need a new strategy to calculate the average deviation or distance.
So, the next step is to square each of the deviation scores.
This gives each score's squared deviation score.
(X – μ)2.
Once we've squared each deviation score, we then add up the squared deviations.
This total is called the sum of squared deviations.
The formula is Sum of Squares (SS) = ∑(X – μ)2.
The sum of squared deviations is also known as sum of squares, or SS, for short.
Next step is to divide the sum of squared deviations (SS) by the number of scores.
This gives us the average of the squared deviations, called the variance.
As you see, along the way to calculating population standard deviation we get the population variance.
Population variance equals mean squared distance (or average squared distance) of the scores
from the population mean.
Population variance measures variability in squared units.
The formula for population variance, sigma σ2 = Sum of Squares (SS) divided by number
of scores (N).
σ2 = SS/N Why do we square the deviation scores?
Well, first, squaring each difference makes them all positive numbers (remember the sum
of the deviations is always equal zero).
If we did not square the deviations, then we would always divide 0 by the number of
scores.
Second, it also makes the bigger differences stand out, for example 1002 is equal to 10,000.
This is a lot bigger than 502, which is equal to 2,500.
And third, squaring the differences make the final answer really big and harder to understand.
It is difficult to interpret because it is measured in squared units.
However, variance plays a really important role in statistical inference, as you will
see in later chapters.
Remember that the goal here is to calculate a measure of the "standard," or average
distance of the scores from the mean.
So the next step is to adjust for having squared all the differences by taking the square root
of the variance.
The standard deviation is the positive square root of the variance.
Now, there are two ways to calculate the sum of squares (the numerator of the variance
and standard deviation formulas).
One way is what I just covered, where we calculate each deviation score, square each of the deviations,
and then add up the deviation scores.
This is the definitional formula for sum of squares.
All of the textbooks use this formula.
However, there is an easier way to calculate sum of squares called the computational formula.
This formula leads to less calculation errors.
First step is the square each score and add the squared scores.
Σ(X2) Next step is add up the scores, square the
sum and divide it by the number of scores.
(ΣX)2/N Then, subtract your answer in the 2nd step
from your answer in the 1st step.
The formula
is SS = Σ(X2) - (ΣX)2/N. Remember that this is just the numerator of
the formula and not the final answer for variance and standard deviation.
We would still need to finish the calculation and divide sum of squares by N to find the
average squared distance from the mean, or variance & standard deviation.
I will show you both of these formulas in action so you can see the difference between
the two.
Let's take a simple example and work through the steps of the definitional formula.
Suppose you have a population with four numbers: 1, 3, 5, 7.
Step 1: find the mean (1+3+5+7)/4 = 16/4 = 4.
Step 2: subtract the mean from each number (1-4) = -3; (3-4) = -1; (5-4) = 1; (7-4) = 3
**To double check your math in this step…the sum of the deviations always equal zero.
Step 3: square each deviation (-3)2 = 9; (-1)2 = 1; (1)2 = 1; (3)2 = 9
Step 4: add up results from step 3 (9+1+1+9) = 20 This is our sum of squares
Step 5: divide sum of squares by N. 20/4 = 5.
This is our population variance.
Step 6: take square root of 5 = 2.2361 This is our population standard deviation,
σ.
The interpretation for this population parameter, σ, from our data set of 1, 3, 5, and 7, is
that the average distance of the scores from the mean is 2.2 points.
Notice how many calculations are needed here and how many ways to make calculation errors.
Now, let's take this same example and work through the steps of the computational formula.
Step 1: square each score and add the squared scores.
(1)2 = 1; (3)2 = 9; (5)2 = 25; (7)2 = 49; add up results (1+9+25+49) = 84
Step 2: add up the scores, square the sum and divide it by the number of scores.
(1+3+5+7) = 16; square the sum 162 = 256; divide result by number of scores 256/4 = 64
Step 3: subtract your answer in the 2nd step from your answer in the 1st step.
84 – 64 = 20. (20, not 22 my bad)
As you can see, we get the same sum of squares (SS) with less calculations and less ways
to make calculation errors.
So, the next two steps are the same as the definitional formula where population variance
is 5 and population standard deviation is 2.2.
One of the goals of inferential statistics is to draw conclusions about the population
based on limited information from a sample.
Samples are different from populations.
Samples have less variability and calculating sample variance and standard deviation in
the same way that we do for a population would result in a biased estimate of the population
parameters.
We want our sample statistics to be unbiased estimates for the population parameters.
In order to remove this bias, we will need to divide our sum of squares by (n-1) instead
of N.
For sample variance and standard deviation, the sum of squares is calculated in the same
way as before.
So the numerator stays the same.
Our denominator is the only thing that changes here.
The denominator changes from N (number of scores) to (n – 1).
Using our previous example, so that
now we're looking at sample data of 1, 3, 5, & 7.
Remember that our sum of squares (SS) was equal to 20.
This does not change because the numerator of our formulas do not change from population
to sample.
What changes is our denominator.
So now to calculate sample variance, we need to divide sum of squares by (n – 1), 20/(4-1)
is equal to 20/3, which is equal to 6.6667.
This is our sample variance, s2.
To calculate sample standard deviation, we take the square root of 6.6667 and that gives
us 2.58.
This is our sample standard deviation, s.
The interpretation for this sample statistic, s, from our data set of 1, 3, 5, and 7, is
that the average distance of the scores from the mean is 2.6 points.
Standard deviation can be difficult to interpret as a single number on its own.
Basically, a small standard deviation means that the values in the data set are close
to the middle of the data set, on average, while a large standard deviation means that
the values in the data set are farther away from the middle, on average.
Why does this matter?
Another goal of inferential statistics is to detect meaningful and significant patterns
in our data.
Variability in the data influences how easy it is to see these patterns.
High variability hides the patterns that we would be able to see in low variability samples.
So, we want small standard deviations in our sample data.
Small standard deviations represent low variability.
Think of low variability and small standard deviations as scores that are closely clustered
around their mean, which will make it easier for us to find meaningful patterns in our
data.
We have just completed Chapter 2.
Again, lots of information to take in.
Remember to practice the calculations and think about the logic of each formula.
This will help you remember which formula to use when.
Make sure to complete the Top Hat homework for both Chapters 1 and 2.
I have also assigned Lab #2.
It is a sample quiz to help you practice the concepts you have learned and has similar
questions to the proctored quiz for Chapters 1 & 2.
Let me know if you have any questions, you can always click on the Canvas Chat to see
if I am online and send me a message.
Không có nhận xét nào:
Đăng nhận xét