Hello everyone. This is our 2nd lecture for Chapter 14 – Introduction to Correlations
and Regression
In our last video, we introduced correlations and the Pearson r coefficient. In this lecture
video, we will discuss other types of correlation coefficients, chi-square (briefly), and linear
regression.
The Pearson correlation measures the direction and strength of a linear relationship between
two continuous variables (when X and Y consist of numerical scores from an interval or ratio
scale of measurement). Other correlations have been developed for nonlinear relationships
and for other types of data. In this video, we will take a quick look at three additional
correlations: Spearman, Point-Biserial, and the phi-coefficient. The beauty of these other
correlations is that they use the same formula as the Pearson r…they just have a different
name due to the type of data being analyzed.
The Spearman correlation, rs, is used to measure the relationship between X and Y when both
variables are measured on ordinal scales. Remember from Chapter 1 that an ordinal scale
typically involves ranking individuals rather than obtaining numerical scores. The Spearman
correlation can be used as a valuable alternative to the Pearson correlation, even when the
original raw scores are on an interval or ratio scale. Remember that Pearson r, measures
the strength and direction of the linear relationship between two variables – how well the data
points fit on a straight line. However, a researcher expects the data to show a consistent
one-directional relationship but not necessarily a linear relationship. For example, for nearly
any skill, increasing amounts of practice tend to be associated with improvements in
performance (the more you practice, the better you get). However, it is not a straight-line
relationship. When you are first learning a new skill, practice produces large improvements
in performance. After performing a skill for years, additional practice produces only minor
changes. If you were to measure this with a Pearson r correlation, you would not find
a correlation of 1 because the data do not fit perfectly on a straight line. So the Spearman
correlation can be used to measure the consistency of the relationship, regardless of its form.
The point-biserial correlation is used to measure the relationship between two variables
in situations where one variable consists of numerical (continuous, interval or scale)
scores, but the second variable has only two values. A variable with only two values is
called a dichotomous variable. Some examples of dichotomous variables are: male versus
female, college grad versus not college grad, first-born child versus later-born child,
success versus failure, and so on.
The phi-coefficient is used to measure the relationship between two variables in situations
where both variables are dichotomous. An example of this would be if a researcher were interested
in examining the relationship between birth order position and personality for individuals
who have at least one sibling. Birth order position as first-born child versus later-born
child and personality classified as introvert or extrovert. Although the phi coefficient
can be used to assess the relationship between two dichotomous variables, the more common
statistical procedure is a chi-square statistic, which is discussed in Chapter 13.
Chi square statistic is more appropriate because it uses the proportions obtained from sample
data to test hypotheses about the corresponding proportions in the population. Sometimes,
a researcher has questions about the proportions or relative frequencies for a distribution
especially when the data are nominal. So, chi square statistic is the appropriate procedure
to use in these situations. Chi-square tests are non-parametric tests.
In statistics, we have parametric tests, which are all of the statistical procedures you
have learned in this class (z-test, t-tests, F-ratio, & correlations). The reason they
are parametric is that they are all about parameters and require that assumptions be
met. There are times when experimental situations do not conform to the requirements of a parametric
test. So, in these situations, when assumptions are violated, we need alternatives to parametric
tests…these alternatives are called nonparametric tests. There are nonparametric tests for every
parametric test. So that was a very brief intro to chi-square…and all you will need
to know for the final.
So, now let's go back to chapter 14 and talk about linear regression. Earlier, we
introduced the Pearson correlation as a technique for describing and measuring the linear relationship
between two variables. If there is a strong linear relationship, we can draw a line through
the middle of the data points. This line serves several purposes:
1. The line makes the relationship between the two variables easier to see.
2. The line identifies the center, or central tendency, of the relationship.
3. Finally and most importantly, the line can be used for prediction. the line establishes
a precise one-to-one relationship between each X value and a corresponding Y value.
Now we will look at a procedure that identifies and defines the straight line that provides
the best fit for any specific set of data. This straight line does not have to be drawn
on a graph; it can be represented in a simple equation (that most, if not all, of you will
recognize). Our goal is to find the equation for the line that best describes the relationship
for a set of X and Y values.
In general, the linear relationship between two variables X and Y can be expressed by
the equation: Y-hat = a + bX, Y-hat is predicted Y. This formula should look familiar (y-hat
= mx + b). In statistics, we use different placeholders for slope and y-intercept. Y-hat
= a + bX: In this general linear equation, the value of b is called the slope. The slope
determines how much the Y variable changes when X is increased by 1 point. The value
of a, in this general linear equation, is called the y-intercept because it determines
the value of Y when X = 0 (on a graph, the a value identifies the point where the line
intercepts the y-axis).
Because a straight line can be extremely useful for describing a relationship between two
variables, we use a statistical technique that provides a standardized way to determine
the best-fitting straight line for any set of data. This statistical procedure is called
regression and the resulting straight line is called the regression line. The goal for
regression is the find the best fitting straight line for a set of data. To do this, however,
we first need to define precisely what we mean by "best fit." For any particular
set of data, it is possible to draw lots of different straight lines that all appear to
pass through the center of the data points. Each of these lines can be defined by a linear
equation. The problem is to find the specific line that provides the best fit to the actual
data points. We define the best fitting line as the one that minimizes prediction error.
Here is some background on how to minimize prediction error. To determine how well a
line fits the data points, the first step is to define mathematically the distance between
the line and each data point. For every X value in the data, the linear equation determines
a Y value on the line. This value is the predicted Y or Y-hat. The distance between this predicted
value and the actual Y value is the error of prediction (Y – Y-hat). This distance
measures the error between the line and the actual data. Because some of the distances
are positive and some are negative, we have to square each distance to get a positive
measure of error. To determine the total error between the line and the data, we add the
squared errors for all the data points. This method is called the least squared error solution.
The regression line is also known as the least squared regression line (or LSRL for short).
Here is the graphical representation of what was just discussed. As you can see the regression
line Y-hat = a + bX is the one that best fits these data because it minimizes the distances
(or errors) between the line and the actual data points.
Here is how we find the slope and y-intercept to construct our regression line equation.
To find the slope: b = sum of products (SP) divided by sum of squares for X (SSX) or b
= r times standard deviation of Y (sY) divided by standard deviation of X (sX). To find the
y-intercept: a = sample mean of Y (MY) minus slope (b) times sample mean of X (MX).
I will post a video that walks you through the calculations and interpretations of a
Pearson r correlation and regression equation.
Không có nhận xét nào:
Đăng nhận xét