Hello. Welcome to a new episode.
Today's topic is Linear Regression, part #2.
Model Building & Application
In this episode, we will continue with the topic of the linear regression.
We are going to focus on a case study.
We will build the linear models and fit the model coefficients.
Then, interpret the coefficients.
There is an obvious prerequisite for today.
Please, watch the previous episode if you haven't done so already.
We will assume that you are familiar with the basic concepts and notations.
The data set we will use today comes from the 1974 Motor Trend US magazine.
This data set shows the fuel consumption or MPG along with 10 other aspects of the car design.
It is a relatively small data set : we have only 32 car models.
As we will see later, there is a column in the data table named "mpg".
mpg stands for miles per gallon. mpg would be the dependent variable.
Then, we have 10 more columns or variables that would have some effect on the miles per gallon.
For example, "cyl" stands for the n# of cylinders in the car engine.
"hp" stands for the horse power.
"wt" stands for the weight.
"am" is a variable that tells you whether the transmission is manual or automatic.
All these would be the possible independent variables for the linear model.
Now, the questions we are asking about this data set are :
Which transmission type (manual or automatic) is better for the MPG?
Can we tell apart the difference in the MPG?
If so, by how much?
OK. Let's do some exploratory data analysis.
We have the data set as we can see here.
The columns are named after the variables.
These columns correspond to the mpg and 10 other features.
The simplest thing we can do would be to calculate the group averages.
In order to do so, we will use the excel spreadsheet function AVERAGEIF.
The syntax is as shown here.
The first argument is the cell range of the variable am.
The second argument is the condition.
For example, it is "=0" for the group average of the automatic transmission.
The third argument is the cell range of the variable mpg.
So, here we have the group averages.
Please, notice that the average MPG for the automatic transmission is 17.147.
And the average MPG for the manual transmission is higher by 7.245.
The next simplest thing we can do with the data would be to calculate the correlation matrix.
We go to the PrimaXL tab, click on this small button that looks like the sigma square sandwiched in between the square brackets.
Here, we specify the cell range of the data columns.
We check on this box to stream the output to a new sheet.
Then, we select the correlation matrix in the drop down menu.
Finally, we press the [Run] button to execute.
Great.
Here, I have manually copy pasted the labels to guide the eyes.
The first row or the first column shows the correlations with the variable mpg.
There seems to be strong correlations between the mpg and some of these variables.
Those variables are cyl, disp, hp and wt to name a few.
These are good candidates as the explanatory variables.
Now, let's apply the linear regression method.
Do you remember that the linear models have this generic form?
Let's do some model building.
As mentioned before, the mpg will be our dependent variable Y.
However, there is a lot of freedom to choose for the remaining part.
Our first model is one of the simplest.
We only have one independent variable.
X1 is nothing else but the variable am which has value equal to 0 for the automatic transmission and 1 for the manual transmission.
A variable that can only take values equal to 0 or 1 is often called the "dummy variable".
We can see that the effect of this variable is like turning on and off this term.
The beta_0 or the intercept can be interpreted as the baseline MPG for the automatic transmission.
Because, for the automatic transmission X1 is zero and the MPG is fully accounted by the beta_0 alone.
Then, what is the meaning of the beta_1?
Let's remember that this term is on when the transmission type is manual.
So, beta_1 is the difference in the MPG for the manual transmission.
If beta_1 is positive, the manual transmission would have better MPG than the automatic.
When beta_1 is negative, the opposite would be true.
Now, let's go to the Excel spreadsheet and fit the model coefficients.
We have a simplified data table with only the columns that we are going to use.
We go to the PrimaXL tab, select linear and then fit.
We select the cell range for the X variable.
Then, we select the cell range for the Y variable.
Finally, we specify the output location.
For the time being, we uncheck these boxes. We will return to them in the future video.
Now, we press the [Run] button to execute.
Great! We have the result.
The intercept or beta_0 is 17.147 which is the same as we got by doing the group average.
So far so good!
And the beta_1 is 7.245 which is the difference of the group averages we had calculated before.
By the way, we already discussed about how to interpret the p-values in some of the previous episodes.
And, we can notice is that the p-value of beta_1 is quite small : smaller than the reference value of 0.05.
So, beta_1 is statistically significant and we can claim that we found meaningful difference in the MPG of the different transmission types.
So, we had just confirmed by two different methods that the average MPG for the manual transmission is higher.
Is everyone happy with this result and we may just call it a day?
Hmm… wait! What about the remaining features (variables)?
We haven't accounted for the influence of the other variables.
So, we press on.
In this second model, we will have more variables : in fact, 3 independent variables.
X1 is as before, the transmission type, am.
X2 is the horse power, hp.
X3 is the weight, wt.
Also, as in the previous model, beta_1 can be interpreted as the difference in the MPG for the manual transmission.
Beta_2 and beta_3 tell you how much impact the horse power and the weight of the car have on the mpg.
However, unlike before the intercept or beta_0 cannot be interpreted as the baseline mpg for the automatic transmission.
This is so because we have more terms now.
Let's go back to the Excel spreadsheet and fit the model coefficients.
Again, we have a simplified data table with only the columns that we are going to use.
We go to the PrimaXL tab, select linear and then fit.
As before, we select the cell ranges for X and Y.
Then, we specify the output location.
Again, we leave these boxes unchecked.
Now, we press the [Run] button to execute.
Great!
The p-values for beta_2 and beta_3 are smaller than the reference value of 0.05.
So the horse power and the weight have meaningful impact on the mileage of the car.
However, the beta_1 which is about 2 has a relatively large p-value : 0.1413.
Thus, we cannot say that the MPG difference is meaningful.
According to this result, we cannot tell which type of transmission is better.
Again, we increase the model complexity.
We still have 3 independent variables X1 , X2 and X3.
However, there is one more term that is X1 * X3.
This term is called the "interaction term".
X1, X2 and X3 are the same as in the previous model.
But this last term is am * wt.
As the am variable is 0 for the automatic transmission and 1 for the manual transmission,
the beta_4 can be interpreted as the additional impact of the weight for the MPG of the manual type.
We can notice that beta_1 is still the difference in the miles per gallon of the different transmission types.
Also here, the intercept beta_0 cannot be interpreted as the baseline mpg for the automatic transmission.
OK, let's go back once more to the Excel spreadsheet and fit the model coefficients.
Again, we have a simplified data table with only the columns that we are going to use.
Please, notice this column that corresponds to the interaction term.
We bring up the PrimaXL menu form and proceed just like before.
Great!
Here, all the coefficients or betas are statistically meaningful as the p-values are quite small.
In particular we can notice that the beta_1 is about 11.55.
According to this model, the MPGs become distinguishable again!
And, the MPG for the manual transmission is higher by 11.55.
Summary
Our findings of today are :
By group average or by doing the regression analysis with a single dummy variable, we found that the MPGs are distinguishable.
The MPG for the manual transmission is higher by 7.245.
However, if we take into account the effect of the horse power and the weight, the MPGs are not distinguishable any more.
Finally, with the additional contribution of the weight just for the manual transmission, the MPGs become distinguishable again!
And, the MPG for the manual transmission is higher by 11.55.
We will wrap up with a question possibly paving the way for the next episode.
We may ask ourselves : why don't we just keep adding more terms?
Why not use more variables such as these.
What we really need to know is whether it makes sense to make the model more complicated.
Is there any criteria?
Well, if you are curious about the answer, please watch the coming episode.
Please, visit our Facebook page for more information and videos.
Also, make sure to subscribe to our YouTube channel.
You can find the PrimaXL installer at our GitHub repository.
Thank you so much!
Until next time, bye!
Không có nhận xét nào:
Đăng nhận xét