Linear Regression analysis is a widely used statistical technique to explore the relationships between numerical variables. If more than two variables are involved we call it “multiple regression”.
In the multiple regression analysis aims and methods are pretty much the same as in simple regression.
The basic equation shifts from Y = a + b*X to Y = a + b1*X1 + b2*X2 + …+ bp*Xp.
Using a linear model that attempts to minimize the sum square of the residuals but using not one but instead several predictor variables.
Some questions will arise like: Which combination of variables will do the best job of prediction? Which is the most important variable in explaining the target?
Additional, various diagnostic measures can provide information about the stability and suitability of the analysis, and provide hints as to how to improve it. Let's check more in this post.
Multiple Linear Regression Example
Let's think in a very simple example where we want to understand how some of the columns like the fertility, gross domestic product, density of population, percent of people living in cities and female literacy will impact the female life expectancy in the analyzed countries.
To perform a Multiple Linear Regression in in our portal we just need to go to the Analytical Arena and select the option Linear Regression.
Select the variables of interest and press Go!
Output Breakdown - Model Summary
The first table is related to the Model summary.
All five variables regarding economic and social aspects of the countries have been entered at the same time into the regression model. This model shows a strong linear fit with a R-square of 0.83. This means that together they account for almost 83% of the variation in female life expectancy for females in all the nations (see the R-square measure). The adjusted R-square measure again is very close to the R-square.
We can also check the F-Statistic (and respectively p-value) that is testing the hypothesis that all parameters are equal to 0. It tests for a significant linear regression relationship between the response variable and the predictor variables. Is a good indicator of whether there is a relationship between our predictor and the response variables.
Although the B coefficients are important for predictive and interpretive purposes, usually we look first to the t test at the end of each line to determine which independent variables are significantly related to the outcome measure. Since five variables are in the equation, we are testing if there is a linear relationship between each independent variable and the dependent measure after adjusting for the effects of the four other independent variables.
We can see that some variables does not present great significant with respect to females life expectancy, like the density of population (density) and the gross domestic product (gdp_cap) with p-value > 0.05.
Thus we can drop the density and the gross domestic product since they are not related to the females life expectancy. Typically, you would rerun the regression after removing variables not found to be significant, but we will proceed and interpret this results.
The Standardized Beta column (or standardized partial regression coefficients) are the partial regression coefficients obtained if all variables are standardized. They are directly comparable, therefore, provide better insight into the importance of the predictor in the model. Such coefficients are used to judge relative importance among several predictor variables in a multiple regression analysis.
We can also see that the female literacy is one of the variables with the higher impact (beta = 0.371) followed by the percent of people living in cities, urban (beta = 0.329). Also it's interesting to notice that the rate fertility is negatively related to the female life expectancy (beta = -0.269 ) meanings that when the fertility rate increases the female life expectancy decreases.
Output Breakdown - Collinearity Diagnostics
Perfect collinearity exists when one of the independent variables in a regression equation has a perfect linear relation to one or more of the other independent variables in the equation.
The problem of multicollinearity is easy to understand. When we interpret a variable’s coefficient, we state that the effect of a variable is controlling for the other variables in the equation. But what if two or more variables vary together, we can’t hold one constant while varying the others, correct?
The colinnearity Statistics table allows to do our formal assessment on multicollinearity. The standard values for Tolerance should be > 0.1 and VIF < 10 for all variables, which they are.
We also can see more technical measures of multicollinearity. The eigenvalues range from 4.210 to 0.008, with the last two being rather close to zero. This meets one standard for multicollinearity. On the other hand, none of the condition indices is above 25 or 30. As is typical of the multicollinearity measures, they will not always agree.
To use the Variance Proportions, you concentrate on the rows with small eigenvalues near 0. We can look at each dimension separately, and you look for two or more variables with relatively high variance proportions. In dimension 5, the variable urban have very high values (0.958), for instance.
Note, if multicollinearity is detected in your data set, you should attempt to adjust for it since your regression coefficients are unstable and you typically interpret these coefficients.
Output Breakdown - Residual Analysis
Residuals are important in regression first because they provide a measure of how accurate the prediction equation is. Second, studying characteristics of data points having large residuals may suggest ways of improving the equation.
A residual summary model appears a residual analysis. The minimum and maximum residuals and standardized residuals display; a negative residual corresponds to an over-prediction and a positive one to an under-prediction.
We can display the fit information. This is a very important in regression analysis. In general, the data doesn’t fall exactly on a line, so the regression equation should include an explicit error term. We have some charts to assist us regarding this analysis.
We can see some signal of asymmetry and the presence of more extreme residuals than we expect from a normal distribution by looking at this histogram.
And this was a briefly introduction to Multiple Linear Regression Analysis. Hope it helps to stress out the importance of the usage and how easy it is to get some quick insights inside Quark Analytics Portal.