Hello there,

Let's talk about linear regression today and start by it's simple case: simple regression analysis.

In the last blog post (How Correlation Analysis Works?) it was introduced the concept of statistical relationship between variables. That relationship indicates the degree and the direction about their association but does not answer the following question:

Is there any algebraic relationship between the variables? Can it be used to estimate the most likely value of one variable, given the value of other variable?

Regression Analysis is the statistical technique that expresses the relationship between 2 or more variables in a form of equation. In the most simple case involving just two measures (called simple regression), regression can be used to explore and quantify the relation between the two variables.

In addition, one can develop a prediction equation useful for producing estimates for new data values, or exploring “what if” scenarios. If more than two variables are involved we generalize to “multiple regression”, but this I will talk more about in another future post.

The Simple Regression Analysis is concerned with specifying the relationship between a single numeric dependent variable (the value to be predicted) and one numeric independent variable (the predictor).

And, as the name implies, the dependent variable depends upon the value of the independent variable. The simplest forms of regression assume that the relationship between the independent and dependent variables follows a straight line. You might recall from basic algebra that lines can be defined in a slope-intercept form similar to:

y=β0+β1x1+εi

- The letter
**y**indicates the dependent variable and x indicates the independent variable. - The slope term β1x1 specifies how much the line rises for each increase in x. Positive values define lines that slope upward while negative values define lines that slope downward.
- The term β0 is known as the intercept because it specifies the point where the line crosses, or intercepts, the vertical y axis. It indicates the value of y when x = 0.
- εi is called random error.

## Simple Regression Example:

Let’s consider that we want to know the impact of the level of alcohol in the amount number of calories in a beer.

A good starting point would be to plot the data in order to view any possible relation, let's check:

By looking to this scatterplot, there seems to be a positive relationship between the two variables. We could say simply that as alcohol content increases, so do calories. This covers the general pattern but is not so much specific. Saying so, let's enter in more detail:

We can see from the next table a very popular measure of fit: the R-square. And can be described as the fraction of the total variance not explained by the model:

- R2 = 0: bad model. No evidence of a linear relationship.
- R2 = 1: good model. The line perfectly fits the data.

In other words, the R-square measure indicates that we can improve our prediction of calories in a beer by almost 84% if we know the alcohol content.

From R-squared we can derive another statistic (using degrees of freedom) that has a standard distribution called an **F-distribution**. and the p-value is, as usual, the probability of observing the data under the null hypothesis of no linear relationship. With the p-value = 0.00 < 0.05, we conclude that **there is a linear relationship**.

The coefficient provides the information for the best fitting line. The Beta indicates the regression coefficient for the prediction , indicating the average change in the dependent measure corresponding to a one-unit change in the independent variable.

The prediction formula would be Calories = -30.449 + 36.678*Alcohol . This slope indicates that on average, for each one-percent change in alcohol there is a 37 calorie increase. The intercept term suggests that a beer containing zero alcohol content would contain -30.449 calories. Maybe we found out a "miracle formula" :) for loosing weight or maybe we need to add more zero alcohol beers to our sample in order to produce a more plausible intercept term.

The residual analysis it's very important , remember that one of the assumptions of the general linear model is that the residuals should be independent of the predicted values. They provide a measure of how accurate the prediction equation is. And, studying characteristics of data points having large residuals may suggest ways of improving the equation. The minimum and maximum residuals and standardized residuals display; a negative residual corresponds to an overprediction and a positive one to an underprediction.

In this blog post I did not talk to you about regression analysis assumptions but this should not be take in light mood. Remember, a successful model is the one that does not violate the assumptions.

And this was a short introduction to Simple Regression Analysis!

Regression analysis is a widely used statistical technique to explore the relationships between continuous variables. Applications of regression are numerous and occur in almost every field, including from economics, management, life sciences, and the social sciences. In fact, we can state that regression analysis may be one of the most widely used statistical technique.

Hope it helps!

See you soon,