In this post, I will talk a little bit about Correlation Analysis.

In general, correlation allows to understand how two continuous variables are related, the strength of that relationship and its direction. By other words, correlation helps you to understand the dependency between variables, and helps you to answer questions like:

  • Is there any statistically significant relationship between temperature and the ice cream sales?
  • Is there any relationship between total fat and protein in food?
  • Is there any relationship between educational level and salary?

How to interpret the correlation coefficient?

This relationship it’s measured by the correlation coefficient (r), with value range between +1 to -1, and we can verify the following situations:

  • Variables positively correlated (r>0): when one variable increases the other also increases. If r=1, indicates perfect ascending linear relation;
  • Variables negatively correlated (r<0): when one variable increases the other decreases. Correlation equal to -1 indicates a perfect descending relation;
  • Variables no correlated (r=0): indicates that there's no relation between variables.

correlation chart

There are different methods for correlation analysis. The two most known are: Pearson and Spearman. Its correct usage depends on the types of variables being studied. The Pearson correlation, as known as the parametric correlation test, is used when the variables are continuous, independents and have normal distribution. On the other hand, the Spearman correlation is the nonparametric version. It measures the strength and direction of association between two ranked variables (the variables will have to be ordinal, interval or ratio).

Where correlation is used?

One of the main goals in research is to establish relationships between a set of observations or variables to arrive at some conclusion which is also near to reality. Such relationships is often an initial step for identifying causal relationships.

Let's think about the following example:

Is there any relationship between the daily temperature and the overall sales of ice-cream, hot-chocolate and cookies?

The next tables, show the correlation coefficient matrix and its significance level, respectively. 

correlation table

We can see that the variable temperature are positively correlated with ice-cream sales (r=0.600). This means that when the daily temperature increases the ice-cream sales also increases. The negative correlation between hot-chocolate sales and temperature (r= -0.75) indicates the opposite, meaning that the increasing temperature leads to a decrease on the hot-chocolate sales. From the table we also identify that there's no significant correlation, i.e. sig. >=0.05, between temperature and cookies as well as between the sales variables.

When estimating the correlation coefficient between sales and temperature the analyst must be certain that the outcome is not due to the biased sampling or sampling error. Saying so, to show that the coefficient is statistically significant and not just due to random sampling error.

Let's check the hypothesis:

  • H0: There is no linear relationship between the two variables.
  • H1: There is a linear relationship between the two variables.

For instance, with a sig. 0.00 < 0.05 between temperature and the sales of ice cream we reject the null hypothesis and say that there's a significative relationship between the 2 variables.  For your help, check the details explained in “Quark Tips & Hits”. 

Another chart that can help you explore those relationships is the scatterplot matrix. It's the most useful for displaying the relationship between multiple variables. In the example below, as said before, we can see that temperature is positively correlated with ice-cream sales, negatively correlated with hot-chocolate sales, and no correlated with cookies sales.

scatterplot matrix

The correlation plot is another type of visuals that can helps us to identify the patterns. This visualization draws a correlation network from pairwise correlations between the selected variables. It's a dynamic graph which allows to verify the strongest (bold links), the weakest (thin bonds) and the direction of that correlations (positive in blue and negative in orange).  You can increase or decrease the correlation threshold in order to check the relations.

correlation graph

And this  was a briefly introduction to correlation analysis. Hope it helps to stress out the importance of the usage and how easy it is to get some quick insights.