Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there. - John W. Tukey
Tukey, a mathematician and a pioneer who first coined the term Exploratory Data Analysis(EDA). This is a very important step regarding the analytics process, it helps us to make sense of our data. Before performing a formal analysis, it is very valuable (probably essential) to explore a data set. No first models should be done without a proper EDA. This will help us, for instance, to better understand the patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.
Exploratory Data Analysis is a critical component of any analysis they serve the purpose of:
- Get an overall view of the data
- Focus on describing our sample – the actual data we observe – as opposed to making inference about some larger population or prediction about future data to be collected
- Identify unusual cases and extreme values
- Identify obvious errors
- As an end in itself
Saying so, take a mass of raw data and condense it in some useful manner. :)
how we can perform exploratory data analysis?
Remember, to start we need to begin by looking at the variables one by one. If we were domain knowledgeable about that, we would check to see if the data values conform to how they usually behave. If we are not very knowledgeable about how the variables in the dataset behave, we will need to take a naive approach to the data and examine it for striking values and patterns. Saying so, let's continue...
Hit! Data Types - Why do we bother with a taxonomy of data types?
For the purposes of data analysis, the data type is very important to help determine the type of visual display or model do create.
Let's stay simple, there are two basic types of structured data: numeric and categorical.
- Numeric data comes in two forms:
- Continuous, where the data that can take on any value in an interval, derives from the property that the ‘distance’ between adjacent points is the same throughout the scale.
- Discrete, where data that can take on only integer values, such as counts.
- Categorical data:
- Nominal, in which numbers are used simply to distinguish between different properties (for instance the marital status or the state name (Arizona, California, etc.)
- Ordinal, in which the numerical values serve to place categories in some meaningful order. ( for example, Customer Satisfaction Scales. )
Let's check some examples:
Summarizing Categorical Variables
Imagine that you wanna know: What is the percentage of customers that uses credit card (CC) as payment method?
Explore first displays information about missing data. Here all 1477 cases had valid values for the Payment Method (PAYMTD) variable.
A frequency analysis provides a summary that indicates the number and percentage of cases falling into each category of a variable. To represent this information graphically we use bar or pie charts. For instance, in this particular sample we can see tha ~ 58% of the sample uses credit card (CC).
Summarizing Numerical Variables
Non Graphical exploratory data analysis is the first step when beginning to analyze your data as part of the general data analysis approach.
This preliminary data analysis step focuses on some points that include include:
- measures of central tendency, i.e. the mean, the media and mode,
- measures of spread, i.e. variability, variants and standard deviation,
- the shape of the distribution.
Regarding the shape of distribution, Populations with the same mean and standard deviation can still have distributions with very different shapes.
For the variable Local Calls we can check that this distribution has a strong positive skewness, we can state that by looking to the values of skewness (2.3) but also to the shape of the histogram on the right side of the image.
Another type of visual that can be very useful is the boxplot, they are usually the first look at distributions of continuous variables.
A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e., 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side. Another point that can be useful is the identification of outliers, in this case we notice that there are some outliers in this sample and if we want to go further the step of data analysis we should consider how to deal with this.
Remember that exploratory data analysis is a necessary and informative step in any data analysis. And the nice thing about these exploratory work is that you can explore kind of basic questions and hypotheses. And, by looking to this you also can identify useful modeling strategies for the "next step".