"A bottle of wine contains more philosophy than all the books in the world"

- Louis Pasteur.

Tasting and enjoying wine is one thing, understand what makes the wine taste good is another. 

One of the joys about analytics is that actually you can pick up some of your favorite thematic and explore the reason why, understand some factors that sometimes are not so evident. 

Saying so, the goal of the post today is to talk about Wine & Analytics.

That's true! let's do the following exercise, using a very exploratory approach: 

  1. Evaluate the physicochemical properties on their quality and type. There are any differences between Red and White Wines? 
  2. Does the perceptions about the overall quality (measured by the number of points that the WineEnthusiast rate the wine) varies according to their type? 

For addressing the first point  we will use a sample of 1599 red wines and 4898 white wines collected from the study of Cortez et al. And the second point we  scrapped data from  Wine Enthusiast Magazine (http://www.winemag.com/) and filter out only the reviews concerning to Portuguese wine. From a total of 4875 Portuguese Wines Reviewed by connoisseurs , we subset a sample of 1134 Portuguese Douro Wines to perform our analysis.

1. Evaluate the physicochemical properties on their quality and type. There are any differences between Red and White Wines?

One of the goals of this post is evaluate the influence of wine physicochemical properties on their quality and type. For this we used a sample of red and white variants of the Portuguese “Vinho Verde” wine. This dataset has 1599 red wines and 4898 white wines.

analytics wine barchart1

The variable classification is a sensorial property, with grade between 0 and 10 given by specialists. Let's check the distribution of the quality ratings:

anlytics wine quality1

For more details check the source: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Although this is a complex relationship to be study let's try to figure out if we can extract some interesting conclusions with this dataset.

analytics wine whitered

Regarding the physicochemical properties of red wine we can see that ones with the highest correlation with quality are alcohol (r = 0.476) and volatile acidity (r = -0.391). 


Red Wine Correlation Matrix

analytics wine correlation matrix red.jpg

Alcohol level seems to be the psychochemical characteristic more related to quality. The volatile acidity it's responsible for a unpleasant taste of vinegar, when exists in high levels. It's measured in grams of acetic acid per litre (g/L). In general, the volatile acidity, it seems that for red wine when the volatile acidity decreases it increase the quality.  


White Wine Correlation Matrix

analytics wine correlation matrix white

Regarding, the white wine, we can also notice that there is a highest correlation between quality and alcohol (r = 0.436) and also a good correlation with density (r = -0.307). It's interesting to notice that the volatile acidity seems not be very related to quality in white wine. Talking about the density, the wines density (g/cm^3) depends on the percentage of alcohol and amount of sugar. If the wine is denser as it contains more sugars, and the quality decrease. Aligned results, as we can see in our sample if the density increases the perceived quality decreases.

analytics wine boxplot red

analytics wine boxplot white

We created bucket quality classes and saw the trend against white and red wines. The rating was classified by  low (below 5), medium (5 and 6) and high (above 6). Using these classes regarding quality perceived values, we can see a clear distinction between high quality wines values regarding  alcohol regarding the remain ones. All the  groups was statistically significant (p-value < 0.05), both for white and red wines.


2. Does the perceptions about the overall quality varies according to their type?

As we told, for this second point we scraped data from the  Wine Enthusiast Magazine and selected a sample of n= 1134 Portuguese Douro Wines reviews.

analytics wine piechart2

This dataset is divided into: 913 red wines (~ 80%) and 221 white wines reviews (~20%).

analytics wine histogram

We can see that the points awarded distribution varies with a minimum of 81 and a maximum perfect score of 100. The average points awarded is 89.14.

analytics wine boxplot2We can see that there's a greater variation in the red wines, but nevertheless it seems that the is in this color that we have the most awarded wines. 


What is the Relationship Between Price and Points Given?

analytics wine relation1

Although, it’s a complex relationship and several studies point out that (Ashenfelter (2008), Shewbridge (1998), Reuter (2000)), most of these studies are what economists call "hedonic pricing" analyses.  That is, the "price of good or service depends both on internal and external factors".

According to Miu (2001), when looking for a good wine, the uninformed consumer, who has limited knowledge of wine types and quality, will often set a price floor on the amount that he is willing to pay. We can see with our results that there's a positive relation between the points given and the wine price  (r = 0.6 ). 

From this chart we can identify the present of some outliers, or can we say wines that are overpriced. Let's filter out the wines with price greater than 100 and see how it looks.

analytics wine relation2

Let's split the analysis by type and we can see that despite the correlation coefficient is similar (r=0.59 , r= 0.61 ) for both types of wine.

analytics wine relation3

It's more clear to that  that there are some wines that seems to be over-priced and others under-priced given the quality level. Specially for the White Wines trend. 

For sure this is the first analysis with a simplistic approach to a complex problem but, we can see that, however, there is some variation of price among a given quality score. Given clue that there's more space to determine the quality perceived than only the price. 

We can see that the quality is not only influenced by the price. But there's a degree of relation. For further analysis it would be interesting add extra information and also include some kind of regional effect, variety, alcohol level, etc.


Additional Exploration - Common Words

Taking the advantage of having the reviews extracted, we decided to go a step further and check if the reviews, so let's do a quick exploitation of the most common words for both wine and red wines.

analytics wine wordcloud

Ok, it's also evident that both types are different also regarding that, and it makes totally sense from the consumer and expert perspective.  ;)

analytics wine boxplot quality

Dividing our rating into high (above 95), medium (90 to 95) and the remain as "low". Just for the sake of understanding. We also find out some interesting result regarding the amount of words that an expert spent on each wine and the points given. 

Is this telling us also something? Is the points given also related to the amount of words that the connoisseur puts in efforts while reviewing?  How is this conclusions drawn in other wine samples? hmmm more things to explore in next posts maybe ;)  using our portal.  

But, one thing is sure, the analytics behind this theme is a world with a lot's of layers that can to be explored. 

Talking about it ... how the brands being reviewed are positioned in this analysis?  How can we use analytics to choose a bottle of wine?

See you later, 

Best Regards,