This report explores a dataset containing quality and attributes for approximately 1600 red wines. ***
## [1] 1599 13
- So, Our data consists of 1599 observations and 13 variables
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Univariate Plots Section
- First as the quality is our main concern, let’s see the distribution of quality

- From the above plot we can notice that the distribution of quality is normal and also bimodal.
- Most of the red wine quality at 5 & 6 grades no quality below 3 and there is no quality higher than 8.
- Now, Let’s plot simple historams of all other variables to see their distribution

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
- Here the distribution of fixed acidity is to some extent normally distributed
- The majority of fixed acidity between 6-9.
- PPresence of outlier as high as 15.9

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
- The distribution of volatile acidity is considered normal with two beaks
- The mean and median values are close
- The majority between 0.2 and 0.8 and some outliers are present

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
- There are some red wines with zero citric acid and there is still some outliers
- The distribution here is positively skewed with two peaks
- The mode of is 0, which is consistent with the data info " found in small quantities, citric acid can add freshness and flavor to wines."

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
- The distribution is Right skewed
- The majority below 2.6 with some outliers which is considered **’Sweet*’** in this case.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
- Some outliers above 0.6 while the majority less than 0.1
- All red wines in this dataset has a proportion of chlorides.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
- The distribution here is Right skewed and with the most frequent value Mode around 5.
- Presence of vertical peaks indicates common concentration usd in different red wines.
- 75% of the datapoits have free SO2 concentration below 21 and there outliers as high as 72

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
- The distribution again here is Right Skewed
- Too big values for the outliers while nearly 75% of the values below 62

- Density distribution takes a bell curve style with the majority between 0.995 and 1

- Most PH values are between 3 & 3.5 with Normal Distribution

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
- The distribution of sulphates in our dataset is Right skewed
- The median and mean values are very close
- The majority of red wines in this dataset has been in the range of 0.5 to 1
- Again presence of some outliers

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
- Alcohol distibution is Right skewed with very close values for mean and median
- The avearge concentration of Alcohol here is around 10%
- Some outlier values above 14, max = 14.90 while the minimum alcohol concentration at 8.4%
- The most frequent value a little above 9
## Bad Average Good
## 63 1319 217

- From the above we conclude the majority of red wines rating is average
Univariate Analysis
What is the structure of your dataset?
- There are 1599 wines in the dataset with 13 features. X referes for indexing.
- The variable of our intrest is quality and we generate rating
- 11 chemical variables : (fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides, free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol)
What is/are the main feature(s) of interest in your dataset?
- The main feature here is quality.
- I would dive in the relationships between quality and other variables.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
- From my point of view I think the following variables could share for wine quality:
- The Alcohol
- Volatile acidity as its high values leads to an unpleasant, vinegar taste
- Citric acid as its the cause of freshness and flavors to wine
- residual sugar as it affects the taste.
Did you create any new variables from existing variables in the dataset?
- Yes, I create a vriable rating.which is ordered factor with levels of bad, average and good
Bivariate Plots Section
- First let’s plot the correlation between our intested features.

- Here we can notice that both alcohol and volatile.acidity are moderately correlated with quality which has almost no relationship with residual.suagar
- Almost strong negative correlationship between citric.acid and volatile.acidity
- Now let’s plot all the features as we could find something new

- From the above plot we can notice some strong realtionships between features like pH and acidity but my aim here is to focus on qaulity.
Correlation:
- alcohol (moderate positive correlation with quality )
- volatile acidity (weak negative correlation with quality)
- sulphates (moderate positive correlation with quality )
- So, We can add sulphates into consideration.
- I want to dive in scatter plots for quality and some other variables like alcohol, sulphates, and volatile acidity.

## # A tibble: 3 x 3
## rating alcohol_median n
## <fct> <dbl> <int>
## 1 Bad 10 63
## 2 Average 10 1319
## 3 Good 11.6 217
- From the above plot we can notice that good quality red wines has the largest median Alcohol compared to the other qualities, it’s obvious.

## # A tibble: 3 x 3
## rating sulphates_median n
## <fct> <dbl> <int>
## 1 Bad 0.56 63
## 2 Average 0.61 1319
## 3 Good 0.74 217
- Here too we can notice that good quality red wines has the largest median sulphates compared to the other qualities.

## # A tibble: 3 x 3
## rating volatile_acidity_median n
## <fct> <dbl> <int>
## 1 Bad 0.68 63
## 2 Average 0.54 1319
## 3 Good 0.37 217
- Here too as we expect that good quality red wines has the lowest median volatile.acidity compared to the other qualities.
Now let’s discover the other variables.

## # A tibble: 3 x 3
## rating fixed_acidity_median n
## <fct> <dbl> <int>
## 1 Bad 7.5 63
## 2 Average 7.8 1319
## 3 Good 8.7 217
- So, from the above it seems that higher quality wines has the largest median fixed acidity, while the median for both average and bad are so close, with a high range in the average quality,it was not obvious.

- Another noticable trend here although it was weak correlation between citric acid and quality, but from the above plot we can notice that higher quality red wines has the largest median citric acid while the lowest median citric acid in lowest quality.

- median residual sugrar is almost the same, with presence of outliers , Although I assumed the importance of residual sugar effect on quality but from the above plot I get back.

- Almost the same median of chlorides for all qualities, except the smalest range for the highest quality, also outlier present here.
## # A tibble: 3 x 3
## rating chlorides_median n
## <fct> <dbl> <int>
## 1 Bad 0.08 63
## 2 Average 0.08 1319
## 3 Good 0.073 217
- Here higher quality wines have the lowest chlorides median with minor difference.


- From the above two plots for Free SO2 and total SO2, it seems that the largest median is present in the avearage rating group.
- Another notice here as it was weak negative correlation between density and quality, but from the above plot we can be decieved by a misleading trend that higher quality red wines has the lowest median density and the highest density in the lowest quality red wines.that is beacuse the y-axis not start from 0, but the actually here all red wines have close density in range 0.995-1 which is also close to water

- From the above we can notice a trend between pH and quality,as in the highest quality of red wines pH medain values are the lowest.
Now let’s figure out the realtion between different variables

- It worth to figure how these variables correaltes to each other:
- Postiviely: -pH with volatile.acidity -fixed acidity with both density and citric.acid.
- Negatively: -pH with both citric.acid and fixed.acidity -citric.acid with volatile acidity.
Bivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
- Quality correlates strongly with alcohol, volatile acidity and sulphates.
- Red wines with high alcohol, sulphates, fixed acidity and citric acid tend to have higher quality
- Red wines with low volatile acidity, density and pH tend to have higher quality
- Average quality red wines have the highest Free S02 & Total SO2. This seems really unusual since I would expect bad quality to have a higher median Free S02 & Total SO2. compared to the other groups which may affect their taste in the test.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
- Positive Correlations : - Alcohol with quality - pH with volatile.acidity - fixed acidity with both density and citric.acid.
- Negative Correlations: - pH with both citric.acid and fixed.acidity - citric.acid with volatile acidity.
What was the strongest relationship you found?
- Strongest Positive Correlations : Alcohol with quality.
- Strongest Negative Correlations: pH with with and fixed.acidity
Multivariate Plots Section
From the previous bivariate analysis we conclude that Alcohol has the strongest correlation wih quality.

## # A tibble: 6 x 5
## rating quality mean_alc median_alc n
## <fct> <ord> <dbl> <dbl> <int>
## 1 Bad 3 9.96 9.93 10
## 2 Bad 4 10.3 10 53
## 3 Average 5 9.90 9.7 681
## 4 Average 6 10.6 10.5 638
## 5 Good 7 11.5 11.5 199
## 6 Good 8 12.1 12.2 18
The median for alcohol concentration is increasing across quality levels.
Now, Let’s divide in more about alcohol with other variables starting with fixed acidity.

- The higher quality of red wines tend to have higher fixed acidity.

- On the other hand tend to have a lower volatile acidity , so let’s figure how they act in the ratio of fixed:volatile acidity

- Let’s focus more on difference between bad and good qualities.

- That is intersting for me, the higher qulaity red wines tend to have higher fixed acidity to volatile acidity ratio above 20, also the majority of average & bad red wines at different concentration of alcohol tends to have fixed:volatile ratio at below that line.

- Although high quality red wines tend to have higher sulphates conc., but the effect of sulphates here is not quiet clear to determine its effect on quality specially for the higher variability in the avearge group.

- Although citric acid is resposible for freshness at low concentration, but as per the above plot I noticed that most of higher quality red wines have a higher concentration of citric acid and the majority at 0 for bad quality.

- Although the correlation between quality and pH was -0.1 , but here high quality wines has tendency for lower pH specially less than 3.2 which sounds logic as higher quality alcohol tend to have more citric acid and fixed acidity which for sure leads to lower pH values.

- Intersting point here as although higher quality red wines tend to have less volatile acidity but the overall comined acidity in them is higher which reflects the pH values to be smaller in compasion to bad quality ones.

- As a try to calculate the ratio of free SO2 to bound SO2, it seems that all the quality levels behave in the same trend with majority having ratio less than 1 except for the variance in the average group.
Multivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
High quality red wines have the largest median & mean for alcohol concentration.
In comparing to alcohol, I found that high quality alcohols tends to have higher fixed acidity and citric acid, but lower volatile acidity and pH.
No effect for sulphates neither nor ration of Free SO2: Bound SO2.
Were there any interesting or surprising interactions between features?
By measuring Fixed acidity to volatile acidity ration, I noticed that the higher qulaity red wines tend to have higher ratio above 20.
Despite the positive correlation between quality and sulphates but I conclude that sulphates acts in different manner in compaing to alcohol across all rating levels
Although the correlation between quality and pH was too weak and negative , but high quality wines has tendency for lower pH specially less than 3.2.
Citric acid is higher in high quality red wines than in bad quality.
Final Plots and Summary
Plot One

Description One
The histogram shows clearly that most of the red wines quality on sclaes 5 & 6 and so they have the average rating.
Plot Two

Description Two
The boxplots here show good quality red wines has the largest median Alcohol compared to the other qualities, it’s obvious..
Plot Three

Description Three
The higher qulaity red wines tend to have higher fixed acidity to volatile acidity ratio above 20, also the majority of average & bad red wines at different concentration of alcohol tends to have fixed:volatile ratio at below that line.
Reflection
The red wines data set contains information on almost 1,599 red wines across 11 variables from around 2009. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of diamonds across many variables.
There was a clear trend between the alcohol concentration or volatile acidity of a red wine and its quality. I was surprised that citric acid did not have a strong positive correlation with quality. I struggled understanding the variance in quality as the level of Free S02 & Total SO2 get increased, but this became more clear when I realized that most of the data contained average rating of quality rank 5 &6.
Some limitations of this data include:
- The source of the data. Given that the red wines date to 2009.
- The imbalance nature of the data most of the quality related to the average group.
- It’s related to only one brand , quality could be differed in presence of other brand
- I have not expeeinced before with wine, so I guess in this topic domain experience is usefull, and this could suggest the idea of using addictive wine user for the test.
- As per my google search people generally have individual preference for wines, even that each occassion has its own wine atmosphere and kind, with meal or without, party or gift.
- Some other variables could be added like tannins content.
Resources
FOR FULL CODE on Github here
A work by Mohamed Hindam
hindamosh@gmail.com