Tidy Tuesday Coffee Ratings

I decided to use TidyModels to create some models predicting Washed/Not Washed (Natural) Coffees.

Visualizations
Analysis
Tidy Tuesday
Published

May 13, 2021

Predicting Process of Green Coffee Beans

With coffee being a hobby of mine, I was scrolling through past Tidy Tuesdays and found one on coffee ratings. Originally I thought looking at predictions of total cup points, but I assumed with all the coffee tasting characteristics that it wouldn’t really tell me anything. Instead, I decided to look into the processing method, as there are different taste characteristics between washed and other processing methods.

Code
coffee <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv') %>% 
  mutate(species = as.factor(species),
         process = recode(processing_method, "Washed / Wet" = "washed",
                          "Semi-washed / Semi-pulped" = "not_washed",
                          "Pulped natural / honey" = "not_washed",
                          "Other" = "not_washed",
                          "Natural / Dry" = "not_washed",
                          "NA" = NA_character_),
         process = as.factor(process),
         process = relevel(process, ref = "washed"),
         country_of_origin = as.factor(country_of_origin)) %>% 
  drop_na(process) %>% 
  filter(country_of_origin != "Cote d?Ivoire")
Rows: 1339 Columns: 43
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): species, owner, country_of_origin, farm_name, lot_number, mill, ic...
dbl (19): total_cup_points, number_of_bags, aroma, flavor, aftertaste, acidi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

After looking at the distributions of procssing methods, I also decided to make the processing method binary with washed and not washed. This worked out better for the prediction models. There are also some descriptives of each variable.

Code
coffee %>% 
  ggplot(aes(processing_method)) +
  geom_bar(color = "white", fill = "dodgerblue") +
  coord_flip()

Code
coffee %>% 
  ggplot(aes(process)) +
  geom_bar(color = "white", fill = "dodgerblue") +
  coord_flip()

Code
psych::describe(coffee, na.rm = TRUE)[c("n", "mean", "sd", "min", "max", "skew", "kurtosis")]
                          n    mean      sd   min       max  skew kurtosis
total_cup_points       1168   82.06    2.71 59.83     90.58 -1.95     9.26
species*               1168    1.01    0.09  1.00      2.00 10.65   111.61
owner*                 1161  134.40   77.04  1.00    287.00  0.11    -0.99
country_of_origin*     1168   14.79   10.08  1.00     36.00  0.31    -1.12
farm_name*              883  264.34  153.73  1.00    524.00 -0.01    -1.28
lot_number*             239   93.18   59.86  1.00    202.00  0.18    -1.27
mill*                   931  188.25  123.79  1.00    419.00  0.23    -1.29
ico_number*            1057  360.89  242.06  1.00    753.00  0.08    -1.27
company*               1077  142.68   76.32  1.00    266.00 -0.10    -1.24
altitude*              1014  154.49   98.46  1.00    351.00  0.32    -1.04
region*                1137  167.49   87.31  1.00    325.00 -0.08    -1.10
producer*               995  307.70  175.31  1.00    624.00 -0.02    -1.11
number_of_bags         1168  153.80  130.08  1.00   1062.00  0.37     0.50
bag_weight*            1168   23.50   16.67  1.00     45.00 -0.23    -1.71
in_country_partner*    1168    9.68    7.04  1.00     25.00  0.41    -1.31
harvest_year*          1161    5.82    2.95  1.00     14.00  0.76    -0.46
grading_date*          1168  242.39  144.52  1.00    495.00  0.11    -1.21
owner_1*               1161  136.29   77.93  1.00    290.00  0.10    -0.99
variety*               1089   12.61    9.77  1.00     29.00  0.63    -1.24
processing_method*     1168    3.98    1.67  1.00      5.00 -1.13    -0.62
aroma                  1168    7.56    0.31  5.08      8.75 -0.55     4.46
flavor                 1168    7.51    0.34  6.08      8.83 -0.34     1.73
aftertaste             1168    7.39    0.34  6.17      8.67 -0.45     1.36
acidity                1168    7.53    0.31  5.25      8.75 -0.30     3.31
body                   1168    7.52    0.28  6.33      8.50 -0.10     0.89
balance                1168    7.51    0.34  6.08      8.58 -0.10     1.17
uniformity             1168    9.84    0.50  6.00     10.00 -4.21    20.83
clean_cup              1168    9.84    0.75  0.00     10.00 -6.98    62.29
sweetness              1168    9.89    0.52  1.33     10.00 -7.53    80.78
cupper_points          1168    7.48    0.40  5.17      8.75 -0.64     2.79
moisture               1168    0.09    0.05  0.00      0.17 -1.41     0.35
category_one_defects   1168    0.51    2.70  0.00     63.00 14.43   279.42
quakers                1167    0.17    0.82  0.00     11.00  6.87    57.30
color*                 1070    2.80    0.64  1.00      4.00 -1.51     2.57
category_two_defects   1168    3.79    5.54  0.00     55.00  3.54    18.53
expiration*            1168  241.61  144.11  1.00    494.00  0.12    -1.21
certification_body*    1168    9.38    6.65  1.00     24.00  0.37    -1.31
certification_address* 1168   16.58    7.33  1.00     29.00 -0.19    -1.06
certification_contact* 1168    9.39    7.26  1.00     26.00  0.36    -0.96
unit_of_measurement*   1168    1.87    0.34  1.00      2.00 -2.21     2.87
altitude_low_meters    1012 1796.86 9073.21  1.00 190164.00 19.36   384.12
altitude_high_meters   1012 1834.27 9071.86  1.00 190164.00 19.36   384.03
altitude_mean_meters   1012 1815.56 9072.31  1.00 190164.00 19.36   384.12
process*               1168    1.30    0.46  1.00      2.00  0.86    -1.27

Now, its time to split the data into training and testing data. I also included the function of strata to stratify sampling based on process.

Code
set.seed(05132021)

coffee_split <- initial_split(coffee, strata = "process")

coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)

I also did some cross validation for the training dataset and used the metrics I was most interested in.

Code
set.seed(05132021)

coffee_fold <- vfold_cv(coffee_train, strata = "process", v = 10)

metric_measure <- metric_set(accuracy, mn_log_loss, roc_auc)

From the beginning I was interested in the tasting characteristics and how they would predict whether the green coffee was washed or not washed. I also included the total cup points because I wanted to see the importance of that predictor on the processing method. The only feature engineering I did was to remove any zero variance in the predictors of the model.

Code
set.seed(05132021)

char_recipe <- recipe(process ~ aroma + flavor + aftertaste +
                        acidity + body + balance + uniformity + clean_cup +
                        sweetness + total_cup_points,
                      data = coffee_train) %>% 
  step_zv(all_predictors(), -all_outcomes())

char_recipe %>% 
  prep() %>% 
  bake(new_data = NULL) %>%
  head()
# A tibble: 6 × 11
  aroma flavor aftertaste acidity  body balance uniformity clean_cup sweetness
  <dbl>  <dbl>      <dbl>   <dbl> <dbl>   <dbl>      <dbl>     <dbl>     <dbl>
1  8.17   8.58       8.42    8.42  8.5     8.25         10        10        10
2  8.08   8.58       8.5     8.5   7.67    8.42         10        10        10
3  8.17   8.17       8       8.17  8.08    8.33         10        10        10
4  8.42   8.17       7.92    8.17  8.33    8            10        10        10
5  8.5    8.5        8       8     8       8            10        10        10
6  8      8          8       8.25  8       8.17         10        10        10
# ℹ 2 more variables: total_cup_points <dbl>, process <fct>

Logistic Regression

The first model I wanted to test with the current recipe was logistic regression. The accuracy and roc auc were alright for a starting model.

Warning: package 'glmnet' was built under R version 4.3.2
Code
collect_metrics(lr_fit)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.698    10 0.00342 Preprocessor1_Model1
2 mn_log_loss binary     0.589    10 0.00775 Preprocessor1_Model1
3 roc_auc     binary     0.648    10 0.0188  Preprocessor1_Model1

Lasso Regression

Now for the first penalized regression. The lasso regression did not improve in either metric. Let’s try the next penalized regression.

Code
collect_metrics(lasso_fit)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.702    10 0.00397 Preprocessor1_Model1
2 mn_log_loss binary     0.592    10 0.00850 Preprocessor1_Model1
3 roc_auc     binary     0.645    10 0.0191  Preprocessor1_Model1

Ridge Regression

The ridge regression was shown to not be a good fitting model. So I tested an additional penalized regression while tuning hyper-parameters.

Code
collect_metrics(ridge_fit)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.697    10 0.00133 Preprocessor1_Model1
2 mn_log_loss binary     0.603    10 0.00212 Preprocessor1_Model1
3 roc_auc     binary     0.612    10 0.0186  Preprocessor1_Model1

Elastic Net Regression

The elastic net regression had slightly better accuracy than the non-penalized logistic regression but the ROC AUC was exactly the same. While the elastic net regression did not take long computationally due to the small amount of data, this model would not be chosen over the logistic regression.

Code
collect_metrics(elastic_fit)
# A tibble: 300 × 8
         penalty mixture .metric     .estimator  mean     n std_err .config     
           <dbl>   <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>       
 1 0.0000000001        0 accuracy    binary     0.698    10 0.00342 Preprocesso…
 2 0.0000000001        0 mn_log_loss binary     0.589    10 0.00775 Preprocesso…
 3 0.0000000001        0 roc_auc     binary     0.648    10 0.0188  Preprocesso…
 4 0.00000000129       0 accuracy    binary     0.698    10 0.00342 Preprocesso…
 5 0.00000000129       0 mn_log_loss binary     0.589    10 0.00775 Preprocesso…
 6 0.00000000129       0 roc_auc     binary     0.648    10 0.0188  Preprocesso…
 7 0.0000000167        0 accuracy    binary     0.698    10 0.00342 Preprocesso…
 8 0.0000000167        0 mn_log_loss binary     0.589    10 0.00775 Preprocesso…
 9 0.0000000167        0 roc_auc     binary     0.648    10 0.0188  Preprocesso…
10 0.000000215         0 accuracy    binary     0.698    10 0.00342 Preprocesso…
# ℹ 290 more rows
Code
show_best(elastic_fit, metric = "accuracy", n = 5)
# A tibble: 5 × 8
        penalty mixture .metric  .estimator  mean     n std_err .config         
          <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>           
1 0.0774          0.111 accuracy binary     0.703    10 0.00434 Preprocessor1_M…
2 0.0000000001    0.333 accuracy binary     0.703    10 0.00422 Preprocessor1_M…
3 0.00000000129   0.333 accuracy binary     0.703    10 0.00422 Preprocessor1_M…
4 0.0000000167    0.333 accuracy binary     0.703    10 0.00422 Preprocessor1_M…
5 0.000000215     0.333 accuracy binary     0.703    10 0.00422 Preprocessor1_M…
Code
show_best(elastic_fit, metric = "roc_auc", n = 5)
# A tibble: 5 × 8
        penalty mixture .metric .estimator  mean     n std_err .config          
          <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            
1 0.0000000001        0 roc_auc binary     0.648    10  0.0188 Preprocessor1_Mo…
2 0.00000000129       0 roc_auc binary     0.648    10  0.0188 Preprocessor1_Mo…
3 0.0000000167        0 roc_auc binary     0.648    10  0.0188 Preprocessor1_Mo…
4 0.000000215         0 roc_auc binary     0.648    10  0.0188 Preprocessor1_Mo…
5 0.00000278          0 roc_auc binary     0.648    10  0.0188 Preprocessor1_Mo…
Code
select_best(elastic_fit, metric = "accuracy")
# A tibble: 1 × 3
  penalty mixture .config               
    <dbl>   <dbl> <chr>                 
1  0.0774   0.111 Preprocessor1_Model019
Code
select_best(elastic_fit, metric = "roc_auc")
# A tibble: 1 × 3
       penalty mixture .config               
         <dbl>   <dbl> <chr>                 
1 0.0000000001       0 Preprocessor1_Model001

New Recipe

Even though the elastic net regression was only slightly better, I decided to update the workflow using that model. This time I decided to update the recipe by including additional predictors like if there were any defects in the green coffee beans, the species of the coffee (e.g., Robusta and Arabica), and the country of origin. I also included additional steps in my recipe by transforming the category predictors and working with the factor predictors, like species, and country of origin. The inclusion of additional steps and the predictors created a better fitting model with the elastic net regression.

Code
set.seed(05132021)

bal_rec <- recipe(process ~ aroma + flavor + aftertaste +
                        acidity + body + balance + uniformity + clean_cup +
                        sweetness + total_cup_points + category_one_defects + category_two_defects + species +
                        country_of_origin,
                      data = coffee_train) %>% 
  step_BoxCox(category_two_defects, category_one_defects) %>% 
  step_novel(species, country_of_origin) %>% 
  step_other(species, country_of_origin, threshold = .01) %>%
  step_unknown(species, country_of_origin) %>% 
  step_dummy(species, country_of_origin) %>% 
  step_zv(all_predictors(), -all_outcomes())
→ A | warning: Non-positive values in selected variable., No Box-Cox transformation could be estimated for: `category_two_defects`, `category_one_defects`
There were issues with some computations   A: x1
There were issues with some computations   A: x2
There were issues with some computations   A: x3
There were issues with some computations   A: x4
There were issues with some computations   A: x5
There were issues with some computations   A: x6
There were issues with some computations   A: x7
There were issues with some computations   A: x8
There were issues with some computations   A: x9
There were issues with some computations   A: x10
There were issues with some computations   A: x10
Code
collect_metrics(elastic_bal_fit) 
# A tibble: 300 × 8
         penalty mixture .metric     .estimator  mean     n std_err .config     
           <dbl>   <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>       
 1 0.0000000001        0 accuracy    binary     0.839    10 0.00923 Preprocesso…
 2 0.0000000001        0 mn_log_loss binary     0.431    10 0.0153  Preprocesso…
 3 0.0000000001        0 roc_auc     binary     0.840    10 0.0163  Preprocesso…
 4 0.00000000129       0 accuracy    binary     0.839    10 0.00923 Preprocesso…
 5 0.00000000129       0 mn_log_loss binary     0.431    10 0.0153  Preprocesso…
 6 0.00000000129       0 roc_auc     binary     0.840    10 0.0163  Preprocesso…
 7 0.0000000167        0 accuracy    binary     0.839    10 0.00923 Preprocesso…
 8 0.0000000167        0 mn_log_loss binary     0.431    10 0.0153  Preprocesso…
 9 0.0000000167        0 roc_auc     binary     0.840    10 0.0163  Preprocesso…
10 0.000000215         0 accuracy    binary     0.839    10 0.00923 Preprocesso…
# ℹ 290 more rows
Code
show_best(elastic_bal_fit, metric = "accuracy", n = 5)
# A tibble: 5 × 8
   penalty mixture .metric  .estimator  mean     n std_err .config              
     <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
1 0.000464   0.111 accuracy binary     0.840    10 0.0101  Preprocessor1_Model0…
2 0.000464   0.222 accuracy binary     0.840    10 0.0101  Preprocessor1_Model0…
3 0.00599    0.111 accuracy binary     0.840    10 0.00933 Preprocessor1_Model0…
4 0.00599    0.222 accuracy binary     0.840    10 0.00933 Preprocessor1_Model0…
5 0.00599    0.333 accuracy binary     0.840    10 0.00933 Preprocessor1_Model0…
Code
show_best(elastic_bal_fit, metric = "mn_log_loss", n = 5)
# A tibble: 5 × 8
   penalty mixture .metric     .estimator  mean     n std_err .config           
     <dbl>   <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>             
1 0.000464   1     mn_log_loss binary     0.420    10  0.0179 Preprocessor1_Mod…
2 0.000464   0.889 mn_log_loss binary     0.420    10  0.0178 Preprocessor1_Mod…
3 0.000464   0.778 mn_log_loss binary     0.420    10  0.0178 Preprocessor1_Mod…
4 0.000464   0.667 mn_log_loss binary     0.420    10  0.0178 Preprocessor1_Mod…
5 0.000464   0.556 mn_log_loss binary     0.420    10  0.0177 Preprocessor1_Mod…
Code
show_best(elastic_bal_fit, metric = "roc_auc", n = 5)
# A tibble: 5 × 8
   penalty mixture .metric .estimator  mean     n std_err .config               
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
1 0.00599    0.778 roc_auc binary     0.843    10  0.0150 Preprocessor1_Model078
2 0.000464   0.667 roc_auc binary     0.843    10  0.0141 Preprocessor1_Model067
3 0.00599    0.889 roc_auc binary     0.843    10  0.0148 Preprocessor1_Model088
4 0.000464   0.444 roc_auc binary     0.843    10  0.0141 Preprocessor1_Model047
5 0.000464   0.556 roc_auc binary     0.843    10  0.0141 Preprocessor1_Model057
Code
select_best(elastic_bal_fit, metric = "accuracy")
# A tibble: 1 × 3
   penalty mixture .config               
     <dbl>   <dbl> <chr>                 
1 0.000464   0.111 Preprocessor1_Model017
Code
select_best(elastic_bal_fit, metric = "mn_log_loss")
# A tibble: 1 × 3
   penalty mixture .config               
     <dbl>   <dbl> <chr>                 
1 0.000464       1 Preprocessor1_Model097
Code
select_best(elastic_bal_fit, metric = "roc_auc")
# A tibble: 1 × 3
  penalty mixture .config               
    <dbl>   <dbl> <chr>                 
1 0.00599   0.778 Preprocessor1_Model078

Now using the testing dataset, we can see how well the final model fit the testing data. While not the best at predicting washed green coffee beans, this was a good test to show that the penalized regressions are not always the best fitting models compared to regular logistic regression. In the end, it seemed like the recipe was the most important component to predicting washed green coffee beans.

→ A | warning: Non-positive values in selected variable., No Box-Cox transformation could be estimated for: `category_two_defects`, `category_one_defects`
There were issues with some computations   A: x1
There were issues with some computations   A: x1
Code
final_results %>%
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.823 Preprocessor1_Model1
2 roc_auc  binary         0.817 Preprocessor1_Model1