Predicting Process of Green Coffee Beans
With coffee being a hobby of mine, I was scrolling through past Tidy Tuesdays and found one on coffee ratings. Originally I thought looking at predictions of total cup points, but I assumed with all the coffee tasting characteristics that it wouldn’t really tell me anything. Instead, I decided to look into the processing method, as there are different taste characteristics between washed and other processing methods.
Code
coffee <- read_csv ('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv' ) |>
mutate (species = as.factor (species),
process = recode (processing_method, "Washed / Wet" = "washed" ,
"Semi-washed / Semi-pulped" = "not_washed" ,
"Pulped natural / honey" = "not_washed" ,
"Other" = "not_washed" ,
"Natural / Dry" = "not_washed" ,
"NA" = NA_character_ ),
process = as.factor (process),
process = relevel (process, ref = "washed" ),
country_of_origin = as.factor (country_of_origin)) |>
drop_na (process) |>
filter (country_of_origin != "Cote d?Ivoire" )
After looking at the distributions of procssing methods, I also decided to make the processing method binary with washed and not washed. This worked out better for the prediction models. There are also some descriptives of each variable.
Code
coffee |>
ggplot (aes (processing_method)) +
geom_bar (color = "white" , fill = "dodgerblue" ) +
coord_flip ()
Code
coffee |>
ggplot (aes (process)) +
geom_bar (color = "white" , fill = "dodgerblue" ) +
coord_flip ()
Code
psych:: describe (coffee, na.rm = TRUE )[c ("n" , "mean" , "sd" , "min" , "max" , "skew" , "kurtosis" )]
n mean sd min max skew kurtosis
total_cup_points 1168 82.06 2.71 59.83 90.58 -1.95 9.26
species* 1168 1.01 0.09 1.00 2.00 10.65 111.61
owner* 1161 134.40 77.04 1.00 287.00 0.11 -0.99
country_of_origin* 1168 14.79 10.08 1.00 36.00 0.31 -1.12
farm_name* 883 264.34 153.73 1.00 524.00 -0.01 -1.28
lot_number* 239 93.18 59.86 1.00 202.00 0.19 -1.27
mill* 931 188.25 123.79 1.00 419.00 0.23 -1.29
ico_number* 1057 360.91 242.06 1.00 753.00 0.08 -1.27
company* 1077 142.68 76.32 1.00 266.00 -0.10 -1.24
altitude* 1014 154.46 98.45 1.00 351.00 0.32 -1.04
region* 1137 167.49 87.31 1.00 325.00 -0.08 -1.10
producer* 995 307.70 175.31 1.00 624.00 -0.02 -1.11
number_of_bags 1168 153.80 130.08 1.00 1062.00 0.37 0.50
bag_weight* 1168 23.50 16.67 1.00 45.00 -0.23 -1.71
in_country_partner* 1168 9.68 7.04 1.00 25.00 0.41 -1.31
harvest_year* 1161 5.82 2.95 1.00 14.00 0.76 -0.46
grading_date* 1168 242.39 144.52 1.00 495.00 0.11 -1.21
owner_1* 1161 136.29 77.93 1.00 290.00 0.10 -0.99
variety* 1089 12.61 9.77 1.00 29.00 0.63 -1.24
processing_method* 1168 3.98 1.67 1.00 5.00 -1.13 -0.62
aroma 1168 7.56 0.31 5.08 8.75 -0.55 4.46
flavor 1168 7.51 0.34 6.08 8.83 -0.34 1.73
aftertaste 1168 7.39 0.34 6.17 8.67 -0.45 1.36
acidity 1168 7.53 0.31 5.25 8.75 -0.30 3.31
body 1168 7.52 0.28 6.33 8.50 -0.10 0.89
balance 1168 7.51 0.34 6.08 8.58 -0.10 1.17
uniformity 1168 9.84 0.50 6.00 10.00 -4.21 20.83
clean_cup 1168 9.84 0.75 0.00 10.00 -6.98 62.29
sweetness 1168 9.89 0.52 1.33 10.00 -7.53 80.78
cupper_points 1168 7.48 0.40 5.17 8.75 -0.64 2.79
moisture 1168 0.09 0.05 0.00 0.17 -1.41 0.35
category_one_defects 1168 0.51 2.70 0.00 63.00 14.43 279.42
quakers 1167 0.17 0.82 0.00 11.00 6.87 57.30
color* 1070 2.80 0.64 1.00 4.00 -1.51 2.57
category_two_defects 1168 3.79 5.54 0.00 55.00 3.54 18.53
expiration* 1168 241.61 144.11 1.00 494.00 0.12 -1.21
certification_body* 1168 9.38 6.65 1.00 24.00 0.37 -1.31
certification_address* 1168 16.58 7.33 1.00 29.00 -0.19 -1.06
certification_contact* 1168 9.39 7.26 1.00 26.00 0.36 -0.96
unit_of_measurement* 1168 1.87 0.34 1.00 2.00 -2.21 2.87
altitude_low_meters 1012 1796.86 9073.21 1.00 190164.00 19.36 384.12
altitude_high_meters 1012 1834.27 9071.86 1.00 190164.00 19.36 384.03
altitude_mean_meters 1012 1815.56 9072.31 1.00 190164.00 19.36 384.12
process* 1168 1.30 0.46 1.00 2.00 0.86 -1.27
Now, its time to split the data into training and testing data. I also included the function of strata
to stratify sampling based on process.
Code
set.seed (05132021 )
coffee_split <- initial_split (coffee, strata = "process" )
coffee_train <- training (coffee_split)
coffee_test <- testing (coffee_split)
I also did some cross validation for the training dataset and used the metrics I was most interested in.
Code
set.seed (05132021 )
coffee_fold <- vfold_cv (coffee_train, strata = "process" , v = 10 )
metric_measure <- metric_set (accuracy, mn_log_loss, roc_auc)
From the beginning I was interested in the tasting characteristics and how they would predict whether the green coffee was washed or not washed. I also included the total cup points because I wanted to see the importance of that predictor on the processing method. The only feature engineering I did was to remove any zero variance in the predictors of the model.
Code
set.seed (05132021 )
char_recipe <- recipe (process ~ aroma + flavor + aftertaste +
acidity + body + balance + uniformity + clean_cup +
sweetness + total_cup_points,
data = coffee_train) |>
step_zv (all_predictors (), - all_outcomes ())
char_recipe |>
prep () |>
bake (new_data = NULL ) |>
head ()
# A tibble: 6 × 11
aroma flavor aftertaste acidity body balance uniformity clean_cup sweetness
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.17 8.58 8.42 8.42 8.5 8.25 10 10 10
2 8.08 8.58 8.5 8.5 7.67 8.42 10 10 10
3 8.17 8.17 8 8.17 8.08 8.33 10 10 10
4 8.42 8.17 7.92 8.17 8.33 8 10 10 10
5 8.5 8.5 8 8 8 8 10 10 10
6 8 8 8 8.25 8 8.17 10 10 10
# ℹ 2 more variables: total_cup_points <dbl>, process <fct>
Logistic Regression
The first model I wanted to test with the current recipe was logistic regression. The accuracy
and roc auc
were alright for a starting model.
Code
# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.698 10 0.00342 Preprocessor1_Model1
2 mn_log_loss binary 0.589 10 0.00775 Preprocessor1_Model1
3 roc_auc binary 0.648 10 0.0188 Preprocessor1_Model1
Lasso Regression
Now for the first penalized regression. The lasso regression did not improve in either metric. Let’s try the next penalized regression.
Code
collect_metrics (lasso_fit)
# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.702 10 0.00397 Preprocessor1_Model1
2 mn_log_loss binary 0.592 10 0.00850 Preprocessor1_Model1
3 roc_auc binary 0.645 10 0.0191 Preprocessor1_Model1
Ridge Regression
The ridge regression was shown to not be a good fitting model. So I tested an additional penalized regression while tuning hyper-parameters.
Code
collect_metrics (ridge_fit)
# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.697 10 0.00133 Preprocessor1_Model1
2 mn_log_loss binary 0.603 10 0.00212 Preprocessor1_Model1
3 roc_auc binary 0.612 10 0.0186 Preprocessor1_Model1
Elastic Net Regression
The elastic net regression had slightly better accuracy than the non-penalized logistic regression but the ROC AUC was exactly the same. While the elastic net regression did not take long computationally due to the small amount of data, this model would not be chosen over the logistic regression.
Code
collect_metrics (elastic_fit)
# A tibble: 300 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0000000001 0 accuracy binary 0.698 10 0.00342 Preprocesso…
2 0.0000000001 0 mn_log_loss binary 0.589 10 0.00775 Preprocesso…
3 0.0000000001 0 roc_auc binary 0.648 10 0.0188 Preprocesso…
4 0.00000000129 0 accuracy binary 0.698 10 0.00342 Preprocesso…
5 0.00000000129 0 mn_log_loss binary 0.589 10 0.00775 Preprocesso…
6 0.00000000129 0 roc_auc binary 0.648 10 0.0188 Preprocesso…
7 0.0000000167 0 accuracy binary 0.698 10 0.00342 Preprocesso…
8 0.0000000167 0 mn_log_loss binary 0.589 10 0.00775 Preprocesso…
9 0.0000000167 0 roc_auc binary 0.648 10 0.0188 Preprocesso…
10 0.000000215 0 accuracy binary 0.698 10 0.00342 Preprocesso…
# ℹ 290 more rows
Code
show_best (elastic_fit, metric = "accuracy" , n = 5 )
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0774 0.111 accuracy binary 0.703 10 0.00434 Preprocessor1_M…
2 0.0000000001 0.333 accuracy binary 0.703 10 0.00422 Preprocessor1_M…
3 0.00000000129 0.333 accuracy binary 0.703 10 0.00422 Preprocessor1_M…
4 0.0000000167 0.333 accuracy binary 0.703 10 0.00422 Preprocessor1_M…
5 0.000000215 0.333 accuracy binary 0.703 10 0.00422 Preprocessor1_M…
Code
show_best (elastic_fit, metric = "roc_auc" , n = 5 )
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0000000001 0 roc_auc binary 0.648 10 0.0188 Preprocessor1_Mo…
2 0.00000000129 0 roc_auc binary 0.648 10 0.0188 Preprocessor1_Mo…
3 0.0000000167 0 roc_auc binary 0.648 10 0.0188 Preprocessor1_Mo…
4 0.000000215 0 roc_auc binary 0.648 10 0.0188 Preprocessor1_Mo…
5 0.00000278 0 roc_auc binary 0.648 10 0.0188 Preprocessor1_Mo…
Code
select_best (elastic_fit, metric = "accuracy" )
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.0774 0.111 Preprocessor1_Model019
Code
select_best (elastic_fit, metric = "roc_auc" )
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.0000000001 0 Preprocessor1_Model001
New Recipe
Even though the elastic net regression was only slightly better, I decided to update the workflow using that model. This time I decided to update the recipe by including additional predictors like if there were any defects in the green coffee beans, the species of the coffee (e.g., Robusta and Arabica), and the country of origin. I also included additional steps in my recipe by transforming the category predictors and working with the factor predictors, like species, and country of origin. The inclusion of additional steps and the predictors created a better fitting model with the elastic net regression.
Code
set.seed (05132021 )
bal_rec <- recipe (process ~ aroma + flavor + aftertaste +
acidity + body + balance + uniformity + clean_cup +
sweetness + total_cup_points + category_one_defects + category_two_defects + species +
country_of_origin,
data = coffee_train) |>
step_BoxCox (category_two_defects, category_one_defects) |>
step_novel (species, country_of_origin) |>
step_other (species, country_of_origin, threshold = .01 ) |>
step_unknown (species, country_of_origin) |>
step_dummy (species, country_of_origin) |>
step_zv (all_predictors (), - all_outcomes ())
Code
collect_metrics (elastic_bal_fit)
# A tibble: 300 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0000000001 0 accuracy binary 0.839 10 0.00923 Preprocesso…
2 0.0000000001 0 mn_log_loss binary 0.431 10 0.0153 Preprocesso…
3 0.0000000001 0 roc_auc binary 0.840 10 0.0163 Preprocesso…
4 0.00000000129 0 accuracy binary 0.839 10 0.00923 Preprocesso…
5 0.00000000129 0 mn_log_loss binary 0.431 10 0.0153 Preprocesso…
6 0.00000000129 0 roc_auc binary 0.840 10 0.0163 Preprocesso…
7 0.0000000167 0 accuracy binary 0.839 10 0.00923 Preprocesso…
8 0.0000000167 0 mn_log_loss binary 0.431 10 0.0153 Preprocesso…
9 0.0000000167 0 roc_auc binary 0.840 10 0.0163 Preprocesso…
10 0.000000215 0 accuracy binary 0.839 10 0.00923 Preprocesso…
# ℹ 290 more rows
Code
show_best (elastic_bal_fit, metric = "accuracy" , n = 5 )
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.000464 0.111 accuracy binary 0.840 10 0.0101 Preprocessor1_Model0…
2 0.000464 0.222 accuracy binary 0.840 10 0.0101 Preprocessor1_Model0…
3 0.00599 0.111 accuracy binary 0.840 10 0.00933 Preprocessor1_Model0…
4 0.00599 0.222 accuracy binary 0.840 10 0.00933 Preprocessor1_Model0…
5 0.00599 0.333 accuracy binary 0.840 10 0.00933 Preprocessor1_Model0…
Code
show_best (elastic_bal_fit, metric = "mn_log_loss" , n = 5 )
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.000464 1 mn_log_loss binary 0.420 10 0.0179 Preprocessor1_Mod…
2 0.000464 0.889 mn_log_loss binary 0.420 10 0.0178 Preprocessor1_Mod…
3 0.000464 0.778 mn_log_loss binary 0.420 10 0.0178 Preprocessor1_Mod…
4 0.000464 0.667 mn_log_loss binary 0.420 10 0.0178 Preprocessor1_Mod…
5 0.000464 0.556 mn_log_loss binary 0.420 10 0.0177 Preprocessor1_Mod…
Code
show_best (elastic_bal_fit, metric = "roc_auc" , n = 5 )
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.00599 0.778 roc_auc binary 0.843 10 0.0150 Preprocessor1_Model078
2 0.000464 0.667 roc_auc binary 0.843 10 0.0141 Preprocessor1_Model067
3 0.00599 0.889 roc_auc binary 0.843 10 0.0148 Preprocessor1_Model088
4 0.000464 0.444 roc_auc binary 0.843 10 0.0141 Preprocessor1_Model047
5 0.000464 0.556 roc_auc binary 0.843 10 0.0141 Preprocessor1_Model057
Code
select_best (elastic_bal_fit, metric = "accuracy" )
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.000464 0.111 Preprocessor1_Model017
Code
select_best (elastic_bal_fit, metric = "mn_log_loss" )
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.000464 1 Preprocessor1_Model097
Code
select_best (elastic_bal_fit, metric = "roc_auc" )
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.00599 0.778 Preprocessor1_Model078
Now using the testing dataset, we can see how well the final model fit the testing data. While not the best at predicting washed green coffee beans, this was a good test to show that the penalized regressions are not always the best fitting models compared to regular logistic regression. In the end, it seemed like the recipe was the most important component to predicting washed green coffee beans.
Code
final_results |>
collect_metrics ()
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.823 Preprocessor1_Model1
2 roc_auc binary 0.817 Preprocessor1_Model1
3 brier_class binary 0.142 Preprocessor1_Model1