I decided to use TidyModels to create some models predicting Washed/Not Washed (Natural) Coffees.
Visualizations
Analysis
Tidy Tuesday
Published
May 13, 2021
Predicting Process of Green Coffee Beans
With coffee being a hobby of mine, I was scrolling through past Tidy Tuesdays and found one on coffee ratings. Originally I thought looking at predictions of total cup points, but I assumed with all the coffee tasting characteristics that it wouldn’t really tell me anything. Instead, I decided to look into the processing method, as there are different taste characteristics between washed and other processing methods.
Rows: 1339 Columns: 43
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): species, owner, country_of_origin, farm_name, lot_number, mill, ic...
dbl (19): total_cup_points, number_of_bags, aroma, flavor, aftertaste, acidi...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
After looking at the distributions of procssing methods, I also decided to make the processing method binary with washed and not washed. This worked out better for the prediction models. There are also some descriptives of each variable.
Code
coffee %>%ggplot(aes(processing_method)) +geom_bar(color ="white", fill ="dodgerblue") +coord_flip()
Code
coffee %>%ggplot(aes(process)) +geom_bar(color ="white", fill ="dodgerblue") +coord_flip()
I also did some cross validation for the training dataset and used the metrics I was most interested in.
Code
set.seed(05132021)coffee_fold <-vfold_cv(coffee_train, strata ="process", v =10)metric_measure <-metric_set(accuracy, mn_log_loss, roc_auc)
From the beginning I was interested in the tasting characteristics and how they would predict whether the green coffee was washed or not washed. I also included the total cup points because I wanted to see the importance of that predictor on the processing method. The only feature engineering I did was to remove any zero variance in the predictors of the model.
The elastic net regression had slightly better accuracy than the non-penalized logistic regression but the ROC AUC was exactly the same. While the elastic net regression did not take long computationally due to the small amount of data, this model would not be chosen over the logistic regression.
Even though the elastic net regression was only slightly better, I decided to update the workflow using that model. This time I decided to update the recipe by including additional predictors like if there were any defects in the green coffee beans, the species of the coffee (e.g., Robusta and Arabica), and the country of origin. I also included additional steps in my recipe by transforming the category predictors and working with the factor predictors, like species, and country of origin. The inclusion of additional steps and the predictors created a better fitting model with the elastic net regression.
→ A | warning: Non-positive values in selected variable., No Box-Cox transformation could be estimated for: `category_two_defects`, `category_one_defects`
There were issues with some computations A: x1
There were issues with some computations A: x2
There were issues with some computations A: x3
There were issues with some computations A: x4
There were issues with some computations A: x5
There were issues with some computations A: x6
There were issues with some computations A: x7
There were issues with some computations A: x8
There were issues with some computations A: x9
There were issues with some computations A: x10
There were issues with some computations A: x10
Now using the testing dataset, we can see how well the final model fit the testing data. While not the best at predicting washed green coffee beans, this was a good test to show that the penalized regressions are not always the best fitting models compared to regular logistic regression. In the end, it seemed like the recipe was the most important component to predicting washed green coffee beans.
→ A | warning: Non-positive values in selected variable., No Box-Cox transformation could be estimated for: `category_two_defects`, `category_one_defects`
There were issues with some computations A: x1
There were issues with some computations A: x1