Tune R tidymodels with the Python optuna package
By Niels van der Velden in reticulate R Python Machine Learning
February 16, 2022
Introduction
I really like tidymodels
to build Machine Learning models in R because of its tidy
way of implementing all the data processing steps. However, when it comes to hyperparameter tuning I found the options very limited and not as easy to use as for instance the Python optuna
package (see my
previous post). Luckily, there is the reticulate
package which allows you to run R code in Python which makes it possible to tune R models using any Python package.
In this post I will show how to use tidymodels
to set op a xgboost
model to predict flower species. I will then tune the hyperparameters directly in R using a grid search and in Python using optuna
.
Build a tidymodel
Required packages
#Load all libraries
library(reticulate)
library(tidyverse)
library(tidymodels)
library(xgboost)
library(ggplot2)
library(knitr)
To build a simple Machine Learning model I will use the Iris dataset. This is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species. We can use the different attributes to fit a xgboost
model that will predict the flower species.
set.seed(123)
#Load data
df <- as_tibble(iris)
head(df)
## # A tibble: 6 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
To train and test the model we split the data into train, test and 10-Fold Cross-Validation set.
# Split data
split <- initial_split(df, strata = Species)
train <- training(split)
test <- testing(split)
folds <- vfold_cv(train)
An xgboost
model is fitted and the average accuracy is calculated for the K-folds. Using the standard parameters an accuracy of 0.911 is achieved.
#Model
xg_model <-
boost_tree(
trees = 1000
) %>%
set_engine("xgboost") %>%
set_mode("classification")
#Recipe
xg_recipe <-
recipe(Species ~., data=split)
#Workflow
xg_workflow <-
workflow() %>%
add_recipe(xg_recipe) %>%
add_model(xg_model)
#Fit
xg_fit <-
xg_workflow %>%
last_fit(split)
#Accuracy on test data
#accuracy <- xg_fit %>% collect_metrics()
xg_fit_rs <-
xg_workflow %>%
fit_resamples(folds)
accuracy_xg <- xg_fit_rs %>% collect_metrics()
accuracy_xg
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.911 10 0.0255 Preprocessor1_Model1
## 2 roc_auc hand_till 0.960 10 0.0216 Preprocessor1_Model1
Next we can tune the parameters using a grid search with a size of 30
set.seed(123)
#Set tuning grid
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train),
learn_rate(),
size = grid_size
)
#Tune the model using a latin hypercube.
#Model
xg_model <-
boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune(),
) %>%
set_engine("xgboost") %>%
set_mode("classification")
#Workflow
xg_workflow <-
workflow() %>%
add_recipe(xg_recipe) %>%
add_model(xg_model)
xgb_res <- tune_grid(
xg_workflow ,
resamples = folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
best_auc <- select_best(xgb_res, "roc_auc")
best_auc
## # A tibble: 1 x 7
## mtry min_n tree_depth learn_rate loss_reduction sample_size .config
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 2 6 3 0.00000689 2.14e-10 0.875 Preprocessor1_Mo~
The model is fit with the best found parameters and accuracy is calculated on the K-folds.
final_xgb <- finalize_workflow(
xg_workflow,
best_auc
)
set.seed(123)
xg_fit_rs <-
final_xgb %>%
fit_resamples(folds)
accuracy_xg <- xg_fit_rs %>% collect_metrics()
accuracy_xg
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.928 10 0.0181 Preprocessor1_Model1
## 2 roc_auc hand_till 0.987 10 0.00581 Preprocessor1_Model1
After tuning an accuracy of 0.928
was achieved which is a slight improvement over the 0.911 using the standard parameters.
Tune R model using Python
The grid search only lead to a slight improvement of the accuracy. Can we achieve better results using the optuna
package? Because we need to run optuna
in Python we need to wrap the tidymodel
that we created above into a function. Notice that I used the rlang
!! injection operator to define the variables. This is because NULL
values result in an error due to delayed evaluation of the variables by the boost_tree()
function.
tune_r_model <- function(trees = 1000, tree_depth = NULL, min_n = NULL, loss_reduction = NULL, sample_size = NULL, mtry = NULL, learn_rate = NULL){
set.seed(123)
xg_model <-
boost_tree(
trees = !!trees,
tree_depth = !!tree_depth,
min_n = !!min_n,
loss_reduction = !!loss_reduction,
sample_size = !!sample_size,
mtry = !!mtry,
learn_rate = !!learn_rate
) %>%
set_engine("xgboost") %>%
set_mode("classification")
#Recipe
xg_recipe <-
recipe(Species ~., data=split)
#Workflow
xg_workflow <-
workflow() %>%
add_recipe(xg_recipe) %>%
add_model(xg_model)
xg_fit_rs <-
xg_workflow %>%
fit_resamples(folds)
accuracy_xg <- xg_fit_rs %>% collect_metrics()
return(accuracy_xg$mean[1])
}
Now that we wrapped the model in an R function we can use optuna
to suggest values for the different parameters. Using the reticulate
package (see:
link) we can run Python code directly in Rstudio and call the tune_R_model
function that we have defined above simply by adding an r.
in front of the functions name. In order to compare the results of optuna
with the grid search that was performed using tidymodels
I use a search space for each suggested value with upper and lower bounds equal to the min and max values that were used in the grid. I then run 30
optimization trials which is equal to the size of the grid.
import optuna
from optuna.samplers import TPESampler
import math
def objective(trial):
tree_depth = trial.suggest_int('tree_depth',1, 15)
min_n = trial.suggest_int('min_n',2, 40)
loss_reduction = trial.suggest_float("loss_reduction", 1e-8, 1.0, log=True)
sample_size = trial.suggest_float("sample_size", 0.2, 1.0)
mtry = trial.suggest_int("mtry", 1, 5)
learn_rate = trial.suggest_float("learn_rate", 1e-8, 1.0, log=True)
out = r.tune_r_model(
trees = 1000,
tree_depth = tree_depth,
min_n = min_n,
loss_reduction = loss_reduction,
sample_size = sample_size,
mtry = mtry,
learn_rate = learn_rate
)
return out
study = optuna.create_study(
direction="maximize",
sampler=TPESampler(seed=123) #For reproducible results
)
study.optimize(objective, n_trials=r.grid_size)
print("The best found Accuracy = {}".format(round(study.best_value,3)))
## The best found Accuracy = 0.937
print("With Parameters: {}".format(study.best_params))
## With Parameters: {'tree_depth': 9, 'min_n': 5, 'loss_reduction': 1.7181734243476516e-07, 'sample_size': 0.7143436449949954, 'mtry': 2, 'learn_rate': 1.0998448330505116e-05}
The best accuracy is 0.937
which is quite a bit better then an accuracy of 0.928
which was found using the grid search.
Conclusion
The reticulate
package makes it very easy to combine R and Python code. To demonstrate this I build a Machine Learning model to predict Iris flower species in R using the tidymodels
package. I then tuned the model using the python optuna
package. This resulted in a higher accuracy model compared to performing a random grid search directly in R using tidymodels
.