Tune R tidymodels with the Python optuna package

By Niels van der Velden in reticulate R Python Machine Learning

February 16, 2022

Introduction

I really like tidymodels to build Machine Learning models in R because of its tidy way of implementing all the data processing steps. However, when it comes to hyperparameter tuning I found the options very limited and not as easy to use as for instance the Python optuna package (see my previous post). Luckily, there is the reticulate package which allows you to run R code in Python which makes it possible to tune R models using any Python package.

In this post I will show how to use tidymodels to set op a xgboost model to predict flower species. I will then tune the hyperparameters directly in R using a grid search and in Python using optuna.

Build a tidymodel

Required packages

#Load all libraries
library(reticulate)
library(tidyverse)
library(tidymodels)
library(xgboost)
library(ggplot2)
library(knitr)

To build a simple Machine Learning model I will use the Iris dataset. This is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species. We can use the different attributes to fit a xgboost model that will predict the flower species.

set.seed(123)
#Load data
df <- as_tibble(iris)
head(df)

## # A tibble: 6 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

To train and test the model we split the data into train, test and 10-Fold Cross-Validation set.

# Split data
split <- initial_split(df, strata = Species)
train <- training(split)
test <- testing(split)
folds <- vfold_cv(train)

An xgboost model is fitted and the average accuracy is calculated for the K-folds. Using the standard parameters an accuracy of 0.911 is achieved.

#Model
xg_model <-
  boost_tree(
    trees = 1000
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

#Recipe
xg_recipe <-
  recipe(Species ~., data=split)

#Workflow
xg_workflow <-
  workflow() %>%
  add_recipe(xg_recipe) %>%
  add_model(xg_model)

#Fit
xg_fit <- 
  xg_workflow %>%
  last_fit(split) 

#Accuracy on test data
#accuracy <- xg_fit %>% collect_metrics()

xg_fit_rs <- 
  xg_workflow %>% 
  fit_resamples(folds)
accuracy_xg <- xg_fit_rs %>% collect_metrics()
accuracy_xg

## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.911    10  0.0255 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.960    10  0.0216 Preprocessor1_Model1

Next we can tune the parameters using a grid search with a size of 30

set.seed(123)
#Set tuning grid
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), train),
  learn_rate(),
  size = grid_size
)

#Tune the model using a latin hypercube. 

#Model
xg_model <-
  boost_tree(
    trees = 1000, 
    tree_depth = tune(), 
    min_n = tune(), 
    loss_reduction = tune(),            
    sample_size = tune(), 
    mtry = tune(),         
    learn_rate = tune(),   
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

#Workflow
xg_workflow <-
  workflow() %>%
  add_recipe(xg_recipe) %>%
  add_model(xg_model)

xgb_res <- tune_grid(
  xg_workflow ,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)
best_auc <- select_best(xgb_res, "roc_auc")
best_auc

## # A tibble: 1 x 7
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .config          
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>            
## 1     2     6          3 0.00000689       2.14e-10       0.875 Preprocessor1_Mo~

The model is fit with the best found parameters and accuracy is calculated on the K-folds.

final_xgb <- finalize_workflow(
  xg_workflow,
  best_auc
)

set.seed(123)
xg_fit_rs <-
  final_xgb %>%
  fit_resamples(folds)
accuracy_xg <- xg_fit_rs %>% collect_metrics()
accuracy_xg

## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.928    10 0.0181  Preprocessor1_Model1
## 2 roc_auc  hand_till  0.987    10 0.00581 Preprocessor1_Model1

After tuning an accuracy of 0.928 was achieved which is a slight improvement over the 0.911 using the standard parameters.

Tune R model using Python

The grid search only lead to a slight improvement of the accuracy. Can we achieve better results using the optuna package? Because we need to run optuna in Python we need to wrap the tidymodel that we created above into a function. Notice that I used the rlang !! injection operator to define the variables. This is because NULL values result in an error due to delayed evaluation of the variables by the boost_tree() function.

tune_r_model <- function(trees = 1000, tree_depth = NULL, min_n = NULL, loss_reduction = NULL, sample_size = NULL, mtry = NULL, learn_rate = NULL){
   
  set.seed(123)
  xg_model <-
  boost_tree(
    trees = !!trees, 
    tree_depth = !!tree_depth, 
    min_n = !!min_n, 
    loss_reduction = !!loss_reduction,            
    sample_size = !!sample_size, 
    mtry = !!mtry,         
    learn_rate = !!learn_rate  
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")
  
  #Recipe
  xg_recipe <-
    recipe(Species ~., data=split)

  #Workflow
  xg_workflow <-
    workflow() %>%
    add_recipe(xg_recipe) %>%
    add_model(xg_model)
  
  xg_fit_rs <- 
  xg_workflow %>% 
  fit_resamples(folds)
  
  accuracy_xg <- xg_fit_rs %>% collect_metrics()
  
  return(accuracy_xg$mean[1])
}

Now that we wrapped the model in an R function we can use optuna to suggest values for the different parameters. Using the reticulate package (see: link) we can run Python code directly in Rstudio and call the tune_R_model function that we have defined above simply by adding an r. in front of the functions name. In order to compare the results of optuna with the grid search that was performed using tidymodels I use a search space for each suggested value with upper and lower bounds equal to the min and max values that were used in the grid. I then run 30 optimization trials which is equal to the size of the grid.

import optuna

from optuna.samplers import TPESampler

import math

def objective(trial):
  
    tree_depth = trial.suggest_int('tree_depth',1, 15)
    min_n = trial.suggest_int('min_n',2, 40)
    loss_reduction = trial.suggest_float("loss_reduction", 1e-8, 1.0, log=True)
    sample_size = trial.suggest_float("sample_size", 0.2, 1.0)
    mtry = trial.suggest_int("mtry", 1, 5)
    learn_rate = trial.suggest_float("learn_rate", 1e-8, 1.0, log=True)
    
    out = r.tune_r_model(
      trees = 1000,
      tree_depth = tree_depth,
      min_n = min_n, 
      loss_reduction = loss_reduction,            
      sample_size = sample_size, 
      mtry = mtry,         
      learn_rate = learn_rate  
      )
      
    return out

study = optuna.create_study(
  direction="maximize", 
  sampler=TPESampler(seed=123) #For reproducible results
  )
study.optimize(objective, n_trials=r.grid_size)

print("The best found Accuracy = {}".format(round(study.best_value,3)))

## The best found Accuracy = 0.937

print("With Parameters: {}".format(study.best_params))

## With Parameters: {'tree_depth': 9, 'min_n': 5, 'loss_reduction': 1.7181734243476516e-07, 'sample_size': 0.7143436449949954, 'mtry': 2, 'learn_rate': 1.0998448330505116e-05}

The best accuracy is 0.937which is quite a bit better then an accuracy of 0.928 which was found using the grid search.

Conclusion

The reticulate package makes it very easy to combine R and Python code. To demonstrate this I build a Machine Learning model to predict Iris flower species in R using the tidymodels package. I then tuned the model using the python optuna package. This resulted in a higher accuracy model compared to performing a random grid search directly in R using tidymodels.