Overview

This is an R package to tune hyperparameters for machine learning algorithms using Bayesian Optimization based on Gaussian Processes. Algorithms currently supported are: Support vector machines, Random forest, and XGboost.

This package has some features:

  • It’s very easy to write Bayesian Optimaization function, but you also able to customise your model very easily.
  • Any class (character, integer, factor) of label column is OK.

Hyperprameter Tuning in Machine Learning

In many methods of machinelearning, it is very important to tune hyperparameters. “Grid Search” was often used to search the appropriate hyperaprameters, but it takes too much time to compute.

To solve this problem, Bayesian Optimization is often used to tune hyperparameters fast. This is a sequential design strategy for global optimization of black-box functions.

While Grid Search is simply an exhaustive searching through a manually specified subset of the hyperparameter space, Bayesian Optimization constructs a posterior distribution of functions (gaussian process) that describes the function you want to optimize best, and search the point whose score may be better.

Machine Learning and Bayesian Optimization in R

We could execute bayesian optimization using rBayesianOptimization package in the past.

  1. Make data
  2. Make the function to maximize
  3. Execute the Bayesian Optimization

For example, if you want to tune hyperparameters of XGboost with 5-fold cross validation, you have to write as following:

library(xgboost)
library(Matrix)

data(agaricus.train, package = "xgboost")
dtrain <- xgb.DMatrix(agaricus.train$data,
                      label = agaricus.train$label)
cv_folds <- KFold(agaricus.train$label, nfolds = 5,
                  stratified = TRUE, seed = 0)
xgb_cv_bayes <- function(max_depth, min_child_weight, subsample) {
  cv <- xgb.cv(params = list(booster = "gbtree", eta = 0.01,
                             max_depth = max_depth,
                             min_child_weight = min_child_weight,
                             subsample = subsample, colsample_bytree = 0.3,
                             lambda = 1, alpha = 0,
                             objective = "binary:logistic",
                             eval_metric = "auc"),
               data = dtrain, nround = 100,
               folds = cv_folds, prediction = TRUE, showsd = TRUE,
               early_stopping_rounds = 5, maximize = TRUE, verbose = 0)
  list(Score = cv$evaluation_log$test_auc_mean[cv$best_iteration],
       Pred = cv$pred)
}
OPT_Res <- BayesianOptimization(xgb_cv_bayes,
                                bounds = list(max.depth = c(2L, 6L),
                                              min_child_weight = c(1L, 10L),
                                              subsample = c(0.5, 0.8)),
                                init_grid_dt = NULL, init_points = 10, n_iter = 20,
                                acq = "ucb", kappa = 2.576, eps = 0.0,
                                verbose = TRUE)

On the other hand, we can write this very easily with MlBayesOpt package.

library(MlBayesOpt)

res0 <- xgb_cv_opt(data = agaricus.train$data,
                   label = agaricus.train$label,
                   objectfun = "binary:logistic",
                   evalmetric = "auc",
                   n_folds = 5,
                   acq = "ucb",
                   init_points = 10,
                   n_iter = 20)

When the data has data.frame class, you have only to write column name to specify the label.

# This takes a lot of time
# fashion data is included in this package
res0 <- xgb_cv_opt(data = fashion,
                   label = y,
                   objectfun = "multi:softmax",
                   evalmetric = "merror",
                   n_folds = 15,
                   classes = 10)

Installation

You can install MlBayesOpt from CRAN:

install.packages("MlbayesOpt")

You can install MlBayesOpt (latest dev version) from github with:

# install.packages("githubinstall")
githubinstall::githubinstall("MlBayesOpt")

# install.packages("devtools")
devtools::install_github("ymattu/MlBayesOpt")

To use this package, please load it by library() function.

library(MlBayesOpt)

The source code for MlBayesOpt package is available on GitHub at

Details

First of all, *_opt() functions mean “Hold-Out”, and *_cv_opt() ones mean “Cross Validation”. In “Hold Out” functions, you at least need to specify both the nameof data and the name of label column in training and test data. In “Cross Validation” functions, you have only to write column name to specify the label and how many times you validate in n_folds option.

Second, you can specify the options of Bayesian Optimization parameters, such as init_points or n_iter. For details of these options, see the help of rBayesianOptimization::BayesianOptimization() function.

Except for xgb_cv_opt() function, the returned value means test accuracy. In xgb_cv_opt() function, he returned value means the value you specified in evalmetric option.

Support Vector Machine

About SVM, this package supports hold-out tuning (svm_opt()) and cross-validation tuning(svm_cv_opt()).

In SVM functions, you can specify the kind of kernel to compute (default is “radial”) from following options.

  • linear: \(u'v\)
  • polynomial: \((\gamma u'v +coef0)^{degree}\)
  • radial basis: \(\exp(-\gamma|u-v|^2)\)
  • sigmoid: \(tanh(\gamma u'v + coef0)\)

Also, when you want to adjust the range of parameters, you can.

For details of these SVM options, see the help of e1071::svm() function.

Example

res0 <- svm_cv_opt(data = fashion_train,
                   label = y,
                   svm_kernel = "polynomial",
                   degree_range = c(2L, 4L),
                   n_folds = 3,
                   kappa = 5,
                   init_points = 4,
                   n_iter = 5)

Random Forest

This package supports only hold-out tuning so far about Random Forest(rf_opt()).

In rf_opt(), you can specify the range of the number of the trees (num_trees), the value of mtry used (mtry_range), and the range of minimum node sizes to best tested (min_node_size_range).

For details of these Random Forest options, see the help of ranger::ranger() function.

Example

res0 <- rf_opt(train_data = fashion_train,
               train_label = y,
               test_data = fashion_test,
               test_label = y,
               mtry_range = c(1L, ncol(fashion_train)-1),
               num_tree = 10L,
               init_points = 4,
               n_iter = 5)

XGboost

For XGboost, this package supports hold-out tuning (xgb_opt()) and cross-validation tuning(xgb_cv_opt()).

In XGboost functions, you must specify object function like objectfun = "multi:softmax" and how to evaluate (e.g. evalmetric = "merror"). Also, you can specify the range of xgboost options like eta (eta_range) and subsample (subsample_range).

In only xgb_cv_opt() function, the returned value means the value you specified in evalmetric option, so when you specify the evaluation metric “merror” or “(m)logloss”, the value is minus.

For details of these Random Forest options, see the help of xgboost::xgb.train() or xgboost::xgb.cv() function.

Example

res0 <- xgb_cv_opt(data = fashion_train,
                   label = y,
                   objectfun = "multi:softmax",
                   evalmetric = "merror",
                   n_folds = 3,
                   classes = 10,
                   init_points = 4,
                   n_iter = 5)