Bayesian Optimization

Imam Firdaus ยท ยท 6 minute read

Bayesian optimization is an optimization algorithm that uses Bayes Theorem to guide the sampling process to find the function minima/maxima. Bayesian optimization is best suited for optimization over continuous domain of less than 20 dimensions, and tolerates stochastic noise in the evaluation. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian technique. After that the method uses an acquisition function defined from the surrogate function to decide which point to sample. One way to perform bayesian optimization is using scikit-optimize with python. Scikit-optimize can be used to optimize the function with single parameter or multiple parameter.

Function Optimization ๐Ÿ”—

Function optimization or global function optimization is a branch of mathematics that attempts to find the global minima or maxima of a function or a set or function on a given set. There are some common term in this discussion :

  • Samples. One example or point in given function
  • Search space. Stretch of a domain which sample is taken
  • Objective function. Function that being optimized or evaluate the value of a sample
  • Cost. Value of a sample that being optimized

Function optimization can take several role such as :

  • Algorithm Training. Optimization of model parameter
  • Algorithm Tuning. Optimization of model hyperparameter
  • Predictive Modeling. Optimization of data, data preparation, and algorithm selection.

Problem which we interested to solve formulated as

$$ x^* = arg \min_{x} f(x) $$

There are several method for optimization, the one that is being discussed in this article is bayesian optimization.

Bayesian Optimization ๐Ÿ”—

Bayesian optimization is an optimization algorithm that uses Bayes Theorem to guide the sampling process to find the function minima/maxima. Bayesian optimization is usually used to analize function that have properties such as :

  • Objective function is black box for which no closed form is know (nor is gradient)
  • Objective function is expensive to evaluate
  • The evaluation of objective function may be noisy

Bayesian optimization is best suited for optimization over continuous domain of less than 20 dimensions, and tolerates stochastic noise in the evaluation. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian technique. After that the method uses an acquisition function defined from the surrogate function to decide which point to sample.

Bayes theorem ๐Ÿ”—

Recall that Bayes Theorem stated as

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

The conditional probability we are interested in calculating is generally referred as posterior probability. The reverse conditional probability is called likelihood. The individual probability referred as prior.

We can notate specific sample ($x_1$, $x_2$, …, $x_n$) and evaluate it using the objective function $f(x_i)$ producing cost for sample $x_i$. Sample and their cost is collected in set D and used to define prior. Thus,

$$ P(f|D) = P(D|f) * P(f) $$

The posterior function represent knowledge that we had about the objective function. It is an approximation that can be used to estimate of different sample candidate that we wish to evaluate.

Surrogate Function ๐Ÿ”—

Bayesian approximation of the objective function that can be sampled efficiently. The surrogate function gives us an estimate of the objective function, which can be used to direct future sampling. One of the more common approach for surrogate function is using gaussian process.

Gaussian Process, or GP, is a model that constructs a joint probability distribution over the variables, assuming a multivariate Gaussian distribution. As such, it is capable of efficient and effective summarization of a large number of functions and smooth transition as more observations are made available to the model.

This smooth structure and smooth transition to new functions based on data are desirable properties as we sample the domain, and the multivariate Gaussian basis to the model means that an estimate from the model will be a mean of a distribution with a standard deviation; that will be helpful later in the acquisition function.

Acquisition Function ๐Ÿ”—

Is a technique by which the posterior is used to select the next sample from the search space. We want to use our belief about the objective function to sample the area of the search space that is most likely to pay off, therefore the acquisition will optimize the conditional probability of locations in the search to generate the next sample. Some of the acquisition function that common to be used are

  • Expected improvement: $-EI(x) = -\mathbb{E}[f(x)-f(x_t^+)]$
  • Lower confidence bound: $LCB(x) = \mu_{GP}(x) + \kappa\sigma_{GP}(x)$
  • Probability of improvement: $-PI(x) = -P(f(x) \geq f(x_t^+) + \kappa)$

The bayesian optimization algorithm is summarized as follows:

Bayesian Optimization using Python ๐Ÿ”—

For this example we are using scikit-optimize. Installation guide is in this link: https://scikit-optimize.github.io/stable/install.html.

Single parameter optimization ๐Ÿ”—

noise_level = 0.1

def f(x, noise_level=noise_level):
    return np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2))\
           + np.random.randn() * noise_level

# Plot f(x) + contours
x = np.linspace(-2, 2, 400).reshape(-1, 1)
fx = [f(x_i, noise_level=0.0) for x_i in x]
plt.plot(x, fx, "r--", label="True (unknown)")
plt.fill(np.concatenate([x, x[::-1]]),
         np.concatenate(([fx_i - 1.9600 * noise_level for fx_i in fx],
                         [fx_i + 1.9600 * noise_level for fx_i in fx[::-1]])),
         alpha=.2, fc="r", ec="None")
plt.legend()
plt.grid()
plt.show()

from skopt import gp_minimize

res = gp_minimize(f,                  # the function to minimize
                  [(-2.0, 2.0)],      # the bounds on each dimension of x
                  acq_func="EI",      # the acquisition function
                  n_calls=15,         # the number of evaluations of f
                  n_random_starts=5,  # the number of random initialization points
                  noise=0.1**2,       # the noise level (optional)
                  random_state=1234)   # the random seed

print(res)

Multiple parameter optimization ๐Ÿ”—

# Define the search-space dimensions. They must all have names!
from skopt.space import Real
from skopt import forest_minimize
from skopt.utils import use_named_args

dim1 = Real(name='foo', low=0.0, high=1.0)
dim2 = Real(name='bar', low=0.0, high=1.0)
dim3 = Real(name='baz', low=0.0, high=1.0)

"""
The dimension may be real number or integer
The sampling process may be using uniform probability
or logarithmic-uniform probability
For further information check the documentation page 
in reference section below this article
"""

# Gather the search-space dimensions in a list.
dimensions = [dim1, dim2, dim3]

# Define the objective function with named arguments
# and use this function-decorator to specify the
# search-space dimensions.
@use_named_args(dimensions=dimensions)
def my_objective_function(foo, bar, baz):
    return foo ** 2 + bar ** 4 + baz ** 8

# Not the function is callable from the outside as
# `my_objective_function(x)` where `x` is a list of unnamed arguments,
# which then wraps your objective function that is callable as
# `my_objective_function(foo, bar, baz)`.
# The conversion from a list `x` to named parameters `foo`,
# `bar`, `baz`
# is done automatically.

# Run the optimizer on the wrapped objective function which is called
# as `my_objective_function(x)` as expected by `forest_minimize()`.
result = forest_minimize(func=my_objective_function,
                         dimensions=dimensions,
                         n_calls=20, base_estimator="ET",
                         random_state=4)

# Print the best-found results.
print("Best fitness:", result.fun)

print("Best parameters:", result.x)

Bayesian Optimization Application ๐Ÿ”—

  • Finding maxima/minima of a function
  • Machine learning hyperparameter tuning
  • Reinforcement learning hyperparameter tuning

Summary ๐Ÿ”—

In this tutorial you discovered Bayesian optimization. The outline of this tutorial cover:

  • What function optimization is and what its relevance
  • How bayesian optimization works including explanation related to surrogate function and acquisition function
  • How to use bayesian optimization using python library scikit-optmimize

Reference ๐Ÿ”—

https://machinelearningmastery.com/what-is-bayesian-optimization/
https://arxiv.org/abs/1012.2599
https://arxiv.org/abs/1807.02811
https://github.com/scikit-optimize/scikit-optimize