BayesOpt
 All Classes Namespaces Files Functions Variables Enumerations Enumerator Groups Pages
Bayesian optimization

Table of Contents

Introduction to Bayesian Optimization

Many problems in engineering, computer science, economics, etc., require to find the extremum of a real valued function. These functions are typically continuous and sometimes smooth (e.g.: Lipschitz continuous). However, those functions do not have a closed-form expression or might be multimodal, where some of the local extrema might have a bad outcome compared to the global extremum. The evaluation of those functions might be costly.

Global optimization is a special case of non-convex optimization where we want to find the global extremum of a real valued function, that is, the target function. The search is done by some pointwise evaluation of the target function.

The objective of a global optimization algorithm is to find the sequence of points

\[ x_n \in \mathcal{A} \subset \mathbb{R}^m , \;\;\; n = 1,2,\ldots \]

which converges to the point $x^*$, that is, the extremum of the target function, when $n$ is large. The algorithm should be able to find that sequence at least for all functions from a given family.

As explained in[16], this search procedure is a sequential decision making problem where point at step $n+1$ is based on decision $d_n$ which considers all previous data:

\[ x_{n+1} = d_n(x_{1:n},y_{1:n}) \]

where $y_i = f(x_i) + \epsilon_i$. For simplicity, many works assume $\epsilon_i = 0$, that is, function evaluations are deterministic. However, we can easily extend the description to include stochastic functions (e.g.: homoscedastic noise $\epsilon_i \sim \mathcal{N}(0,\sigma)$).

The search method is the sequence of decisions $d = {d_0,\ldots, d_{n-1}}$, which leads to the final decision $x_{n} = x_{n}(d)$. In most applications, the objective is to optimize the response of the final decisions. Then, the criteria relies on the optimality error or optimality gap, which can be expressed as:

\[ \delta_n(f,d) = f\left(x_n\right) - f(x^*) \]

In other applications, the objective may require to converge to $x^*$ in the input space. Then, we can use for example the Euclidean distance error:

\[ \delta_n(f,d) = \|x_n - x^*\|_2 \label{eq:dist-error} \]

The previous equations can also be interpreted as variants of the loss function for the decision at each step. Thus, the optimal decision is defined as the function that minimizes the loss function:

\[ d_n = \arg \min_d \delta_n(f,d) \]

This requires full knowledge of function $f$, which is unavailable. Instead, let assume that the target function $f = f(x)$ belongs to a family of functions $f \in F$, e.g.: continuous functions in $\mathbb{R}^m$. Let also assume that the function can be represented as sample from a probability distribution over functions $f \sim P(f)$. Then, the best response case analysis for the search process is defined as the decision that optimizes the expectation of the loss function:

\[ d^{BR}_n = \arg \min_d \mathbb{E}_{P(f)} \left[ \delta_n(f,d)\right]= \arg \min_d \int_F \delta_n(f,d) \; dP(f) \]

where $P$ is a prior distribution over functions.

However, we can improve the equation considering that, at decision $d_n$ we have already observed the actual response of the function at $n-1$ points, $\{x_{1:n-1},y_{1:n-1}\}$. Thus, the prior information of the function can be updated with the observations and the Bayes rule:

\[ P(f|x_{1:n-1},y_{1:n-1}) = \frac{P(x_{1:n-1},y_{1:n-1}|f) P(f)}{P(x_{1:n-1},y_{1:n-1})} \]

In fact, we can actually rewrite the equation to represent the updates sequentially:

\[ P(f|x_{1:i},y_{1:i}) = \frac{P(x_{i},y_{i}|f) P(f|x_{1:i-1},y_{1:i-1})}{P(x_{i},y_{i})}, \qquad \forall \; i=1 \ldots n-1 \]

Thus, the previous equation can be rewritten as:

\[ d^{BO}_n = \arg \min_d \mathbb{E}_{P(f|x_{1:n-1},y_{1:n-1})} \left[ \delta_n(f,d)\right] = \arg \min_d \int_F \delta_n(f,d) \; dP(f|x_{1:n-1},y_{1:n-1}) \]

This equation is the root of Bayesian optimization, where the Bayesian part comes from the fact that we are computing the expectation with respect to the posterior distribution, also called belief, over functions. Therefore, Bayesian optimization is a memory-based optimization algorithm.

As commented before, most of the theory of Bayesian optimization is related to deterministic functions, we consider also stochastic functions, that is, we assume there might be a random error in the function output. In fact, evaluations can produce different outputs if repeated. In that case, the target function is the expected output. Furthermore, in a recent paper by[4] it has been shown that, even for deterministic functions, it is better to assume certain error in the observation. The main reason being that, in practice, there might be some mismodelling errors which can lead to instability of the recursion if neglected.

Bayesian optimization general model

In order to simplify the description, we are going to use a special case of Bayesian optimization model defined previously which corresponds to the most common application. In subsequent Sections we will introduce some generalizations for different applications.

Without loss of generality, consider the problem of finding the minimum of an unknown real valued function $f:\mathbb{X} \rightarrow \mathbb{R}$, where $\mathbb{X}$ is a compact space, $\mathbb{X} \subset \mathbb{R}^d, d \geq 1$. Let $P(f)$ be a prior distribution over functions represented as a stochastic process, for example, a Gaussian process $\mathbf{x}i(\cdot)$, with inputs $x \in \mathbb{X}$ and an associate kernel or covariance function $k(\cdot,\cdot)$. Let also assume that the target function is a sample of the stochastic process $f \sim \mathbf{x}i(\cdot)$.

In order to find the minimum, the algorithm has a maximum budget of $N$ evaluations of the target function $f$. The purpose of the algorithm is to find optimal decisions that provide a better performance at the end.

One advantage of using Gaussian processes as a prior distributions over functions is that new observations of the target function $(x_i,y_i)$ can be easily used to update the distribution over functions. Furthermore, the posterior distribution is also a Gaussian process $\mathbf{x}i_i = \left[ \mathbf{x}i(\cdot) | x_{1:i},y_{1:i} \right]$. Therefore, the posterior can be used as an informative prior for the next iteration in a recursive algorithm.

In a more general setting, many authors have suggested to modify the standard zero-mean Gaussian process for different variations that include semi-parametric models[8][6][9][17], use of hyperpriors on the parameters [13][2][7], Student t processes[5][19][25], etc.

We use a generalized linear model of the form:

\[ f(x) = \phi(\mathbf{x})^T \mathbf{w} + \epsilon(\mathbf{x}) \]

where

\[ \epsilon(\mathbf{x}) \sim \mathcal{NP} \left( 0, \sigma^2_s (\mathbf{K}(\theta) + \sigma^2_n \mathbf{I}) \right) \]

The term $\mathcal{NP}$ means a non-parametric process, which can make reference to a Gaussian process $\mathcal{GP}$ or a Student's t process $\mathcal{TP}$. In both cases, $\sigma^2_n$ is the observation noise variance, sometimes called nugget, and it is problem specific. Many authors decide to fix this value $\sigma^2_n = 0$ when the function $f(x)$ is deterministic, for example, a computer simulation. However, as cleverly pointed out in[4], there might be more reasons to include this term appart from being the observation noise, for example, to consider model inaccuracies.

This model has been presented in different ways depending on the field where it was used: