Models and functions

This library was originally developed for as part of a robotics research project[13][12], where a Gaussian process with hyperpriors on the mean and signal covariance parameters. Then, the metamodel was constructed using the Maximum a Posteriory (MAP) of the parameters. By that time, it only supported one kernel function, one mean function and one criterion.

However, the library now has grown to support many more surrogate models, with different distributions (Gaussian processes, Student's-t processes, etc.), with many kernels and mean functions. It also provides different criteria (even some combined criteria) so the library can be used to any problem involving some bounded optimization, stochastic bandits, active learning for regression, etc.

As seen in Section modopt this library implements only one general regression model. However, we can assign a set of priors on the parameters of the model , (the kernel hyperparameter will be discussed in Section learnker). Thus, the options are:

- "sGaussianProcess": a standard Gaussian process where the hyperparameters are known.
- "sGaussianProcessML": a standard Gaussian process where the hyperparameters are estimated directly from data using maximum likelihood estimates.
- "sGaussianProcessNormal": a Gaussian process with a Normal prior on the mean function parameters and known .
- "sStudentTProcessJef": in this case we use the Jeffreys prior for and . This is a kind of uninformative prior which is invariant to reparametrizations. Once we set a prior on the posterior becomes a Student's t Process.
- "sStudentTProcessNIG": in this case we standard conjugate priors, that is, a Normal prior on and a Inverse Gamma on . Therefore, the posterior is again a Student's t process.

Gaussian processes are a very general model that can achieve good performance with a reasonable computational cost. However, Student's t processes, thanks to the hierarchical structure of priors, provide an even more general setup for a minor extra cost. Furthermore, the Student's t distribution is robust to outliers and heavy tails in the data.

One of the critical components of Gaussian and Student's t processes is the definition of the kernel function, which defines the correlation between points in the input space. As a correlation function, the kernel must satisfy a set of properties (e.g.: being positive definite). All the kernel models available and its combinations satisfy the kernel restrictions.

The functions with **"ISO"** in their name are *isotropic* function, that is, they share a single set of parameters for all the dimensions of the input space.

The functions with **"ARD"** in their name use *Automatic Relevance Determination*, that is, they use independent parameters for every dimension of the problem. Therefore, they can be use to find the *relevance* of the each feature in the input space. In the limit, this can be used for feature selection.

- "kConst": a simple constant function.
- "kLinear", "kLinearARD": a linear function.
- "kMaternISO1", "kMaternISO3","kMaternISO5","kMaternARD1","kMaternARD3","kMaternARD5": Matern kernel functions. The number divided by 2 represents the order of the function. See[18] for a description.
- "kPolyX": Polynomial kernel function. X is a number 1-6 which represents the exponent of the function.
- "kSEARD","kSEISO": Squared exponential kernel, also known as Gaussian kernel.
- "kRQISO": Rational quadratic kernel, also known as Student's t kernel.

This kernels allow to combine some of the previous kernels.

- "kSum": Sum of kernels.
- "kProd": Product of kernels.

Note that the binary kernels only admits two terms. However, we can combine them for more complex operations. For example if we write:

"kSum(kMaternISO3,kSum(kRQISO,kProd(kPoly4,kConst))"

it represents the expresion: Matern(3) + RationalQuadratic + C*Polynomial^4

In this case, the vector of parameters is splited from left to right: 1 for the Matern function, 2 for the RQ function, 2 for polynomial function and 1 for the constant. If the vector of parameters have more or less than 6 elements, the system complains.

Although the nonparametric process is able to model a large amount of funtions, we can model the expected value of the nonparametric process as a parametric function. This parametric model will help to capture large offsets and global trends.

The usage is analogous to the kernel functions.

- "mZero","mOne","mConst": constant functions. For simplicity and because they are largely used, we provide special cases f(x) = 0 and f(x) = 1.
- "mLinear": linear function.
- "mSum": binary function which can be used to combine other functions.

As discussed in Introduction to Bayesian Optimization, one of the critical aspects for Bayesian optimization is the decision (loss) function. Unfortunately, the functions described there are unavailable, because they assume knowledge of the optimal value . However, we can define proxies for those functions.

Some criteria, such as the expected improvement and the lower confidence bound admits an annealed version "cXXXa". In that version, the parameter that is used to trade off exploration and exploitation changes over time to priorize exploration at the begining and exploitation at the end.

Many criteria depends on the prediction function, which can be a Gaussian or a Student's t distribution, depending on the surrogate model. However, the library includes all the criteria for both distributions, and the system automatically selected the correct one.

- "cEI","cBEI","cEIa": The most extended and reliable algorithm is the Expected Improvement algorithm[14]. In this case we provide the general version from[21] which includes an exponent to trade off exploration and exploitation "cEI". Whe also includes a variation from[15] which add a
*bias*or*threshold*to the improvement "cBEI". - "cLCB", "cLCBa": Another popular algorithm is the Lower Confidence Bound (LCB), or UCB in case of maximization. Introduced by [3] as Sequential Design for Optimization (SDO).
- "cPOI": Probability of improvement, by[10]
- "cExpReturn","cThompsonSampling","cOptimisticSampling": This criteria are related with the predicted return of the function. The first one is literally the expected return of the function (mean value). The second one is based on the Thompson sampling (drawing a random sample from the predicted distribution). Finally, the optimistic sampling takes the minimum of the other two (mean vs random).

- "cAopt": This is based on the A-optimality criteria. It is the predicted variance at the query point. Thus, this criteria is intended for
**exploration**of the input space, not for optimization. - "cDistance": This criteria adds a cost to a query point based on the distance with respect to the previous evaluation. Combined with other criteria functions, it might provide a more realistic setup for certain applications[11]

- "cSum","cProd": Sum and product of different criteria functions.
- "cHedge", "cHedgeRandom": Bandit based selection of the best criteria based on the GP-Hedge algorithm[7]. It automatically learns based on the behaviour of the criteria during the optimization process. The original version "cHedge" uses the maximum expected return as a
*reward*for each criteria. We add a variant "cHedgeRandom" where the*reward*is defined in terms of Thompson sampling.

In this case, the combined criteria admits more that two functions. For example:

"cHedge(cSum(cEI,cDistance),cLCB,cPOI,cOptimisticSampling)"

The posterior distribution of the model, which is necessary to compute the criterion function, cannot be computed in closed form if the kernel hyperparameters are unknown. Thus, we need a find to approximate this posterior distribution conditional on the kernel hyperparameters.

First, we need to consider if we are going to use a full Bayesian approach or an empirical Bayesian approach. The first one, computes the full posterior distribution by propagation of the uncertainty of each element and hyperparameter to the posterior. In this case, it can be done by discretization of the hyperparameter space or by using MCMC (not yet implemented). In theory, it is more precise but the computation burden can be orders of magnitude higher. The empirical approach on the other hand computes a point estimate of the hyperparameters based on some score function and use it as a "true" value. Although the uncertainty estimation in this case might not be as accurate as the full Bayesian, the computation overhead is minimal.

For the score function, we need to find the likelihood function of the observed data for the parameters. Depending on the model, this function will be a multivariate Gaussian distribution or multivariate t distribution. In general, we present the likelihood as a log-likelihood function up to a constant factor, that is, we remove the terms independent of from the log likelihood. In practice, whether we use a point estimate (maximum score) or full Bayes MCMC/discrete posterior, the constant factor is not needed.

We are going to consider the following score functions to learn the kernel hyperparameters:

- Leave one out cross-validation (SC_LOOCV): In this case, we try to maximize the average predicted log probability by the
*leave one out*(LOO) cross-validation strategy. This is sometimes called a pseudo-likelihood.

- Maximum Total Likelihood (SC_MTL) For any of the models presented, one approach to learn the hyperparameters is to maximize the likelihood of all the parameters , and . Then, the likelihood function is a multivariate Gaussian distribution. We can obtain a better estimate if we adjust the number of degrees of freedom, this is called
*restricted maximum likelihood*. The library automatically selects the restricted version, if it is suitable.

- Posterior maximum likelihood (SC_ML): In this case, the likelihood function is modified to consider the posterior estimate of based on the different cases defined in Section surrmods. In this case, the function will be a multivariate Gaussian or t distribution, depending on the kind of prior used for .

- Maximum a posteriori (SC_MAP): We can modify the previous algorithms by adding a prior distribution . By default, we add a joint normal prior on all the kernel hyperparameters. However, if the variance of the prior
*hp_std*is invalid (<=0), then we assume a flat prior on that hyperparameter. Since we assume that the hyperparameters are independent, we can apply priors selectively only to a small set.

In order to build a suitable surrogate function, we a need a preliminar set of samples. In Bayesian optimization this is typically performed using alternative experimental design criteria. In this first step, usually the main criteria is space filling. Thus, we have implemented the subsequent designs:

- Latin hypercube sampling: Each dimension of the space is divided in several intervals. Samples are then taken according to a generalization of the Latin square scheme. http://en.wikipedia.org/wiki/Latin_hypercube_sampling

- Sobol sequences: It is a set of quasi-random low-discrepancy sequences. Thus the space is sampled more evenly than with uniform sampling. http://en.wikipedia.org/wiki/Sobol_sequence

- Uniform sampling: The search space is sampled uniformly.

Note: Since we do not assume any struture in the set of discrete points during discrete optimization, only uniform sampling of the discrete set is available in that case.

Generated on Tue Mar 25 2014 23:31:59 for BayesOpt by 1.8.4