The one-shot Knowledge Gradient acquisition function
The one-shot Knowledge Gradient acquisition function
The Knowledge Gradient (KG) (see [2, 3]) is a look-ahead acquisition function that quantifies the expected increase in the maximum of the modeled black-box function from obtaining additional (random) observations collected at the candidate set . KG often shows improved Bayesian Optimization performance relative to simpler acquisition functions such as Expected Improvement, but in its traditional form it is computationally expensive and hard to implement.
BoTorch implements a generalized variant of parallel KG [3] given by
where is the posterior at conditioned on , the (random) dataset observed at , and .
In general, we recommend using Ax for a simple BO setup like this one,
since this will simplify your setup (including the amount of code you need to write)
considerably. You can use a custom BoTorch model and acquisition function in Ax,
following Ax's
Modular BoTorch tutorial tutorial. To
use the KG acquisition function, it is sufficient to add
"botorch_acqf_class": qKnowledgeGradient,
to model_kwargs
. The linked tutorial shows
how to use a custom BoTorch model. If you'd like to let Ax choose which model to use
based on the properties of the search space, you can skip the surrogate
argument in
model_kwargs
.
Optimizing KG
The conventional approach for optimizing parallel KG (where ) is to apply stochastic gradient ascent, with each gradient observation potentially being an average over multiple samples. For each sample , the inner optimization problem for the posterior mean is solved numerically. An unbiased stochastic gradient of KG can then be computed by leveraging the envelope theorem and the optimal points . In this approach, every iteration requires solving numerous inner optimization problems, one for each outer sample, in order to estimate just one stochastic gradient.
The "one-shot" formulation of KG in BoTorch treats optimizing as an entirely deterministic optimization problem. It involves drawing