The one-shot Knowledge Gradient acquisition function
The one-shot Knowledge Gradient acquisition function
The Knowledge Gradient (KG) (see [2, 3]) is a look-ahead acquisition function that quantifies the expected increase in the maximum of the modeled black-box function from obtaining additional (random) observations collected at the candidate set . KG often shows improved Bayesian Optimization performance relative to simpler acquisition functions such as Expected Improvement, but in its traditional form it is computationally expensive and hard to implement.
BoTorch implements a generalized variant of parallel KG [3] given by
where is the posterior at conditioned on , the (random) dataset observed at , and .
In general, we recommend using Ax for a simple BO setup like this one,
since this will simplify your setup (including the amount of code you need to write)
considerably. You can use a custom BoTorch model and acquisition function in Ax,
following the Using BoTorch with Ax
tutorial. To use the KG acquisition function, it is sufficient to add
"botorch_acqf_class": qKnowledgeGradient,
to model_kwargs
. The linked tutorial shows
how to use a custom BoTorch model. If you'd like to let Ax choose which model to use
based on the properties of the search space, you can skip the surrogate
argument in
model_kwargs
.
Optimizing KG
The conventional approach for optimizing parallel KG (where ) is to apply stochastic gradient ascent, with each gradient observation potentially being an average over multiple samples. For each sample , the inner optimization problem for the posterior mean is solved numerically. An unbiased stochastic gradient of KG can then be computed by leveraging the envelope theorem and the optimal points . In this approach, every iteration requires solving numerous inner optimization problems, one for each outer sample, in order to estimate just one stochastic gradient.
The "one-shot" formulation of KG in BoTorch treats optimizing
as an entirely deterministic optimization problem. It
involves drawing num_fantasies
fixed base samples
for the outer expectation,
sampling fantasy data
, and constructing
associated fantasy models . The
inner maximization can then be moved outside of the sample average, resulting in the
following optimization problem:
where and .
If the inner expectation does not have an analytic expression, one can also draw fixed
base samples and use an MC
approximation as with the standard MC acquisition functions of type
MCAcquisitionFunction
. In either case one is left with a deterministic optimization
problem.
The key difference from the envelope theorem approach is that we do not solve the inner optimization problem to completion for every fantasy point for every gradient step with respect to . Instead, we solve the nested optimization problem jointly over and the fantasy points . The resulting optimization problem is of higher dimension, namely instead of , but unlike the envelope theorem formulation it can be solved as a single optimization problem, which can be solved using standard methods for deterministic optimization.
[1] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. Advances in Neural Information Processing Systems 33, 2020.
[2] P. Frazier, W. Powell, and S. Dayanik. A Knowledge-Gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 2008.
[3] J. Wu and P. Frazier. The parallel knowledge gradient method for batch bayesian optimization. NIPS 2016.
Setting up a toy model
We'll fit a standard SingleTaskGP
model on noisy observations of the synthetic
function in d=2
dimensions on the hypercube
.
# Install dependencies if we are running in colab
import sys
if 'google.colab' in sys.modules:
%pip install botorch
import os
import math
import torch
from botorch.fit import fit_gpytorch_mll
from botorch.models import SingleTaskGP
from botorch.utils import standardize
from gpytorch.mlls import ExactMarginalLogLikelihood
SMOKE_TEST = os.environ.get("SMOKE_TEST")
bounds = torch.stack([torch.zeros(2), torch.ones(2)])
train_X = bounds[0] + (bounds[1] - bounds[0]) * torch.rand(20, 2)
train_Y = torch.sin(2 * math.pi * train_X[:, [0]]) * torch.cos(
2 * math.pi * train_X[:, [1]]
)
train_Y = standardize(train_Y + 0.05 * torch.randn_like(train_Y))
model = SingleTaskGP(train_X, train_Y)
mll = ExactMarginalLogLikelihood(model.likelihood, model)
fit_gpytorch_mll(mll);
Defining the qKnowledgeGradient acquisition function
The qKnowledgeGradient
complies with the standard MCAcquisitionFunction
API. The
only mandatory argument in addition to the model is num_fantasies
the number of
fantasy samples. More samples result in a better approximation of KG, at the expense of
both memory and wall time.
qKnowledgeGradient
also supports the other parameters of MCAcquisitionFunction
, such
as a generic objective objective
and pending points X_pending
. It also accepts a
current_value
argument that is the maximum posterior mean of the current model (which
can be obtained by maximizing PosteriorMean
acquisition function). This does not
change the optimizer so it is not required, but it means that the acquisition value is
some constant shift of the actual "Knowledge Gradient" value.
from botorch.acquisition import qKnowledgeGradient
NUM_FANTASIES = 128 if not SMOKE_TEST else 4
qKG = qKnowledgeGradient(model, num_fantasies=NUM_FANTASIES)
Optimizing qKG
qKnowledgeGradient
subclasses OneShotAcquisitionFunction
, which makes sure that the
fantasy parameterization is automatically generated and optimized when
calling optimize_acqf
on the acquisition function. This means that optimizing one-shot
KG in BoTorch is just a easy as optimizing any other acquisition function (from an API
perspective, at least). It turns out that a careful initialization of the fantasy points
can significantly help with the optimization (see the logic in
botorch.optim.initializers.gen_one_shot_kg_initial_conditions
for more information).
Here we use num_restarts=10
random initial q
-batches with q=2
in parallel, with
the intialization heuristic starting from raw_samples = 512
raw points (note that
since qKnowledgeGradient
is significantly more expensive to evaluate than other
acquisition functions, large values of num_restarts
and raw_samples
, which are
typically feasible in other settings, can result in long wall times and potential memory
issues).
Finally, since we do not pass a current_value
argument, this value is not actually the
KG value, but offset by the constant (w.r.t. the candidates)
.
from botorch.optim import optimize_acqf
from botorch.utils.sampling import manual_seed
NUM_RESTARTS = 10 if not SMOKE_TEST else 2
RAW_SAMPLES = 512 if not SMOKE_TEST else 4
with manual_seed(1234):
candidates, acq_value = optimize_acqf(
acq_function=qKG,
bounds=bounds,
q=2,
num_restarts=NUM_RESTARTS,
raw_samples=RAW_SAMPLES,
)
candidates
acq_value
Computing the actual KG value
We first need to find the maximum posterior mean - we can use a large number of random restarts and raw_samples to increase the likelihood that we do indeed find it (this is a non-convex optimization problem, after all).
from botorch.acquisition import PosteriorMean
NUM_RESTARTS = 20 if not SMOKE_TEST else 2
RAW_SAMPLES = 2048 if not SMOKE_TEST else 4
argmax_pmean, max_pmean = optimize_acqf(
acq_function=PosteriorMean(model),
bounds=bounds,
q=1,
num_restarts=20 if not SMOKE_TEST else 2,
raw_samples=2048 if not SMOKE_TEST else 4,
)
Now we can optimize KG after passing the current value. We also pass in the sampler
from the original qKG
above, which containst the fixed base samples .
This is to ensure that we optimize the same approximation and so our values are an
apples-to-apples comparison (as num_fantasies
increases, the effect of this randomness
will get less and less important).
qKG_proper = qKnowledgeGradient(
model,
num_fantasies=NUM_FANTASIES,
sampler=qKG.sampler,
current_value=max_pmean,
)
with manual_seed(1234):
candidates_proper, acq_value_proper = optimize_acqf(
acq_function=qKG_proper,
bounds=bounds,
q=2,
num_restarts=NUM_RESTARTS,
raw_samples=RAW_SAMPLES,
)
candidates_proper
acq_value_proper