Composite Bayesian Optimization with the High Order Gaussian Process¶

In this tutorial, we're going to explore composite Bayesian optimization Astudillo & Frazier, ICML, '19 with the High Order Gaussian Process (HOGP) model of Zhe et al, AISTATS, '19. The setup for composite Bayesian optimization is that we have an unknown (black box) function mapping input parameters to several outputs, and a second, known function describing the quality of the functional output. We wish to find input parameters that maximize the output metric function. We wish to find input parameters that maximize the output metric function in a black-box manner.

Specifically, this can be described as $\max_{x \in \mathcal{X}} g(f(x)),$ where $f$ is unknown and $g$ is known. As in traditional Bayesian optimization, we are going to construct a Gaussian process surrogate model over the expensive to evaluate function $f(.),$ and will use a HOGP to model this function.

HOGP model description¶

The High Order Gaussian Process (HOGP) model is a Gaussian process model designed specifically to operate over tensors or multi-dimensional arrays and exploits structure in the tensor to be able to operate efficiently. Specifically, the HOGP takes as inputs $y \in \mathbb{R}^{N \times d_2 \times \cdots \times d_M}$ and assumes that $\text{vec}(y) \sim \mathcal{N}(0, \otimes_{i=1}^M K_i + \sigma^2 I),$ where $K_1 = K_{XX}.$ Each dimension of the tensor has its own kernel function, $K_i,$ as well as a set of $d_i$ latent parameters that can be optimized over.

Recently, Maddox et al, '21 proposed a method for computing posterior samples from the HOGP by exploiting structure in the posterior distribution, thereby enabling its usage in BO settings. While they show that this approach allows to use composite BO on problems with tens or thousands of outputs, for scalability we consider a much smaller example here (that does not require GPU acceleration).

In [4]:

import torch
import os
import logging
import numpy as np
import math
import matplotlib.pyplot as plt
import time

import gpytorch.settings as gpt_settings

from botorch.acquisition import qSimpleRegret, qExpectedImprovement
from botorch.acquisition.objective import GenericMCObjective
from botorch.models import HigherOrderGP, SingleTaskGP
from botorch.models.higher_order_gp import FlattenedStandardize
from botorch.models.transforms import Normalize, Standardize
from botorch import fit_gpytorch_model
from botorch.optim import optimize_acqf
from botorch.sampling import IIDNormalSampler
from botorch.utils.sampling import draw_sobol_samples

from gpytorch.mlls import ExactMarginalLogLikelihood

%matplotlib inline

SMOKE_TEST = os.environ.get("SMOKE_TEST")

In [5]:

from botorch.optim.fit import fit_gpytorch_torch

Set Device and dtype¶

In [6]:

torch.manual_seed(0)
device = torch.device("cpu") if not torch.cuda.is_available() else torch.device("cuda:0")
dtype = torch.float

print("Using ", device)

Using  cuda:0

In [7]:

models_used = (
    "rnd",
    "ei",
    "ei_hogp_cf",
)

Problem Description¶

We use a simple test problem describing the concentration of pollutants after a chemical spill from Astudillo & Frazier, ICML, '19 defined over a $3 \times 4$ grid of values $s,t$ and we wish to optimize the parameters w.r.t. their true values, to estimate the true value of parameters, $x = [M, D, L, \tau].$ The function is given by $$ f(s,t | M, D, L, \tau) := \frac{M}{\sqrt{4 \pi D t}} \exp\{-\frac{s^2}{4Dt}\} + \frac{1_{t > \tau} M}{\sqrt{4 \pi D(t - \tau)}} \exp\{- \frac{(s - L)^2}{4 D (t - \tau)}\}, $$ with the cheap to evaluate, differentiable function given by $g(y):= \sum_{(s,t) \in S \times T} \left(c(s, t|x_{\text{true}}) - y\right)^2.$ As the objective function itself is going to be implemented in Pytorch, we will be able to differentiate through it, enabling the usage of gradient-based optimization to optimize the objectives with respect to the inputs.

In [8]:

def env_cfun(s, t, M, D, L, tau):
    c1 = M / torch.sqrt(4 * math.pi * D * t)
    exp1 = torch.exp(-(s ** 2) / 4 / D / t)
    term1 = c1 * exp1
    c2 = M / torch.sqrt(4 * math.pi * D * (t - tau))
    exp2 = torch.exp(-((s - L) ** 2) / 4 / D / (t - tau))
    term2 = c2 * exp2
    term2[torch.isnan(term2)] = 0.0
    return term1 + term2

Helper Functions¶

These are helper functions for us to maximize the acquisition function and to get random points.

In [9]:

def gen_rand_points(bounds, num_samples):
    points_nlzd = torch.rand(num_samples, bounds.shape[-1]).to(bounds)
    return bounds[0] + (bounds[1] - bounds[0]) * points_nlzd
    

def optimize_ei(qEI, bounds, **options):
    with gpt_settings.fast_computations(covar_root_decomposition=False):
        cands_nlzd, _ = optimize_acqf(
                qEI, bounds, **options, 
        )
    return cands_nlzd

Below is a wrapped function to help us define bounds on the parameter space, we can also vary the size of the grid if we'd like to.

In [10]:

def prepare_data(s_size=3, t_size=4, device=device, dtype=dtype):
    print("---- Running the environmental problem with ", s_size, t_size, " ----")
    # X = [M, D, L, tau]
    bounds = torch.tensor(
        [[7.0, 0.02, 0.01, 30.010], [13.0, 0.12, 3.00, 30.295]],
        device=device,
        dtype=dtype,
    )

    M0 = torch.tensor(10.0, device=device, dtype=dtype)
    D0 = torch.tensor(0.07, device=device, dtype=dtype)
    L0 = torch.tensor(1.505, device=device, dtype=dtype)
    tau0 = torch.tensor(30.1525, device=device, dtype=dtype)

    # we can vectorize everything, no need for loops
    if s_size == 3:
        S = torch.tensor([0.0, 1.0, 2.5], device=device, dtype=dtype)
    else:
        S = torch.linspace(0.0, 2.5, s_size, device=device, dtype=dtype)
    if t_size == 4:
        T = torch.tensor([15.0, 30.0, 45.0, 60.0], device=device, dtype=dtype)
    else:
        T = torch.linspace(15.0, 60.0, t_size, device=device, dtype=dtype)

    Sgrid, Tgrid = torch.meshgrid(S, T)

    # X = [M, D, L, tau]
    def c_batched(X, k=None):
        return torch.stack([env_cfun(Sgrid, Tgrid, *x) for x in X])

    c_true = env_cfun(Sgrid, Tgrid, M0, D0, L0, tau0)

    def neq_sum_quared_diff(samples):
        # unsqueeze
        if samples.shape[-1] == (s_size * t_size):
            samples = samples.unsqueeze(-1).reshape(
                *samples.shape[:-1], s_size, t_size
            )

        sq_diffs = (samples - c_true).pow(2)
        return sq_diffs.sum(dim=(-1, -2)).mul(-1.0)

    objective = GenericMCObjective(neq_sum_quared_diff)
    num_samples = 32

    return c_batched, objective, bounds, num_samples

In the above, we construct a GenericMCObjective instance to codify the objective function (which is minimizing the MSE of the output tensors and the outputs corresponding to the "true" parameter values). Note that the objective function is encoded in PyTorch and is differentiable (although it technically doesn't have to be). Ultimately, we backpropagate through the objective with respect to the input parameters (and through the HOGP as well).

BO Loop¶

Finally, we run the BO loop over three trials for $15$ batches each. This loop might take a while.

We will be comparing to both random selection and batch expected improvment on the aggregated metric.

In [11]:

n_init = 30

if SMOKE_TEST:
    n_batches = 1
    batch_size = 2
    n_trials = 1
else:
    n_batches = 15
    batch_size = 4
    n_trials = 3

As a word of caution, we've found that when fitting the HOGP model, using first-order optimizers (e.g. Adam) as is used in fit_gpytorch_torch tends to outperform second-order optimizers such as L-BFGS-B due to the large number of free parameters in the HOGP. L-BFGS-B tends to overfit in practice here.

In [12]:

all_objective_vals = []

with gpt_settings.cholesky_jitter(1e-4):
    c_batched, objective, bounds, num_samples = prepare_data(device=device, dtype=dtype)

    train_X_init = gen_rand_points(bounds, n_init)
    train_Y_init = c_batched(train_X_init)

    # these will keep track of the points explored
    train_X = {k: train_X_init.clone() for k in models_used}
    train_Y = {k: train_Y_init.clone() for k in train_X}

    # run the BO loop
    for i in range(n_batches):
        tic = time.time()

        # get best observations, log status
        best_f = {k: objective(v).max().detach() for k, v in train_Y.items()}

        logging.info(
            f"It {i+1:>2}/{n_batches}, best obs.: "
            ", ".join([f"{k}: {v:.3f}" for k, v in best_f.items()])
        )

        # generate random candidates
        cands = {}
        cands["rnd"] = gen_rand_points(bounds, batch_size)

        optimize_acqf_kwargs = {
            "q": batch_size,
            "num_restarts": 10,
            "raw_samples": 512,
        }
        sampler = IIDNormalSampler(num_samples=128)

        train_Y_ei = objective(train_Y["ei"]).unsqueeze(-1)
        model_ei = SingleTaskGP(
            train_X["ei"],
            train_Y_ei,
            input_transform=Normalize(train_X["ei"].shape[-1]),
            outcome_transform=Standardize(train_Y_ei.shape[-1]),
        )

        mll = ExactMarginalLogLikelihood(model_ei.likelihood, model_ei)
        fit_gpytorch_torch(mll, options={"lr": 0.01, "maxiter": 3000, "disp": False})

        # generate qEI candidate (single output modeling)
        qEI = qExpectedImprovement(model_ei, best_f=best_f["ei"], sampler=sampler)
        cands["ei"] = optimize_ei(qEI, bounds, **optimize_acqf_kwargs)

        model_ei_hogp_cf = HigherOrderGP(
            train_X["ei_hogp_cf"],
            train_Y["ei_hogp_cf"],
            outcome_transform=FlattenedStandardize(train_Y["ei_hogp_cf"].shape[1:]),
            input_transform=Normalize(train_X["ei_hogp_cf"].shape[-1]),
            latent_init="gp",
        )

        mll = ExactMarginalLogLikelihood(model_ei_hogp_cf.likelihood, model_ei_hogp_cf)
        fit_gpytorch_torch(mll, options={"lr": 0.01, "maxiter": 3000, "disp": False})

        # generate qEI candidate (multi-output modeling)
        qEI_hogp_cf = qExpectedImprovement(
            model_ei_hogp_cf,
            best_f=best_f["ei_hogp_cf"],
            sampler=sampler,
            objective=objective,
        )
        cands["ei_hogp_cf"] = optimize_ei(qEI_hogp_cf, bounds, **optimize_acqf_kwargs)

        # make observations and update data
        for k, Xold in train_X.items():
            Xnew = cands[k]
            if Xnew.shape[0] > 0:
                train_X[k] = torch.cat([Xold, Xnew])
                train_Y[k] = torch.cat([train_Y[k], c_batched(Xnew)])

        logging.info(f"Wall time: {time.time() - tic:1f}")

    objective_dict = {k: objective(train_Y[k]) for k in train_Y}

---- Running the environmental problem with  3 4  ----

I0329 104248.534 <ipython-input-12-935863451e4e>:20] rnd: -0.666It  1/15, best obs.: , ei: -0.666It  1/15, best obs.: , ei_hogp_cf: -0.666
I0329 104349.332 <ipython-input-12-935863451e4e>:78] Wall time: 60.798171
I0329 104349.334 <ipython-input-12-935863451e4e>:20] rnd: -0.666It  2/15, best obs.: , ei: -0.109It  2/15, best obs.: , ei_hogp_cf: -0.000
I0329 104359.222 <ipython-input-12-935863451e4e>:78] Wall time: 9.888608
I0329 104359.224 <ipython-input-12-935863451e4e>:20] rnd: -0.666It  3/15, best obs.: , ei: -0.090It  3/15, best obs.: , ei_hogp_cf: -0.000
I0329 104415.049 <ipython-input-12-935863451e4e>:78] Wall time: 15.826928
I0329 104415.052 <ipython-input-12-935863451e4e>:20] rnd: -0.666It  4/15, best obs.: , ei: -0.090It  4/15, best obs.: , ei_hogp_cf: -0.000
I0329 104427.876 <ipython-input-12-935863451e4e>:78] Wall time: 12.825083
I0329 104427.879 <ipython-input-12-935863451e4e>:20] rnd: -0.406It  5/15, best obs.: , ei: -0.090It  5/15, best obs.: , ei_hogp_cf: -0.000
I0329 104444.104 <ipython-input-12-935863451e4e>:78] Wall time: 16.226710
I0329 104444.107 <ipython-input-12-935863451e4e>:20] rnd: -0.406It  6/15, best obs.: , ei: -0.072It  6/15, best obs.: , ei_hogp_cf: -0.000
/mnt/xarfuse/uid-25824/c80901c0-seed-nspid4026533660_cgpid93750242-ns-4026533645/gpytorch/utils/cholesky.py:40: NumericalWarning:

A not p.d., added jitter of 1.0e-04 to the diagonal

I0329 104457.139 <ipython-input-12-935863451e4e>:78] Wall time: 13.033942
I0329 104457.141 <ipython-input-12-935863451e4e>:20] rnd: -0.406It  7/15, best obs.: , ei: -0.044It  7/15, best obs.: , ei_hogp_cf: -0.000
I0329 104527.454 <ipython-input-12-935863451e4e>:78] Wall time: 30.313691
I0329 104527.456 <ipython-input-12-935863451e4e>:20] rnd: -0.406It  8/15, best obs.: , ei: -0.009It  8/15, best obs.: , ei_hogp_cf: -0.000
I0329 104550.315 <ipython-input-12-935863451e4e>:78] Wall time: 22.860429
I0329 104550.318 <ipython-input-12-935863451e4e>:20] rnd: -0.406It  9/15, best obs.: , ei: -0.009It  9/15, best obs.: , ei_hogp_cf: -0.000
I0329 104608.382 <ipython-input-12-935863451e4e>:78] Wall time: 18.065853
I0329 104608.385 <ipython-input-12-935863451e4e>:20] rnd: -0.406It 10/15, best obs.: , ei: -0.009It 10/15, best obs.: , ei_hogp_cf: -0.000
I0329 104618.371 <ipython-input-12-935863451e4e>:78] Wall time: 9.987822
I0329 104618.373 <ipython-input-12-935863451e4e>:20] rnd: -0.406It 11/15, best obs.: , ei: -0.009It 11/15, best obs.: , ei_hogp_cf: -0.000
I0329 104629.971 <ipython-input-12-935863451e4e>:78] Wall time: 11.599180
I0329 104629.973 <ipython-input-12-935863451e4e>:20] rnd: -0.406It 12/15, best obs.: , ei: -0.009It 12/15, best obs.: , ei_hogp_cf: -0.000
I0329 104641.266 <ipython-input-12-935863451e4e>:78] Wall time: 11.294424
I0329 104641.269 <ipython-input-12-935863451e4e>:20] rnd: -0.406It 13/15, best obs.: , ei: -0.009It 13/15, best obs.: , ei_hogp_cf: -0.000
I0329 104652.543 <ipython-input-12-935863451e4e>:78] Wall time: 11.274903
I0329 104652.546 <ipython-input-12-935863451e4e>:20] rnd: -0.406It 14/15, best obs.: , ei: -0.009It 14/15, best obs.: , ei_hogp_cf: -0.000
I0329 104725.562 <ipython-input-12-935863451e4e>:78] Wall time: 33.017415
I0329 104725.565 <ipython-input-12-935863451e4e>:20] rnd: -0.406It 15/15, best obs.: , ei: -0.009It 15/15, best obs.: , ei_hogp_cf: -0.000
I0329 104802.468 <ipython-input-12-935863451e4e>:78] Wall time: 36.904831

In [18]:

methods_dict = {k: objective_dict[k].cpu().cummax(0)[0] for k in models_used}
mean_results = {k: -methods_dict[k][n_init:] for k in models_used}

Finally, we plot the results, showing that the HOGP performs well on this task, and converges to a closer parameter value than a batch GP on the composite metric itself.

In [22]:

plt.figure(figsize = (8,6))
labels_dict = {"rnd": "Random", "ei": "EI", "ei_hogp_cf": "Composite EI"}
for k in models_used:
    plt.plot(
        torch.arange(n_batches * batch_size),
        mean_results[k],
        label = labels_dict[k],
    )
plt.legend(fontsize = 20)
plt.semilogy()
plt.xlabel("Number of Function Queries")
plt.ylabel("Difference from True Parameter")

Out[22]:

Text(0, 0.5, 'Difference from True Parameter')

Download Tutorial Jupyter Notebook

Download Tutorial Source Code

BoTorch

0.6.6