Design Philosophy
Main Design Tenets
BoTorch adheres to the following main design tenets:
Modularity & Simplicity
- Make it easy for researchers to develop and implement new ideas by following
a modular design philosophy & making heavy use of auto-differentiation. Most
BoTorch components are
torch.nn.Module
instances, so that users familiar with PyTorch can easily implement new differentiable components. - Facilitate model-agnostic Bayesian Optimization by maintaining lightweight APIs and first-class support for Monte-Carlo-based acquisition functions.
- Make it easy for researchers to develop and implement new ideas by following
a modular design philosophy & making heavy use of auto-differentiation. Most
BoTorch components are
Performance & Scalability
- Achieve high levels of performance across different platforms with device-agnostic code by using highly parallelized batch operations.
- Expand the applicability of Bayesian Optimization to very large problems by harnessing scalable modeling frameworks such as GPyTorch.
Parallelism Through Batched Computations
Batching (as in batching data or batching computations) is a central component to all modern deep learning platforms and plays a critical role in the design of BoTorch. Examples of batch computations in BoTorch include:
- A batch of candidate points $X$ to be evaluated in parallel on the black-box function we are trying optimize. In BoTorch, we refer to this kind of batch as a "q-batch".
- A batch of q-batches to be evaluated in parallel on the surrogate model of the black-box function. These facilitate fast evaluation on modern hardware such as GPUs and multi-core CPUs with advanced instruction sets (e.g. AVX). In BoTorch, we refer to a batch of this type as "t-batch" (as in "torch-batch").
- A batched surrogate model, each batch of which models a different output (which is useful for multi-objective Bayesian Optimization). This kind of batching also aims to exploit modern hardware architecture.
Note that none of these notions of batch pertains to the batching of training data, which is commonly done in training Neural Network models (sometimes called "mini-batching"). BoTorch aims to be agnostic with regards to the particular model used - so while model fitting may indeed be performed via stochastic gradient descent using mini-batch training, BoTorch itself abstracts away from this.
For an in-depth look at the different batch notions in BoTorch, take a look at the Batching in BoTorch section.
Optimizing Acquisition Functions
While BoTorch tries to align as closely as possible with PyTorch when possible, optimization of acquisition functions requires a somewhat different approach. We now describe this discrepancy and explain in detail why we made this design decision.
In PyTorch, modules typically map (batches of) data to an output, where the mapping is parameterized by the parameters of the modules (often the weights of a Neural Network). Fitting the model means optimizing some loss (which is defined with respect to the underlying distribution of the data). As this distribution is unknown, one cannot directly evaluate this function. Instead, one considers the empirical loss function, i.e. the loss evaluated on all data available. In typical machine learning model training, a stochastic version of the empirical loss, obtained by "mini-batching" the data, is optimized using stochastic optimization algorithms.
In BoTorch, AcquisitionFunction
modules map an input design $X$ to the acquisition function value. Optimizing
the acquisition function means optimizing the output over the possible values of
$X$. If the acquisition function is deterministic, then so is the optimization
problem.
For large Neural Network models, the number of optimization variables is very
high, and can be in the hundreds of thousands or even millions of parameters.
The resulting optimization problem is often solved using using first-order
stochastic gradient descent algorithms (e.g. SGD and its many variants).
Many of these are implemented in the torch.optim
module. The typical way of
optimizing a model with these algorithms is by extracting the module's
parameters (e.g. using parameters()
), and writing a manual optimization loop
that calls step()
on a torch Optimizer
object.
Optimizing acquisition functions is different since the problem dimensionality
is often much smaller. Indeed, optimizing over $q$ design points in a
$d$-dimensional feature space results in $qd$ scalar parameters to optimize
over. Both $q$ and $d$ are often quite small, and hence so is the dimensionality
of the problem.
Moreover, the optimization problem can be cast as a deterministic one (either
because an analytic acquisition function is used, or because the
reparameterization trick is employed to render the Monte-Carlo-based evaluation
of the acquisition function deterministic in terms of the input tensor $X$).
As a result, optimization algorithms that are typically inadmissible for
problems such as training Neural Networks become promising alternatives to
standard first-order methods. In particular, this includes quasi-second order
methods (such as L-BFGS or SLSQP) that approximate local curvature of the
acquisition function by using past gradient information.
These methods are currently not well supported in the torch.optim
package,
which is why BoTorch provides a custom interface that wraps the optimizers from
the scipy.optimize
module.