Design Philosophy

Main Design Tenets

BoTorch adheres to the following main design tenets:

Modularity & Simplicity
- Make it easy for researchers to develop and implement new ideas by following a modular design philosophy & making heavy use of auto-differentiation. Most BoTorch components are torch.nn.Module instances, so that users familiar with PyTorch can easily implement new differentiable components.
- Facilitate model-agnostic Bayesian Optimization by maintaining lightweight APIs and first-class support for Monte-Carlo-based acquisition functions.
Performance & Scalability
- Achieve high levels of performance across different platforms with device-agnostic code by using highly parallelized batch operations.
- Expand the applicability of Bayesian Optimization to very large problems by harnessing scalable modeling frameworks such as GPyTorch.

Parallelism Through Batched Computations

Batching (as in batching data or batching computations) is a central component to all modern deep learning platforms and plays a critical role in the design of BoTorch. Examples of batch computations in BoTorch include:

A batch of candidate points $X$ to be evaluated in parallel on the black-box function we are trying optimize. In BoTorch, we refer to this kind of batch as a "q-batch".
A batch of q-batches to be evaluated in parallel on the surrogate model of the black-box function. These facilitate fast evaluation on modern hardware such as GPUs and multi-core CPUs with advanced instruction sets (e.g. AVX). In BoTorch, we refer to a batch of this type as "t-batch" (as in "torch-batch").
A batched surrogate model, each batch of which models a different output (which is useful for multi-objective Bayesian Optimization). This kind of batching also aims to exploit modern hardware architecture.

Note that none of these notions of batch pertains to the batching of training data, which is commonly done in training Neural Network models (sometimes called "mini-batching"). BoTorch aims to be agnostic with regards to the particular model used - so while model fitting may indeed be performed via stochastic gradient descent using mini-batch training, BoTorch itself abstracts away from this.

For an in-depth look at the different batch notions in BoTorch, take a look at the Batching in BoTorch section.

Optimizing Acquisition Functions

While BoTorch tries to align as closely as possible with PyTorch when possible, optimization of acquisition functions requires a somewhat different approach. We now describe this discrepancy and explain in detail why we made this design decision.

In PyTorch, modules typically map (batches of) data to an output, where the mapping is parameterized by the parameters of the modules (often the weights of a Neural Network). Fitting the model means optimizing some loss (which is defined with respect to the underlying distribution of the data). As this distribution is unknown, one cannot directly evaluate this function. Instead, one considers the empirical loss function, i.e. the loss evaluated on all data available. In typical machine learning model training, a stochastic version of the empirical loss, obtained by "mini-batching" the data, is optimized using stochastic optimization algorithms.

In BoTorch, AcquisitionFunction modules map an input design $X$ to the acquisition function value. Optimizing the acquisition function means optimizing the output over the possible values of $X$. If the acquisition function is deterministic, then so is the optimization problem.

For large Neural Network models, the number of optimization variables is very high, and can be in the hundreds of thousands or even millions of parameters. The resulting optimization problem is often solved using first-order stochastic gradient descent algorithms (e.g. SGD and its many variants). Many of these are implemented in the torch.optim module. The typical way of optimizing a model with these algorithms is by extracting the module's parameters (e.g. using parameters()), and writing a manual optimization loop that calls step() on a torch Optimizer object.

Optimizing acquisition functions is different since the problem dimensionality is often much smaller. Indeed, optimizing over $q$ design points in a $d$-dimensional feature space results in $qd$ scalar parameters to optimize over. Both $q$ and $d$ are often quite small, and hence so is the dimensionality of the problem. Moreover, the optimization problem can be cast as a deterministic one (either because an analytic acquisition function is used, or because the reparameterization trick is employed to render the Monte-Carlo-based evaluation of the acquisition function deterministic in terms of the input tensor $X$). As a result, optimization algorithms that are typically inadmissible for problems such as training Neural Networks become promising alternatives to standard first-order methods. In particular, this includes quasi-second order methods (such as L-BFGS or SLSQP) that approximate local curvature of the acquisition function by using past gradient information. These methods are currently not well supported in the torch.optim package, which is why BoTorch provides a custom interface that wraps the optimizers from the scipy.optimize module.