Interruptible optimization runs with checkpoints#

Christian Schell, Mai 2018 Reformatted by Holger Nahrstaedt 2020

Problem statement#

Optimization runs can take a very long time and even run for multiple days. If for some reason the process has to be interrupted results are irreversibly lost, and the routine has to start over from the beginning.

With the help of the callbacks.CheckpointSaver callback the optimizer’s current state can be saved after each iteration, allowing to restart from that point at any time.

This is useful, for example,

  • if you don’t know how long the process will take and cannot hog computational resources forever

  • if there might be system failures due to shaky infrastructure (or colleagues…)

  • if you want to adjust some parameters and continue with the already obtained results

import numpy as np


Simple example#

We will use pretty much the same optimization problem as in the Bayesian optimization with skopt notebook. Additionally we will instantiate the callbacks.CheckpointSaver and pass it to the minimizer:

from skopt import gp_minimize
from skopt.callbacks import CheckpointSaver

noise_level = 0.1

def obj_fun(x, noise_level=noise_level):
    return np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2)) + np.random.randn() * noise_level

checkpoint_saver = CheckpointSaver("./checkpoint.pkl", compress=9) # kwargs passed to `skopt.dump`

    obj_fun,  # the function to minimize
    [(-20.0, 20.0)],  # the bounds on each dimension of x
    x0=[-20.0],  # the starting point
    acq_func="LCB",  # the acquisition function (optional)
    n_calls=10,  # number of evaluations of f including at x0
    n_random_starts=3,  # the number of random initial points
    # a list of callbacks including the checkpoint saver
         fun: -0.17524445239614728
           x: [-18.660711608230713]
   func_vals: [-4.682e-02 -8.228e-02 -6.538e-03 -7.134e-02  9.064e-02
                7.662e-02  8.261e-02 -1.324e-01 -1.752e-01  1.002e-01]
     x_iters: [[-20.0], [5.857990176187936], [-11.97095004855501], [5.450171667295798], [10.524218484747195], [-17.111120867646253], [7.251301457256783], [-19.16709880389749], [-18.660711608230713], [-18.28429723556215]]
      models: [GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                                       n_restarts_optimizer=2, noise='gaussian',
                                       normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                                       n_restarts_optimizer=2, noise='gaussian',
                                       normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                                       n_restarts_optimizer=2, noise='gaussian',
                                       normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                                       n_restarts_optimizer=2, noise='gaussian',
                                       normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                                       n_restarts_optimizer=2, noise='gaussian',
                                       normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                                       n_restarts_optimizer=2, noise='gaussian',
                                       normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                                       n_restarts_optimizer=2, noise='gaussian',
                                       normalize_y=True, random_state=655685735)]
       space: Space([Real(low=-20.0, high=20.0, prior='uniform', transform='normalize')])
random_state: RandomState(MT19937)
       specs:     args:                    func: <function obj_fun at 0x0000020BCE5B9940>
                                     dimensions: Space([Real(low=-20.0, high=20.0, prior='uniform', transform='normalize')])
                                 base_estimator: GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5),
                                                                          n_restarts_optimizer=2, noise='gaussian',
                                                                          normalize_y=True, random_state=655685735)
                                        n_calls: 10
                                n_random_starts: 3
                               n_initial_points: 10
                        initial_point_generator: random
                                       acq_func: LCB
                                  acq_optimizer: auto
                                             x0: [-20.0]
                                             y0: None
                                   random_state: RandomState(MT19937)
                                        verbose: False
                                       callback: [<skopt.callbacks.CheckpointSaver object at 0x0000020BCF20E710>]
                                       n_points: 10000
                           n_restarts_optimizer: 5
                                             xi: 0.01
                                          kappa: 1.96
                                         n_jobs: 1
                               model_queue_size: None
                               space_constraint: None
              function: base_minimize

Now let’s assume this did not finish at once but took some long time: you started this on Friday night, went out for the weekend and now, Monday morning, you’re eager to see the results. However, instead of the notebook server you only see a blank page and your colleague Garry tells you that he had had an update scheduled for Sunday noon – who doesn’t like updates?

gp_minimize did not finish, and there is no res variable with the actual results!

Restoring the last checkpoint#

Luckily we employed the callbacks.CheckpointSaver and can now restore the latest result with skopt.load (see Store and load skopt optimization results for more information on that)

from skopt import load

res = load('./checkpoint.pkl')

Possible problems#

  • changes in search space: You can use this technique to interrupt the search, tune the search space and continue the optimization. Note that the optimizers will complain if x0 contains parameter values not covered by the dimension definitions, so in many cases shrinking the search space will not work without deleting the offending runs from x0 and y0.

  • see Store and load skopt optimization results

for more information on how the results get saved and possible caveats

Total running time of the script: (0 minutes 1.988 seconds)

Gallery generated by Sphinx-Gallery