Kormos

The kormos package provides an interface between scipy.optimize.minimize and Keras for training models with deterministic minimization algorithms like L-BFGS.

It provides Keras users with:

keras.Model subclasses that may be optimized without changes in the API—a model may be trained using either the built-in stochastic mini-batch algorithms or the deterministic batch algorithms from scipy.optimize
Out-of-the-box interoperability with the usual Keras utilities and workflows; e.g.:
- fit() still returns a the history object with optimization metadata and validation metrics from each iteration of the solver and is usable by KerasTuner
- Support for distributed training strategies (at least in principle—this has admittedly not been integration tested)
The ability to use second order optimization methods from scipy by evaluating Hessian-vector-products if you have a very specific need for this (spoiler: you almost certainly do not)

Motivation

Why would anyone want to go full batch in this day and age?

Keras is a powerful tool for developing predictive models and optimizing their parameters in a high-level language. While its primary use case is large-scale deep learning, Keras’s auto-differentiation utilities (from Tensorflow) that enable rapid prototyping and optimization with gradient-based minimization algorithms are great for other use cases in mathematical modelling and numerical optimization.

If you are working with models or datasets in which training data can reasonably fit in memory on a single machine, you may have situations in which deterministic algorithms like L-BFGS or Newton-CG are complementary or viable alternatives to the stochastic optimizers available in Keras, since:

deterministic algorithms don’t require additional hyperparamter tuning to ensure convergence; if you’re just prototyping something small and having trouble tuning learning rates, you may just want to crank out L-BFGS for a few minutes as a sanity check that your model can in fact be optimized
these algorithms may have faster convergence to accurate solutions of the optimization problem if the dataset is small enough that full batch gradient and Hessian-vector-product computation times aren’t prohibitive

So TL;DR: because Luddites exist even in the field of numerical optimization.

Why The Name Kormos?

Because Keras is a powerful and useful tool, and is named after the Greek word κέρας, which means horn.

Kormos is related to Keras, but it’s not very powerful or useful. It’s named after the Greek word κορμός, which means stump.

License

This project is released under the MIT license, and contains adaptations of other codes released under the Apache and MIT licenses. Please see the header in each source file for the applicable license and copyright notices therein.

Setup

Requirements

The kormos package is built for:

Python 3+
Tensorflow 2+ (and the respective tensorflow.keras module with it)
Scipy 0.1+ (any version really, since the scipy.optimize.minimize signature is stable)

Installation

Install via the PyPI package kormos using pip:

pip3 install kormos

Alternatively, if you like your releases bloody rare you may install from git directly:

pip3 install git+https://github.com/mbhynes/kormos.git

Usage Examples

A kormos model is drag-and-drop replaceable with any Keras model. Below we provide some toy code examples, including Collaborative Filtering and MNIST classification examples adapted from the Code Examples section of keras.io.

Example: Linear Regression with Sequential API

import numpy as np
from tensorflow import keras

import kormos

rank = 50

# Define the model using the keras.model.Model Sequential API
model = kormos.models.BatchOptimizedSequentialModel()
model.add(
    keras.layers.Dense(
        units=1,
        input_shape=(rank,),
        activation=None,
        use_bias=False,
        kernel_regularizer=keras.regularizers.L2(1e-3),
        kernel_initializer="ones",
    )
)
loss = keras.losses.MeanSquaredError()
model.compile(loss=loss, optimizer='l-bfgs-b', metrics=['mean_absolute_error'])

# Generate samples of normally distributed random data
np.random.seed(1)
w = np.random.normal(size=rank)
X = np.random.normal(size=(1000, rank))
y = np.expand_dims(X.dot(w), axis=1)

Xval = np.random.normal(size=(1000, rank))
yval = np.expand_dims(Xval.dot(w), axis=1)

# Fit the model
history = model.fit(
    x=X,
    y=y,
    epochs=10,
    validation_data=(Xval, yval),
    options={"maxcors": 3}, # can pass options payload if so desired
)
best_fit_weights = np.reshape(model.trainable_weights[0].numpy(), (1, -1))
assert np.allclose(best_fit_weights, w, 1e-2)

We can now inspect the optimization metris traced in the history object returned from fit(). The training metrics captured by kormos include the:

training loss function value (including regularization terms)
2-norm of the batch gradient
number of evaluations of the loss/gradient function (equivalent to an epoch for a stochastic optimizer)
number of evaluations of the Hessian-vector-product function, if applicable (equivalent to an epoch for a stochastic optimizer)

>>> import pandas as pd; pd.DataFrame(history.history)
        loss       grad  fg_evals  hessp_evals   val_loss  val_mean_absolute_error
79.121972  17.946233         2            0  78.418121                 7.137860
 0.192005   0.713242         3            0   0.232164                 0.344657
 0.056429   0.186013         4            0   0.059140                 0.088700
 0.047397   0.042760         5            0   0.047348                 0.015531
 0.047006   0.008019         6            0   0.047006                 0.006401
 0.046991   0.001854         7            0   0.046994                 0.005846
 0.046990   0.000350         8            0   0.046992                 0.005675
 0.046990   0.000073         9            0   0.046992                 0.005642
 0.046990   0.000051        11            0   0.046992                 0.005642

We can now also recompile the model to use a stochastic optimizer; let’s refit the model using ADAM:

# Recompile the model to use a different optimizer (this doesn't change its weights)
model.compile(loss=model.loss, optimizer='adam', metrics=['mean_absolute_error'])

# Reset the weights
model.set_weights([np.random.random(size=(rank, 1))])

# Fit the model using ADAM
history = model.fit(
    x=X,
    y=y,
    epochs=150,
    validation_data=(Xval, yval),
)

This is a somewhat contrived example in modern machine learning (small dataset and simple model with very few parameters), but it’s the kind of classical use case in which a deterministic algorithm will converge faster than a stochastic algorithm. If you were interested in Keras primarily for the nice tensorflow API and autodifferentiation routines, but had unsexy, non-deep modelling goals, this bud’s for you:

>>> import pandas as pd; pd.DataFrame(history.history)
          loss  mean_absolute_error   val_loss  val_mean_absolute_error
  59.751369             6.218111  52.518566                 5.756832
  50.042812             5.688218  45.344589                 5.346300
  43.674156             5.308869  40.368832                 5.043641
  39.074280             5.021304  36.492527                 4.795147
  35.389912             4.781666  33.423710                 4.588754
..         ...                  ...        ...                      ...
 0.047031             0.008966   0.047031                 0.009047
 0.047023             0.008606   0.047025                 0.008718
 0.047017             0.008268   0.047019                 0.008344
 0.047012             0.007934   0.047013                 0.007977
 0.047008             0.007655   0.047009                 0.007717

[150 rows x 4 columns]

Example: Linear Regression using the Functional API

The same linear regression model as above may be expressed equivalently by the functional API. Here we specify a different scipy solver, the Newton-CG algorithm that uses Hessian-vector-products:

# Define the model using the keras.model.Model functional API
model_input = keras.Input(shape=(rank,), name="input")
model_output = keras.layers.Dense(
    units=1,
    input_shape=(rank,),
    activation=None,
    use_bias=False,
    kernel_regularizer=keras.regularizers.L2(1e-3),
    kernel_initializer="ones",
)(model_input)
model = kormos.models.BatchOptimizedModel(
    inputs=model_input,
    outputs=model_output,
)
loss = keras.losses.MeanSquaredError()
model.compile(loss=loss, optimizer='newton-cg', metrics=['mean_absolute_error'])

# Fit the model on the same data as previously
history = model.fit(
    x=X,
    y=y,
    epochs=10,
    validation_data=(Xval, yval),
)
best_fit_weights = np.reshape(model.trainable_weights[0].numpy(), (1, -1))
assert np.allclose(best_fit_weights, w, 1e-2)

The Newton-CG algorithm has second order convergence, so we should find that the gradient norm has decreased by several orders of magnitude more than with the L-BFGS-B algorithm. (Of course, practically speaking this is a moot point in the world of approximate parameter estimation due to the limitations of both imperfect models and sampling bias that exists in training datasets: the numerical error in the solution is orders of magnitude smaller than other errors…)

Example: Collaborative Filtering for Item Recommendation

We present a simple linear matrix factorization model for building a recommender system using the MovieLens dataset, and use the same preprocessing steps as in the Keras example, Collaborative Filtering for Movie Recommendations.

Define the Model

We define a simple matrix factorization model for factorizing the ratings matrix into the product of 2 latent feature matrices, represented by user and item embeddings:

import tensorflow as tf
from tensorflow import keras
import kormos

def build_model(rank, num_users, num_items, **kwargs):
    inputs = [
        keras.Input(shape=(1,), name="user", dtype=tf.int32),
        keras.Input(shape=(1,), name="item", dtype=tf.int32),
    ]
    user_embedding = keras.layers.Embedding(
        input_dim=(num_users + 1),
        output_dim=rank,
        mask_zero=True,
        embeddings_initializer="normal",
        embeddings_regularizer=keras.regularizers.L2(1e-5),
        name="user_embedding",
    )
    item_embedding = keras.layers.Embedding(
        input_dim=(num_items + 1),
        output_dim=rank,
        mask_zero=True,
        embeddings_initializer="normal",
        embeddings_regularizer=keras.regularizers.L2(1e-5),
        name="item_embedding",
    )
    features = [
        user_embedding(inputs[0]),
        item_embedding(inputs[1]),
    ]
    output = keras.layers.Dot(axes=2, normalize=False)(features)
    model = kormos.models.BatchOptimizedModel(
        inputs=inputs,
        outputs=output,
        **kwargs
    )
    return model

Prepare the Data

We run the same pre-processing steps as in the Keras example above. (Please be aware that there are methodological errors in these steps that we have left unchanged: (1) it is not correct to split the training and testing data uniformly randomly, since some movies have only 1 rating and hence should not be members of the testing set, and (2) it is not possible to construct a factorization model that represents each user/item by a vector of rank k if k is greater than the number of observations (ratings) that that user/item has in the training data—such a system is overdetermined).

import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path

# Download the data from http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
# Use the ratings.csv file
movielens_data_file_url = (
    "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
)
movielens_zipped_file = keras.utils.get_file(
    "ml-latest-small.zip", movielens_data_file_url, extract=False
)
keras_datasets_path = Path(movielens_zipped_file).parents[0]
movielens_dir = keras_datasets_path / "ml-latest-small"

# Only extract the data the first time the script is run.
if not movielens_dir.exists():
    with ZipFile(movielens_zipped_file, "r") as zip:
        # Extract files
        print("Extracting all the files now...")
        zip.extractall(path=keras_datasets_path)
        print("Done!")

ratings_file = movielens_dir / "ratings.csv"
df = pd.read_csv(ratings_file)

user_ids = df["userId"].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
movie_ids = df["movieId"].unique().tolist()
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)

num_users = len(user2user_encoded)
num_movies = len(movie_encoded2movie)
df["rating"] = df["rating"].values.astype(np.float32)
# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])

print(
    "Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating: {}".format(
        num_users, num_movies, min_rating, max_rating
    )
)

df = df.sample(frac=1, random_state=42)
x = df[["user", "movie"]].values
# Normalize the targets between 0 and 1. Makes it easy to train.
y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * df.shape[0])
x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:],
)

Train the Model

We may now train our factorization model:

rank = 5
model = build_model(rank, num_users, num_movies)
model.compile(
    loss=tf.keras.losses.MeanSquaredError(),
    optimizer="l-bfgs-b",
)

history = model.fit(
  x=(x_train[:, 0], x_train[:, 1]),
  y=y_train,
  batch_size=2**14,
  epochs=10,
  verbose=1
  validation_data=((x_val[:, 0], x_val[:, 1]), y_val),
)

>>> import pandas as pd; pd.DataFrame(history.history)
        loss      grad  fg_evals  hessp_evals  val_loss
 0.499431  0.001055         2            0  0.497424
 0.492091  0.010318         5            0  0.496749
 0.491067  0.015367         7            0  0.499127
 0.461140  0.012731         9            0  0.472772
 0.271020  0.017515        12            0  0.327173
 0.228658  0.021585        14            0  0.298120
 0.156481  0.012698        16            0  0.226349
 0.125350  0.007833        17            0  0.193145
 0.101411  0.007957        18            0  0.169513
 0.093375  0.013233        19            0  0.162208
0.082876  0.005307        20            0  0.152423
0.077789  0.004717        21            0  0.149731
0.072867  0.004420        22            0  0.144979
0.066927  0.006463        23            0  0.137852
0.063850  0.004983        24            0  0.136306
0.061897  0.002353        25            0  0.133633
0.060514  0.001867        26            0  0.132471
0.058629  0.002211        27            0  0.131402
0.057408  0.003710        28            0  0.130704
0.056111  0.001484        29            0  0.129850

Example: MNIST convnet

As a more realistic example of using kormos on a canonical dataset, we adapt the sample classification problem from the MNIST convnet example. Please note that this convolutional network model has a large number of highly correlated parameters to optimize, and stochastic algorithms like ADAM will generally perform better and provide better results. However we provide it as an example of how both stochastic and deterministic algorithms may be combined by recompiling a kormos model.

Prepare the Data

import numpy as np

from tensorflow import keras
from keras import layers

# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# Load the data and split it between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Build the Model

from kormos.models import BatchOptimizedSequentialModel

def build_model():
    model = BatchOptimizedSequentialModel(
        [
            keras.Input(shape=input_shape),
            layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Flatten(),
            layers.Dropout(0.5),
            layers.Dense(num_classes, activation="softmax"),
        ]
    )
    return model

model = build_model()
model.summary()

Model: "batch_optimized_sequential_model"
_________________________________________________________________
 Layer (type)                   Output Shape              Param #
=================================================================
 conv2d (Conv2D)                (None, 26, 26, 32)        320

 max_pooling2d (MaxPooling2D)   (None, 13, 13, 32)        0

 conv2d_1 (Conv2D)              (None, 11, 11, 64)        18496

 max_pooling2d_1 (MaxPooling2D) (None, 5, 5, 64)          0

 flatten (Flatten)              (None, 1600)              0

 dropout (Dropout)              (None, 1600)              0

 dense (Dense)                  (None, 10)                16010

=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________

Train the Model

We use this example train the model by running a combination of different algorithms. We start by running ADAM for 1 epoch, and then using this solution as a warm start initial guess for a batch solver by recompiling the model:

loss = keras.losses.CategoricalCrossentropy()
# Train a model with ADAM
model = build_model()
model.compile(loss=loss, optimizer="adam", metrics=["accuracy"])
hist1 = model.fit(x_train, y_train, batch_size=2**5, epochs=1, validation_data=(x_test, y_test))

# Continue training the model with a batch algorithm.
# We can instantiate the optimizer as well instead of a string identifier
optimizer = kormos.optimizers.ScipyBatchOptimizer()
model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])

# We can specify the method and any options for it in fit as keyword wargs
hist2 = model.fit(
    x_train,
    y_train,
    batch_size=2**14, # this is much larger than for stochastic solvers!
    epochs=3,
    validation_data=(x_test, y_test),
    method='bfgs',
)

Implementation Details

The kormos package implements an interface for batch optimization and wraps scipy.optimize.minimize in that interface in the following steps:

We create a subclass of keras.Model, BatchOptimizedModel (and BatchOptimizedSequentialModel to extend the Sequential API).
The subclass provides a fit_batch() method with nearly identical signature to the parent fit(), but does not perform stochastic mini-batch optimization. Instead, this method offloads all optimization to the the model’s optimizer attribute, which must implement the method minimize() to perform training by minimizing the the loss function provided during model compilation.
When a BatchOptimizedModel is compiled with a BatchOptimzer (or string identifier for one) as its optimizer argument, the fit() method inherited from keras.Model is overriden with a pointer to fit_batch() (such that a BatchOptimizedModel may be trained with either stochastic or deterministic solvers, depending on how it’s compiled).
The ScipyBatchOptimizer class extends the BatchOptimizer interface and uses the scipy.optimize.minimize routine to fit the model.

At first face this is more complicated than the recommended way of extending Keras to perform custom training (i.e. by overriding keras.Model.train_step() such as in the article Customizing what happens in fit()). However, unfortunately we found extending train_step() to be awkward or infeasible for implementing a batch optimization algorithm while still making use of the standard Keras utilities for computing validation metrics at each iteration end (epoch). Overriding the model train_step() (and putting the call to scipy.optimize.minimize inside it) would mean that from the Keras model’s perspective only a single epoch would be performed, such that validation metrics would only be computed at the very end of the optimzation routine.

API Reference

Models

class kormos.models.BatchOptimizedModel(*args, **kwargs)[source]

A keras.Model allowing parameter optimization by either stochastic or deterministic (full batch) algorithms.

Stochastic optimization may be performed on a BatchOptimizedModel using the standard optimizers in keras.optimizers.* or by using a kormos.optimizers.BatchOptimizer. The optimizer argument to the model’s compile method will determine which training procedure should be used.

For instance, the same model may be optimized either by SGD or by L-BFGS-B as follows:

from tensorflow import keras
from kormos.models import BatchOptimizedSequentialModel

# Create an Ordinary Least Squares regressor
model = BatchOptimizedSequentialModel()
model.add(keras.layers.Dense(
  units=1,
  input_shape=(5,),
))

# compile the model for stochastic optimization
model.compile(loss=keras.losses.MeanSquaredError(), optimizer="sgd")
model.fit(...)

# compile the model for deterministic optimization using scipy.optimize.minimize
model.compile(loss=keras.losses.MeanSquaredError(), optimizer="L-BFGS-B")
model.fit(...)

compile(**kwargs)[source]

Configure the model for training.

If the optimizer argument is specified as one of the keras.optimizers.* (or a string identifier thereby), then this method will simply call the parent method keras.Model.compile with the arguments provided. Subsequent calls to the model.fit will perform training using the standard keras.Model.fit method.

If the optimizer argument is specified as a valid kormos.optimizers.BatchOptimizer (or a valid string identifier to create one using kormos.optimizers.get()), then the model will be configured to map calls to fit to the method fit_batch.

Keyword Arguments

optimizer – A keras.optimizer.Optimizer or string identifier, or a kormos.optimizer.BatchOptimizer or string identifier
**kwargs – all other kwargs as passed to keras.Model.compile

fit_batch(x=None, y=None, batch_size=None, epochs=20, verbose='auto', callbacks=None, validation_split=0.0, validation_data=None, shuffle=False, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None, validation_batch_size=None, validation_freq=1, max_queue_size=10, workers=1, use_multiprocessing=False, pretrain_fn=None, **kwargs)[source]

Train the model using a batch optimization algorithm for up to the maximum number of iterations (epochs), and return the optimization history object.

This method will prepare the training dataset and execution context, and then call the model’s optimizer’s minimize method (please note that the optimizer attribute must implement the kormos.optimizers.BatchOptimizer interface). The optimizer should implement a gradient-based optimization algorithm that uses the entire set of training data to make parameter updates.

This is different than the standard keras model training procedure in which mini-batches (sample subsets) of the training dataset are created and the optimizer makes updates to the model parameters with a stochastic algorithm using only that subset.

The arguments to this method have the same meaning as for the keras.models.Model.fit method; please see the documentation there except for the arguments noted below.

Parameters

batch_size (int) – the number of training examples to process in a single “mini-batch” if the training dataset provided to compile must be converted to a tensorflow.data.Dataset. If unspecified and the training data provided must be converted to a tensorflow.data.Dataset, the value in this model’s optimizer (of type BatchOptimizer) will be used. Please note the choice of batch_size here is different than in stochastic optimization in which the optimizer typically performs best with small mini-batches on the order of 2^4 or 2^5. The batch_size here should be determined based on the available machine memory, and good values will be on the order of 2^12 or larger, depending on the memory requirements for the evaluation of the model (memory limitations are more likely to be encountered if second order algorithsm like the Newton-CG method are used).
pretrain_fn (callable) – univariate function that accepts the model as its argument and performs an arbitrary operation on it. Or, a list of such functions if the model’s optimizer accepts it. The pretrain_fn will be applied prior to the start of the optimization routine.

Keyword Arguments

**kwargs – keyword arguments to be passed to this model’s optimizer.minimize(**kwargs) method

Returns

A History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

Raises

RuntimeError –
1. If the model was never compiled or,
2. If model.fit is wrapped in tf.function. –
ValueError – In case of mismatch between the provided input data and what the model expects or when the input data is empty.

class kormos.models.BatchOptimizedSequentialModel(*args, **kwargs)[source]

Optimizers

class kormos.optimizers.BatchOptimizer(name='batchoptimizer', batch_size=1024, max_cache_entries=10, dtype=tf.float64)[source]

An optimizer class providing function callables for minimization by batch optimization algorithms.

Codes implementing batch minimization routines, such scipy.optimize.minimize, typically have a call signature like the following:

minimize(
  fun,        # callable, the objective function f(x) to minimize
  ...
  jac=None,   # callable, the jacobian of f(x)
  hessp=None, # callable implementing a hessian/vector product for the hessian of f(x)
  ...
)

The BatchOptimizer factory class provides methods with the appropriate signature for the above callables:

fun, in the method func(x) or func_and_grad(x)
jac, in the method grad(x) or func_and_grad(x)
hessp, in the method hessp(x, vector)

This class implements the callables above for a provided keras.Model and training dataset, in which the execution paradigm is for full batch gradient-based optimization, in which every sample in a provided dataset is used to compute the loss function and gradient values.

The BatchOptimizer class implements the function and gradient by evaluating them over mini-batches of data and accumulating the result of each minibatch such that the computation is robust even when the training dataset is large or the model requires lots of RAM or GPU memory. To ensure the correct summation, the model’s loss function attribute must have the reduction=keras.losses.Reduction.SUM explicitly set (after which the total sum will be averaged).

Child classes extending this base class must implement minimize, which is the method that implements a gradient-based (or Hessian-based) optimization algorithm.

__init__(name='batchoptimizer', batch_size=1024, max_cache_entries=10, dtype=tf.float64)[source]

Construct a BatchOptimizer object, but do not build it. Please note that the build method must be called before parameter optimization may be performed on a model.

The provided batch_size here will only be used if build() is called with training dataset that is not already adapated into a tensorflow.data.Dataset. In this case, the choice of batch_size here is different than in stochastic optimization in which the optimizer typically performs best with small mini-batches on the order of 2^4 or 2^5. The batch_size here should be determined based on the available machine memory, and good values will be on the order of 2^12 or larger, depending on the memory requirements for the evaluation of the model (memory limitations are more likely to be encountered if second order algorithsm like the Newton-CG method are used).

The dtype of a model’s backing tensors when converting to an numpy.ndarray may be explicitly specified. In general, tensorflow.float64 will make the optimization routine more numerically robust for loss functions with flat regions, and as such double precision is recommended for the tensorflow model itself. While single precision (tensorflow.float32) is often used in deep learning applications, numerical minimization routines that use line searches may require double precision (tensorflow.float64) for Wolfe convergence criteria (sufficient decrease) to be numerically resolvable. In addition, since scipy.optimize.minimize routines wrap Fortran codes that expect double precision, it is generally necessary to use double precision for robust with scipy Altering this will [probably] only work on bespoke locally-compiled scipy distributions. and the kormos package isn’t tested against anything other than double precision. In reality this really shouldn’t even be exposed as an argument—just walk away and let’s forget the whole thing.

Parameters

batch_size (int) – the number of training examples to process in a single “mini-batch” if the training dataset provided to build() must be converted to a tensorflow.data.Dataset
max_cache_entries (int) – the number of previous loss/gradient/hessp values to cache during training.
dtype – the tensorflow dtype to use for weight matrices, defaults to tf.float64.

build(model, Xyw)[source]

Dynamically build this BatchOptimizer object to prepare for a minimization routine for parameter fitting.

The build process consists of 2 steps:

Store the size metadata about this object’s model’s trainable_weights such that we may convert between the stacked numpy.ndarray vector of parameters and the model’s list of multidimensional tensors.
Create the tensorflow function-wrapped callables to evaluate the loss function, the gradient, and the Hessian/vector product. These callables are built dynamically on each call to build, such that the model metadata may be used to create them. Tensorflow’s autograph tracing should occurr only once on these callables.

This method has been partly adapted from Pi-Yueh Chuang’s MIT-licensed code, and this file retains the original 2019 copyright notice above.

Parameters

model (keras.Model) – the model to be optimized
Xyw (tensorflow.data.Dataset) – the training dataset for optimization. If a type other than Dataset is provided, this method will attempt to create one using the BatchOptimizer’s batch_size attribute. Providing a Dataset is recommended.

Raises

ValueError – if the model.loss.reduction is not kerass.losses.reduction.Reduction.SUM

get_weights()[source]

Retrieve the model weights as a numpy.ndarray vector.

Returns: vector of parameters used by scipy.optimize.minimize
Return type: numpy.ndarray

set_weights(x)[source]

Set a keras model’s model.trainable_weights (by default, use this object’s model).

Parameters: x – a numpy.ndarray (or similar iterable) parameter vector

func_and_grad(x)[source]

Evaluate the loss function and its gradient.

This method will loop over mini-batches of the training data and compute the loss and gradient in a tensorflow.GradientTape for each mini-batch, such that models that require large amounts of memory or large training sets (or both) may be used robustly.

Readers familiar with the standard keras model training procedure will note the absence of regularization losses provided to self.model.compiled_loss. The regularization terms are computed separately from the loop snippet above, and then added to the loss & gradient values.

Please note that providing a sample_weight does not necessarily produce a weighted average over the dataset. Rather, like keras, the sample_weight is applied element-wise to each data sample, and the final total is divided by the number of training examples if the reduction mechanism is keras.losses.reduction.Reduction.SUM_OVER_BATCH_SIZE. If an actual weighted average of losses over a dataset is desired, it is the caller’s responsibility to ensure that the sum of the values in sample_weight is 1.

Parameters: x (numpy.ndarray) – vector at which to evaluate the loss function
Returns: value of the loss function and its gradient
Return type: (float, numpy.ndarray)

func(x)[source]

Evaluate the loss function at x.

Parameters: x (numpy.ndarray) – vector at which to evaluate the loss function
Returns: the value of the loss function at x
Return type: float

grad(x)[source]

Evaluate the gradient of the loss function at x.

Parameters: x (numpy.ndarray) – vector at which to evaluate the loss function
Returns: the gradient of the loss function at x
Return type: numpy.ndarray

hessp(x, vector)[source]

Evaluate the product of the Hessian matrix of the loss function evaluated at x with a given vector for use in second order optimization methods.

The implementation here has been adapted from the Hessian-vector-product benchmarking suite in tensorflow, available here.

The computation of the Hessian is performed over mini-batches and accumulated, as described in the docstring for the method func_and_grad.

Parameters

x (numpy.ndarray) – point at which to evaluate the loss function
vector (numpy.ndarray) – direction vector to project the Hessian onto

Returns

the Hessian vector product

Return type

numpy.ndarray

minimize(epochs=1, callback=None, pretrain_fn=None, **kwargs)[source]

class kormos.optimizers.ScipyBatchOptimizer(name='scipy', method=None, epochs=None, pretrain_fn=None, options=None, bounds=None, constraints=None, tol=None, **kwargs)[source]

A BatchOptimizer that wraps the scipy.optimize.minimize library for multivariate gradient-based optimization.

DEFAULT_METHOD = 'L-BFGS-B'[source]

__init__(name='scipy', method=None, epochs=None, pretrain_fn=None, options=None, bounds=None, constraints=None, tol=None, **kwargs)[source]

Initialize a ScipyBatchOptimizer and store any default arguments to pass to scipy.minimize.optimize. Arguments may be passed either here in the constructor or directly in minimize(), which takes precedence.

The ability to pass solver arguments to the constructor here is a mechanism for interoperability with KerasTuner, such that hyperparameter search over the solver configurations could be performed if desired.

Parameters

epochs (int or list[int]) – number of maximum iterations to perform. This will be passed as options[‘maxiters’] to scipy.
callback (callable) – univariate callable function to be executed at the end of each iteration. The argument to this function is the current np.ndarray parameter vector.
pretrain_fn (callable) – univariate callable function to be executed at the end of each iteration. The argument to this function is the model instance. This may be used in conjuction with a list-like method to modify a model or set certain model layers as trainable or not trainable in an alternating fashion.
method (str or list[str]) – method to use for optimization
options (dict or list[dict]) – dictionary payload of options to configure an optimization algorithm
bounds (sequence or scipy.optimize.Bounds) – Bounds on variables for Nelder-Mead, L-BFGS-B, TNC, SLSQP, Powell, and trust-constr methods. Please see the scipy.optimize.minimize docstring for details. Please note that the bounds specified are used as-is for successive calls to minimize, even if the pretrain_fn has modified the model’s architecture.
constraints – Constraints definition for the COBYLA, SLSQP and trust-constr solvers. Please see the scipy.optimize.minimize docstring for details. Please note that the constraints specified are used as-is for successive calls to minimize, even if the pretrain_fn has modified the model’s architecture.
tol (float) – Numerical tolerance for termination. When tol is specified, the selected solver sets the relevant solver-specific tolerance(s) equal to tol. For detailed control, set the solver-specific configuration value through options.

minimize(epochs=10, callback=None, pretrain_fn=None, method=None, options=None, bounds=None, constraints=None, tol=None)[source]

Minimize the loss function for a compiled model using an algorithm from scipy.optimize.minimize.

The arguments to this method replicate those of the scipy.optimize and may be used analogously. However it is also possible to pass list values for the following arguments:

epochs
method
options

When lists are provided, this will result in multiple successive calls to scipy.optimize.minize using the method and options requested, allowing for combinations of algorithms to be used. The result of one method will be passed in a daisy chain as the initial value for the next method.

Parameters

epochs (int or list[int]) – number of maximum iterations to perform. This will be passed as options[‘maxiters’] to scipy.
callback (callable) – univariate callable function to be executed at the end of each iteration. The argument to this function is the current np.ndarray parameter vector.
pretrain_fn (callable) – univariate callable function to be executed at the end of each iteration. The argument to this function is the model instance. This may be used in conjuction with a list-like method to modify a model or set certain model layers as trainable or not trainable in an alternating fashion.
method (str or list[str]) – method to use for optimization
options (dict or list[dict]) – dictionary payload of options to configure an optimization algorithm
bounds (sequence or scipy.optimize.Bounds) – Bounds on variables for Nelder-Mead, L-BFGS-B, TNC, SLSQP, Powell, and trust-constr methods. Please see the scipy.optimize.minimize docstring for details. Please note that the bounds specified are used as-is for successive calls to minimize, even if the pretrain_fn has modified the model’s architecture.
constraints – Constraints definition for the COBYLA, SLSQP and trust-constr solvers. Please see the scipy.optimize.minimize docstring for details. Please note that the constraints specified are used as-is for successive calls to minimize, even if the pretrain_fn has modified the model’s architecture.
tol (float) – Numerical tolerance for termination. When tol is specified, the selected solver sets the relevant solver-specific tolerance(s) equal to tol. For detailed control, set the solver-specific configuration value through options.

Returns

scipy.optimize.OptimizeResult

kormos.optimizers.get(identifier)[source]

Return an instantiated kormos.optimizers.BatchOptimizer.

Parameters: identifier (str or kormos.optimizers.BatchOptimizer) – A string identifier to use to create a BatchOptimizer. May be either ‘scipy’ or the name of a specific method implemented by scipy.optimize.minimize(method=<name>)

Caching Utilities

The kormos.optimizers.BatchOptimizer uses a simple dictionary cache for the func_and_grad method with the signature as described below. Please be aware that this cache utility is bespoke and not intended for downstream usage outside of the BatchOptimizer.

class kormos.utils.cache.OptimizationStateCache(max_entries=100)[source]

__init__(max_entries=100)[source]

static hash(x)[source]

clear()[source]

update(x, **kwargs)[source]

get(x, key)[source]

get_last(key, k=-1)[source]

static cached(key, max_entries=10, cache_name='cache')[source]

Kormos

Motivation

Why The Name Kormos?

License

Setup

Requirements

Installation

Usage Examples

Example: Linear Regression with Sequential API

Example: Linear Regression using the Functional API

Example: Collaborative Filtering for Item Recommendation

Example: MNIST convnet

Implementation Details

Acknowledgements & Related Work

API Reference

Models

Optimizers

Caching Utilities

Indices