作者:Kuriko Iwai
to tune hyperparamters of deep learning models (Keras Sequential model), in comparison with a traditional approach — Grid Search.
Bayesian Optimization is a sequential design strategy for global optimization of black-box functions.
It is particularly well-suited for functions that are expensive to evaluate, lack an analytical form, or have unknown derivatives.
In the context of hyperparameter optimization, the unknown function can be:
Unlike traditional optimization methods that rely on direct function evaluations, Bayesian Optimization builds and refines a probabilistic model of the objective function, using this model to intelligently select the next evaluation point.
The core idea revolves around two key components:
The model approximates the unknown objective function (f(x)) to a surrogate model such as Gaussian Process (GP).
A GP is a non-parametric Bayesian model that defines a distribution over functions. It provide:
Mathematically, for a Gaussian Process, the predictions at an unobserved point (x∗), given observed data (X, y), are normally distributed:
where
The acquisition function determines a next point (x_t+1) to evaluate by quantifying how “promising” a candidate point is for improving the objective function, by balancing:
Common acquisition functions include:
Probability of Improvement (PI)
PI selects the point that has the highest probability of improving upon the current best observed value (f(x+)):
where
ξ controls a trade-off between exploration and exploitation, and a larger ξ encourages more exploration.
Expected Improvement (EI)
Quantifies the expected amount of improvement over the current best observed value:
Assuming a Gaussian Process surrogate, the analytical form of EI is defined:
where ϕ is the probability density function (PDF) of the standard normal distribution.
EI is one of the most widely used acquisition functions. EI also considers the magnitude of the improvement unlike PI.
Upper Confidence Bound (UCB)
UCB balances exploitation (high mean) and exploration (high variance), focusing on points that have both a high predicted mean and high uncertainty:
κ≥0 is a tuning parameter that controls the balance between exploration and exploitation.
A larger κ puts more emphasis on exploring uncertain regions.
Bayesian Optimization iteratively updates the surrogate model and optimizes the acquisition function.
It guides the search towards optimal regions while minimizing the number of expensive objective function evaluations.
Now, let us see the process with code snippets using KerasTuner
for a fraud detection task (binary classification where y=1 (fraud) costs us the most.)
Initializes the process by sampling the hyperparameter space randomly or low-discrepancy sequencing (ususally picking up 5 to 10 points) to get an idea of the objective function.
These initial observations are used to build the first version of the surrogate model.
As we build Keras Sequential model, we first define and compile the model, then define theBayesianOptimization
tuner with the number of initial points to assess.
import keras_tuner as kt
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
# initialize a Keras Sequential model
model = Sequential([
Input(shape=(self.input_shape,)),
Dense(
units=hp.Int(
'neurons1', min_value=20, max_value=60, step=10),
activation='relu'
),
Dropout(
hp.Float(
'dropout_rate1', min_value=0.0, max_value=0.5, step=0.1
)),
Dense(
units=hp.Int(
'neurons2', min_value=20, max_value=60, step=10),
activation='relu'
),
Dropout(
hp.Float(
'dropout_rate2', min_value=0.0, max_value=0.5, step=0.1
)),
Dense(
1, activation='sigmoid',
bias_initializer=keras.initializers.Constant(
self.initial_bias_value
)
)
])
# compile the model
model.compile(
optimizer=optimizer,
loss='binary_crossentropy',
metrics=[
'accuracy',
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc')
]
)
# define a tuner with the intial points
tuner = kt.BayesianOptimization(
hypermodel=custom_hypermodel,
objective=kt.Objective("val_recall", direction="max"),
max_trials=max_trials,
executions_per_trial=executions_per_trial,
directory=directory,
project_name=project_name,
num_initial_points=num_initial_points,
overwrite=True,
)
num_initial_points
defines how many initial, randomly selected hyperparameter configurations should be evaluated before the algorithm starts to guide the search.
If not given, KerasTuner takes a default value: 3 * dimensions of the hyperparameter space.
Build and train the probabilistic model (surrogate model, often a Gaussian Process or a Tree-structured Parzen Estimator for Bayesian Optimization) using all available observed datas points (input values and their corresponding output values) to approximate the true function.
The surrogate model provides the mean prediction (μ(x)) (most likely from the Gaussian process) and uncertainty (σ(x)) for any unobserved point.
KerasTuner uses an internal surrogate model to model the relationship between hyperparameters and the objective function’s performance.
After each objective function evaluation via train run, the observed data points (hyperparameters and validation metrics) are used to update the internal surrogate model.
Use an optimization algorithm (often a cheap, local optimizer like L-BFGS or even random search) to find the next point (x_t+1) that maximizes the chosen acquisition function.
This step is crucial because it identifies the most promising next candidate for evaluation by balancing exploration (trying new, uncertain areas of the hyperparameter space) and exploitation (refining promising areas).
KerasTuner uses an optimization strategy such as Expected Improvement or Upper Confidence Bound to find the next set of hyperparameters.
Evaluate the true, expensive objective function (f(x)) at the new candidate point (x_t+1).
The Keras model is trained using the provided training datasets and evaluated on the validation data. We set val_recall
as the result of this evaluation.
def fit(self, hp, model=None, *args, **kwargs):
model = self.build(hp=hp) if not model else model
batch_size = hp.Choice('batch_size', values=[16, 32, 64])
epochs = hp.Int('epochs', min_value=50, max_value=200, step=50)
return model.fit(
batch_size=batch_size,
epochs=epochs,
class_weight=self.class_weights_dict,
*args,
**kwargs
)
Add the newly observed data point (x_(t+1), f(x_(t+1))) to the set of observations.
Repeat Step 2 — 5 until a stopping criterion is met.
Technically, the tuner.search()
method orchestrates the entire Bayesian optimization process from Step 2 to 5:
tuner.search(
X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[early_stopping_callback]
)
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
best_keras_model_from_tuner = tuner.get_best_models(num_models=1)[0]
The method repeatedly performs these steps until the max_trials
limit is reached or other internal stopping criteria such as early_stopping_callback
are met.
Here, we set recall
as our key metrics to penalize the misclassification as False Positive costs us the most in the fraud detection case.
Learn More: KerasTuner Source Code
The Bayesian Optimization process aimed to enhance the model’s performance, primarily by maximizing recall.
The tuning efforts yielded a trade-off across key metrics, resulting in a model with significantly improved recall at the expense of some precision and overall accuracy compared to the initial state:
Best performing hyperparameter set:
Optimal Neural Network Summary:
Key Performance Metrics:
We tuned a Keras Sequential model with Grid Search on Adam optimizer for comparison:
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from sklearn.model_selection import GridSearchCV
from scikeras.wrappers import KerasClassifier
param_grid = {
'model__learning_rate': [0.001, 0.0005, 0.0001],
'model__neurons1': [20, 30, 40],
'model__neurons2': [20, 30, 40],
'model__dropout_rate1': [0.1, 0.15, 0.2],
'model__dropout_rate2': [0.1, 0.15, 0.2],
'batch_size': [16, 32, 64],
'epochs': [50, 100],
}
input_shape = X_train.shape[1]
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
class_weights = class_weight.compute_class_weight(
class_weight='balanced',
classes=np.unique(y_train),
y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))
keras_classifier = KerasClassifier(
model=create_model,
model__input_shape=input_shape,
model__initial_bias_value=initial_bias,
loss='binary_crossentropy',
metrics=[
'accuracy',
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc')
]
)
grid_search = GridSearchCV(
estimator=keras_classifier,
param_grid=param_grid,
scoring='recall',
cv=3,
n_jobs=-1,
error_score='raise'
)
grid_result = grid_search.fit(
X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[early_stopping_callback],
class_weight=class_weights_dict
)
optimal_params = grid_result.best_params_
best_keras_classifier = grid_result.best_estimator_
Grid Search tuning resulted in a model with strong precision and good overall accuracy, though with a lower recall compared to the Bayesian Optimization approach:
Best performing hyperparameter set:
Optimal Neural Network Summary:
Grid Search Performance:
Comparison with Bayesian Optimization:
Bayesian Optimization
Grid Search:
Bayesian Optimization
Grid Search
Bayesian Optimization
Grid Search
Bayesian Optimization:
Grid Search
The experiment effectively demonstrated the distinct strengths of Bayesian Optimization and Grid Search in hyperparameter tuning.
Bayesian Optimization, by design, proved highly effective at intelligently navigating the search space and prioritizing a specific objective, in this case, maximizing recall.
It successfully achieved a higher recall rate (0.8400) compared to Grid Search, indicating its ability to find more positive instances.
This capability comes with an inherent trade-off, leading to reduced precision and overall accuracy.
Such an outcome is highly valuable in applications where minimizing false negatives is critical (e.g., medical diagnosis, fraud detection).
Its efficiency, stemming from probabilistic modeling that guides the search towards promising areas, makes it a preferred method for optimizing costly experiments or simulations where each evaluation is expensive.
In contrast, Grid Search, while exhaustive, yielded a more balanced model with superior precision (0.8304) and overall accuracy (0.7825).
This suggests Grid Search was more conservative in its predictions, resulting in fewer false positives.
In summary, while Grid Search offers a straightforward and exhaustive approach, Bayesian Optimization stands out as a more sophisticated and efficient method capable of finding superior results with fewer evaluations, particularly when optimizing for a specific, often complex, objective like maximizing recall in a high-dimensional space.
The optimal choice of tuning method ultimately depends on the specific performance priorities and resource constraints of the application.
Author: Kuriko IWAI
Portfolio / LinkedIn / Github
May 26, 2025
All images, unless otherwise noted, are by the author.
The article utilizes synthetic data, licensed under Apache 2.0 for commercial use.