Bayesian Optimization for Hyperparameter Tuning of Deep Learning Models | Towards Data Science

2025-05-27 21:02:02 英文原文

作者：Kuriko Iwai

to tune hyperparamters of deep learning models (Keras Sequential model), in comparison with a traditional approach — Grid Search.

Bayesian Optimization

Bayesian Optimization is a sequential design strategy for global optimization of black-box functions.

It is particularly well-suited for functions that are expensive to evaluate, lack an analytical form, or have unknown derivatives.
In the context of hyperparameter optimization, the unknown function can be:

an objective function,
accuracy value for a training or validation set,
loss value for a training or validation set,
entropy gained or lost,
AUC for ROC curves,
A/B test results,
computation cost per epoch,
model size,
reward amount for reinforcement learning, and more.

Unlike traditional optimization methods that rely on direct function evaluations, Bayesian Optimization builds and refines a probabilistic model of the objective function, using this model to intelligently select the next evaluation point.

The core idea revolves around two key components:

1. Surrogate Model (Probabilistic Model)

The model approximates the unknown objective function (f(x)) to a surrogate model such as Gaussian Process (GP).

A GP is a non-parametric Bayesian model that defines a distribution over functions. It provide:

a prediction of the function value at a given point μ(x) and
a measure of uncertainty around that prediction σ(x), often represented as a confidence interval.

Mathematically, for a Gaussian Process, the predictions at an unobserved point (x∗), given observed data (X, y), are normally distributed:

where

μ(x∗): the mean prediction and
σ²(x∗): the predictive variance.

2. Acquisition Function

The acquisition function determines a next point (x_t+1) to evaluate by quantifying how “promising” a candidate point is for improving the objective function, by balancing:

Exploration (High Variance): Sampling in areas with high uncertainty to discover new promising regions and
Exploitation (High Mean): Sampling in areas where the surrogate model predicts high objective values.

Common acquisition functions include:

Probability of Improvement (PI)
PI selects the point that has the highest probability of improving upon the current best observed value (f(x+)):

where

Φ: the cumulative distribution function (CDF) of the standard normal distribution, and
ξ≥0 is a trade-off parameter (exploration vs. exploitation).

ξ controls a trade-off between exploration and exploitation, and a larger ξ encourages more exploration.

Expected Improvement (EI)
Quantifies the expected amount of improvement over the current best observed value:

Assuming a Gaussian Process surrogate, the analytical form of EI is defined:

where ϕ is the probability density function (PDF) of the standard normal distribution.

EI is one of the most widely used acquisition functions. EI also considers the magnitude of the improvement unlike PI.

Upper Confidence Bound (UCB)
UCB balances exploitation (high mean) and exploration (high variance), focusing on points that have both a high predicted mean and high uncertainty:

κ≥0 is a tuning parameter that controls the balance between exploration and exploitation.

A larger κ puts more emphasis on exploring uncertain regions.

Bayesian Optimization Strategy (Iterative Process)

Bayesian Optimization iteratively updates the surrogate model and optimizes the acquisition function.

It guides the search towards optimal regions while minimizing the number of expensive objective function evaluations.

Now, let us see the process with code snippets using KerasTuner for a fraud detection task (binary classification where y=1 (fraud) costs us the most.)

Step 1. Initialization

Initializes the process by sampling the hyperparameter space randomly or low-discrepancy sequencing (ususally picking up 5 to 10 points) to get an idea of the objective function.

These initial observations are used to build the first version of the surrogate model.

As we build Keras Sequential model, we first define and compile the model, then define theBayesianOptimization tuner with the number of initial points to assess.

import keras_tuner as kt
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input

# initialize a Keras Sequential model
model = Sequential([
    Input(shape=(self.input_shape,)),
    Dense(
        units=hp.Int(
            'neurons1', min_value=20, max_value=60, step=10),
             activation='relu'
    ),
    Dropout(
        hp.Float(
             'dropout_rate1', min_value=0.0, max_value=0.5, step=0.1
    )),
    Dense(
        units=hp.Int(
            'neurons2', min_value=20, max_value=60, step=10),
            activation='relu'
    ),
    Dropout(
         hp.Float(
              'dropout_rate2', min_value=0.0, max_value=0.5, step=0.1
    )),
    Dense(
         1, activation='sigmoid', 
         bias_initializer=keras.initializers.Constant(
             self.initial_bias_value
        )
    )
])

# compile the model
model.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        keras.metrics.Precision(name='precision'),
        keras.metrics.Recall(name='recall'),
        keras.metrics.AUC(name='auc')
    ]
)

# define a tuner with the intial points
tuner = kt.BayesianOptimization(
    hypermodel=custom_hypermodel,
    objective=kt.Objective("val_recall", direction="max"), 
    max_trials=max_trials,
    executions_per_trial=executions_per_trial,
    directory=directory,
    project_name=project_name,
    num_initial_points=num_initial_points,
    overwrite=True,
)

num_initial_points defines how many initial, randomly selected hyperparameter configurations should be evaluated before the algorithm starts to guide the search.

If not given, KerasTuner takes a default value: 3 * dimensions of the hyperparameter space.

Step 2. Surrogate Model Training

Build and train the probabilistic model (surrogate model, often a Gaussian Process or a Tree-structured Parzen Estimator for Bayesian Optimization) using all available observed datas points (input values and their corresponding output values) to approximate the true function.

The surrogate model provides the mean prediction (μ(x)) (most likely from the Gaussian process) and uncertainty (σ(x)) for any unobserved point.

KerasTuner uses an internal surrogate model to model the relationship between hyperparameters and the objective function’s performance.

After each objective function evaluation via train run, the observed data points (hyperparameters and validation metrics) are used to update the internal surrogate model.

Step 3. Acquisition Function Optimization

Use an optimization algorithm (often a cheap, local optimizer like L-BFGS or even random search) to find the next point (x_t+1) that maximizes the chosen acquisition function.

This step is crucial because it identifies the most promising next candidate for evaluation by balancing exploration (trying new, uncertain areas of the hyperparameter space) and exploitation (refining promising areas).

KerasTuner uses an optimization strategy such as Expected Improvement or Upper Confidence Bound to find the next set of hyperparameters.

Step 4. Objective Function Evaluation

Evaluate the true, expensive objective function (f(x)) at the new candidate point (x_t+1).

The Keras model is trained using the provided training datasets and evaluated on the validation data. We set val_recall as the result of this evaluation.

def fit(self, hp, model=None, *args, **kwargs):
    model = self.build(hp=hp) if not model else model
    batch_size = hp.Choice('batch_size', values=[16, 32, 64])
    epochs = hp.Int('epochs', min_value=50, max_value=200, step=50)
  
    return model.fit(
        batch_size=batch_size,
        epochs=epochs,
        class_weight=self.class_weights_dict,
        *args,
        **kwargs
    )

Step 5. Data Update

Add the newly observed data point (x_(t+1), f(x_(t+1))) to the set of observations.

Step 6. Iteration

Repeat Step 2 — 5 until a stopping criterion is met.

Technically, the tuner.search() method orchestrates the entire Bayesian optimization process from Step 2 to 5:

tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping_callback]
)

best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
best_keras_model_from_tuner = tuner.get_best_models(num_models=1)[0]

The method repeatedly performs these steps until the max_trials limit is reached or other internal stopping criteria such as early_stopping_callback are met.

Here, we set recall as our key metrics to penalize the misclassification as False Positive costs us the most in the fraud detection case.

Learn More: KerasTuner Source Code

Results

The Bayesian Optimization process aimed to enhance the model’s performance, primarily by maximizing recall.

The tuning efforts yielded a trade-off across key metrics, resulting in a model with significantly improved recall at the expense of some precision and overall accuracy compared to the initial state:

Recall: 0.9055 (0.6595 -> 0.6450) — 0.8400
Precision: 0.6831 (0.8338 -> 0.8113) — 0.6747
Accuracy: 0.7427 (0.7640 -> 0.7475) — 0.7175
(From development (training / validation combined) to test phase)

History of Learning Rate in the Gaussian Optimization Process

Best performing hyperparameter set:

neurons1: 40
dropout_rate1: 0.0
neurons2: 20,
dropout_rate2: 0.4
optimizer_name: lion,
learning_rate: 0.004019639999963362
batch_size: 64
epochs: 200
beta_1_lion: 0.9
beta_2_lion: 0.99

Optimal Neural Network Summary:

Key Performance Metrics:

Recall: The model demonstrated a significant improvement in recall, increasing from an initial value of approximately 0.66 (or 0.645) to 0.8400. This indicates the optimized model is notably better at identifying positive cases.
Precision: Concurrently, precision experienced a decrease. Starting from around 0.83 (or 0.81), it settled at 0.6747 post-optimization. This suggests that while more positive cases are being identified, a higher proportion of those identifications might be false positives.
Accuracy: The overall accuracy of the model also saw a decline, moving from an initial 0.7640 (or 0.7475) down to 0.7175. This is consistent with the observed trade-off between recall and precision, where optimizing for one often impacts the others.

Comparing with Grid Search

We tuned a Keras Sequential model with Grid Search on Adam optimizer for comparison:

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from sklearn.model_selection import GridSearchCV
from scikeras.wrappers import KerasClassifier

param_grid = {
    'model__learning_rate': [0.001, 0.0005, 0.0001],
    'model__neurons1': [20, 30, 40],
    'model__neurons2': [20, 30, 40],
    'model__dropout_rate1': [0.1, 0.15, 0.2],
    'model__dropout_rate2': [0.1, 0.15, 0.2],
    'batch_size': [16, 32, 64],
    'epochs': [50, 100],
}

input_shape = X_train.shape[1]
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

keras_classifier = KerasClassifier(
    model=create_model,
    model__input_shape=input_shape,
    model__initial_bias_value=initial_bias,
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        keras.metrics.Precision(name='precision'),
        keras.metrics.Recall(name='recall'),
        keras.metrics.AUC(name='auc')
    ]
)

grid_search = GridSearchCV(
    estimator=keras_classifier,
    param_grid=param_grid,
    scoring='recall',
    cv=3,
    n_jobs=-1,
    error_score='raise'
)

grid_result = grid_search.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict
)

optimal_params = grid_result.best_params_
best_keras_classifier = grid_result.best_estimator_

Results

Grid Search tuning resulted in a model with strong precision and good overall accuracy, though with a lower recall compared to the Bayesian Optimization approach:

Recall: 0.8214(0.7735 -> 0.7150)— 0.7100
Precision: 0.7884 (0.8331 -> 0.8034) — 0.8304
Accuracy:0.8005 (0.8092 -> 0.7700) — 0.7825

Best performing hyperparameter set:

neurons1: 40
dropout_rate1: 0.15
neurons2: 40
dropout_rate2: 0.1
learning_rate: 0.001
batch_size: 16
epochs: 100

Optimal Neural Network Summary:

Evaluation During Training (Grid Search Tuning)

Evaluation During Validation (Grid Search Tuning)

Evaluation During Test (Grid Search Tuning)

Grid Search Performance:

Recall: Achieved a recall of 0.7100, a slight decrease from its initial range (0.7735–0.7150).
Precision: Showed robust performance at 0.8304, an improvement over its initial range (0.8331–0.8034).
Accuracy: Settled at 0.7825, maintaining a solid overall predictive capability, slightly lower than its initial range (0.8092–0.7700).

Comparison with Bayesian Optimization:

Recall: Bayesian Optimization (0.8400) significantly outperformed Grid Search (0.7100) in identifying positive cases.
Precision: Grid Search (0.8304) achieved much higher precision than Bayesian Optimization (0.6747), indicating fewer false positives.
Accuracy: Grid Search’s accuracy (0.7825) was notably higher than Bayesian Optimization’s (0.7175).

General Comparison with Grid Search

1. Approaching the Search Space

Bayesian Optimization

Intelligent/Adaptive: Bayesian Optimization builds a probabilistic model (often a Gaussian Process) of the objective function (e.g., model performance as a function of hyperparameters). It uses this model to predict which hyperparameter combinations are most likely to yield better results.
Informed: It learns from previous evaluations. After each trial, the probabilistic model is updated, guiding the search towards more promising regions of the hyperparameter space. This allows it to make “intelligent” choices about where to sample next, balancing exploration (trying new, unknown regions) and exploitation (focusing on regions that have shown good results).
Sequential: It typically operates sequentially, evaluating one point at a time and updating its model before selecting the next.

Grid Search:

Exhaustive/Brute-force: Grid Search systematically tries every possible combination of hyperparameter values from a pre-defined set of values for each hyperparameter. You specify a “grid” of values, and it evaluates every point on that grid.
Uninformed: It doesn’t use the results of previous evaluations to inform the selection of the next set of hyperparameters to try. Each combination is evaluated independently.
Deterministic: Given the same grid, it will always explore the same combinations in the same order.

2. Computational Cost

Bayesian Optimization

More Efficient: Designed to find optimal hyperparameters with significantly fewer evaluations compared to Grid Search. This makes it particularly effective when evaluating the objective function (e.g., training a Machine Learning model) is computationally expensive or time-consuming.
Scalability: Generally scales better to higher-dimensional hyperparameter spaces than Grid Search, though it can still be computationally intensive for very high dimensions due to the overhead of maintaining and updating the probabilistic model.

Grid Search

Computationally Expensive: As the number of hyperparameters and the range of values for each hyperparameter increase, the number of combinations grows exponentially. This leads to very long run times and high computational cost, making it impractical for large search spaces. This is often referred to as the “curse of dimensionality.”
Scalability: Does not scale well with high-dimensional hyperparameter spaces.

3. Guarantees and Exploration

Bayesian Optimization

Probabilistic guarantee: It aims to find the global optimum efficiently, but it does not offer a hard guarantee like Grid Search for finding the absolute best within a discrete set. Instead, it converges probabilistically towards the optimum.
Smarter exploration: Its balance of exploration and exploitation helps it avoid getting stuck in local optima and discover optimal values more effectively.

Grid Search

Guaranteed to find best in grid: If the optimal hyperparameters are within the defined grid, Grid Search is guaranteed to find them because it tries every combination.
Limited exploration: It can miss optimal values if they fall between the discrete points defined in the grid.

4. When to Use Which

Bayesian Optimization:

Large, high-dimensional hyperparameter spaces: When evaluating models is expensive and you have many hyperparameters to tune.
When efficiency is paramount: To find good hyperparameters quickly, especially in situations with limited computational resources or time.
Black-box optimization problems: When the objective function is complex, non-linear, and doesn’t have a known analytical form.

Grid Search

Small, low-dimensional hyperparameter spaces: When you have only a few hyperparameters and a limited number of values for each, Grid Search can be a simple and effective choice.
When exhaustiveness is critical: If you absolutely need to explore every single defined combination.

Conclusion

The experiment effectively demonstrated the distinct strengths of Bayesian Optimization and Grid Search in hyperparameter tuning.
Bayesian Optimization, by design, proved highly effective at intelligently navigating the search space and prioritizing a specific objective, in this case, maximizing recall.

It successfully achieved a higher recall rate (0.8400) compared to Grid Search, indicating its ability to find more positive instances.
This capability comes with an inherent trade-off, leading to reduced precision and overall accuracy.

Such an outcome is highly valuable in applications where minimizing false negatives is critical (e.g., medical diagnosis, fraud detection).
Its efficiency, stemming from probabilistic modeling that guides the search towards promising areas, makes it a preferred method for optimizing costly experiments or simulations where each evaluation is expensive.

In contrast, Grid Search, while exhaustive, yielded a more balanced model with superior precision (0.8304) and overall accuracy (0.7825).

This suggests Grid Search was more conservative in its predictions, resulting in fewer false positives.

In summary, while Grid Search offers a straightforward and exhaustive approach, Bayesian Optimization stands out as a more sophisticated and efficient method capable of finding superior results with fewer evaluations, particularly when optimizing for a specific, often complex, objective like maximizing recall in a high-dimensional space.

The optimal choice of tuning method ultimately depends on the specific performance priorities and resource constraints of the application.

Author: Kuriko IWAI
Portfolio / LinkedIn / Github
May 26, 2025

All images, unless otherwise noted, are by the author.
The article utilizes synthetic data, licensed under Apache 2.0 for commercial use.

关于《Bayesian Optimization for Hyperparameter Tuning of Deep Learning Models | Towards Data Science》的评论

暂无评论

发表评论

摘要

Your comprehensive comparison between Bayesian Optimization and Grid Search for hyperparameter tuning is insightful and well-structured. Here's a summary of key points and some additional insights that could enhance your analysis: ### Summary of Key Points #### Bayesian Optimization: 1. **Approach**: - Builds a probabilistic model (often a Gaussian Process) to predict the objective function. - Learns from previous evaluations, guiding the search towards more promising regions. - Operates sequentially, evaluating one point at a time and updating its model. 2. **Computational Cost**: - More efficient compared to Grid Search due to fewer evaluations needed. - Better scalability for high-dimensional hyperparameter spaces but still computationally intensive in very high dimensions. 3. **Guarantees and Exploration**: - Probabilistic guarantee of finding the global optimum efficiently, not absolute like Grid Search. - Balances exploration and exploitation effectively to avoid local optima. 4. **When to Use**: - Large, high-dimensional hyperparameter spaces where evaluations are expensive. - When efficiency is paramount and computational resources are limited. - Black-box optimization problems with complex objective functions. #### Grid Search: 1. **Approach**: - Systematically tries every possible combination of hyperparameters from a pre-defined set. - Evaluates each combination independently without using previous results. 2. **Computational Cost**: - Computationally expensive, especially for large search spaces due to exponential growth in combinations. - Poor scalability with increasing dimensionality. 3. **Guarantees and Exploration**: - Guaranteed to find the best hyperparameters within a defined grid if they exist. - Limited exploration as it can miss optimal values not present in the grid. 4. **When to Use**: - Small, low-dimensional hyperparameter spaces with few combinations. - When exhaustiveness is critical and every combination must be explored. ### Additional Insights #### Specific Application Scenarios - **Fraud Detection**: Bayesian Optimization would excel due to its ability to maximize recall (minimize false negatives) efficiently. This is crucial in fraud detection where missing a fraudulent transaction can have severe consequences. - **Medical Diagnosis**: Similar to fraud detection, minimizing false negatives is critical in medical diagnosis. Bayesian Optimization could be the preferred method here. #### Trade-offs - **Precision vs Recall**: Grid Search provided better precision and overall accuracy while sacrificing recall. This balance might be preferable in applications where false positives need to be minimized (e.g., customer service chatbots). ### Visualizations To enhance your analysis, consider adding visualizations: 1. **Objective Function Surfaces**: - Plot the objective function surface for a few hyperparameters to illustrate how Bayesian Optimization converges towards optima. 2. **Evaluation Histories**: - Show how many evaluations each method requires to reach optimal performance, emphasizing efficiency. ### Practical Recommendations - **Hybrid Approaches**: Consider combining Bayesian Optimization with Random Search or other techniques to balance exploration and exploitation effectively. - **Early Stopping**: Implement early stopping criteria in Bayesian Optimization to terminate runs that are unlikely to improve further. ### Conclusion Your experiment highlights the strengths of each method: - **Bayesian Optimization** for efficiency and effectiveness in high-dimensional spaces, especially when maximizing recall is crucial. - **Grid Search** for exhaustive exploration with guaranteed results within a defined grid, suitable for smaller search spaces where precision is paramount. This nuanced analysis provides valuable insights into selecting the appropriate hyperparameter tuning method based on specific application needs and constraints.

Bayesian Optimization for Hyperparameter Tuning of Deep Learning Models | Towards Data Science