Deep Learning: Optimization Techniques
Optimization in general is an extremely difficult task. Traditionally, machine learning has avoided the difficulty of general optimization by carefully designing the objective function and constraints to ensure that the optimization problem is convex. When training neural networks, we must confront the general nonconvex case. Even convex optimization is not without its complications.
In this article, we will discuss the main types of ML optimization techniques.
Top Optimization Techniques in Machine Learning
Gradient Descent
The gradient descent method is the most popular optimisation method. The idea of this method is to update the variables iteratively in the (opposite) direction of the gradients of the objective function. With every update, this method guides the model to find the target and gradually converge to the optimal value of the objective function.
we start with random model parameters and calculate the error for each learning iteration, keep updating the model parameters to move closer to the values that results in minimum cost.
Gradient descent algorithms multiply the gradient (slope) by a scalar known as the learning rate (or step size) to determine the next point. This parameter tells how far to move the weights in the direction of the gradient.
If we denote dw
and db
as gradients to update our parameters W
and b
for gradient descent algorithm as follows:
W = W — learning rate * dW
b = b — learning rate * db
If the learning rate is small, then training is more reliable, but it will take a lot of time because steps towards the minimum of the loss function are tiny.
If the learning rate is high, then training may not converge or even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse. Thus our aim is to find the optimal learning rate which can quickly find the minimum loss.
Gradient Descent with Momentum
Gradient descent with momentum will always work much faster than the algorithm Standard Gradient Descent. The basic idea of Gradient Descent with momentum is to calculate the exponentially weighted average of your gradients and then use that gradient instead to update your weights.It functions faster than the regular algorithm for the gradient descent.
Implementation
We use dW and db to update our parameters W and b during the backward propagation as follows:
W = W — learning rate * dW
b = b — learning rate * db
In momentum we take the exponentially weighted averages of dW and db, instead of using dW and db independently for each epoch.
VdW = β * VdW + (1 — β) * dW
Vdb = β * Vdb + (1 — β) *db
Where beta ‘β’ is a different hyperparameter called momentum, ranging from 0 to 1. To calculate the new weighted average, it sets the weight between the average of previous values and the current value.
We’ll update our parameters after calculating the exponentially weighted averages.
W = W — learning rate * VdW
b = b — learning rate * Vdb
RMSprop
There is an algorithm called RMSprop, which stands for root mean square prop, which can also accelerate gradient descent. RMSprop uses the same concept of the exponentially weighted average of gradient as gradient descent with momentum but the difference is parameter update.
Implementation
We use dW and db to update our parameters W and b during the backward propagation as follows:
W = W — learning rate * dW
b = b — learning rate * db
In RMSprop we take the exponentially weighted averages of the squares of dW and db instead of using dW and db separately for each epoch.
SdW = β * SdW + (1 — β) * dW2
Sdb = β * Sdb + (1 — β) * db2
Where beta ‘β’ is a different hyperparameter called momentum, ranging from 0 to 1. To calculate the new weighted average, it sets the weight between the average of previous values and the current value.
We’ll update our parameters after calculating the exponentially weighted averages.
W = W — learning rate *dW / sqrt(SdW+ε)
b = b — learning rate * db / sqrt(Sdb+ε)
SdW+ε is relatively small so here we divide dW by relatively small number (we add ε to avoid divide by zedo) while Sdb+ε is relatively large so we divide db with a comparatively larger number to slow down the changes on a vertical dimension.
Adam
Adam optimization algorithm is one of the unique algorithms that has really stood up and proven to be effective well across a wide variety of models in deep learning. Adam optimization algorithm essentially takes and ties together Momentum and RMSprop. Adams stands for Adaptive Moment Estimation.
Implementation
To implement Adam optimization algorithm, we need to initialize:
Vdw = 0, Sdw = 0, Vdb = 0, Sdb = 0
2. Then on iteration t:
Compute the derivatives dw, db using current mini-batch gradient descent
3. And do the momentum exponentially weighted average. So
VdW = ß1 * VdW + (1- ß1) * dW
Vdb = ß1 * Vdb + (1 — ß1) * db
4. And do the RMSprop update as well. So,
SdW = ß2 * SdW + (1- ß2) * dW2
Sdb = ß2 * Sdb + (1 — ß2) * db2
5.We need to implement bias correction in typical Adam’s implementation. So, we’ll have Vcorrected (where Vcorrected means after correction of the bias).
VdWcorrected = VdW / (1- ß1t)
Vdbcorrected = Vdb / (1- ß1t)
6. And then similarly, we implement this bias correction on S as well.
SdWcorrected = SdW / (1- ß2t)
Sdbcorrected = Sdb / (1 — ß2t)
7. finally, we need to perform the update.
W = W — learning rate * (VdWcorrected / sqrt(SdWcorrected+ ε))
b = b — learning rate * (Vdbcorrected / sqrt(Sdbcorrected+ ε))
where:
- psilon ‘ε’ is a very small number to avoid dividing by zero (epsilon = 10–8).
- ß1 and ß2 are hyperparameters which control the two weighted averages. In practice we use ß1 = 0.9 and ß2 = 0.999 as the default values.
- Alpha is the learning rate and a range of values to be tested to see what works best for different problems.
Learning Rate Schedules
nitial rate can be left as system default or can be selected using a range of techniques. A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum . There are many different learning rate schedules but the most common are time-based, step-based and exponential.
Decay serves to settle the learning in a nice place and avoid oscillations, a situation that may arise when a too high constant learning rate makes the learning jump back and forth over a minima, and is controlled by a hyperparameter.
Momentum is analogous to a ball rolling down a hill; we want the ball to settle at the lowest point of the hill (corresponding to the lowest error). Momentum both speeds up the learning (increasing the learning rate) when the error cost gradient is heading in the same direction for a long time and also avoids local minima by ‘rolling over’ small bumps.
Time-Based Decay
The mathematical form of time-based decay is lr = lr0/(1+kt)
where lr
, k
are hyperparameters and t
is the iteration number. The SGD optimizer takes decay
and lr
arguments and update the learning rate by a decreasing factor in each epoch.
lr *= (1. / (1. + self.decay * self.iterations))
Step Decay
Step decay schedule drops the learning rate by a factor every few epochs. The mathematical form of step decay is :
lr = lr0 * drop^floor(epoch / epochs_drop)
Exponential Decay
Another common schedule is exponential decay. It has the mathematical form lr = lr0 * e^(−kt)
, where lr
, k
are hyperparameters and t
is the iteration number. Similarly, we can implement this by defining exponential decay function and pass it to LearningRateScheduler
.
def exp_decay(epoch):
initial_lrate = 0.1
k = 0.1
lrate = initial_lrate * exp(-k*t)
return lratelrate = LearningRateScheduler(exp_decay)
Choosing the Right Optimization Algorithm
We have discussed a series of related algorithms that each seek to address the challenge of optimizing deep models by adapting the learning rate for each model parameter. At this point, a natural question is: which algorithm should one choose?
Unfortunately, there is currently no consensus on this point. Schaul et al. (2014) presented a valuable comparison of a large number of optimization algorithms across a wide range of learning tasks. While the results suggest that the family of algorithms with adaptive learning rates performed fairly robustly, no single best algorithm has emerged.
Currently, the most popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta, and Adam. The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning)
To delve into this topic and understand it more, I advise you to follow up and watch this series of lescens