Regularization Techniques in Deep Learning

4 min readJan 30, 2021

Regularization is a set of techniques that can help avoid overfitting in neural networks, thereby improving the accuracy of deep learning models when it is fed entirely new data from the problem domain. There are various regularization techniques, some of the most popular ones that I will try explain in this post are :

L1 and L2 Regularization
Dropout
Data Augmentation
Early stopping

1. L1 and L2 Regularization:

It is the most common type of regularization.In regression model , L1 regularization is called Lasso Regression and L2 is called Ridge Regression.

These update the general cost function with another term as regularization.

Cost function = Loss ( cross entropy) + regularization

In machine learning this fitting process involves loss function as RSS (Residual sum of square).

Lasso ( L1 Normalization):

Ridge (L2 Normalization)

‘y’ :the learned relation

‘β’ : is the coefficient estimates for different variables or predictors(x).

λ : is the tuning parameter that decides how much we want to penalize the flexibility of our model.

The difference in these two are penalty term.Ridge adds square magnitude of coefficient as penalty term to the loss function. Lasso (Least Absolute Shrinkage and Selection Operator) adds absolute value of magnitude of coefficient.

In case of the huge number of feature in data set, so for feature selection , the Lasso shrinks the less important feature’s coefficient to zero.

2. Dropout

This is the one of the most interesting types of regularization techniques. It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.

At every iteration, the dropout randomly selects some nodes of neural network and removes them along with all of their incoming and outgoing connections as shown below.

So each iteration has a different set of nodes and this results in a different set of outputs. It can also be thought of as an ensemble technique in machine learning.

Ensemble models usually perform better than a single model as they capture more randomness. Similarly, dropout also performs better than a normal neural network model.

This probability of choosing how many nodes should be dropped is the hyper-parameter of the dropout function. As seen in the image above, dropout can be applied to both the hidden layers as well as the input layers.

Due to these reasons, dropout is usually preferred when we have a large neural network structure in order to introduce more randomness.

3. Data Augmentation

The simplest way to reduce overfitting is to increase the size of the training data. In machine learning, we were not able to increase the size of training data as the labeled data was too costly.

But, now let’s consider we are dealing with images. In this case, there are a few ways of increasing the size of the training data — rotating the image, flipping, scaling, shifting, etc. like the image of the dog below.

This technique is known as data augmentation. This usually provides a big leap in improving the accuracy of the model. It can be considered as a mandatory trick in order to improve our predictions.

4. Early stopping

A very common training scenario that leads to overfitting is when a model is trained on a relatively larger data set. In this situation, the training of the model for a larger period of time wouldn’t result in its increased generalization capability; it would instead lead to overfitting.

After a certain point in the training steps and after a significant reduction in the training error, there comes a time when the validation error starts to increase. This signifies that overfitting has started. By using the Early Stopping technique, we stop the training of the models and hold the parameters as they are as soon as we see an increase in the testing error.

Happy Learning…