Breaking News

Guide to latest AdaBelief optimizer for deep learning – Analytics India Magazine

We have seen enough of the optimizers previously in Tensorflow and PyTorch library, today we will be discussing a specific one i.e. AdaBelief. Almost every neural network and machine learning algorithm use optimizers to optimize their loss function using gradient descent. There are many optimizers available in PyTorch as well as TensorFlow for a specific type of problems like SGD, Adam, RMSprop, and many more, now to choose the best optimizer there are many factors we need to consider like the speed of convergence, generalization of model and loss metrics.

Now SGD(Stochastic Gradient Descent) is better in generalization while Adam is good in convergence speed. 

Recently researchers from Yale University have introduced a new Novel optimizer with a paper called “AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients” here. It combines many features of other optimizers into one. Most popular optimizers for deep learning and machine learning tasks are categorized as adaptive methods(Adam) and accelerated schemes(SGD). For most of the task such as convolutional neural networks

(CNN), adaptive methods(Adam) act faster but generalize worse compared

to SGD(stochastic gradient descent); and for more complex tasks like generative adversarial networks (GANs), adaptive methods are typically the default because of their stability. 

So Authors introduces AdaBelief that can simultaneously achieve three goals: 

fast convergence as in adaptive methods,
good generalization as in SGD, 
training stability.

Adam and AdaBelief

Let’s see both of the optimizers in detail what’s been changed and optimized.

[embedded content]

Adam(Adaptive Moment Estimation)

The Adam Optimizer is one of the most used optimizers to train different kinds of neural networks.

in Adam, the update direction is  , where is the EMA (Exponential Moving Average) of ; It basically combines the optimization techniques of momentum and RMSprop. Adam consist of two internal states : momentum and squared momentum of the gradient (g). With every training batch, each of them is updated using exponential weighted averaging (EWA)

here β is  referred as hyperparameters. These are then used to update the parameters for each step as shown below:

where α is the learning rate, and ϵ is added to improve stability.

AdaBelief optimizer is extremely similar to the Adam optimizer, with one slight difference. Here instead of using v-t, the EMA of gradient squared, we have this new parameter s-t:

And this s-t replaces v-t to form this update direction:

Installation and Usage

git clone

1. PyTorch implementations

See folder PyTorch_Experiments, for each subfolder, execute sh See readme.txt in each subfolder for visualization, or refer to jupyter notebook for visualization.

pip install adabelief-pytorch==0.2.0
from adabelief_pytorch import AdaBelief
AdaBelief_optimizer = AdaBelief(model.parameters(), lr=1e-3, eps=1e-16, betas=(0.9,0.999), weight_decouple = True, rectify = False)

2. Tensorflow implementation

Some of the projects Text classification and word embedding in Tensorflow

See Also

pip install adabelief-tf==0.2.0
from adabelief_tf import AdaBeliefOptimizer
Adam_optimizer = AdaBeliefOptimizer(learning_rate=1e-3, epsilon=1e-14, rectify=False)

Below are some pf the experimental results comparing the performance of AdaBelief optimizer with each other optimizers on different neural networks like CNNs, LSTMs, and GANs:

1. Results on Image Classification:

2. Result on LSTM(Time Series Modeling):

3. Results on a small GAN(Generative Adversarial Network) with vanilla CNN generator:

4. Results on Transformer

5. Results on Toy Example


We get to know AdaBelief, that is an optimizer derived from Adam and has no extra parameters, just a change in one of the parameters. It gives both fast convergence speed as well as good generalization in models. It’s easy to adapt its step size according to its “belief” in the current gradient direction. It performs well in the “Large gradient and small curvature” cases as it is considered as both the amplitude and sign of the gradients. Some of the external resources you can look for:

Read more

#wpdevar_comment_1 span,#wpdevar_comment_1 iframe{width:100% !important;}

Subscribe to our Newsletter
Get the latest updates and relevant offers by sharing your email.