Nesterov Momentum Equivalence Derivation
source link: https://galenwong.github.io/blog/2020-02-08-nesterov-momentum-equivalence/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Nesterov Momentum Equivalence Derivation
That Missing Piece in the Papers and the Slides
2020-02-08 • 5 min read
During the Winter Quarter of 2020 at UCLA, I am/was taking a class on neural network and deep learning. The course is numbered ECE C147/247. It was a course modified from the famous Stanford CS 231n.
Note: I assume you (the reader) already knows what is momentum update, and how Nesterov Momentum is different from momentum.
In the chapter about optimization on training (the gradient descent stage), we learned about a technique called Nesterov Momentum used to update the weight to improve training speed.
The formulation of Nesterov Momentum update is given as
Updating momentum:
Updating parameter:
Here JJJ is the loss function. However, this formulation is difficult to implement in code. In code, we usually have ∇θJ(θ)\nabla_\theta J(\theta)∇θJ(θ) available to us instead of ∇θJ(θ+αv)\nabla_{\theta} J(\theta + \alpha\mathbf{v})∇θJ(θ+αv).
To solve this problem, an alternative formulation is given. We first let
Then, the formulation becomes
In class, there were no explanation given on how the 2 formulations are equivalent. Instead, a hint is given to us that the proof can be done through a substitution of variable. This formulation still confuses me a lot.
There are multiple points of confusion here:
- How do we calculate
∇θ~oldJ(θ~old)\nabla_{\tilde\theta_\text{old}}J(\tilde\theta_\text{old})∇θ~oldJ(θ~old)? 2. What is the actual meaning of θ~\tilde\thetaθ~? 3. What are we storing as our parameter?
I then sought help from the Stanford course webpage on this chapter. They pointed to two papers on the derivation of Nesterov momentum. Only one of them (session 3.5) discusses the alternative derivation, but it simply stated the 2 formulation without proof. I understand the target audience of the paper are very likely capable to figure out the equivalence but I am not one of them.
The paper did offer some insight:
the parameters θ~\tilde\thetaθ~ are a completely equivalent replacement of θ\thetaθ.
This offers the answer to confusion 3. We are storing θ~\tilde\thetaθ~ as the parameter. In code, what is given to us the the derivative of J(θ~)J(\tilde\theta)J(θ~) with respect to θ\thetaθ. Using chain rule, we can show that
This answer confusion 1, since in code we are given gradient of loss with θ~\tilde\thetaθ~ with respect to θ\thetaθ and it is equivalent to gradient with respect to θ~\tilde\thetaθ~. Now back to confusion 2, what is the meaning of θ~\tilde\thetaθ~? Is it okay to use it as the parameter for prediction and testing? The answer is yes. By definition, θ~\tilde\thetaθ~ is simply the parameter after one momentum update. Therefore, we can think of it as a parameter value that is closer to the optimum.
I had to take the time to do the derivation on my own. Here, I offer the derivation of the equivalence. First, we have the substitution and the original formulation.
We substitute (1)(1)(1) into (2)(2)(2) and apply the relationship ∇θJ(θ~)=∇θ~J(θ~)\nabla_{\theta}J(\tilde\theta) = \nabla_{\tilde\theta}J(\tilde\theta)∇θJ(θ~)=∇θ~J(θ~):
We can rewrite (1)(1)(1) as θ=θ~−αv\theta = \tilde\theta - \alpha\mathbf{v}θ=θ~−αv. Then, we substitute (1)(1)(1) into (3)(3)(3). Note that we have to substitute both θnew\theta_\text{new}θnew and θold\theta_\text{old}θold.
Now we have arrived at the alternative formulation with (4)(4)(4) and (5)(5)(5). Therefore, in the code, the new parameter is θ~new\tilde\theta_\text{new}θ~new that is returned instead of θnew\theta_\text{new}θnew. Again, the idea is that the actual parameter stored and used for evaluation is always updated with the momentum. Therefore, the gradient is taken with loss is the parameter updated with the momentum.
Recommend
-
21
Derivation of the 22 Srutis Transliteration scheme : - ---------------------- a A i I u U e ai o au ( the anuswara and visarga do not figure in this discussion). k kh ...
-
14
Observational equivalence and unsafe code Oct 2, 2016 I spent a really interesting day last week at Northeastern University. First, I saw a fun talk by Philip Haller covering
-
11
Contributor JasonGross commented...
-
8
The Equivalence of "Close" and "Distant" Reading; or, Toward a New Object for Data-Rich Literary History Citation Bode, K 2017, 'The Equivalence of "Close" and "Distant" Reading; or, Toward a New Object for Data-Rich Lite...
-
7
Equivalence, non-inferiority and superiority testing an interactive visualization Created by Kristoffer Magnusson
-
9
The Equivalence contravariant functor by Mark Seemann An introduction to the Equivalence contravariant functor for object-oriented programmers. This article is an instalment in
-
4
Optimism:EVM Equivalence完全符合以太坊虚拟机规范,将成为L2下一个通用标准 • 11 小时前 以太坊扩容方案Optimism发文,称相信EVM Equivalence...
-
5
Using Equivalence Class Partition and Boundary Value Analysis while Unit Testing Joseph Pasmore January 7, 2022
-
5
Free fallin' — Einstein wins again: Space satellite confirms weak equivalence principle MICROSCOPE experiment measured accelerations of free-falling objects in space....
-
5
[Submitted on 14 Feb 2024] LogicPrpBank: A Corpus for Logical Implication and Equivalence Download PDF
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK