Gradient descent

Gradient descent is a simple algorithm for optimization, and empirically works extremely well for modern deep learning.

The idea is very simple:

  • At any given point in parameter space, compute the gradient of the loss with respect to parameters – the direction of steepest ascent. Since we want to decrease loss, we step in the negative gradient direction – the direction of steepest descent.
  • This brings us to a new point in parameter space, and we repeat.

A key hyperparameter for gradient descent is the step size (i.e., the learning rate), commonly denoted as \(\eta\). This hyperparameter dictates how large of a step we take along the direction of steepest descent.

Formally, we can write gradient descent as follows:

\[\theta_{t+1} = \theta_t - \eta \nabla_{\theta} \mathcal{L}(\theta_t),\]

where \(\theta_t\) denotes the parameters at iteration \(t\), \(\eta\) denotes the step size, and \(\mathcal{L}\) represents a loss function mapping parameters to a scalar.

Gradient flow

Gradient flow is an idealized, continuous version of gradient descent, taking \(\eta \rightarrow 0\).

We can write gradient flow as follows:

\[\frac{d\theta}{dt} = -\nabla \mathcal{L}(\theta),\]

where we think of \(\theta\) evolving over continuous time \(t\).

Approximating gradient descent with gradient flow

While gradient flow is always continuously moving in the direction of steepest descent, gradient descent takes discrete steps and does not account for the change in loss landscape between each step. This can lead to a mismatch between the paths taken by gradient flow and gradient descent.

One natural question is: can we understand finite-step gradient descent as gradient flow on a slightly different objective?

More precisely, suppose the original loss is \(\mathcal{L}(\theta)\). We can write a single step of gradient descent as

\[\theta_{\mathrm{GD}} = \theta_0 - \eta \nabla \mathcal{L}(\theta_0).\]

We want to find a nearby loss \(\mathcal{L}_\eta\) whose gradient flow, started at the same \(\theta_0\) and run for time \(\eta\), ends up near the endpoint of the discrete gradient descent step. Formally, we want to find a loss \(\mathcal{L}_\eta\) such that the gradient flow trajectory \(\theta(t)\), defined by

\[\frac{d\theta(t)}{dt} = -\nabla \mathcal{L}_\eta(\theta(t)), \qquad \theta(0)=\theta_0,\]

has endpoint \(\theta_{\mathrm{flow}} := \theta(\eta)\) satisfying

\[\theta_{\mathrm{flow}} \approx \theta_{\mathrm{GD}}.\]

This adjusted loss \(\mathcal{L}_{\eta}\) has different names in different fields; in this note we’ll refer to it as the “modified loss”.

A quadratic warmup

We’ll start with a simple quadratic loss:

\[\mathcal{L}(\theta) = \frac{\lambda}{2}\theta^2,\]

where \(\theta \in \mathbb{R}\) is our learned parameter, and \(\lambda \in \mathbb{R}^{+}\) is a constant that controls the curvature.

The gradient is $\nabla \mathcal{L}(\theta) = \lambda\theta$, and so one step of gradient descent gives

\[\begin{aligned} \theta_{\mathrm{GD}} &= \theta_0 - \eta \nabla \mathcal{L}(\theta_0) \\ &= \theta_0 - \eta (\lambda\theta_0) \\ &= (1 - \eta \lambda)\theta_0. \end{aligned}\]

Now we’ll approximate the endpoint of gradient flow on the same loss. Recall that gradient flow is defined by the differential equation

\[\begin{aligned} \frac{d\theta}{dt} &= -\nabla \mathcal{L}(\theta) \\ &= -\lambda\theta. \end{aligned}\]

The solution to this differential equation is

\[\theta(t) = e^{-\lambda t}\theta_0.\]

After flowing for time \(\eta\),

\[\theta_{\mathrm{flow}} = e^{-\lambda\eta}\theta_0.\]

Recalling the Taylor expansion of \(e^x\), we can write

\[\begin{aligned} e^{-\lambda\eta} &= 1 - \lambda\eta + \frac{\lambda^2\eta^2}{2} + O(\eta^3). \end{aligned}\]

Using this approximation, we can write the endpoint of gradient flow as

\[\theta_{\mathrm{flow}} = \left(1 - \lambda\eta + \frac{\lambda^2\eta^2}{2} + O(\eta^3)\right)\theta_0.\]

Comparing this to the result of discrete gradient descent, we see that the two agree to first order in \(\eta\), but disagree at order \(\eta^2\) and higher.

Left: the quadratic bowl \(\mathcal{L}(\theta) = \frac{\lambda}{2}\theta^2\) and the starting point \(\theta_0=1.2\). Right: parameter trajectories over time: GF (blue, continuous) and discrete GD (black dots, plotted at times \(t = k\eta\) with \(\eta=0.5\)). Note that the two paths do not match. In the case of our quadratic loss, GD approaches the minimum more quickly than GF – this is because GD locks in the starting gradient for the entire step, while GF’s gradient continuously shrinks as \(|\theta|\) decreases.

The modified quadratic

From the above analysis, we see that gradient descent lands at \(\theta_{\mathrm{GD}} = (1 - \eta \lambda)\theta_0\), while gradient flow lands at \(\theta_{\mathrm{flow}} = \left(1 - \lambda\eta + \frac{\lambda^2\eta^2}{2} + O(\eta^3)\right)\theta_0\). The mismatch primarily stems from the extra second-order term in gradient flow: \(\frac{\lambda^2\eta^2}{2}\theta_0\). Note that this extra (positive) term causes gradient flow to move slightly less far down the quadratic bowl than gradient descent.

So, in order to construct a modified gradient flow that behaves similarly to gradient descent here, we can try making the bowl effectively steeper:

\[\mathcal{L}_\eta(\theta) = \frac{1}{2}\left(\lambda+\frac{\eta}{2}\lambda^2\right)\theta^2.\]

We can now analyze how gradient flow behaves on this modified loss.

At the start of the flow, the initial velocity is

\[\begin{aligned} \left.\frac{d\theta}{dt}\right|_{t=0} &= \left.-\frac{d\mathcal{L}_\eta (\theta)}{d\theta} \right|_{\theta=\theta_0}\\ &= -\left(\lambda+\frac{\eta}{2}\lambda^2\right)\theta_0 \\ &= -\lambda\theta_0-\frac{\eta}{2}\lambda^2\theta_0. \end{aligned}\]

To get the acceleration, we differentiate the flow equation once more with respect to time:

\[\begin{aligned} \frac{d^2\theta}{dt^2} &= -\left(\lambda+\frac{\eta}{2}\lambda^2\right)\frac{d\theta}{dt}. \end{aligned}\]

At \(t=0\), this gives

\[\begin{aligned} \left.\frac{d^2\theta}{dt^2}\right|_{t=0} &= \left[-\left(\lambda+\frac{\eta}{2}\lambda^2\right)\frac{d\theta}{dt}\right]_{t=0} \\ &= -\left(\lambda+\frac{\eta}{2}\lambda^2\right) \left[-\left(\lambda+\frac{\eta}{2}\lambda^2\right)\theta_0\right] \\ &= \left(\lambda+\frac{\eta}{2}\lambda^2\right)^2\theta_0 \\ &= \lambda^2\theta_0+O(\eta). \end{aligned}\]

Note that the \(O(\eta)\) part of the acceleration will end up contributing only to an order \(\eta^3\) term, because acceleration will be multiplied by \(\eta^2/2\).

Now we can approximate the endpoint of gradient flow on the modified loss:

\[\begin{aligned} \theta_{\mathrm{flow}} &= \theta(0) + \eta \cdot \left.\frac{d\theta}{dt}\right|_{t=0} + \frac{\eta^2}{2} \cdot \left.\frac{d^2\theta}{dt^2}\right|_{t=0} + O(\eta^3) \\ &= \theta_0 + \eta\left(-\lambda\theta_0-\frac{\eta}{2}\lambda^2\theta_0\right) + \frac{\eta^2}{2}\left(\lambda^2\theta_0+O(\eta)\right) + O(\eta^3) \\ &= \theta_0 -\eta\lambda\theta_0 -\frac{\eta^2}{2}\lambda^2\theta_0 +\frac{\eta^2}{2}\lambda^2\theta_0 +O(\eta^3) \\ &= \theta_0-\eta\lambda\theta_0+O(\eta^3). \end{aligned}\]

The order-\(\eta^2\) terms cancel, so gradient flow on the modified loss matches one gradient descent step on the original loss (up to order \(\eta^2\)). Nice!

Rewriting the modified loss in terms of gradient norm

Let’s look back at the modified loss that we used. The modified loss is just the original loss plus one extra correction term:

\[\mathcal{L}_\eta(\theta) = \underbrace{\frac{1}{2}\lambda\theta^2}_{\text{original loss } \mathcal{L}(\theta)} + \underbrace{\frac{\eta}{4}\lambda^2\theta^2}_{\text{finite-step correction}}.\]

Since \(\nabla \mathcal{L}(\theta)=\lambda\theta\), we can write the extra term as

\[\frac{\eta}{4}\lambda^2\theta^2 = \frac{\eta}{4}\|\nabla\mathcal{L}(\theta)\|^2.\]

We will see that, in general, the modified loss can be written as

\[\mathcal{L}_\eta(\theta) = \mathcal{L}(\theta) + \frac{\eta}{4}\|\nabla \mathcal{L}(\theta)\|^2 + O(\eta^2).\]

Left: the original loss \(\mathcal{L}(\theta)\) (blue) and the modified loss \(\mathcal{L}_\eta(\theta)\) (orange), with \(\lambda=1\) and \(\eta=0.5\). Note that the modified loss is a slightly steeper bowl. Right: parameter trajectories over time: GF on the original loss (blue), GF on the modified loss (orange), and discrete GD (black dots). Note that GF on the modified loss now tracks the discrete GD steps more closely, compared to GF on the original loss.

The general case

Now we’ll try to generalize the modified loss construction.

We’ll denote the initial gradient as \(g := \nabla\mathcal{L}(\theta_0)\) and the initial Hessian as \(H := \nabla^2\mathcal{L}(\theta_0)\).

We can write the endpoint of one gradient descent step as

\[\theta_{\mathrm{GD}} = \theta_0-\eta g.\]

We can similarly write the endpoint of gradient flow on the original loss as

\[\theta_{\mathrm{flow}} = \theta_0-\eta g+\frac{\eta^2}{2}Hg+O(\eta^3).\]

Note that, just like in our toy quadratic case, gradient descent and gradient flow agree up to first order in \(\eta\), but disagree at order \(\eta^2\) and higher. Gradient flow has an extra second-order term: \(\frac{\eta^2}{2}Hg\).

Now we try a nearby loss whose first correction is proportional to the step size:

\[\mathcal{L}_\eta(\theta) = \mathcal{L}(\theta)+\eta R(\theta)+O(\eta^2),\]

where \(R\) is the correction term.

For gradient flow on this modified loss, initial velocity is

\[\begin{aligned} \left.\frac{d\theta}{dt}\right|_{t=0} &= -\nabla \mathcal{L}_\eta(\theta_0) \\ &= -\nabla \mathcal{L}(\theta_0)-\eta\nabla R(\theta_0)+O(\eta^2) \\ &= -g-\eta\nabla R(\theta_0)+O(\eta^2). \end{aligned}\]

Initial acceleration is

\[\begin{aligned} \left.\frac{d^2\theta}{dt^2}\right|_{t=0} &= -\nabla^2\mathcal{L}_\eta(\theta_0) \left.\frac{d\theta}{dt}\right|_{t=0} \\ &= -\left(\nabla^2\mathcal{L}(\theta_0)+O(\eta)\right) \left(-\nabla\mathcal{L}(\theta_0)+O(\eta)\right) \\ &= \nabla^2\mathcal{L}(\theta_0)\nabla\mathcal{L}(\theta_0)+O(\eta) \\ &= Hg+O(\eta). \end{aligned}\]

Note that the \(R\)-dependent pieces in the acceleration are only order \(\eta\), so after multiplying by \(\eta^2/2\) they only matter at order \(\eta^3\).

We can now write the approximation for the endpoint of gradient flow on the modified loss:

\[\begin{aligned} \theta_{\mathrm{flow}} &= \theta_0 +\eta \cdot \left.\frac{d\theta}{dt}\right|_{t=0} +\frac{\eta^2}{2} \cdot \left.\frac{d^2\theta}{dt^2}\right|_{t=0} +O(\eta^3) \\ &= \theta_0 +\eta\left(-g-\eta\nabla R(\theta_0)+O(\eta^2)\right) +\frac{\eta^2}{2}\left(Hg+O(\eta)\right) +O(\eta^3) \\ &= \theta_0-\eta g +\eta^2\left(\frac{1}{2}Hg-\nabla R(\theta_0)\right) +O(\eta^3). \end{aligned}\]

To match \(\theta_{\mathrm{GD}} = \theta_0-\eta g\) through order \(\eta^2\), we need the \(\eta^2\) term to vanish; i.e., we need

\[\nabla R(\theta_0)=\frac{1}{2}Hg.\]

A function with this gradient is

\[R(\theta)=\frac{1}{4}\|\nabla\mathcal{L}(\theta)\|^2.\]
Proof

Write \(R\) coordinate-wise as

\[R(\theta) = \frac{1}{4}\sum_i \left(\frac{\partial \mathcal{L}(\theta)}{\partial \theta_i}\right)^2.\]

Differentiate with respect to coordinate \(\theta_j\):

\[\begin{aligned} \frac{\partial R(\theta)}{\partial \theta_j} &= \frac{1}{2}\sum_i \frac{\partial \mathcal{L}(\theta)}{\partial \theta_i} \frac{\partial^2 \mathcal{L}(\theta)}{\partial \theta_j\,\partial \theta_i}. \end{aligned}\]

At \(\theta_0\), this becomes

\[\frac{\partial R(\theta_0)}{\partial \theta_j} = \frac{1}{2}\sum_i H_{ji}g_i.\]

That is the \(j\)th coordinate of \(\frac{1}{2}Hg\). Therefore \(\nabla R(\theta_0)=\frac{1}{2}Hg\).

Thus

\[\mathcal{L}_\eta(\theta) = \mathcal{L}(\theta) + \frac{\eta}{4}\|\nabla \mathcal{L}(\theta)\|^2 + O(\eta^2).\]

Sources