How close can noisy gradient descent get?

Suppose we want to minimize some loss $\mathcal{L}(\boldsymbol{\theta}): \mathbb{R}^d \to \mathbb{R}$ via gradient descent. At each step, we receive the true gradient $\nabla \mathcal{L}(\boldsymbol{\theta}_t)$ plus some randomly sampled noise $\boldsymbol{\varepsilon}_t$ . We run this noisy gradient descent with a constant learning rate.

What happens? Do we converge to the true minimum of $\mathcal{L}$ ? Or do we hover within some range of it?

Turns out we hover within some range (a sort of “cloud”) around the minimum. The size of this cloud depends on the step size, the magnitude of the noise, and the local curvature of the loss.

The 1D quadratic case

Consider a simple 1D quadratic loss function

\mathcal{L}(\theta)=\frac{\lambda}{2}\theta^2,

where $\theta \in \mathbb{R}$ is the parameter to be learned, and $\lambda \in \mathbb{R}^{+}$ is a constant that controls the curvature of the quadratic. The minimum is at $\theta=0$ , and the true gradient is $\nabla\mathcal{L}(\theta)=\lambda\theta$ .

Now suppose noisy gradient descent sees

g_t=\lambda\theta_t+\varepsilon_t,

where each noise term $\varepsilon_t$ is an independently drawn random variable with $\mathbb{E}[\varepsilon_t]=0$ and $\text{Var}(\varepsilon_t)=\sigma^2$ .

We can write one step of noisy gradient descent as

\begin{aligned} \theta_{t+1} &= \theta_t-\eta g_t \\ &= \theta_t-\eta(\lambda\theta_t+\varepsilon_t) \\ &= (1-\eta\lambda)\theta_t-\eta\varepsilon_t. \end{aligned}

Letting $a:=1-\eta\lambda$ , we can write

\theta_{t+1}=a\theta_t-\eta\varepsilon_t.

Note that we’ll assume $\lvert a \rvert < 1$ (or equivalently $0 < \eta\lambda < 2$ ). If this were not the case, then learning would be unstable, with the iterate blowing up to infinity.¹

Analysis of parameter iterates

The expectation of an iterate

Note that we can think of each iterate $\theta_t$ as a random variable – each $\theta_t$ depends on all the random noise terms sampled up to that point: $\varepsilon_0,\varepsilon_1,\ldots,\varepsilon_{t-1}$ .

Let $m_t:=\mathbb{E}[\theta_t].$ We write the expectation of the next iterate as a simple recursion:

\begin{aligned} m_{t+1} &= \mathbb{E}[a\theta_t-\eta\varepsilon_t] \\ &= a\mathbb{E}[\theta_t]-\eta\mathbb{E}[\varepsilon_t] \\ &= am_t. \end{aligned}

Thus, the expectation of iterates evolves as

m_1=am_0,\qquad m_2=am_1=a^2m_0,\qquad m_t=a^tm_0.

If $\lvert a \rvert < 1$ , then $a^t \to 0$ , so $\mathbb{E}[\theta_t]\to 0.$

From this, we see that the expectation of iterates goes to the minimum. However, we will see that actual iterates do not stay there – they have non-zero variance, driven by the fact that new noise is injected at every step.

The variance of an iterate

Let $v_t:=\text{Var}(\theta_t).$ Then

\begin{aligned} v_{t+1} &= \text{Var}(a\theta_t-\eta\varepsilon_t) \\ &= a^2\text{Var}(\theta_t)+\eta^2\text{Var}(\varepsilon_t) \\ &= a^2v_t+\eta^2\sigma^2. \end{aligned}

Note that there is no cross term because the noise $\varepsilon_t$ is independent of the iterate $\theta_t$ .

At stationarity (i.e., when we’re in the final “cloud”), the variance stops changing, so $v_{t+1}=v_t=v$ . Therefore

\begin{aligned} v &= a^2v+\eta^2\sigma^2 \\ v(1-a^2) &= \eta^2\sigma^2 \\ v &= \frac{\eta^2\sigma^2}{1-a^2}. \end{aligned}

Now substitute $a=1-\eta\lambda$ :

\begin{aligned} 1-a^2 &= 1-(1-\eta\lambda)^2 \\ &= \eta\lambda(2-\eta\lambda). \end{aligned}

So the stationary parameter variance is

\text{Var}(\theta) = \frac{\eta\sigma^2}{\lambda(2-\eta\lambda)}.

This is the noise floor in the toy model. Smaller learning rates and lower gradient noise shrink the cloud, while smaller curvature expands it.

Analysis of loss

So far, we have described the cloud in parameter space.

At stationarity, $\theta$ is a random variable with

\mathbb{E}[\theta]=0, \qquad \text{Var}(\theta) = \frac{\eta\sigma^2}{\lambda(2-\eta\lambda)}.

We can compute statistics of the loss, since the loss is just a function of parameters $\theta$ .

The expected loss is

\begin{aligned} \mathbb{E}[\mathcal{L}(\theta)] &= \mathbb{E}\left[\frac{\lambda}{2}\theta^2\right]\\ &= \frac{\lambda}{2} \mathbb{E}\left[\theta^2\right]. \end{aligned}

Since $\mathbb{E}[\theta] = 0$ , $\mathbb{E}[\theta^2] = \text{Var}(\theta)$ . Therefore,

\begin{aligned} \mathbb{E}[\mathcal{L}(\theta)] &= \frac{\lambda}{2}\text{Var}(\theta) \\ &= \frac{\lambda}{2} \cdot \frac{\eta\sigma^2}{\lambda(2-\eta\lambda)} \\ &= \frac{\eta\sigma^2}{2(2-\eta\lambda)}. \end{aligned}

Extending to general cases

We spent a bunch of time trying to understand a simple 1D quadratic. Why care about a 1D quadratic?

Near a nice local minimum, a smooth loss looks quadratic. The 1D quadratic is the simplest local model of what noisy gradient descent sees when it is close to a minimum.

In 1D, if the minimum is at $\theta^\star$ , then for nearby $\theta$ ,

\mathcal{L}(\theta) \approx \mathcal{L}(\theta^\star) + \frac{1}{2}\mathcal{L}''(\theta^\star)(\theta-\theta^\star)^2.

So the same calculation applies locally, with curvature $\lambda=\mathcal{L}''(\theta^\star).$

In multiple dimensions, the same idea applies direction by direction. Near a minimum, the loss is approximately a quadratic bowl. The Hessian tells us the curvature in each local direction. Flatter directions have weaker force pulling the iterates back toward the minimum, so the noisy iterates spread out more along those directions; sharper directions have stronger force pulling the iterates back toward the minimum, so the cloud is narrower along those directions.

References

References are listed in alphabetical order.

Gradient descent on neural networks typically occurs at the edge of stability [link]
Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. International Conference on Learning Representations. 2021.
Noise and fluctuation of finite learning rate stochastic gradient descent [link]
Kangqiao Liu, Liu Ziyin, and Masahito Ueda. Proceedings of the 38th International Conference on Machine Learning. 2021.
Acceleration of stochastic approximation by averaging [link]
B. T. Polyak and A. B. Juditsky. SIAM Journal on Control and Optimization. 1992.

Footnotes

This is the same basic stability threshold behind the “edge-of-stability” phenomenon: for a quadratic with curvature $\lambda$ , gradient descent becomes unstable once the learning rate reaches $\eta \geq 2/\lambda$ . ↩