Muon
One recurring theme in empirical deep learning is that we want to keep intermediate activations at a healthy size – we don’t want them to blow up or vanish.
This principle has inspired a bunch of design choices that are popular in empirical deep learning today – for example initialization schemes (e.g., Xavier and Kaiming He initialization), layer normalization, and gradient clipping.
Muon is inspired by this same idea, applied to optimization. The goal is to keep activations at every layer reasonably sized (we’ll define “reasonably sized” in a moment); these activations should be reasonably sized not just at initialization, but also throughout training.
Natural norms
We first need to define what we mean when we say that an activation vector \(v \in \mathbb{R}^d\) is “reasonably sized”.
One standard way to measure the size of a vector is the \(\ell_2\) norm: \(\|v\|_2 = \sqrt{\sum_{i=1}^d v_i^2}\).
However, this norm is dimension-dependent: for example, a vector with all entries \(\pm 1\) has \(\|v\|_2 = \sqrt{d}\), which grows with dimension.
The natural norm for vectors: RMS norm
The root mean square (RMS) norm fixes this by averaging over entries instead of summing:
\[\|v\|_{\text{RMS}} := \sqrt{\frac{1}{d} \sum_{i=1}^d v_i^2} = \frac{1}{\sqrt{d}} \|v\|_2.\]Now a \(\pm 1\) vector has \(\|v\|_{\text{RMS}} = 1\) regardless of dimension.
Sanity check with a Gaussian random vector
since each \(v_i \sim \mathcal{N}(0, 1)\) has \(\mathbb{E}[v_i^2] = \text{Var}(v_i) = 1\).
Compare with the Euclidean norm:
\[\mathbb{E}[\|v\|_2^2] = \mathbb{E}\!\left[\sum_{i=1}^d v_i^2\right] = \sum_{i=1}^d \mathbb{E}[v_i^2] = d \cdot 1 = d,\]which grows with dimension.
Intuitively, the RMS norm measures the scale of a typical entry of \(v\), rather than accumulating across all entries like the \(\ell_2\) norm does. This makes it the natural norm for dense vectors1: \(\|v\|_{\text{RMS}} = \Theta(1)\) exactly captures “the entries of \(v\) are \(\Theta(1)\)-sized”, independent of dimension.
The natural norm for matrices: RMS-to-RMS operator norm
What does it mean for a weight matrix \(W \in \mathbb{R}^{d_\text{out} \times d_\text{in}}\) to be “well-behaved”? It should map reasonably-sized inputs to reasonably-sized outputs: if \(\|x\|_{\text{RMS}} = \Theta(1)\), then we want \(\|Wx\|_{\text{RMS}} = \Theta(1)\).
The natural way to measure this is the RMS-to-RMS operator norm – the worst-case RMS stretch:
\[\|W\|_{\text{RMS} \to \text{RMS}} := \max_{x \neq 0} \frac{\|Wx\|_{\text{RMS}}}{\|x\|_{\text{RMS}}}.\]This is the natural spectral norm. It relates to the standard spectral norm \(\|W\|_*\) (largest singular value) by a dimensional factor:
\[\|W\|_{\text{RMS} \to \text{RMS}} = \sqrt{\frac{d_\text{in}}{d_\text{out}}} \cdot \|W\|_*.\]Relating RMS-to-RMS and spectral norm
Taking the max over \(x\):
\[\|W\|_{\text{RMS} \to \text{RMS}} = \sqrt{\frac{d_\text{in}}{d_\text{out}}} \cdot \|W\|_*.\]The spectral scaling condition
Let’s recall what we’re trying to achieve. We want every layer’s weight matrix \(W\) to map reasonably-sized inputs to reasonably-sized outputs – that is, \(\|W\|_{\text{RMS} \to \text{RMS}} = \Theta(1)\). Using the relationship we derived above, this is equivalent to:
\[\|W\|_* = \Theta\!\left(\sqrt{\frac{d_\text{out}}{d_\text{in}}}\right).\]We want this to hold not just at initialization, but throughout training. After a weight update \(W \to W + \Delta W\), the change in output for input \(x\) is \(\Delta y = \Delta W \, x\). By definition of the operator norm:
\[\|\Delta y\|_{\text{RMS}} = \|\Delta W \, x\|_{\text{RMS}} \leq \|\Delta W\|_{\text{RMS} \to \text{RMS}} \cdot \|x\|_{\text{RMS}}.\]So if the input is reasonably sized (\(\|x\|_{\text{RMS}} = \Theta(1)\)) and we want the output change to also be reasonably sized (\(\|\Delta y\|_{\text{RMS}} = \Theta(1)\)), we need \(\|\Delta W\|_{\text{RMS} \to \text{RMS}} = \Theta(1)\), i.e.:
\[\|\Delta W\|_* = \Theta\!\left(\sqrt{\frac{d_\text{out}}{d_\text{in}}}\right).\]Together, these two requirements form the spectral scaling condition from Yang, Simon, and Bernstein, 2024: both the weights and their updates should have \(\Theta(1)\) RMS-to-RMS operator norm, or equivalently, spectral norm \(\Theta(\sqrt{d_\text{out}/d_\text{in}})\).
Deriving Muon
We’ve established that \(\Delta W\) should have \(\Theta(1)\) RMS-to-RMS operator norm. That is, \(\|\Delta W\|_{\text{RMS} \to \text{RMS}} \leq \alpha\) for some constant \(\alpha\). Since \(\|W\|_{\text{RMS} \to \text{RMS}} = \sqrt{d_\text{in}/d_\text{out}} \cdot \|W\|_*\), this is equivalent to \(\|\Delta W\|_* \leq \eta\) where \(\eta := \alpha \sqrt{d_\text{out}/d_\text{in}}\).
Given this budget, we want to decrease the loss as much as possible. To first order, the change in loss is \(\langle \nabla_W \mathcal{L}, \Delta W \rangle\), where \(\langle A, B \rangle := \sum_{ij} A_{ij} B_{ij}\) is the entrywise inner product between matrices. So we want to solve:
\[\min_{\Delta W} \; \langle \nabla_W \mathcal{L}, \Delta W \rangle \quad \text{subject to} \quad \|\Delta W\|_* \leq \eta.\]We can write the gradient in its SVD:
\[\nabla_W \mathcal{L} = \sum_{i=1}^r \sigma_i u_i v_i^\top,\]with singular values \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\).
Plugging in the SVD, and using linearity of inner product, we get:
\[\langle \nabla_W \mathcal{L}, \Delta W \rangle \;=\; \left\langle \sum_{i=1}^r \sigma_i u_i v_i^\top, \; \Delta W \right\rangle \;=\; \sum_{i=1}^r \sigma_i \, \langle u_i v_i^\top, \Delta W \rangle.\]Note that only the projection of \(\Delta W\) onto the singular directions \(\{u_i v_i^\top\}\) appears in the objective – any orthogonal component doesn’t help decrease the loss, but can only increase \(\|\Delta W\|_*\). Thus, without loss of generality, we can write \(\Delta W = \sum_{i=1}^r c_i u_i v_i^\top\).
This is an SVD of \(\Delta W\) with singular values \(\lvert c_i \rvert\), so \(\|\Delta W\|_* = \max_i \lvert c_i \rvert\), and the optimization reduces to:
\[\min_{c_1, \ldots, c_r} \; \sum_{i=1}^r \sigma_i \, c_i \quad \text{subject to} \quad \max_i \lvert c_i \rvert \leq \eta.\]Since all \(\sigma_i > 0\), the minimum is achieved by setting \(c_i = -\eta\) for all \(i\), giving:
\[\Delta W = -\eta \sum_{i=1}^r u_i v_i^\top = -\eta \, UV^\top.\]This is the Muon update: take the gradient’s SVD, replace all singular values with 1, and then scale by \(-\eta\).
Comparison with gradient descent
What if we had constrained the Frobenius norm instead of the spectral norm? That is, what if we solved:
\[\min_{\Delta W} \; \langle \nabla_W \mathcal{L}, \Delta W \rangle \quad \text{subject to} \quad \|\Delta W\|_F \leq \eta.\]If we treat \(\Delta W\) as a flat vector of entries, the Frobenius norm is just the \(\ell_2\) norm, and this is asking: which direction decreases the loss most per unit \(\ell_2\) step? That’s the definition of the gradient. So the solution is \(\Delta W \propto -\nabla_W \mathcal{L}\): gradient descent.
So the key difference between gradient descent and Muon stems from the norm constraint. Gradient descent effectively constrains the Frobenius norm, which treats the weight matrix as an unstructured vector of numbers; Muon effectively constrains the spectral norm, which measures how the matrix acts on inputs.
The spectral norm only constrains the largest singular value of \(\Delta W\) – once you’ve spent your budget \(\eta\) on the top singular direction, the remaining directions are free. Muon takes advantage of this by stepping equally in every singular direction. Gradient descent can’t do this: under the Frobenius norm, every direction draws from a shared budget (\(\sum c_i^2 \leq \eta^2\)), so stepping more in one direction means stepping less in another. The optimal allocation sets \(c_i \propto -\sigma_i\), concentrating on the largest singular directions at the expense of the smaller ones.
Making it practical: Newton-Schulz
The (ideal) Muon update requires computing \(UV^\top\) from the gradient \(\nabla_W \mathcal{L} = U\Sigma V^\top\). Computing the full SVD at every step is too expensive. But we don’t need the full SVD – we just need the map \(U\Sigma V^\top \mapsto UV^\top\).
Turns out there are some algorithms that do exactly this – Newton-Schulz iterations. The algorithm works iteratively: at each step, it applies an operation to the matrix that acts independently on each singular value while preserving the singular vectors, and the operation is chosen so that all singular values converge to 1. In practice, 5-10 iterations suffice, and each iteration is just a few matrix multiplications – fast on GPUs. See these blog posts for more details.
The Muon update rule
Putting it all together, the Muon update for a weight matrix \(W \in \mathbb{R}^{d_\text{out} \times d_\text{in}}\) is:
\[W \leftarrow W - \alpha \cdot \sqrt{\frac{d_\text{out}}{d_\text{in}}} \cdot \text{NewtonSchulz}(\nabla_W \mathcal{L}).\]Let’s unpack this:
- \(\text{NewtonSchulz}(\nabla_W \mathcal{L})\) approximately orthogonalizes the gradient: \(U\Sigma V^\top \mapsto UV^\top\). This has spectral norm \(1\).
- The \(\sqrt{d_\text{out}/d_\text{in}}\) factor scales the RMS-to-RMS learning rate \(\alpha\) to the spectral norm budget: recall \(\eta = \alpha \sqrt{d_\text{out}/d_\text{in}}\).
Note that the \(\sqrt{d_\text{out}/d_\text{in}}\) factor absorbs the layer dimensions into the update, which helps \(\alpha\) transfer across weight matrices of different shapes and sizes.
Sources
- A Spectral Condition for Feature Learning - Yang, Simon, and Bernstein, 2024
- Deriving Muon - Jeremy Bernstein, 2025
- Muon: An optimizer for hidden layers in neural networks - Jordan et al., 2024
Footnotes
-
A dense vector is one where every entry contributes a comparable amount to the squared norm. By contrast, a sparse vector has only a constant number of non-negligible entries, regardless of dimension – e.g., a one-hot encoding vector. For sparse vectors, the ordinary \(\ell_2\) norm is already dimension-independent (e.g., a one-hot vector has \(\|e_i\|_2 = 1\), regardless of dimension), so no correction is needed. The RMS norm is the natural norm specifically for dense vectors. Throughout this note we focus on dense vectors, since hidden activations in a transformer can be thought of as dense; the embedding layer (which maps sparse inputs to dense activations) requires separate treatment. ↩