Abstract
The paper under review (Liu et al. *Muon is Scalable for LLM Training*, arXiv:2502.16982) claims that Muon, an optimizer that orthogonalizes momentum matrices via a Newton-Schulz iteration before applying updates, trains LLMs with roughly the sample efficiency of a well-tuned AdamW (Loshchilov & Hutter, 2019) at 16B MoE scale. The contribution is a scaling recipe: a weight decay correction, an update RMS-matching rescaling , and a demonstration on *Moonlight* (3B activated / 16B total MoE). I find the empirical demonstration substantive but the theoretical story incomplete: the spectral-norm control argument is clean at the single-step level, yet the global convergence claim leans on assumptions about LLM loss-landscape anisotropy that are asserted rather than established. The evaluation protocol is where the real scrutiny lies, and on that front the paper makes several nontrivial choices, most notably the compute-matched rather than token-matched baseline, that deserve examination before the community declares AdamW obsolete.
Overall novelty rating: moderate to significant as an empirical scaling study; moderate as a theoretical contribution, since the spectral-norm framing was already present in Jordan's 2024 Muon writeup and in Bernstein & Newhouse's (2024) modular duality work.
1. The Claim, Formally
Muon operates on 2D weight matrices . Given a gradient and a momentum buffer , the update is:
where is a degree- Newton-Schulz polynomial that, to high accuracy in steps, returns an approximate orthogonalization . Informally, Muon replaces the momentum matrix with its *nearest* semi-orthogonal matrix in Frobenius norm, modulo the polynomial approximation error.
The central formal claim is that this projection bounds the update's spectral norm to 1, implying that per-step changes to are controlled in operator norm rather than in Frobenius norm (as in AdamW's coordinate-wise rescaling). The paper does not prove a convergence rate. What it claims, via experiment, is a scaling law: at matched compute, Muon attains the same validation loss as AdamW at approximately half the training tokens.
This is a strong claim, and each moving piece warrants dissection.
2. Derivation Walkthrough: What Newton-Schulz Actually Computes
The Newton-Schulz iteration for computing the matrix sign function, equivalently, the orthogonal polar factor, takes the form:
for the standard cubic variant (Higham, 2008, *Functions of Matrices*, Ch. 8). Jordan's original Muon uses a quintic variant with coefficients chosen to widen the basin of attraction near singular values far from 1. Given any with , repeated application drives nonzero singular values to 1 while preserving and from the SVD . The fixed point is , the orthogonal polar factor of .
Crucially, convergence requires initially. The Moonshot paper handles this with a pre-normalization , a convenient upper bound since . This is the first place an implicit assumption slides in. If has a long tail of small singular values, as momentum matrices tend to, especially early in training, Frobenius normalization aggressively shrinks the top singular value, and the NS iteration must then spend its budget lifting small singular values toward 1. The net effect is that Muon treats the *rank profile* of the momentum as less informative than its *direction profile* on the Stiefel manifold. Whether that is the right prior for LLM gradients is an empirical question the paper does not interrogate.
Relax this assumption and Muon does something subtle. If the effective rank of is much smaller than , then amplifies the contributions of the *small* singular directions to match the large ones. This is the opposite of what second-order methods like Shampoo (Gupta et al. 2018) or K-FAC (Martens & Grosse, 2015) do, which damp small curvature directions. The choice to *equalize* rather than *inverse-scale* singular values is a preconditioning prior: the authors are betting that LLM gradient matrices are approximately low-rank in a way that under-represents useful update directions, and that rebalancing them helps.
I would like to see that prior tested. A direct measurement of the singular value spectrum of at steps for a Muon-trained and an AdamW-trained run would immediately clarify whether Muon is rescuing suppressed directions or merely homogenizing noise. The paper does not report this.
3. The RMS-Matching Rescaling: A Hyperparameter Transfer Trick
A nontrivial practical contribution is the per-matrix rescaling:
The motivation is that an orthogonal matrix has and RMS . AdamW's update, by contrast, has RMS (coordinate-wise normalized). To make Muon's effective step size comparable to AdamW's for learning-rate transfer, one multiplies by . The constant is empirical.
This is clever, but also telling. It means the authors *want* Muon to look like AdamW in update magnitude, then rely on the angular (directional) difference to produce the win. The theoretical framing of spectral-norm control weakens here: once rescaled by , the operator-norm bound is no longer 1 but . For and , that amounts to an operator-norm step of roughly , which is not small. The "spectral-norm-constrained" story becomes a statement about the *ratio* of operator to Frobenius norm of the update, not an absolute bound.
4. Weight Decay: The Quiet Fix
The second scaling fix is weight decay. The authors report that naive Muon without weight decay drifts: grows unboundedly, because the orthogonalized update has no inherent mechanism to shrink . Adding standard AdamW-style decoupled weight decay stabilizes training.
This is a known failure mode of sign-like optimizers; signSGD (Bernstein et al. 2018) exhibits the same pathology. The paper's ablation shows that weight decay is not cosmetic: removing it degrades validation loss substantially at the 1.5B scale. What remains unclear is whether Muon's weight decay coefficient transfers across scales in the same way AdamW's does. The paper fixes but does not sweep it at the 16B scale, leaving open whether the reported efficiency gain is robust to suboptimal or an artifact of a particular tuning.
5. Comparison to Alternative Approaches
Muon is not the first attempt to move beyond coordinate-wise adaptive optimizers. The relevant landscape:
| Method | Precondition by | Memory overhead vs SGD | Spectral-norm aware |
|---|---|---|---|
| Adam/AdamW (Kingma & Ba, 2015; Loshchilov & Hutter, 2019) | Coordinate variance | params | No |
| Shampoo (Gupta et al. 2018) | , roots | params + factor buffers | Approximately |
| K-FAC (Martens & Grosse, 2015) | Fisher Kronecker factors | Similar | Yes, via curvature |
| LAMB (You et al. 2020) | Layer-wise trust ratio | No | |
| Lion (Chen et al. 2023) | Sign of momentum | Crudely | |
| SOAP (Vyas et al. 2024) | Shampoo preconditioner in rotated basis | Similar to Shampoo | Approximately |
| Muon (Jordan, 2024; Liu et al. 2025) | NS orthogonalization of momentum | Yes, by construction |
The natural comparison is to Shampoo and SOAP, which also exploit matrix structure. Shampoo applies , which coincides exactly with the polar factor when the singular values are raised to the power. In this sense, Muon is a memory-light approximation of one-sided Shampoo preconditioning, with the NS iteration replacing expensive matrix roots. The paper acknowledges this connection but does not run Shampoo or SOAP as baselines at 16B scale, the comparison that would most clearly isolate Muon's contribution.
The missing baseline is informative. If Muon's advantage is fundamentally about orthogonalizing momentum (i.e. aligning with the polar factor), then SOAP, which rotates into an eigenbasis and applies Adam there, should be competitive. If Muon wins against SOAP at scale, the effect is real. If Shampoo matches Muon, the win is about matrix-level preconditioning generally, and Muon is primarily a *systems* contribution (cheaper NS iteration, better memory footprint). The paper's experimental design does not permit us to distinguish these hypotheses.
6. Experimental Validation Assessment
The headline result: Moonlight (16B MoE, 3B activated) trained with Muon reaches AdamW-equivalent validation loss at roughly half the compute. Key numbers from the paper:
| Configuration | Tokens | Val loss | Compute (rel.) |
|---|---|---|---|
| AdamW baseline, 16B MoE | 5.7T | baseline | |
| Muon, 16B MoE | ~2.5T | ≈ baseline | ~ |
| Muon small scale (400M dense) | ablation | better than AdamW | - |
The scaling-law plot shows Muon's loss curve sitting below AdamW's across the entire compute range tested (up to ~ FLOP), with the gap approximately constant in log-log. That is the cleanest evidence presented.
Consider the error bars. The paper reports results from single training runs at the large scales, standard practice in LLM training literature, but methodologically fragile. At 16B scale, cost forbids runs, so no confidence interval on the claimed efficiency is available. At smaller scales where multiple runs would be feasible, the paper does not report variance across seeds. Without that, the claim reduces to "in one 16B run, Muon reached AdamW's loss at roughly half the tokens", weaker than a scaling law.
Four evaluation choices are worth flagging:
1. Baseline tuning. The AdamW baseline uses learning rate , , standard values for LLMs. These are not swept at 16B. If AdamW's optimal at 16B is, say, , the baseline is handicapped. The baseline was not properly tuned at scale; it was tuned at proxy scale and transferred. Defensible, but not ironclad.
2. Data mixture. Both runs use the same Moonlight mixture, which is proprietary. Fine for a head-to-head, impossible to replicate externally.
3. Validation loss vs downstream. The efficiency claim is on validation loss. Downstream task scores are reported at final checkpoints only, not along the compute axis. Two optimizers reaching the same loss can produce different downstream behavior (see Hoffmann et al. 2022, *Chinchilla*, on loss-vs-capability decoupling at marginal differences).
4. Numerical precision. The NS iteration is performed in bfloat16 in the released code. No error analysis of NS in low precision is presented; given that NS is iterative and the quintic variant has coefficients that amplify floating-point error near singular values of 0, this matters for reproducibility.
7. Failure Mode Analysis
Where does Muon break?
7.1 Small or tall matrices. The scaling assumes approximately square . For embedding matrices ( with ) and logit heads, the rescaling becomes enormous. The paper applies Muon to hidden matrices and uses AdamW for embeddings and the LM head. This is a pragmatic patch, but it means Muon is not a drop-in replacement; it is an *optimizer mixture*. Any scaling-law claim is about the hidden-layer contribution, not end-to-end.
7.2 Low-effective-rank gradients. If has effective rank , orthogonalization inflates small-singular-value directions by a factor of , which can be very large. Early in training, when gradients are structured and low-rank, this risks injecting noise. The paper does not measure gradient rank trajectories.
7.3 Interaction with MoE routing. Moonlight is MoE. Expert gradients are sparse (only activated experts receive gradients per token) and biased by the router. Orthogonalizing expert weight matrices treats the sparse-activation structure as if it were dense, which may distort the effective update. Muon's advantage at 16B MoE may not transfer to dense 16B, or may be *larger* at MoE, through some interaction with routing balance. The paper reports only the MoE configuration.
7.4 Long-horizon training. The claimed efficiency holds up to ~ FLOP. At Chinchilla-optimal token-to-parameter ratios of 20:1, and on into the overtrained regime of 100:1+ that frontier labs now use, does Muon's advantage persist, diminish, or widen? The paper does not extrapolate or report behavior deep in the overtraining regime.
7.5 Adversarial curvature. If the loss landscape has directions of extreme anisotropy, e.g. in the embedding of rare tokens, Muon's equalization of singular values may update *away* from the stable manifold. This would manifest as instability spikes. The paper reports stable training without spikes, which is reassuring, but does not probe adversarial inputs or tail-token loss.
8. Related Work Positioning
- Jordan (2024), the original Muon blog-post writeup, established the NS-orthogonalized momentum idea and demonstrated it at ~100M parameters on NanoGPT-style experiments. The Moonshot paper's novelty is *scaling* plus the specific weight-decay and RMS-matching fixes that make scaling work.
- Bernstein & Newhouse (2024), *Modular Duality in Deep Learning* (arXiv:2410.21265), provides a theoretical framework in which operator-norm steepest descent naturally produces orthogonalized updates, giving Muon a principled justification as dualization under the spectral norm. The Moonshot paper cites this connection but does not extend it.
- Shampoo and SOAP (Gupta et al. 2018; Vyas et al. 2024) are the strongest alternative matrix-aware optimizers. SOAP in particular has shown competitive LLM results at up to 1B scale. The absence of a SOAP baseline at 16B is the single biggest gap in the experimental design.
- Lion (Chen et al. 2023, arXiv:2302.06675) is a sign-based optimizer discovered by search; it shares Muon's memory profile without the spectral-norm machinery. A Lion baseline at 16B would clarify how much of the gain is from memory efficiency (enabling different hyperparameter regimes) versus from orthogonalization specifically.
- Adafactor (Shazeer & Stern, 2018) offers another low-memory alternative, used widely for large-scale training. Its omission is less critical, but still notable.
9. Open Technical Questions
1. Is the win from orthogonalization, or from matrix-aware preconditioning more broadly? A SOAP baseline at 16B MoE would answer this directly.
2. Does Muon's efficiency gain persist deep in the overtrained regime (token:parameter > 100)? The scaling plot stops before this regime.
3. What is the singular value spectrum of $M_t$ under Muon vs AdamW throughout training? A small diagnostic experiment with large interpretive value.
4. How sensitive is the $2\times$ claim to $\lambda$, $\mu$, and the NS iteration depth? The paper sweeps but not the other two at scale.
5. Does Muon's advantage transfer across architectures? Only decoder-only Transformers are tested. State-space models and diffusion transformers remain untouched.
10. Assessment
The Moonshot paper is the most serious scaling evidence we have for a non-AdamW optimizer on LLMs. The empirical result is credible. But "credible" is not "settled." The theoretical framing, spectral-norm control, is a useful intuition rather than a convergence proof; the comparison to Shampoo and SOAP is missing at scale; and the experiments are single-run at 16B, which means the number carries an unquantified variance. A fairer comparison would include properly tuned Shampoo and SOAP baselines at identical compute, report multiple seeds at the largest feasible scale, and probe the singular spectrum of momentum to validate the mechanism.
Reproducibility is not optional; it is the minimum. Moonshot released code and a checkpoint, to their credit, and substantially more than most large-scale optimizer papers. But the training data is proprietary, so external replication of the exact claim is impossible. What the community can do, and should, is reproduce the scaling curves at 1B to 7B dense on public data (e.g. SlimPajama, FineWeb) against a properly tuned AdamW, Shampoo, and SOAP. If Muon's advantage holds up in that setting, the case for adoption becomes much stronger.
My recommendation: treat Muon as a promising empirical finding worth follow-up, not as a replacement for AdamW. The field has been burned before by optimizer papers that showed gains at one scale and evaporated at the next (recall the LAMB-vs-AdamW trajectory of 2020, 2022). Careful, controlled replication at public-data scales is the next step.
Reproducibility & Sources
Primary paper: Liu, J. et al. *Muon is Scalable for LLM Training*. arXiv:2502.16982, 2025.
Code repository: Official release by Moonshot AI available on GitHub under the Moonlight/Muon organization; pretrained Moonlight checkpoints released on Hugging Face.
Datasets: Training data is a proprietary Moonshot AI mixture. Validation perplexity is reported on internal held-out sets. No public dataset is shared. Downstream evaluations use standard public benchmarks (MMLU, GSM8K, HumanEval, BBH), which are publicly accessible.
Reproducibility assessment (1, 5):
- Code availability: 4/5. Training code and the NS iteration kernel are released, but the full 16B training pipeline, including data loader and sharding details, is only partially documented.
- Data availability: 2/5. The proprietary data mixture is the critical gap; the efficiency claim cannot be reproduced on identical data externally.
- Experimental detail sufficient: 3/5. Hyperparameters are reported, but ablation coverage at 16B is limited, no cross-seed variance is reported at scale, and no direct Shampoo/SOAP baseline is run at matched scale.
Relevant prior works cited:
- Loshchilov & Hutter, *Decoupled Weight Decay Regularization*, ICLR 2019.
- Kingma & Ba, *Adam: A Method for Stochastic Optimization*, ICLR 2015.
- Gupta, Koren & Singer, *Shampoo: Preconditioned Stochastic Tensor Optimization*, ICML 2018.
- Martens & Grosse, *Optimizing Neural Networks with Kronecker-factored Approximate Curvature*, ICML 2015.
- Bernstein et al. *signSGD: Compressed Optimisation for Non-Convex Problems*, ICML 2018.
- You et al. *Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes* (LAMB), ICLR 2020.
- Shazeer & Stern, *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost*, ICML 2018.
- Chen et al. *Symbolic Discovery of Optimization Algorithms* (Lion), arXiv:2302.06675, 2023.
- Vyas et al. *SOAP: Improving and Stabilizing Shampoo using Adam*, 2024.
- Bernstein & Newhouse, *Modular Duality in Deep Learning*, arXiv:2410.21265, 2024.
- Hoffmann et al. *Training Compute-Optimal Large Language Models* (Chinchilla), 2022.
- Higham, *Functions of Matrices: Theory and Computation*, SIAM, 2008.
