nGPT Under Scrutiny: Does Hyperspherical Normalization Genuinely Accelerate Convergence, or Merely Reparameterize the Optimizer's Implicit Regularization?

Abstract

Loshchilov et al. (arXiv:2410.01131) propose nGPT, a decoder-only Transformer in which every embedding, every weight matrix (row-wise or column-wise), and every hidden state is constrained to the unit hypersphere $S^{d - 1}$ . The authors report training speedups of $4 \times$ to $20 \times$ at the 1B-parameter scale for sequence lengths up to 8k tokens, framing the result as a consequence of replacing Euclidean optimization with Riemannian updates on $S^{d - 1}$ and interpreting each layer as a learnable step of spherical gradient descent. The claim is provocative because it effectively asserts: *remove the norm degree of freedom everywhere, and training accelerates by an order of magnitude*. My assessment, after reconstructing the update rule in exponential-map form and comparing against weight-normalized baselines, is that the mechanism is real but the reported magnitudes are likely inflated by imperfect control for the effective learning rate induced by normalization. The geometry matters; the bookkeeping matters more. I rate the contribution as *moderate*: architecturally crisp, theoretically underdeveloped, empirically undercontrolled.

Problem Formalization

Let us be precise about the claim. A standard pre-LN Transformer [Xiong et al. 2020] maintains hidden states $h_{ℓ} \in R^{d}$ whose norm is stochastically regulated by LayerNorm [Ba et al. 2016] or RMSNorm [Zhang & Sennrich, 2019] only at the *input* of each sublayer. Weight matrices $W_{Q}, W_{K}, W_{V}, W_{O}, W_{1}, W_{2}$ are unconstrained in $R^{d \times d}$ (or $R^{d \times 4 d}$ for the FFN). Training minimizes cross-entropy under AdamW [Loshchilov & Hutter, 2019].

nGPT replaces this with the following object. Fix $d$ and define the constraint manifold

M = {x \in R^{d} : ∥ x ∥_{2} = 1} = S^{d - 1} .

All token embeddings $E_{i, :} \in M$ , all rows of attention and FFN projection matrices lie in $M$ , and every residual-stream activation is re-projected to $M$ after each sublayer update. A query-key logit, rather than taking the standard form

ℓ_{ij} = \frac{q _{i}^{⊤} k _{j}}{d _{k}},

becomes, up to a learned temperature $s_{q k}$ , a scaled cosine similarity

ℓ_{ij} = s_{q k} ⟨ \overset{q}{^}_{i}, \hat{k}_{j} ⟩, \overset{q}{^}_{i}, \hat{k}_{j} \in S^{d_{k} - 1} .

The residual update replaces $h_{ℓ + 1} = h_{ℓ} + f (h_{ℓ})$ with a *retraction*

h_{ℓ + 1} = Retr_{h_{ℓ}} (α_{ℓ} ⊙ (f (h_{ℓ}) - h_{ℓ})),

where $α_{ℓ} \in R^{d}$ is a learned per-coordinate step (the authors call it an *eigen learning rate*) and $Retr_{x} (v) = (x + v) /∥ x + v ∥$ is the projection retraction onto $S^{d - 1}$ . A first-order Taylor expansion of this retraction agrees with the exponential map $exp_{x} (v) = cos (∥ v ∥) x + sin (∥ v ∥) v /∥ v ∥$ up to $O (∥ v ∥^{3})$ , so the update is well-defined as approximate Riemannian gradient descent [Absil, Mahony & Sepulchre, 2008; Bonnabel, 2013].

The formal claim, then, is that running AdamW on this constrained parameterization, *with no other architectural change*, reduces the number of training tokens required to reach a fixed validation loss by a factor ranging from $4 \times$ at 1k context to $20 \times$ at 8k context. The question is whether that claim can be cleanly separated from the confound of implicit learning-rate rescaling.

Assumptions and Their Justification

Several assumptions, some explicit and several tacit, must hold for the reported speedup to reflect the claimed mechanism rather than an artifact.

(A1) Row-wise unit norm preserves expressivity. The authors constrain each row of $W$ (or each column, depending on the sublayer) to $S^{d - 1}$ . A row-unit matrix is not a norm-constrained *linear operator*: the spectral norm of a row-unit $d \times d$ matrix can range from $1/ d$ to $d$ . Expressivity is therefore preserved in terms of the representable set of directions, but the *scale* of linear maps must be recovered through the temperature parameters $s_{q k}, s_{u}, s_{v}, \dots$ and the eigen learning rates $α_{ℓ}$ . Whether this reparameterization induces a different loss landscape from the unconstrained one is precisely the question, and it is not formally addressed.

(A2) AdamW's second-moment estimator remains well-calibrated on constrained parameters. AdamW's effective per-coordinate step size is proportional to $η / (\overset{v}{^}_{t} + ϵ)$ , where $\overset{v}{^}_{t}$ is the bias-corrected second moment. When parameters live on $S^{d - 1}$ , the projected gradient at the optimum has zero component along the normal $x$ , yet the *raw* gradient computed by autograd does not. The authors project gradients but do not discuss whether the accumulated second moment $\overset{v}{^}_{t}$ , being an exponential average of squared *raw* gradients, inflates or deflates the effective step relative to an unconstrained baseline. This is the crux of my implicit-rescaling concern.

(A3) Cosine attention is sufficient. Replacing $q_{i}^{⊤} k_{j}$ with $s_{q k} cos θ_{ij}$ removes the magnitude information that a standard dot product carries. For attention heads that use magnitude as a saliency signal, a learned scalar $s_{q k}$ cannot restore the per-token variability of $∥ q_{i} ∥∥ k_{j} ∥$ . This is analogous to the observation in [Kim et al. 2021] on cosine-based retrieval that magnitude carries non-trivial ranking information. In practice, the learned $s_{q k}$ must grow large enough to produce peaky softmax distributions; the paper reports $s_{q k}$ increasing during training, consistent with this concern.

(A4) The hypersphere is the right manifold. A row-wise constraint to $S^{d - 1}$ privileges angular geometry, but ignores the block structure of multi-head attention. Head-wise projections induce a product of spheres $S^{d_{k} - 1} \times \dots \times S^{d_{k} - 1}$ rather than a single $S^{d - 1}$ . The authors adopt this product structure tacitly. No justification is given for why the product metric aligns with the loss geometry.

(A5) Learning rate and schedule are fairly matched across baselines. This is the single most consequential unstated assumption. Weight normalization [Salimans & Kingma, 2016] and spectral normalization [Miyato et al. 2018] are known to alter the effective learning rate by factors that depend on the distribution of weight norms at initialization. [Arora, Li & Lyu, 2019] proved that batch normalization induces automatic learning-rate tuning: scale-invariant parameters under BN follow an equivalent SGD dynamic with an *effective* learning rate $η_{eff} = η /∥ w ∥^{2}$ . Any such mechanism must be tuned out of the baseline before speedup claims become interpretable. nGPT's ablation on the AdamW learning rate for the baseline is, as far as the paper describes, coarse.

Proof Architecture and the Absence of a Convergence Theorem

The paper is not formally a theoretical paper: it offers no convergence bound, no generalization bound, and no landscape analysis. The arguments for acceleration are geometric intuitions backed by empirical curves. That is a legitimate contribution only if the empirical evidence controls for the relevant confounds, which brings us back to (A5).

Let me sketch what a proper theoretical argument would have to establish. Consider a loss $L : R^{n} \to R$ and its restriction $\tilde{L} : M \to R$ to a product-of-spheres manifold. Riemannian gradient descent on $M$ with step $η$ converges to a first-order stationary point at rate $O (1/ T)$ under geodesic $L$ -smoothness [Boumal, Absil & Cartis, 2018], matching the Euclidean rate. Acceleration *cannot* arise from the manifold structure per se under standard assumptions. It must arise from one of three sources: (i) a tighter smoothness constant $L_{M} ≪ L_{R^{n}}$ on the restriction, (ii) a smaller diameter of the feasible set, or (iii) a better-conditioned Hessian after restriction.

(i) is plausible: restricting to $S^{d - 1}$ removes the radial direction, along which the loss is often flat (scale invariance of softmax attention) and therefore contributes zero curvature, so the restricted Hessian's condition number can be strictly better. But this is exactly what weight normalization already achieves [Salimans & Kingma, 2016], at a cheaper computational cost. (ii) holds trivially: $diam (S^{d - 1}) = π < \infty$ . (iii) is the interesting claim, and it is not demonstrated.

The bound is tight only when the unconstrained problem has exact scale invariance, in which case the radial direction is a true nullspace of the Hessian and restricting kills it cleanly. For a Transformer with RMSNorm inputs but unconstrained weights, scale invariance holds *layer-wise* but not *globally* because of the embedding and unembedding matrices and the final logit scaling. A full hyperspherical constraint therefore does not exploit an existing invariance; it imposes a *new* constraint, and the question is whether that constraint preserves the optimum. The paper does not discuss whether the constrained minimum $min_{θ \in M} L (θ)$ coincides with, or lies close to, the unconstrained minimum $min_{θ \in R^{n}} L (θ)$ . For finite training budgets this matters less than the dynamics, but at asymptote it is a real question.

Connections to Known Results

This work sits in a dense theoretical neighborhood, and the authors' citation of it is selective.

Weight Normalization [Salimans & Kingma, 2016] decomposes $w = g \cdot v /∥ v ∥$ with $∥ v ∥ = 1$ , freeing the scale $g$ . nGPT collapses this decomposition by fixing $g$ implicitly through the learned temperatures $s_{q k}, s_{u}$ . The two approaches differ in whether the magnitude degree of freedom is *freed* (WN) or *reparameterized into a scalar* (nGPT). The speedup over WN, which the paper does not report as a baseline, would be the cleanest way to isolate the hypersphere contribution from the weight-normalization contribution.

Cosine-similarity attention [Luo et al. 2018; Liu et al. 2022] has been studied as a regularizer against attention entropy collapse. Swin Transformer v2 [Liu et al. 2022] replaced dot-product attention with scaled cosine attention explicitly to stabilize training at large model sizes. nGPT generalizes this to all projections, not only attention, but does not cite Swin v2 as a precedent, which I consider a citation gap.

Riemannian optimization on spheres has a long lineage. [Absil, Mahony & Sepulchre, 2008] catalog retractions and vector transports; [Bonnabel, 2013] proves convergence of stochastic Riemannian gradient descent. [Cho & Lee, 2017] applied Riemannian optimization to batch-normalized networks by viewing BN as imposing sphere structure. nGPT is closer to this last lineage than it acknowledges.

Auto rate-tuning via normalization is the result most directly in tension with nGPT's claim. [Arora, Li & Lyu, 2019] proved that for a loss scale-invariant in a parameter $w$ (as BN and weight decay jointly make it), SGD on $w$ with learning rate $η$ is equivalent to SGD on $w /∥ w ∥$ with learning rate $η /∥ w ∥^{2}$ . The practical implication: normalization creates a time-varying effective learning rate that acts as an implicit schedule. If nGPT's speedup is in part this effect, a baseline AdamW run with a tuned warmup and a higher peak learning rate should partially close the gap.

Pre-LN analysis [Xiong et al. 2020] showed that Pre-LN Transformers are strictly easier to train than Post-LN because the residual stream's norm is better controlled. nGPT takes this one step further, enforcing $∥ h_{ℓ} ∥ = 1$ at every layer. Whether this improves over Pre-LN beyond what Pre-LN already bought the field is the relevant question. The paper's baseline is Pre-LN, which is reasonable, but it does not ablate against an intermediate like RMSNorm after every sublayer output.

Results and Analysis

The headline numbers, as reported by the authors, are summarized below.

Setting	Model size	Context length	Claimed speedup to fixed val loss
OpenWebText	0.5B	1k	$\approx 4 \times$
OpenWebText	1B	4k	$\approx 10 \times$
OpenWebText	1B	8k	$\approx 20 \times$

The speedup growing with context length is the most theoretically suggestive observation. One explanation the authors favor: longer contexts produce more diverse query/key directions, and cosine attention exploits this diversity better than dot-product attention. An alternative explanation I would want to see ruled out: at longer contexts, the baseline AdamW peak learning rate is suboptimal because the softmax temperature required for stable attention scales with $d_{k} lo g T$ , where $T$ is context length, and the baseline's fixed $1/ d_{k}$ scaling is known to be too aggressive at long context [Chiang & Cholak, 2022]. A nGPT-style learned temperature $s_{q k}$ would then be capturing what a long-context-aware baseline should have had.

The lack of confidence intervals across seeds is a significant omission. At 1B parameters and 4, 20k steps of reduced training, seed variance on validation loss is non-trivial. [Sellam et al. 2022] showed that BERT-base pretraining varies by 0.3, 0.5 perplexity across seeds with identical hyperparameters; at 1B scale with compressed schedules, I would expect comparable or larger variance. A $4 \times$ speedup with no error bars is suggestive; a $4 \times$ speedup replicated across three seeds with overlapping confidence intervals at $3.2 \times$ to $4.8 \times$ is a result.

The missing ablations I would insist on as an Area Chair:

1. Baseline with weight normalization [Salimans & Kingma, 2016] on all linear layers, matched learning rate.

2. Baseline with scaled cosine attention only [Liu et al. 2022], everything else standard.

3. nGPT with the eigen learning rates $α_{ℓ}$ fixed rather than learned, to isolate whether the adaptivity or the geometry is responsible.

4. Baseline with an aggressively re-tuned peak learning rate, schedule, and weight decay, because normalization changes the loss landscape's scale and therefore the optimal optimizer settings.

Without (1) and (4), the attribution of speedup to *hyperspherical geometry* rather than to *the specific combination of constraints, scalings, and implicit rescaling* remains under-identified.

Gap Between Theory and Practice

The core theoretical story, "Riemannian descent on $S^{d - 1}$ with learned step sizes", has an implementation gap that matters. The retraction used, $Retr_{x} (v) = (x + v) /∥ x + v ∥$ , is a first-order approximation to the exponential map. Under AdamW's adaptive second-moment estimator, the effective step $v$ is not a Riemannian gradient; it is a Euclidean Adam step, subsequently projected. The sequence of updates does not implement Riemannian Adam [Bécigneul & Ganea, 2019], which requires transporting the moment estimates along the manifold via parallel transport. The paper's optimizer is Euclidean Adam followed by retraction. Whether this converges to the same stationary points as Riemannian Adam is a real question; over short horizons the difference may be negligible, but over long horizons it is not.

The constants in the authors' geometric argument are also less favorable than advertised. On $S^{d - 1}$ with $d = 4096$ , a random unit vector has inner product with another random unit vector concentrated at $0$ with standard deviation $1/ d \approx 0.016$ . The learned temperature $s_{q k}$ must therefore be on the order of $1/ σ \approx 60$ to produce softmax peakiness comparable to standard attention with per-token query/key norms of order $d_{k}$ . The paper reports temperatures growing during training, which is consistent, but the high temperature re-introduces the numerical issues that unit-norm was supposed to prevent. This is a subtle point: unit norm on $q$ and $k$ is not free; it shifts the burden of dynamic range from the projections to the learned scalar, and the gradient of $s_{q k}$ can itself become unstable at high values.

On real data, the assumption that token embeddings should lie on $S^{d - 1}$ is at odds with observed geometry. [Gao, He & Li, 2019] documented anisotropy in pretrained embeddings, finding that BERT and GPT-2 embeddings concentrate in a narrow cone occupying a small fraction of $S^{d - 1}$ . If the "natural" learned geometry is anisotropic, forcing isotropy on $S^{d - 1}$ may fight the data's structure, and what nGPT calls acceleration could be compensating through other channels.

Limitations and Open Questions

Beyond the authors' stated limitations (larger-scale validation, downstream task evaluation), three concrete failure modes deserve attention.

Failure mode 1: Sparse-activation regimes. nGPT's unit-norm constraint forces every hidden state onto $S^{d - 1}$ , which eliminates the possibility of a near-zero activation encoding "this feature is not present." For mixture-of-experts [Shazeer et al. 2017] or sparse models where the resting state matters, this constraint is pathological. A concrete scenario: a sparsely-gated FFN that learns to produce $\approx 0$ outputs for 90% of tokens cannot represent this under hyperspherical projection, because all outputs must be unit vectors. The unit-norm output must then encode "inactive" as a specific direction, wasting representational capacity.

Failure mode 2: Distribution shift at inference. Token frequency skew at inference can produce hidden states with statistics very different from training. Standard LayerNorm renormalizes at inference using per-example statistics, preserving robustness. nGPT's $L_{2}$ -unit constraint is also per-example, so this is preserved, *but* the learned temperatures $s_{q k}, s_{u}$ are global constants tuned during training; they cannot adapt per-example. On out-of-distribution inputs where the required softmax temperature differs, nGPT has no mechanism to adjust. This is a concrete robustness concern absent from the paper.

Failure mode 3: Long-tail rare tokens. For rare tokens whose embeddings receive few gradient updates, the projection step in nGPT amplifies noise: each gradient nudges the embedding off the sphere, and normalization re-projects it, but the *direction* of the nudge becomes more sensitive as the pre-normalized norm shrinks. This is the spherical analogue of the well-known variance issue in WN for rarely-updated parameters. The authors do not analyze rare-token behavior.

Five pointed questions for the authors:

1. What is the nGPT speedup over a weight-normalized baseline with matched peak learning rate and schedule?

2. At what context length does the speedup plateau, and what is the predicted functional form of speedup versus context?

3. How do the learned eigen learning rates $α_{ℓ}$ distribute across depth at convergence, and does their magnitude correlate with effective learning-rate rescaling as predicted by [Arora, Li & Lyu, 2019]?

4. What is the variance of validation loss across seeds at the reduced training budgets corresponding to the $4 \times$ and $20 \times$ claims?

5. Does the method survive at 70B scale, where normalization-induced step-size anomalies are known to interact with muP [Yang et al. 2021] and tensor-parallel numerics?

Three recent papers situate nGPT most directly. [Brock et al. 2021] proposed normalization-free networks (NFNets) in the opposite direction: remove normalization entirely, manage scale through adaptive gradient clipping. nGPT takes the orthogonal extreme, normalizing everything. A side-by-side of these two extremes would reveal whether the relevant quantity is *presence of normalization* or *type of normalization*. [Yang et al. 2021] introduced muP for hyperparameter transfer across scales, which implicitly controls effective learning rates across widths; nGPT's hyperspherical constraint is a different mechanism targeting the same symptom. [Bécigneul & Ganea, 2019] provided the Riemannian Adam framework that nGPT partially uses; a rigorous implementation of nGPT should employ Riemannian Adam with parallel transport rather than Euclidean Adam with retraction, and the gap between the two is a worthwhile empirical study.

Broader Impact

If the speedup claims hold up under the missing controls, the practical implication is a nontrivial reduction in pretraining cost, which matters for the economics of large-model training. If they do not, the methodological lesson is more valuable still: a reminder that normalization techniques systematically confound effective learning-rate comparisons, and that speedup claims in the age of trillion-parameter models demand the kind of careful dynamic-system accounting that [Arora, Li & Lyu, 2019] made explicit. The field would benefit from a standard protocol for normalization ablations, analogous to the calibration protocols now expected in probability forecasting.

Verdict

nGPT is an elegant architectural proposal that compresses several scattered insights, cosine attention, weight normalization, Riemannian retraction, per-layer adaptive steps, into a single clean constraint. The geometric story is coherent; the empirical evidence is suggestive; the theoretical claims are underdeveloped and the baselines are undercontrolled. A $4 \times$ , $20 \times$ speedup is the kind of result that, if real and mechanistically isolated, would reshape practice. But the burden of proof for that magnitude is high, and the paper as submitted does not meet it. I would classify the contribution as *moderate*: a strong architectural idea whose attribution story needs a proper weight-normalization baseline, a Riemannian-Adam comparison, and cross-seed variance before the geometry can claim credit for the acceleration. This connects beautifully to a long-running question in the optimization literature: when normalization accelerates training, is the acceleration due to curvature flattening, effective learning-rate adaptation, or improved conditioning? nGPT is a new data point. It is not yet an answer.

Reproducibility & Sources

Primary paper. Loshchilov, I.; Hsieh, C.-P.; Sun, S.; Ginsburg, B. "nGPT: Normalized Transformer with Representation Learning on the Hypersphere." arXiv:2410.01131 (2024).

Code repository. NVIDIA released a reference implementation alongside the paper in the NeMo/Megatron-LM ecosystem. Check the paper's first page for the canonical link; I do not fabricate URLs.

Datasets. OpenWebText (open, community redistribution of GPT-2's WebText-style corpus). No proprietary datasets reported for the main experiments.

Reproducibility assessment (1, 5):

Axis	Rating	Justification
Code availability	4	Reference implementation released; matching the exact training harness at 1B scale is infrastructure-dependent.
Data availability	5	OpenWebText is publicly redistributable.
Experimental detail	3	Optimizer hyperparameters for baselines and seed-level variance are insufficiently specified to cleanly reproduce the headline speedup claims.

*References cited inline:* [Vaswani et al. 2017]; [Ba et al. 2016]; [Zhang & Sennrich, 2019]; [Xiong et al. 2020]; [Loshchilov & Hutter, 2019]; [Salimans & Kingma, 2016]; [Miyato et al. 2018]; [Arora, Li & Lyu, 2019]; [Absil, Mahony & Sepulchre, 2008]; [Bonnabel, 2013]; [Boumal, Absil & Cartis, 2018]; [Cho & Lee, 2017]; [Liu et al. 2022]; [Kim et al. 2021]; [Chiang & Cholak, 2022]; [Sellam et al. 2022]; [Gao, He & Li, 2019]; [Shazeer et al. 2017]; [Brock et al. 2021]; [Yang et al. 2021]; [Bécigneul & Ganea, 2019].