Grokking Revisited: Delayed Generalization as Phase Transition or Weight-Decay Geometry?

The Question That Will Not Die

Four years ago, Power et al. (2022) deposited a strange empirical observation into the machine learning literature and walked away. A small transformer, trained on modular arithmetic with weight decay, would fit the training set in a few hundred steps, sit at chance-level test accuracy for up to $1 0^{5}$ steps, and then, abruptly, *generalize*. The loss curve looked like two sigmoids separated by a flat desert. They called it grokking.

The field has spent four years arguing about what, if anything, grokking reveals about neural generalization. The debate matters because grokking is one of the rare phenomena in deep learning where we have clean, reproducible dynamics on a toy problem with a known ground-truth representation (the Fourier basis on $Z / p Z$ , as shown by Nanda et al. 2023). If we cannot explain grokking, our understanding of implicit regularization is in worse shape than we admit.

This review is not a summary of Power et al. The paper's empirical content is well known. What I want to do is situate the grokking phenomenon within a taxonomy of explanations, evaluate which survive rigorous scrutiny, and ask whether the "phase transition" framing is a genuine theoretical handle or an artifact of how weight decay interacts with the loss-landscape geometry of attention. The reduction reveals something fundamental, or at least I will argue it does.

Historical Context: How We Got Here

The intellectual lineage of grokking runs through several overlapping programs in learning theory and optimization. Understanding the lineage is half of understanding the phenomenon.

The first thread is the implicit bias of gradient descent, formalized by Soudry et al. (2018) for logistic regression on separable data. Their theorem states that gradient descent on the logistic loss converges in direction to the max-margin solution, even without explicit regularization, at rate $O (1/ lo g t)$ . This result established that the optimizer itself encodes a preference over zero-training-loss solutions. It is the theoretical backdrop against which grokking makes sense at all: if every interpolating solution were equally likely, there would be no reason for delayed generalization to prefer the Fourier solution over any memorizing one.

The second thread is double descent (Belkin et al. 2019; Nakkiran et al. 2020). Double descent showed that test error can decrease past the interpolation threshold, violating classical bias-variance intuitions. Grokking is arguably a temporal analog: not a function of model size or data, but of *optimization time* holding everything else fixed. The mathematical connection is not accidental. Both phenomena concern the geometry of the set of interpolating solutions and the dynamics that traverse it.

The third thread is mechanistic interpretability, and specifically the result by Nanda et al. (2023) on progress measures. They showed that the Power et al. transformer implements modular arithmetic via a discrete Fourier transform: each attention head computes $cos (ω_{k} (a + b))$ and $sin (ω_{k} (a + b))$ for a small set of frequencies $ω_{k}$ , and the unembedding performs the inverse transform. Critically, they constructed a *progress measure*, the Fourier gap, which rises smoothly during the memorization plateau and predicts the onset of generalization *before* test accuracy moves. This is a major result: it reframes grokking from a discontinuous phase transition into a continuous representational transition that is invisible to loss-based observables.

The fourth thread is the lottery ticket / sparse circuit hypothesis (Frankle & Carbin, 2019; Merrill et al. 2023). Merrill et al. proposed that grokking is a competition between a dense memorizing subnetwork and a sparse generalizing circuit, with weight decay slowly starving the dense subnetwork until the sparse one dominates. This framing is appealing because it makes a concrete prediction: the time-to-grok should scale with the ratio of the two subnetworks' norm decay rates under weight decay.

The fifth thread, perhaps the most important for this review, is the lazy-to-rich transition (Chizat et al. 2019; Woodworth et al. 2020). In the Neural Tangent Kernel regime, networks behave like kernel methods and cannot learn feature structure. In the rich, feature-learning regime, the network aligns its representations with task structure. Kumar et al. (2024) conjectured that grokking is the manifestation of a network transitioning from a lazy fit (effectively memorization in a large-norm kernel regime) to a rich fit (the Fourier circuit) under the slow pressure of weight decay. If this is right, grokking is not a mysterious phase transition at all. It is the standard feature-learning dynamic, merely slowed down until we can observe it.

Finally, work by Barak et al. (2022) on SGD learning parities established that gradient-based optimization can exhibit long plateaus followed by rapid learning on problems with high statistical dimension. Grokking on modular arithmetic is plausibly the same phenomenon: modular arithmetic over $Z / p Z$ has a Fourier structure whose SQ dimension (Kearns, 1998) is exponential in the relevant feature, and gradient descent must discover the right representation before it can make progress.

Taxonomy of Explanations

Let me classify the existing accounts of grokking into coherent families, because the literature is now large enough that readers routinely conflate incompatible hypotheses.

Family 1: The Phase Transition Account. Grokking is a genuine dynamical phase transition, analogous to spin-glass transitions in statistical physics. The evidence: the abruptness of test-accuracy gain and the existence of sharp critical hyperparameters (learning rate, weight decay, data fraction) beyond which grokking disappears. Power et al. (2022) implicitly adopt this framing. Thilak et al. (2022) on the "slingshot mechanism" and related work on adaptive-optimizer instabilities lean the same direction.

Family 2: The Representation-Learning Account. Grokking is the slow formation of a structured representation that was never present at memorization time. Nanda et al. (2023) is the canonical evidence. The transformer discovers the Fourier basis over hundreds of thousands of steps; the test-accuracy jump reflects when the representation crosses a threshold of usability, not a dynamical discontinuity. Liu et al. (2022) formalize this with an effective theory on a toy model in which representation quality and generalization are continuously linked.

Family 3: The Circuit Competition Account. Weight decay induces a competition between a dense, memorizing subnetwork and a sparse, generalizing one. The dense subnetwork wins the training-loss race; the sparse one wins the $L_{2}$ norm race. Varma et al. (2023) quantify this with a notion of "circuit efficiency." Merrill et al. (2023) make similar claims using Gini-based sparsity measures.

Family 4: The Lazy-to-Rich Account. Grokking is a feature-learning transition slowed by initialization scale and weight decay. On this view, no special grokking mechanism is needed. Kumar et al. (2024) argue that if you initialize small enough and let weight decay pull norms further down, any sufficient-capacity network will eventually leave the lazy regime.

Family 5: The Optimization-Geometry Account. Grokking reflects the geometry of the interpolating manifold under weight decay. On modular arithmetic, the set of zero-train-loss solutions contains both memorizing and generalizing points. Weight decay defines an energy on this set. The generalizing solutions lie at lower energy, but the dynamics must traverse a saddle to reach them. The traversal time is the grokking delay.

Power et al. (2022) belongs historically to Family 1 in framing, but the subsequent literature has largely moved away from treating grokking as a physical phase transition. The weight of evidence now favors a combination of Families 2, 4, and 5, with Family 3 providing a useful local picture.

Comparative Analysis

Here is a structured comparison of the most influential grokking papers, which any reader entering the field should internalize.

Paper	Primary Mechanism	Scale (params / task)	Key Quantitative Claim
Power et al. (2022)	Empirical discovery; attributes to regularization pressure	~ $4 \times 1 0^{5}$ / modular arithmetic mod 97	Test accuracy jumps from <5% to >95% in under 10% of training steps, after $1 0^{4}$ to $1 0^{5}$ steps of memorization
Nanda et al. (2023)	Fourier-circuit formation; progress measures	~ $4 \times 1 0^{5}$ / mod 113	Fourier gap rises smoothly; excluded loss shows continuous progress during the "plateau"
Liu et al. (2022)	Effective theory: representation norm vs. decoder norm	Toy 2-layer models	Phase diagram as a function of weight decay $λ$ and representation scale $σ$
Varma et al. (2023)	Circuit efficiency; sparse wins under $L_{2}$	Modular arithmetic and beyond	Predicts "ungrokking" if weight decay is removed mid-training; verifies empirically
Thilak et al. (2022)	Slingshot instability of Adam	Attention and MLP	Grokking correlates with norm oscillations under Adam, not SGD
Kumar et al. (2024)	Lazy-to-rich transition with scale $α$	Controlled 2-layer	Grokking time scales as $α^{- 2}$ in initialization scale

Two features of this table deserve further attention. First, the claimed mechanisms are not mutually exclusive. A Fourier circuit *can* be the sparse subnetwork *and* the feature-learning target *and* the low-weight-decay-energy attractor. Much of the "disagreement" in the literature is really about emphasis. Second, the experimental scales are uniformly small. Nobody has demonstrated grokking, in its sharp form, on tasks with input dimension above $\sim 1 0^{4}$ or on networks above $\sim 1 0^{7}$ parameters. The scaling behavior therefore remains speculative.

Technical Analysis of Power et al.

Let me now audit the original paper's methodology against the standard I would apply as an Area Chair.

Setup. Power et al. train a 2-layer transformer on the task of predicting $a \circ b mod p$ for various binary operations $\circ$ on $Z / p Z$ with $p \in {97, 113}$ , using fractions $f \in [0.3, 0.9]$ of the $p^{2}$ examples for training. The loss is cross-entropy over $p$ logits. They use AdamW with weight decay $λ = 1$ (an unusually large value) and report test accuracy versus optimization step on log axes.

Claim 1: Grokking is a distinct phenomenon from overfitting followed by convergence. Supported. Train accuracy reaches 100% well before test accuracy moves, and the gap persists over orders of magnitude of optimization steps. The absence of a learning-rate-warmup artifact is plausible but not formally controlled.

Claim 2: Weight decay is necessary for grokking. Partially supported. They show that without weight decay, test accuracy remains low. They do not cleanly disentangle weight decay from its interaction with AdamW's second-moment normalization. As Thilak et al. (2022) later showed, Adam induces a different effective regularization than SGD, and the "slingshot" dynamics are Adam-specific. A rigorous version of the claim would require the same experiment with SGD plus explicit $L_{2}$ regularization, which Power et al. do not fully report.

Claim 3: The phenomenon generalizes across operations. They test several binary operations on $Z / p Z$ and one non-abelian group ( $S_{5}$ ). Grokking appears in all of them, with varying delays. This is good evidence that the phenomenon is not specific to addition mod $p$ , but it remains within the narrow class of algebraic group operations with strong harmonic structure.

What is missing from the paper? A clean quantitative relationship between weight decay magnitude $λ$ , learning rate $η$ , data fraction $f$ , and grokking time $τ_{grok}$ . The paper presents grokking as a qualitative phenomenon. The subsequent literature (Liu et al. 2022; Varma et al. 2023) has tried to fill this gap with phase diagrams, but the theoretical scaling laws remain conjectural. One specific missing experiment: fix all hyperparameters except $λ$ , and plot $τ_{grok} (λ)$ on log-log axes. The prediction from the lazy-to-rich account is $τ_{grok} \sim λ^{- 1}$ ; from the circuit-efficiency account, a different scaling. Distinguishing these would be the decisive experiment.

Mathematical Insight: Why Weight Decay Matters

Consider the simplest toy model that reproduces grokking: a 2-layer linear network with factorized parameters $W = U V^{⊤}$ trained on a rank-1 target, with weight decay $λ$ . The loss is

L (U, V) = \frac{1}{2} ∥ U V^{⊤} - Y ∥_{F}^{2} + \frac{λ}{2} (∥ U ∥_{F}^{2} + ∥ V ∥_{F}^{2}) .

The set of global minima at $λ = 0$ is a continuous manifold: any $(U, V)$ with $U V^{⊤} = Y$ works. At $λ > 0$ , the minimizer is unique (up to orthogonal ambiguity), and it is the *low-norm* factorization, which corresponds to the minimum nuclear norm solution. Gradient flow on this objective has the balancing property $\frac{d}{d t} (U^{⊤} U - V V^{⊤}) = 0$ , so the dynamics remain on a balanced manifold. The time to reach the unique $λ$ -regularized solution from an initialization at a high-norm memorizing point is governed by the spectral gap of the Hessian at the saddle, which scales as $O (λ^{- 1})$ for small $λ$ .

This is the mathematical content of the lazy-to-rich story. The saddle traversal time is the grokking delay. The Fourier circuit in Power et al.'s transformer is the low-norm solution; the memorizing solution is the high-norm attractor; weight decay slowly pushes the dynamics across the saddle. On this reading, grokking is not a phase transition. It is a slow saddle traversal whose *apparent* discontinuity arises from the exponential approach dynamics near the saddle.

A sharper claim. If grokking is really saddle traversal, then $τ_{grok}$ should depend on the initialization norm $σ$ as $τ_{grok} \sim lo g (σ / σ^{*}) / (λκ)$ , where $κ$ is the local curvature and $σ^{*}$ is the saddle-point norm. Kumar et al. (2024) provide evidence consistent with the $σ$ dependence. The logarithmic scaling is the signature of saddle dynamics, distinguishable from polynomial scaling under other accounts. As far as I know, no paper has cleanly verified the log scaling in a full-size transformer setting, and this remains an open empirical question.

Trend Analysis

The grokking subfield has evolved through three roughly distinguishable phases. Phase one (2022) was discovery and initial bewilderment. Phase two (2023) was the mechanistic-interpretability breakthrough: Nanda et al. (2023) and Chughtai et al. (2023) on group representations demystified *what* the network was computing. Phase three (2024 onward) concerns unification: does grokking reduce to more general feature-learning dynamics, and can we predict when it will occur in realistic large-scale training?

The field is clearly moving toward Family 4 (lazy-to-rich) and Family 5 (optimization-geometry) explanations. The phase-transition framing is now a minority view among theoretically oriented researchers, though it persists in popularizations. This trajectory is healthy. The more we reduce grokking to standard optimization dynamics, the more useful it becomes as a probe of implicit regularization in the broader sense.

Simultaneously, there is divergence. Some researchers are extending grokking to hidden progress in realistic training runs: Barak et al. (2022) on parities, and subsequent work on in-context learning dynamics, suggest that grokking-like delayed generalization occurs in settings where the delay is masked by loss averaging or curriculum effects. If true, grokking is not a toy phenomenon. It is a ubiquitous feature of feature learning under sparsity-preferring regularization, and we routinely fail to notice it.

Gap Identification

Several important problems remain unresolved, and together they define the research frontier.

Gap 1: The scale question. Does grokking occur in models above $1 0^{9}$ parameters? The answer matters because if grokking is truly about saddle traversal in the feature-learning regime, it should persist at all scales, merely with different time constants. Anecdotal evidence from LLM training suggests there are plateaus in downstream metric curves that look grokking-like, but no controlled study exists.

Gap 2: The task-structure question. Grokking has been demonstrated almost exclusively on tasks with sharp algebraic structure (group operations, sparse parities). Does it occur on natural language, vision, or protein structure prediction? If not, what distinguishes tasks that grok from those that do not? A candidate answer: tasks where the minimum-description-length solution differs substantially from the memorizing solution in norm, which is plausibly a function of task entropy and representational compressibility.

Gap 3: The optimizer-dependence question. Thilak et al. (2022) showed that Adam is not the same as SGD for grokking. What is the right theoretical frame for Adam-induced implicit regularization? Work by Cohen et al. (2021) on the edge of stability is relevant, but a clean theory of how Adam's variance normalization interacts with weight decay to produce grokking dynamics is still missing.

Gap 4: The observability question. If grokking is a representational transition that precedes the behavioral jump, we need better progress measures. Nanda et al.'s Fourier gap works for modular arithmetic because the target representation is known. For realistic tasks, we do not know the target representation in advance. A general-purpose progress measure, perhaps based on representation-compression ratios or effective-rank dynamics, would be transformative.

Limitations and Failure Modes of the Grokking Framework

Beyond what Power et al. acknowledge, several failure modes deserve attention.

Failure mode 1: The finite-data artifact. Grokking appears only for data fractions $f$ in a narrow band: too little data and the network cannot generalize at all; too much and it generalizes without a plateau. This suggests grokking is parameterized by a specific ratio of training examples to model capacity, not a robust phenomenon. Scaling this ratio to modern LLM training regimes (where $f$ is effectively 1 in a single-epoch sense but effectively small per-token) is nontrivial.

Failure mode 2: The distribution-shift blindness. All grokking experiments are in-distribution. If the network generalizes to all held-out pairs in $Z / p Z$ , what does this tell us about out-of-distribution behavior? Very little. The "generalization" in grokking is closer to "interpolation of a finite structure" than to robust generalization in the learning-theoretic sense.

Failure mode 3: The representation-degeneracy risk. Nanda et al.'s Fourier circuit is one of many representations that fit modular arithmetic. Gromov (2023) showed that analytic solutions exist in closed form. A network that partially forms a Fourier circuit but with wrong frequencies will exhibit test accuracy that is positive yet below 100%, and this may look like "partial grokking" in larger systems, making the sharp-transition framing misleading.

Prediction: Where This Goes in Two to Three Years

My prediction is that by 2028, the community will no longer treat grokking as a distinct phenomenon requiring its own theory. Instead, it will be absorbed into a unified account of feature learning under norm regularization, where the lazy-to-rich transition is the parent phenomenon and grokking is the special case with particularly clean dynamics. The theoretical tools for this unification, mirror-descent analyses of weight decay, reparameterization-invariant measures of representation richness, and nonconvex saddle-traversal bounds, already exist in nascent form.

Simultaneously, I expect at least one major empirical study demonstrating grokking-like delayed generalization in frontier-scale pretraining runs, using representation-based progress measures rather than loss curves. This would validate the view that implicit-bias phenomena we observe in toy models are present, though masked, in realistic training. The consequence for practice would be significant. Training curves for frontier models would be read differently, and the "how much longer to train" question would be answered by representation dynamics rather than by loss extrapolation.

The lower bound tells us what is fundamentally impossible, and that is liberating. If we can prove that some forms of delayed generalization cannot be detected by any polynomial-time loss-based observable, the field will pivot decisively toward representation-based diagnostics. I would bet modestly on this direction.

Novelty Rating and Assessment

Power et al. (2022) is an empirical paper with modest theoretical content. Its novelty, in the field-historical sense, is significant. It is not transformative, because the phenomenon it reports would have been discovered eventually by someone probing the feature-learning transition, but it crystallized the question in a reproducible, concrete form. Its lasting contribution is not the phase-transition framing (which subsequent work has largely superseded) but the clean benchmark it established.

On the four axes:

Significance of contribution: high, because it catalyzed a productive subfield.
Technical correctness: adequate for an empirical paper; the claims made are supported by the experiments shown, though the interpretive claims extend beyond the evidence.
Clarity: good; the paper is short and transparent.
Novelty vs. engineering refinement: pure discovery, minimal engineering.

Key Questions

1. If grokking is truly a phase transition rather than a slow saddle traversal, what is the order parameter, and can it be measured independently of the test loss? Progress measures from Nanda et al. (2023) suggest the answer may be "no, there is no sharp order parameter," in which case the phase-transition framing should be abandoned.

2. Does grokking time $τ_{grok}$ scale as $λ^{- 1}$ (saddle traversal), as $λ^{- 2}$ (diffusive escape), or logarithmically (critical dynamics)? The experiment is straightforward but, to my knowledge, has not been published in clean form.

3. What is the relationship between grokking and the double-descent time-axis phenomenon? Both involve non-monotonic dependence on a resource (time or capacity). Is there a unified dynamical theory?

4. On tasks without known algebraic structure, can we define a progress measure that predicts grokking onset? Without such a measure, grokking remains a curiosity of group theory, not a tool for understanding realistic training.

5. How does grokking interact with curriculum and batch composition in large-scale training? If frontier models exhibit hidden delayed generalization, changes to data schedules might systematically reveal or suppress it.

Verdict

Power et al. (2022) is a landmark empirical paper whose framing has been partially superseded by subsequent work. As an Area Chair, I would have accepted it at ICLR or NeurIPS without hesitation, despite its theoretical thinness, because the phenomenon it documents is real, reproducible, and deeply interesting. The right assessment today is that grokking is not a phase transition in the rigorous sense, but a slow feature-learning dynamic whose apparent discontinuity is an artifact of loss-based observables. The phenomenon remains a rare, clean empirical handle on the implicit-bias question, and the subfield it has spawned is one of the healthier corners of contemporary deep-learning theory.

The right abstraction, I suspect, is feature-learning-as-saddle-traversal under norm regularization, and the hard part is finding the progress measure that makes the saddle visible without knowing the target in advance. That is the open conjecture worth proving.

Reproducibility and Sources

Primary paper: Power, A. Burda, Y. Edwards, H. Babuschkin, I. & Misra, V. (2022). *Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.* arXiv:2201.02177.

Code repository: The authors released minimal training scripts via OpenAI's research blog; the most widely used community reimplementation is the one accompanying Nanda et al. (2023) for mechanistic analysis. Readers entering the area should consult the most recent GitHub ecosystem rather than rely on the original release.

Datasets: All tasks are synthetic and fully specified by the modulus $p$ and the binary operation. Generation is trivial from the paper's description; no external data access is required.

Reproducibility rating:

Code availability: 3/5 (minimal original code, strong community reimplementations)
Data availability: 5/5 (fully synthetic)
Experimental detail: 4/5 (hyperparameters specified; some interaction effects with AdamW implementation details require care)

References Cited

Barak, B. Edelman, B. L. Goel, S. Kakade, S. Malach, E. & Zhang, C. (2022). Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit.
Belkin, M. Hsu, D. Ma, S. & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. PNAS.
Chizat, L. Oyallon, E. & Bach, F. (2019). On Lazy Training in Differentiable Programming. NeurIPS.
Chughtai, B. Chan, L. & Nanda, N. (2023). A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations.
Cohen, J. M. Kaur, S. Li, Y. Kolter, J. Z. & Talwalkar, A. (2021). Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability.
Frankle, J. & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR.
Gromov, A. (2023). Grokking modular arithmetic.
Kearns, M. (1998). Efficient Noise-Tolerant Learning from Statistical Queries. JACM.
Kumar, T. Bordelon, B. Gershman, S. J. & Pehlevan, C. (2024). Grokking as the transition from lazy to rich training dynamics. ICLR.
Liu, Z. Kitouni, O. Nolte, N. Michaud, E. Tegmark, M. & Williams, M. (2022). Towards Understanding Grokking: An Effective Theory of Representation Learning. NeurIPS.
Merrill, W. Tsilivis, N. & Shukla, A. (2023). A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks.
Nakkiran, P. Kaplun, G. Bansal, Y. Yang, T. Barak, B. & Sutskever, I. (2020). Deep Double Descent. ICLR.
Nanda, N. Chan, L. Lieberum, T. Smith, J. & Steinhardt, J. (2023). Progress Measures for Grokking via Mechanistic Interpretability. ICLR.
Neyshabur, B. Tomioka, R. & Srebro, N. (2015). In Search of the Real Inductive Bias.
Power, A. Burda, Y. Edwards, H. Babuschkin, I. & Misra, V. (2022). Grokking. arXiv:2201.02177.
Soudry, D. Hoffer, E. Nacson, M. S. Gunasekar, S. & Srebro, N. (2018). The Implicit Bias of Gradient Descent on Separable Data. JMLR.
Thilak, V. Littwin, E. Zhai, S. Saremi, O. Paiss, R. & Susskind, J. (2022). The Slingshot Mechanism.
Varma, V. Shah, R. Kenton, Z. Kramár, J. & Kumar, R. (2023). Explaining grokking through circuit efficiency.
Woodworth, B. Gunasekar, S. Lee, J. D. Moroshko, E. Savarese, P. Golan, I. Soudry, D. & Srebro, N. (2020). Kernel and Rich Regimes in Overparametrized Models.
Zhang, C. Bengio, S. Hardt, M. Recht, B. & Vinyals, O. (2017). Understanding Deep Learning Requires Rethinking Generalization. ICLR.