The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

Opening: A Reframing That Deserves Scrutiny

Grokking has long been treated as a representation-learning phenomenon. The standard story, descending from [Power et al. 2022] through the mechanistic work of [Nanda et al. 2023] and the circuit-efficiency framework of [Varma et al. 2023], treats the long plateau as evidence that the *right* internal structure has not yet crystallized. The network memorizes; eventually a cleaner representation emerges; then test accuracy snaps into place.

The authors of arXiv:2604.13082 propose a sharper decomposition. The encoder, they claim, already knows. It organizes parity and residue structure within the first few thousand steps. What lags is not the geometry of the hidden space but the decoder's ability to *read* that geometry into correct output tokens. The 2.75x grokking speedup obtained through encoder transplantation is the headline causal evidence. The reframing is elegant, and if correct, it reorients a substantial body of mechanistic interpretability work. It also, I will argue, rests on assumptions the abstract does not fully discharge, and admits an alternative reading the authors may not have considered.

This review proceeds in the critical-commentary tradition. I steelman the decoder-bottleneck hypothesis, locate its weakest inferential link, propose a competing interpretation consistent with the same evidence, and specify what experimental result would falsify each reading. The reduction reveals something fundamental about grokking, and that is worth taking seriously on its own terms.

1. Steelman: The Strongest Version of the Decoder-Bottleneck Claim

Let $ϕ_{t} : X \to R^{d}$ denote the encoder at training step $t$ , and $g_{t} : R^{d} \to Δ (Y)$ the decoder (including cross-attention and output projection). The paper's central empirical claim can be stated as two inequalities separated in time. For some early $t_{1}$ and late $t_{2} ≫ t_{1}$ :

I (ϕ_{t_{1}} (X); Y) \approx I (ϕ_{t_{2}} (X); Y), acc (g_{t_{1}} \circ ϕ_{t_{1}}) ≪ acc (g_{t_{2}} \circ ϕ_{t_{2}}) .

The encoder's representation carries near-sufficient statistics for the target (parity, residue classes modulo small $k$ ) almost immediately. Task accuracy, however, depends on a composed map, and the decoder's projection onto the output simplex lags. On this reading, grokking is not representation acquisition. It is a *readout capacity* problem.

The causal intervention is the load-bearing piece. Transplanting a pretrained $ϕ$ into a fresh model with a randomly initialized $g$ and observing a 2.75x acceleration demonstrates that the encoder's state is not merely correlated with eventual generalization. It is *sufficient input* to a substantially faster trajectory. This is a cleaner causal argument than the correlational analyses of [Nanda et al. 2023] or the norm-growth diagnostics of [Thilak et al. 2022]. At its strongest, the paper's contribution is a reattribution of a canonical mechanistic puzzle from the encoder to the decoder, backed by an intervention that controls for the joint training dynamic.

For a one-step Collatz predictor, the relevant structure is modular. Parity determines whether $n \mapsto n /2$ or $n \mapsto 3 n + 1$ applies. Residues modulo small primes govern the downstream arithmetic. If the encoder learns these features early, the task reduces to a low-dimensional classification that a linear probe should solve. The paper's implicit prediction, which a careful reader should extract, is that a linear probe on $\phi_{t_1}$ would achieve near-perfect test accuracy while the full model remains at chance. That is a testable, quantitative consequence.

Contribution classification. This is primarily an empirical finding (c) with a causal-intervention design. It is not a new theoretical result, nor a new algorithm. The transplant procedure is a diagnostic, not a training improvement. I would rate the novelty as moderate-to-significant, conditional on the robustness of the transplant experiment across architectural variations not reported in the abstract.

2. Positioning Against Prior Work

The decoder-as-bottleneck framing intersects several prior threads, and a precise accounting is essential.

[Power et al. 2022] established the grokking phenomenon on modular arithmetic but offered no mechanistic decomposition.
[Nanda et al. 2023] reverse-engineered the Fourier structure of modular addition in a single-layer transformer, showing that the encoder's *and* unembedding's Fourier bases co-evolve. Their progress measures conflate representation and readout.
[Liu et al. 2022] framed grokking as a transition driven by weight decay and representation quality.
[Varma et al. 2023] proposed the circuit-efficiency account: a generalizing circuit is more parameter-efficient than a memorizing one, and weight decay drives the transition.
[Barak et al. 2022] in *Hidden Progress in Deep Learning* documented that non-generalizing networks accumulate useful structure long before behavioral signals appear. This is arguably the closest conceptual antecedent.
[Gromov, 2023] gave an analytical treatment of grokking modular arithmetic, explicitly decomposing it into feature learning and classification-head alignment.

Against this landscape, the new work's contribution is a *causal* rather than correlational separation of encoder and decoder dynamics in an encoder-decoder (seq2seq) setting, rather than in the decoder-only or single-layer architectures that dominate the prior literature. The shift from modular arithmetic to one-step Collatz is also notable. Collatz has no closed-form algebraic generator; the structure is modular, but the output branches on a learned predicate. This raises the bar for the encoder's representation relative to pure modular addition, which makes the observed *early* emergence of parity and residue structure more surprising, not less.

What is genuinely new is the transplantation protocol as a causal probe for mechanistic attribution. What is not new is the observation that representations precede behavior, which [Barak et al. 2022] already argued at the phenomenological level.

3. The Weakest Link: What Does "Decoder Bottleneck" Actually Mean?

Here is where I want to push hard. The claim that the decoder is the bottleneck admits at least three non-equivalent interpretations, and the abstract does not disambiguate them.

Interpretation A (Readout Capacity). The decoder's function class at step $t_{1}$ cannot realize the correct map from $ϕ_{t_{1}} (X)$ to $Y$ , even though such a map exists within the decoder's eventual function class. The bottleneck is expressive.

Interpretation B (Optimization Geometry). The decoder *can* represent the correct map, but SGD on the joint loss follows a trajectory that fits the training set via a memorizing route first, and only later finds the generalizing decoder. The bottleneck is optimization-path dependent.

Interpretation C (Coupling). The encoder and decoder are entangled during training; gradients flowing back through $g$ shape $ϕ$ . A pretrained $ϕ$ accelerates grokking not because the decoder was the bottleneck, but because it *removes a destabilizing feedback loop* that joint training induces.

The 2.75x transplant speedup is compatible with all three. Under A, the decoder simply needs to learn against a fixed good representation, which is faster than co-training. Under B, the fixed encoder prunes the memorizing basin from the loss landscape. Under C, the acceleration is about decoupled dynamics, not about where the "real" learning was happening.

This is not a pedantic distinction. The three interpretations predict different follow-up experiments and different practical takeaways. If A is correct, then larger or better-initialized decoders should grok faster without any encoder intervention. If B is correct, then curriculum learning or modified optimizers should suffice. If C is correct, then the key is decoupling, and *any* sufficiently rich frozen encoder, including one trained on a different task, should accelerate grokking.

The abstract's phrase "limited access to already learned structure" leans toward A, but the transplantation experiment alone cannot distinguish among these possibilities. This is the weakest link: the inference from "transplantation accelerates grokking" to "the decoder was the bottleneck" requires an auxiliary assumption that the paper may or may not defend in its body.

4. An Alternative Interpretation the Authors May Not Have Considered

Consider a fourth possibility that reframes the entire finding. In joint training, the encoder is continuously perturbed by gradients from an incompetent decoder. Let $ϕ^{*}$ denote the idealized encoder representation. Define the *useful* component of the encoder as the projection onto the subspace spanned by the residue and parity features, and the *noise* component as everything else. The linear probe measures the useful component. Test accuracy, however, depends on the entire encoder, because the decoder uses all of it.

Under this reading, the encoder *does* contain the right features early, but it also carries a large amount of decoder-gradient-induced noise. The decoder cannot generalize not because it lacks capacity, but because it is attempting to read a noisy linear combination of features from a representation that is still settling. Grokking occurs when the encoder's signal-to-noise ratio crosses a threshold, at which point the decoder's optimization becomes well-posed.

Transplanting a *trained* encoder into a fresh model removes the noise component, because training has already consolidated $ϕ^{*}$ . The decoder then grokks faster not because readout was the bottleneck, but because the effective task it faces is lower-dimensional and better conditioned. In this view, the bottleneck is still representational; it is simply that representation quality is not captured by linear probe accuracy on parity and residue classes.

This alternative is consistent with the theoretical intuition from [Saxe et al. 2019] on linear network dynamics and with the implicit bias literature. It also predicts a specific experiment: compare transplanting a fully trained encoder against transplanting an encoder at step $t_{1}$ (when probe accuracy is high but training accuracy is still low). If the $t_{1}$ -transplanted encoder does *not* accelerate grokking as much as the fully trained one, the linear probe is not capturing what matters, and the decoder-bottleneck claim weakens substantially.

5. Technical Analysis: Formal Considerations

Let me formalize the empirical setup. In an encoder-decoder transformer, the composed predictor is $f_{θ} = g_{θ} \circ ϕ_{θ}$ . The paper's linear-probe methodology estimates

L_{probe} (ϕ) = W min E_{(x, y)} [ℓ (W ϕ (x), y)],

which lower-bounds the Bayes risk achievable from $ϕ (X)$ . The decoder's actual loss is

L_{task} (ϕ, g) = E [ℓ (g (ϕ (x)), y)] \geq L_{probe} (ϕ) .

The gap $L_{task} - L_{probe}$ is what the paper attributes to decoder readout. But the decoder in a transformer is not a linear probe: it is an autoregressive stack with self-attention, cross-attention, and an output projection. The function class is vastly larger than the probe's. A more careful framing would ask whether the decoder's *realized* function at step $t$ approaches the optimal decoder for $ϕ_{t}$ , which is a question about optimization, not expressivity.

Complexity consideration. For one-step Collatz on $n$ -bit integers, the target function has circuit complexity $O (lo g n)$ in standard Boolean models. The encoder's hidden dimension is presumably $d ≫ lo g n$ , so expressivity is not the binding constraint for either module. The question is *sample-efficient learnability under SGD*, which is an implicit-bias problem, not a VC-dimension problem.

Statistical assessment. The 2.75x speedup requires error bars. Grokking times are notoriously seed-sensitive; [Nanda et al. 2023] report variance of $1 0^{4}$ steps across seeds on modular addition. Without the number of seeds and confidence intervals, the 2.75x figure is a point estimate of unknown reliability. This is a reproducibility flag.

6. Experimental Assessment and Missing Ablations

Based on the abstract alone, the critical experiments I want to see are:

Experiment	Purpose	Inference it enables
Transplant encoder at varying $t$	Separate linear-probe quality from transplant efficacy	Tests alternative interpretation
Linear probe accuracy vs. task accuracy curves	Quantify the representation-behavior gap	Calibrates the bottleneck claim
Transplant random-init encoder of matched norm	Control for norm effects	Rules out trivial explanations
Freeze decoder and retrain encoder	Symmetric counterfactual	Tests whether encoder is not the bottleneck
Multiple seeds with variance reporting	Statistical reliability	Validates 2.75x figure
Cross-task transplant (encoder from different arithmetic task)	Tests coupling interpretation	Distinguishes A from C

The symmetric experiment, freezing the decoder and retraining the encoder, is the most conspicuous missing ablation. If the decoder is the bottleneck, freezing it at a late checkpoint and re-initializing the encoder should *not* accelerate grokking. If both directions accelerate, the "bottleneck" framing collapses into a coupling story.

The cross-task transplant is equally important. A pretrained encoder from a different modular task, if it accelerates grokking, would suggest the benefit is about decoupling dynamics rather than task-specific representation content. I would expect the authors to have run this; if they did not, it is a significant gap.

7. Limitations and Failure Modes

Limitation 1: Task specificity. One-step Collatz is a narrow testbed. The residue and parity structure the encoder learns early is *present in the input distribution* without any task supervision. For tasks where the relevant structure is not discoverable from marginal statistics, the encoder may have no early head start. Modular multiplication, lookup-table tasks with random permutations, or compositional tasks without algebraic structure would likely show different dynamics. The claim "representations outrun behavior" may be task-dependent.

Limitation 2: Encoder-decoder architecture. The decomposition relies on a clear encoder/decoder separation, which is natural for seq2seq but less so for the decoder-only transformers that dominate modern LLM work. Extending the claim to decoder-only models requires a layer-wise analogue (early vs. late layers), which introduces confounds the current setup avoids.

Limitation 3: Linear probe as the representation metric. Linear probes detect linearly decodable features. If the encoder encodes the relevant structure non-linearly, the probe will underestimate representation quality at later steps, potentially inverting the claimed order. A non-linear probe ablation is essential.

Concrete failure scenario. Consider a task where the target depends on a product of hidden features, $y = f (x_{1}) \cdot g (x_{2}) mod p$ . An encoder may learn $f$ and $g$ separately, each linearly probeable, while the decoder must learn a bilinear combination that is not linearly probeable from the encoder output. In this regime, linear probe accuracy saturates early while task accuracy requires a genuine decoder capacity upgrade. The decoder-bottleneck claim would be *correct* here, but for reasons the current framework does not articulate.

8. Questions for the Authors

1. What happens under the symmetric intervention: freezing a trained decoder and re-initializing the encoder? Does grokking still accelerate? If so, the bottleneck framing requires revision.

2. How does the transplant speedup vary with the source encoder's training step? Is there a minimum $t^{*}$ below which transplantation provides no benefit, and how does $t^{*}$ relate to the onset of linear probe accuracy?

3. Does a *non-linear* probe (e.g. a two-layer MLP) on $ϕ_{t_{1}}$ achieve the same accuracy as the final model? If not, the linear probe is underselling the representation's limitations.

4. Across how many seeds was the 2.75x figure measured, and what is the variance? Grokking timing is heavy-tailed.

5. Does the result extend to tasks where the relevant structure is not recoverable from input marginals alone, for example a random-lookup permutation task? This would test whether the finding concerns *encoder-decoder dynamics* or *input structure*.

9. What Would Change My Mind

Validates the paper: A symmetric experiment shows no speedup when the decoder is frozen and the encoder reinitialized, *and* non-linear probes confirm the encoder's representation is task-sufficient at $t_{1}$ , *and* the 2.75x effect holds across $\geq 10$ seeds with narrow confidence intervals.
Falsifies the paper: Cross-task encoder transplantation produces a comparable speedup, indicating that the benefit is about decoupling rather than learned structure. Or: non-linear probes show that representation quality at $t_{1}$ is substantially below final, contradicting the "already learned" framing.

10. Broader Implications

If the decoder-bottleneck claim is correct in its strong form, the field's grokking research agenda needs recalibration. Mechanistic interpretability work that reads off circuits from the encoder's weight matrices [Nanda et al. 2023; Chughtai et al. 2023] may have been measuring the easy half of the problem. Progress measures based on encoder structure would be insufficient; readout-progress measures would need to be developed. On the practical side, architectural interventions that widen or deepen the decoder, or apply stronger weight decay selectively to the output projection, would become first-order research targets.

If the claim is incorrect, and the alternative interpretation (encoder-decoder coupling as the real effect) holds, the implication is different but equally significant. It would suggest that training dynamics in jointly optimized modules are fundamentally non-separable, and that "which component learns first" is not a well-posed question without specifying the counterfactual decoupling protocol.

Either way, the paper opens a productive line of inquiry. The lower bound on what this study establishes is that causal intervention via component transplantation is a viable methodology for mechanistic attribution, and that is worth something independent of whether the specific bottleneck claim survives scrutiny.

Verdict

As an Area Chair, I would rate this as a borderline accept with major revision requests. The causal intervention is a genuine methodological contribution, and the empirical phenomenon is interesting. The interpretation is underdetermined by the evidence as presented in the abstract, and the missing ablations (symmetric intervention, cross-task transplant, non-linear probes, seed variance) are not optional. With those experiments, this becomes a clear accept. Without them, it risks being the kind of reframing that gets cited widely before anyone notices the inferential gap.

The lower bound tells us what is fundamentally impossible, and that is liberating: transplantation can falsify some accounts of grokking, even if it cannot by itself establish the positive claim the paper makes. That methodological point may outlast the specific conclusion.

Reproducibility & Sources

Primary paper: *The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior.* arXiv:2604.13082.
Code repository: No official code link found in the abstract. Reproducibility depends on release.
Datasets: One-step Collatz prediction, programmatically generated from the Collatz recurrence on integer inputs. The generation procedure must be specified (input range, tokenization, train/test split) for reproducibility.
Reproducibility rating:

Axis	Rating (1-5)	Rationale
Code availability	1	No repository referenced in abstract
Data availability	4	Task is synthetic and trivially regenerable
Experimental detail	2	Abstract lacks seed counts, hyperparameters, variance

Selected prior work referenced

[Power et al. 2022] *Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.*
[Nanda et al. 2023] *Progress Measures for Grokking via Mechanistic Interpretability.*
[Barak et al. 2022] *Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit.*
[Varma et al. 2023] *Explaining Grokking Through Circuit Efficiency.*
[Liu et al. 2022] *Towards Understanding Grokking: An Effective Theory of Representation Learning.*
[Thilak et al. 2022] *The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon.*
[Gromov, 2023] *Grokking Modular Arithmetic.*
[Saxe et al. 2019] *A Mathematical Theory of Semantic Development in Deep Neural Networks.*