Surprise as a Loss Function: Dissecting the Test-Time Memory Update in Titans (arXiv:2501.00663)

Abstract

The common framing is that attention is the only game in town for long-range dependencies, and that state-space models (SSMs) and linear attention variants are strictly a compression tradeoff against it. *Titans* [Behrouz et al. 2025, arXiv:2501.00663] challenges that framing by proposing a neural long-term memory module whose parameters are updated *at inference time* via a surprise-driven gradient step on an associative-memory objective. The headline claim is a model family that scales beyond 2M context with favorable accuracy and throughput on needle-in-haystack (NIAH) and language modeling benchmarks. In this review I dissect the central technical object: the surprise-based update rule. I argue that the rule is mathematically a momentum-SGD step on a local reconstruction loss, that this reframing clarifies both why it works and where it will fail, and that the experimental protocol, while respectable, leaves several load-bearing claims undersupported. Reproducibility is moderate: the paper is detailed on the update rule but light on the hyperparameter schedules that almost certainly matter.

1. The Claim, Stated Formally

Let $x_{1}, x_{2}, \dots, x_{T}$ denote a stream of tokens, and let $ϕ : R^{d} \to R^{d_{k}}$ , $ψ : R^{d} \to R^{d_{v}}$ be learned key and value projections. A *neural memory* is a parametric map $M_{t} : R^{d_{k}} \to R^{d_{v}}$ (in Titans, a small MLP) whose parameters $θ_{t} \in R^{p}$ evolve during inference. The associative-memory objective at step $t$ is

ℓ_{t} (θ) = \frac{1}{2} ∥ M_{θ} (ϕ (x_{t})) - ψ (x_{t}) ∥_{2}^{2} .

The authors define *surprise* as the instantaneous gradient $g_{t} = \nabla_{θ} ℓ_{t} (θ_{t - 1})$ and update

S_{t} = η_{t} S_{t - 1} - θ_{t} g_{t}, θ_{t} = (1 - α_{t}) θ_{t - 1} + S_{t},

where $η_{t}, θ_{t}, α_{t} \in [0, 1]$ are *data-dependent* gates produced by small linear maps from $x_{t}$ . Retrieval at query $q_{t} = ϕ (x_{t})$ returns $M_{θ_{t}} (q_{t})$ , which is then fused with self-attention via one of three variants: Memory-as-Context (MAC), Memory-as-Gating (MAG), or Memory-as-Layer (MAL).

That is the entire novelty, stripped of adornment. The remainder, architectural variants, chunked parallelization, persistent-memory tokens, is engineering scaffolding around this core object. My dissection will focus there.

2. Derivation Walkthrough: What Is This Update, Really?

Strip away the notation and the update is a momentum-SGD step on a per-token associative-memory loss, with a data-dependent weight-decay term. $S_{t}$ is the velocity buffer, $η_{t}$ is a per-token momentum coefficient, $θ_{t}$ (confusingly reused as a symbol by the authors) is a per-token learning rate, and $α_{t}$ is a per-token weight decay. Set $η_{t} \equiv β$ and $α_{t} \equiv 0$ and you recover vanilla Polyak-momentum SGD on $ℓ_{t}$ . Additionally set $η_{t} = 0$ and you recover the *fast-weights* update of [Ba et al. 2016] and its linear-attention reformulation by [Schlag et al. 2021]. The contribution, then, is not a new class of update but a new *parameterization* of an already-known family: the gates are predicted from the input, and the inner module $M_{θ}$ is an MLP rather than a bilinear form.

The critical assumptions enter at three points. First, the quadratic reconstruction loss $ℓ_{t}$ is assumed to be the right surrogate for "useful to remember." This is a strong prior. It asserts that what the model should store is whatever most reduces the immediate key-value reconstruction error, a local, token-wise notion of surprise. If the relevant information is *globally* non-obvious but *locally* predictable, the gradient will be small and the token will not be memorized. That failure mode is not hypothetical (see §5).

Second, the data-dependent gates are assumed to be learnable end-to-end via backprop-through-the-online-update. This requires differentiating through a sequence of gradient steps, which the authors handle by the now-standard chunked-parallel trick familiar from GLA and DeltaNet [Yang et al. 2024]. The chunk size $C$ becomes a silent hyperparameter that trades memory fidelity (small $C$ , many small steps) against parallel throughput (large $C$ , fewer larger steps). The paper's ablations on $C$ are limited.

Third, the MLP memory $M_{θ}$ is assumed to have enough capacity to absorb useful associations without catastrophic interference. The authors use a 2-layer MLP. That is a design choice with no theoretical grounding in the paper, and the sensitivity to depth and width is reported only in passing.

What breaks if we relax these? Relax the quadratic loss and you get a different notion of surprise; arguably a cross-entropy loss in a discrete-code memory would better match downstream language-modeling objectives, yet the authors do not run this comparison. Relax the data-dependent gating and you get a plain online-learning memory in the style of [Schlag et al. 2021], which the paper's own ablations show is substantially worse, good, the gating earns its keep. Relax the MLP, reverting to a linear outer-product memory, and you lose the non-linear retrieval the authors credit with most of the NIAH gains; this is the one ablation that is done cleanly.

3. Comparison to Alternatives

The relevant comparison set is *not* vanilla Transformers. It is the family of sub-quadratic sequence models that have proliferated since 2023.

Model	Memory state	Update rule	Per-token cost	Recall capacity
Transformer [Vaswani et al. 2017]	KV cache, $O (T)$	Append	$O (T d)$	Exact
Linear Attention [Katharopoulos et al. 2020]	Matrix $d_{k} \times d_{v}$	Outer-product add	$O (d_{k} d_{v})$	Lossy, linear
RetNet [Sun et al. 2023]	Matrix + decay	Exp-decayed sum	$O (d_{k} d_{v})$	Lossy, exponential fade
Mamba / S6 [Gu & Dao, 2023]	SSM hidden state	Selective scan	$O (d_{s} d)$	Lossy, input-selective
DeltaNet / GLA [Yang et al. 2024]	Matrix + delta	Rank-1 correction	$O (d_{k} d_{v})$	Lossy, overwriting
Titans memory	MLP params $θ \in R^{p}$	Momentum-SGD on $ℓ_{t}$	$O (p)$ per step	Lossy, non-linear

The substantive question: why prefer an MLP with SGD updates over a matrix with delta-rule updates? The authors' answer is that non-linear memory can store more associations before interference, echoing the classical Hopfield-network capacity results and the modern Hopfield reformulation of [Ramsauer et al. 2021]. That is plausible, but the paper does not actually measure capacity in a controlled way. A clean experiment would fix the parameter budget $p$ across Titans, DeltaNet, and a modern Hopfield layer, sweep the number of stored key-value pairs, and report the recall curve. That experiment is missing.

Computationally, Titans' memory step costs $O (p)$ per token where $p$ is the MLP parameter count, but crucially it requires *backprop through the MLP* at each token to compute $g_{t}$ . For a 2-layer MLP of width $w$ , that is $O (w^{2})$ per token, with an additional constant from momentum buffering. This is heavier than DeltaNet's rank-1 update. The paper reports throughput numbers favorable to Titans, but the comparison baselines are not always matched on parameter count or on kernel-level optimization effort, a familiar issue in this literature.

4. Experimental Validation Assessment

Here is where I want to slow down. The devil is in the evaluation protocol.

Language modeling. Titans is compared to Transformer++, Mamba, Mamba-2, DeltaNet, and a few others on standard perplexity and zero-shot downstream tasks at 340M, 400M, and 760M parameter scales. The reported perplexity gaps fall in the range of 0.1, 0.5 points. At this scale, with single-seed runs (which I believe is the case based on the paper's reporting style, though this should be stated explicitly), 0.1, 0.5 perplexity is *within run-to-run noise* in my experience reproducing similar setups. I would want to see at least three seeds and a paired bootstrap confidence interval before calling any of these differences real.

Consider the error bars. The paper does not, to my reading, provide them on the main perplexity tables. That is not disqualifying; most papers in this lineage do not either. But it means the claim "Titans outperforms Mamba-2" at these scales should be read as "the point estimate is better on this run," not as a verified effect.

Needle-in-a-haystack. The NIAH and BABILong results are more striking. Titans reportedly maintains high retrieval accuracy at 2M context on single-needle and multi-needle variants, where attention-free baselines degrade. This is the cleanest evidence for the method. But NIAH is an odd benchmark: it is a test of *lossless* retrieval of an injected string, which is precisely what the associative-memory objective $ℓ_{t}$ optimizes directly. The alignment between benchmark and training objective is suspicious. A fairer test would use distributional reasoning tasks, for instance, RULER [Hsieh et al. 2024] variants that require aggregation over retrieved content rather than mere retrieval of a span.

BABILong. Gains here are more convincing because the task genuinely requires multi-hop reasoning over long context. The reported improvements over RMT and Mamba variants are substantial. However, the baseline hyperparameters, particularly for RMT, are inherited from prior work rather than re-tuned for these context lengths. That the baseline was not properly tuned is a hypothesis I cannot rule out without running it myself.

Ablations. The paper ablates: (a) linear vs. MLP memory, (b) with and without momentum, (c) with and without weight decay, (d) the three architectural variants MAC, MAG, and MAL. These are the right ablations. Missing: sensitivity to chunk size $C$ , sensitivity to MLP depth and width, and sensitivity to the scheduling of $η_{t}, α_{t}$ , are the learned gates actually doing something nontrivial, or collapsing to near-constant values? Without the gate-statistics analysis, we cannot distinguish "data-dependent gating is critical" from "any reasonable fixed momentum would work."

5. Failure Mode Analysis

Four failure modes seem predictable from the formulation.

Failure mode 1: low-surprise-but-high-utility tokens. The update magnitude scales with $∥ g_{t} ∥$ , which scales with $ℓ_{t}$ . If a token's key-value association is already well-predicted by the current memory, say, a frequently seen entity name, the model will *not* update on it, even if that token is the answer to a downstream query. The surprise criterion is a proxy for information gain under a reconstruction loss, not for downstream utility. A concrete adversarial construction: embed a critical fact as a statistically typical continuation in the prefix, then query on it later. I would expect Titans to underperform a straight attention model on this.

Failure mode 2: gradient interference in the MLP. Per-token SGD on a small MLP is known to suffer from catastrophic interference when the input distribution is non-stationary [McCloskey & Cohen, 1989, and the large subsequent literature on continual learning]. Over a 2M-token context, the MLP is being updated 2M times with per-token gradients. Unless the effective learning rate schedule decays sufficiently, later tokens will overwrite earlier associations. The weight-decay term $α_{t}$ is meant to mitigate this, yet it also *actively erases* old memories. The paper does not report memory-retention curves as a function of distance; I would want a plot of recall accuracy versus needle-to-query distance at 2M context, with error bars over needle positions.

Failure mode 3: distribution shift between train and test context length. The test-time update rule is trained end-to-end on training sequences of some maximum length $L_{train}$ . At inference on 2M-token sequences, the gate networks are being asked to produce learning rates and decay rates for a regime they were never trained on. Extrapolation of gating behavior is not free. The authors claim length generalization, but the evidence is largely on NIAH, which is a narrow test.

Failure mode 4: adversarial prompts that exploit the update rule. Because the memory is updated by gradient descent on attacker-controllable tokens, a carefully crafted prefix can in principle drive $θ_{t}$ to a target state. This is a novel attack surface, analogous to but distinct from prompt injection. The paper does not discuss it. I flag it as an open safety question rather than a claimed result.

Titans sits at the intersection of three lineages.

The fast-weights lineage begins with [Hinton & Plaut, 1987] and [Ba et al. 2016], was reformulated as linear attention by [Schlag et al. 2021] ("Linear Transformers Are Secretly Fast Weight Programmers"), and extended with the delta rule by [Yang et al. 2024]. Titans generalizes this by replacing the bilinear fast-weight matrix with a non-linear MLP and the delta rule with momentum-SGD.

The test-time training lineage [Sun et al. 2020; Gandelsman et al. 2022] proposes updating model parameters on test inputs via self-supervised objectives. Titans adopts this framing explicitly but applies it at token-level granularity rather than at the example level, and with a memory-specific rather than task-general objective.

The modern-SSM lineage [Gu & Dao, 2023; Dao & Gu, 2024] provides selective, input-dependent state updates with efficient hardware-aware implementations. Titans' data-dependent gates $η_{t}, α_{t}, θ_{t}$ stand in direct analogy to Mamba's $Δ_{t}$ parameterization, and the chunked parallel computation inherits from this lineage.

The novel synthesis, as I read it, is this: *test-time-trained fast weights with input-dependent optimizer hyperparameters, inside a hybrid attention architecture*. Each of those elements existed separately. The combination is of moderate novelty, not transformative.

Novelty rating: moderate. The update rule is a principled reparameterization of known ideas; the empirical engineering required to make it work at 2M context is non-trivial; the theoretical contribution is limited.

7. Broader Impact

If the method replicates, the practical implication is that long-context inference becomes cheaper at accuracy parity, because the memory-parameter cost is $O (p)$ rather than $O (T)$ . For deployment contexts where long documents must be processed repeatedly, legal corpora, scientific literature, code repositories, this matters. The ethical considerations are two-fold: first, the adversarial-prompt failure mode (§5, mode 4) introduces a new vector that deserves safety evaluation before production use; second, cheaper long-context models accelerate the deployment of pervasive conversational memory, with the familiar privacy implications.

8. Open Technical Questions

Five questions I would want answered before citing this result as established.

1. Are the perplexity gaps real? Three-seed runs with confidence intervals at 340M and 760M on the standard subset of The Pile or SlimPajama.

2. Is the data-dependent gating actually data-dependent? Report the empirical distribution of $η_{t}, α_{t}, θ_{t}$ on held-out text. If they collapse to narrow bands, the gating is decorative.

3. Does memory retention degrade with distance at 2M context? Needle-recall accuracy as a function of needle position, with error bars over positions and seeds.

4. Does the MLP memory beat a modern Hopfield layer [Ramsauer et al. 2021] at matched parameter count? This is the cleanest capacity comparison.

5. Is the method robust to adversarial prefixes designed to drive $\theta_t$? A red-team evaluation specific to the update rule.

Reproducibility is not optional; it is the minimum. None of these questions require new methodology, only that the experiments be run properly.

Assessment

Titans is a well-executed exploration of a moderately novel architectural idea: turning the key-value memory of a transformer into a small neural network that performs test-time gradient descent on an associative loss. The paper is clear about the update rule and thoughtful about architectural integration. The empirical claims on long-context retrieval are suggestive but protocol-dependent; the language-modeling claims at sub-1B scale fall within the noise range I would expect without multi-seed validation. The core theoretical object, momentum-SGD with input-dependent hyperparameters on a per-token reconstruction loss, is a productive reframing that invites direct comparison with prior fast-weights work. I would recommend acceptance at a top venue conditional on the authors reporting seed variance and adding the capacity-matched comparison against Hopfield and DeltaNet variants.

Reproducibility & Sources

Primary paper. Behrouz, A. Zhong, P. Mirrokni, V. (2025). *Titans: Learning to Memorize at Test Time*. arXiv:2501.00663.

Code repository. At the time of review, no official code repository is referenced in the paper body. Community reimplementations exist but have not been verified for numerical parity. Treat any unofficial implementation as a starting point, not as ground truth.

Datasets. Standard language-modeling corpora (The Pile, SlimPajama-style mixtures) for pretraining; NIAH and BABILong [Kuratov et al. 2024] for long-context evaluation. All are publicly accessible through standard HuggingFace mirrors; no proprietary data is claimed.

Reproducibility assessment (1, 5 scale):

Code availability: 2/5. No verified official release at review time; reimplementation requires non-trivial work on the chunked-parallel online update.
Data availability: 5/5. All benchmarks and pretraining corpora are public and standard.
Experimental detail sufficiency: 3/5. The update rule is specified clearly; hyperparameter schedules, chunk sizes, MLP dimensions, and seed counts are under-reported. A careful reimplementer will encounter two or three meaningful ambiguities.

Overall: moderate reproducibility. The ideas are specified; the experimental numbers are not yet reproducible without guesswork.