The Structured State Space Duality: An Experimental Audit of What the Transformer-SSM Equivalence Does and Does Not Prove

The central claim of Dao and Gu's recent work (arXiv:2405.21060) is that a restricted class of state-space models and a restricted class of attention mechanisms are not merely analogous but are two factorizations of the same underlying object, which they term the Structured State Space Duality (SSD). From this algebraic correspondence they derive Mamba-2, a refined selective SSM whose core operator admits both a linear-time recurrent form and a quadratic-time matrix form. The empirical headline is that Mamba-2 matches or exceeds Transformer quality at 2.7B parameters while training 2 to 8 times faster than its predecessor Mamba.

Let us be precise about the claim. The duality is not an assertion that every Transformer is an SSM, or that every SSM is a Transformer. It is the statement that a semiseparable matrix decomposition sits on both sides of a bridge: on the SSM side, the recurrence $h_{t} = A_{t} h_{t - 1} + B_{t} x_{t}$ , $y_{t} = C_{t}^{⊤} h_{t}$ with $A_{t}$ scalar-times-identity yields a matrix transformation $Y = M X$ in which $M$ is lower-triangular and 1-semiseparable; on the attention side, masked linear attention with a causal mask yields the same class of matrix, up to the choice of $A_{t}$ . This is an elegant result, and the correct frame in which to read it is Mahoney and Drineas's line of work on structured matrix factorizations applied to sequence models, not the journalistic frame of 'attention equals SSM.'

Contribution classification: primarily (b) a new algorithm (Mamba-2) derived from (a) a theoretical observation (the SSD equivalence), supported by (c) empirical findings at moderate scale. I rate the theoretical novelty moderate, the algorithmic novelty moderate to significant, and the empirical contribution moderate. The paper is best understood not as proving Transformers are SSMs, but as identifying the narrow intersection at which they coincide and exploiting that intersection to recover hardware efficiency lost in the original Mamba (Gu and Dao, 2023).

1. Claims vs. Evidence Map

The paper makes, by my count, five load-bearing claims. I audit each.

Claim 1: SSMs with scalar-times-identity structure on $A_t$ are mathematically equivalent to a restricted form of linear attention. The evidence is a constructive proof via semiseparable matrix representation. This is a mathematical identity rather than an empirical claim, and the derivation is tight. The restriction, however, is severe. Standard Mamba (Gu and Dao, 2023) uses a diagonal $A_{t}$ with distinct entries per channel; collapsing to scalar-times-identity is a real loss of expressivity that the paper treats as a design choice rather than a theoretical limitation. This is the key concession enabling the duality, and readers should not overlook it.

Claim 2: Mamba-2 matches Transformer++ at matched parameter count on language modeling. Evidence: perplexity and downstream zero-shot results at 125M, 350M, 1.3B, and 2.7B parameters. The comparison is to a strong Transformer recipe with RoPE, SwiGLU, and RMSNorm. The evidence is moderate. Perplexity curves are reported, but I did not find per-seed variance, and at 2.7B the gap to Transformer++ on several downstream tasks sits within what one would expect from tokenizer and data-curation noise.

Claim 3: Mamba-2 trains 2 to 8 times faster than Mamba. Evidence: wall-clock throughput on A100 hardware. This is the most methodologically clean claim, because it is a hardware measurement and the mechanism (matrix multiplication units versus scan primitives) is unambiguous. I rate the evidence strong.

Claim 4: The SSD form scales to larger state sizes without proportional compute cost. Evidence: ablation over state dimension $N$ . The evidence is moderate; the scan-based Mamba-1 becomes memory-bound at large $N$ , and matrix-multiply SSD sidesteps this. Whether the quality gains from larger $N$ persist at 7B or 70B remains unestablished.

Claim 5: Hybrid architectures combining SSD blocks with a small fraction of attention layers outperform pure SSD or pure attention. Evidence: a handful of hybrid configurations, primarily on associative recall tasks from the MQAR benchmark of Arora et al. (2024). The evidence here is weak to moderate: the hybrids are tuned, the baselines are not uniformly tuned, and the claim that 'about 10% attention is enough' is presented without a principled sweep.

2. Baseline Audit

Baselines are the place at which SSM papers have historically been weakest, and this one is better than most but still incomplete.

Baseline family	Included?	Fairly tuned?	Notes
Transformer++ (RoPE, SwiGLU, RMSNorm)	Yes	Plausibly	Matches modern recipes; LR and batch size reported
Mamba (Gu and Dao, 2023)	Yes	Yes	Same codebase, strong apples-to-apples
RetNet (Sun et al. 2023)	Partial	Unclear	Not in headline scaling tables
RWKV-v5/v6 (Peng et al. 2023)	No	N/A	Notable omission; RWKV-v6 is a close competitor
Hyena (Poli et al. 2023)	No	N/A	Absent from main results
GLA / Gated Linear Attention (Yang et al. 2024)	Partial	Unclear	Closely related; deserved head-to-head
xLSTM (Beck et al. 2024)	No	N/A	Contemporaneous; understandable omission

The missing baseline that matters most is Gated Linear Attention (Yang et al. 2024). GLA is algebraically very close to SSD, uses a similar matrix-form chunked kernel, and was published essentially in parallel. A head-to-head comparison at matched training recipe would clarify how much of the Mamba-2 gain comes from SSD specifically, as opposed to the broader class of hardware-aware linear attention variants. The closest comparison derivable from the paper's tables is only indirect.

A fairer comparison would also include Based (Arora et al. 2024), whose Taylor-approximation linear attention attacks the same efficiency frontier. The authors include MQAR from the same group as an evaluation but not Based as a training baseline, which is an odd asymmetry.

3. Ablation Completeness

The paper's ablations cover (i) state dimension $N$ , (ii) head structure, (iii) the role of convolutional short paths, and (iv) the fraction and placement of attention layers in hybrids. Missing ablations include the following.

The scalar-$A_t$ restriction itself. The whole SSD derivation hinges on collapsing the per-channel $A_{t}$ of Mamba-1 to a scalar. What does Mamba-1 look like if one applies the same collapse? Is the quality loss from scalar- $A_{t}$ negligible, moderate, or severe? Without this ablation, one cannot tell whether SSD is paying a quality tax for its speed gain, or whether the tax is zero.

Chunk size $Q$. The SSD kernel chunks sequences into blocks of size $Q$ and mixes a quadratic intra-chunk computation with a linear inter-chunk recurrence. The choice of $Q$ trades memory against arithmetic. A sweep of $Q \in {64, 128, 256, 512}$ at fixed model size would expose the throughput/quality Pareto frontier. I did not locate such a sweep.

Position encoding. Transformer++ uses RoPE. Mamba-2 uses no explicit positional encoding because the recurrence is intrinsically causal. A controlled ablation that strips RoPE from the Transformer baseline, or adds ALiBi to both, would isolate how much of the quality gap is architectural versus a matter of positional prior.

Hybrid placement. Where in depth the attention layers sit within a hybrid stack matters. Early, middle, late, interleaved? The paper reports a single successful configuration without a placement sweep.

The ablation I would have run as reviewer is as follows. Fix total parameter count at 1.3B. Sweep the axis (scalar- $A_{t}$ versus diagonal- $A_{t}$ ) crossed with (scan kernel versus SSD matrix kernel). This is a 2x2 design. It cleanly separates the algorithmic contribution (SSD kernel) from the architectural restriction (scalar $A_{t}$ ). Without this 2x2, the paper conflates a hardware improvement with an architectural simplification.

4. Statistical Rigor

This is where I have the most concern. The headline tables report single-run perplexities and zero-shot accuracies without error bars, confidence intervals, or seed counts. In my experience, zero-shot accuracy on tasks such as PIQA, HellaSwag, and ARC has between 0.3 and 0.8 points of run-to-run variance at 1.3B scale. Several reported gaps fall within this band.

The paper does not report a significance test for any comparison. The throughput numbers do not quote variance across runs or across hardware batches. For a paper whose central empirical claim is 'match Transformer quality and beat it on speed,' the absence of variance estimates on both axes is a real gap. I would want, at minimum, three seeds per configuration at 125M and 350M, and a bootstrap confidence interval on the perplexity gap.

A concrete failure mode: suppose Mamba-2 at 2.7B shows a 1-seed perplexity advantage of 0.04 over Transformer++, and the ARC-c accuracy gap is 0.6 points. Without seed variance, we cannot tell whether we are looking at signal or at the particular initialization the authors happened to run. This is not an accusation of cherry-picking; it is a request that the empirical claim be stated at its true confidence level.

5. Dataset & Evaluation Concerns

Training is on the Pile (Gao et al. 2020), or a curated subset, and evaluation is on the standard LM Harness suite (Gao et al. 2021). This is the accepted protocol, but two concerns persist.

First, the Pile has known contamination with some downstream benchmarks. This affects the Transformer baseline and Mamba-2 equally, so relative comparisons are defensible, but absolute zero-shot numbers should be interpreted accordingly. The paper does not discuss contamination.

Second, the associative-recall benchmarks (MQAR from Arora et al. 2024) are known to be diagnostic of linear-attention failure modes. Mamba-2's partial recovery of recall via hybrid attention is the load-bearing evidence that the architecture handles information-routing tasks adequately. But MQAR is synthetic. The one natural-language analog, needle-in-a-haystack, is reported only briefly. A fuller evaluation on RULER (Hsieh et al. 2024) would be the appropriate modern test of long-context retrieval fidelity, and its absence is a meaningful gap.

Third, no evaluation at context lengths beyond 8k is reported in the main tables, despite the architecture's intrinsic long-context motivation. This is surprising. The original Mamba paper at least attempted 128k-length synthetic tasks. For a successor whose pitch rests partly on scaling state and context, the lack of 32k/64k/128k natural-language evaluations is a gap the next version should address.

6. Reproducibility Assessment

Code is released at github.com/state-spaces/mamba and the Mamba-2 SSD kernel is included. The training hyperparameters (learning rates, batch sizes, warmup, weight decay) are reported in the appendix. Data mixture weights and tokenizer choices are stated. This is a relatively high bar; I rate code availability strong (4/5), data availability moderate (3/5) because the exact Pile subset and shuffling seed are not always identifiable, and experimental detail strong (4/5).

Compute estimate: training a 2.7B model for roughly 300B tokens on A100-80GB hardware requires on the order of $3 \times 1 0^{21}$ FLOPs. At typical cluster utilization this amounts to a few thousand A100-hours, roughly a week on 128 A100s. Replicating the full suite from 125M to 2.7B is a substantial but not prohibitive investment, well within reach of a funded academic lab or a mid-sized industry team.

7. Theoretical Insight: What the Duality Buys and What It Costs

Let me formalize the computational claim. Define the SSD operator on sequence length $T$ , state size $N$ , and chunk size $Q$ . The scan-based Mamba-1 kernel costs $O (T N)$ memory and $O (T N)$ arithmetic, but the arithmetic is scan-shaped and cannot be expressed as a dense matrix multiplication. The SSD kernel restructures computation into $T / Q$ chunks, each requiring a $Q \times Q$ intra-chunk matmul (arithmetic $O (Q^{2} N)$ ) plus a size- $N$ inter-chunk state update. Total arithmetic is

FLOPs_{SSD} = O (\frac{T}{Q} \cdot Q^{2} N) = O (T QN)

which is worse than the scan's $O (T N)$ by a factor of $Q$ , but the constant on modern GPUs is roughly 16 times better because the scan has poor tensor-core utilization while the matmul has excellent utilization. With $Q \approx 64$ and the tensor-core speedup, the break-even point favors SSD at most practical scales. This is the true speed argument, and it is a hardware argument, not a mathematical one.

The cost is expressivity. The scalar-times-identity restriction on $A_{t}$ reduces the effective state rank. The bound is tight only when the task has limited per-channel dynamic range; for tasks with heterogeneous channel behavior, diagonal $A_{t}$ is strictly more expressive. This connects to Orvieto et al. (2023) on linear recurrent units, where per-channel parameterization is empirically important.

8. Limitations and Failure Modes

Beyond what the authors acknowledge, I identify the following failure modes.

Exact copy and retrieval at long context. Linear attention and scalar- $A_{t}$ SSMs share a known information bottleneck: the state is a fixed-rank summary of the entire past. Arora et al. (2024) and Jelassi et al. (2024) prove that such architectures require $Ω (T)$ -sized state to solve general associative recall, which contradicts the $O (N)$ state assumption. Mamba-2 mitigates but does not solve this via hybrid attention. A concrete failure scenario: a 32k-context code completion task requiring exact variable-name recall across distant scopes.

Distribution shift in $A_t$ scaling. The discretization step size $Δ_{t}$ is selection-parameterized. At test-time sequence lengths far exceeding training, the implicit assumption that $Δ_{t}$ remains in a trained regime may fail. This is the SSM analog of the RoPE extrapolation problem (Press et al. 2022), and it is not audited.

Low-precision training. The matrix-form kernel opens the door to FP8 and lower, yet no bf16/fp8 ablation is reported. Given recent work on low-bit training dynamics (for instance the precision-scaling literature of the past year), this is a near-term question whose answer is not obvious.

9. Questions for Authors

1. What is the per-seed variance of the headline perplexity and zero-shot numbers at 1.3B and 2.7B, and does the reported Mamba-2 vs Transformer++ gap remain significant at the 95% level under a three-seed bootstrap?

2. In the 2x2 ablation of (scalar- $A_{t}$ , diagonal- $A_{t}$ ) x (scan kernel, SSD kernel) at fixed parameter count, how much quality is lost by the $A_{t}$ restriction alone?

3. Why was Gated Linear Attention (Yang et al. 2024) not included as a primary baseline, given its algorithmic proximity to SSD?

4. At 32k, 64k, and 128k context on natural-language retrieval (RULER or similar), how does Mamba-2 compare to Transformer++ with YaRN or LongRoPE?

5. What is the training and inference behavior at FP8, and does the SSD kernel's numerical conditioning degrade more gracefully than softmax attention at low precision?

10. Verdict

Strength of empirical evidence: moderate. The theoretical observation is genuine and elegantly presented. The algorithmic contribution is real, and the hardware speedup is well-established. The claim that Mamba-2 matches Transformer quality is supported at the tested scales but not at the error-bar level one would want, and the supporting baseline coverage omits the most algorithmically adjacent competitor.

I would accept at a top venue with a request for revision: add seed variance on headline tables, include GLA and RWKV-v6 baselines, run the 2x2 expressivity-versus-kernel ablation, and report long-context retrieval on a natural-language benchmark. With those additions, the paper moves from 'strong regional result' to 'definitive reference for structured linear sequence models.'

The duality result will outlive Mamba-2. It reframes a decade of work on linear attention (Katharopoulos et al. 2020; Choromanski et al. 2021; Schlag et al. 2021) and state-space sequence models (Gu et al. 2022; Smith et al. 2023) as variations on a shared semiseparable structure. That reframing is itself a contribution, and it connects to the older literature on displacement-rank matrix factorization in a way that the paper could have exploited more fully. The open question, and the one I find most interesting, is whether there exists a structured expansion of the semiseparable class that restores the expressivity lost to the scalar- $A_{t}$ restriction while retaining matmul-friendly kernels. That is the paper I want to read next.

11. Reproducibility & Sources

Primary paper: Dao, T. and Gu, A. 'Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.' arXiv:2405.21060.

Code repository: github.com/state-spaces/mamba (includes the Mamba-2 SSD kernel in the official Mamba codebase).

Datasets: The Pile (Gao et al. 2020), SlimPajama (Soboleva et al. 2023) derivative mixtures, LM Evaluation Harness task suite (Gao et al. 2021), MQAR associative-recall benchmark (Arora et al. 2024).

Reproducibility ratings: code availability 4/5, data availability 3/5, experimental detail 4/5. Compute budget to replicate the 2.7B result: approximately a few thousand A100-80GB hours.