Mixture-of-Depths Under Experimental Audit: Does Top-k Token Routing Learn Difficulty, or Just Position?

The widely circulated result from Raposo et al. (arXiv:2404.02258) is that per-token dynamic compute allocation enables transformers to match isoFLOP baselines with substantially fewer active parameters per layer. The paper frames this as compute-optimal transformer inference: a top- $k$ router selects which tokens participate in self-attention and the MLP at each block, while the remainder route around the block via the residual. The claim worth auditing is not whether this works in aggregate. It is whether the router has actually learned something resembling semantic difficulty, or whether it has converged on a positional heuristic that a deterministic rule could reproduce for free.

Let us examine the error bars.

1. Claims vs. Evidence Map

The paper advances four intertwined claims. I will restate each in auditable form before assessing the supporting evidence.

Claim A. A learned top- $k$ router over depth achieves loss equivalent to a dense isoFLOP baseline while using fewer FLOPs per forward pass, and matches or exceeds it at matched training FLOPs.

Claim B. Because $k$ is a fixed fraction of the sequence (e.g. $k = N /2$ or $N /4$ ), the computation graph remains static, so MoD avoids the dynamic-shape problems that plague early-exit and Adaptive Computation Time [Graves, 2016].

Claim C. The router learns a non-trivial allocation policy: different tokens receive different amounts of compute, and the allocation correlates with something resembling prediction difficulty rather than being uniform or purely positional.

Claim D. MoD composes with Mixture-of-Experts (MoE) to yield MoDE, with additive gains over either alone.

Claim	Supporting Evidence in Paper	Evidence Strength
A (FLOPs-quality frontier)	IsoFLOP sweeps at 220M-1.1B scale, loss vs. FLOPs curves	Moderate
B (static graph)	Mechanical consequence of fixed- $k$ design	Strong (by construction)
C (learns difficulty)	Qualitative routing visualizations, loss improvements	Weak
D (MoDE composition)	Ablation combining MoD + MoE	Moderate

Claim A is the headline; Claim C is the story. The experimental support for C is the thinnest part of the paper, and it is precisely the part that determines whether this constitutes a genuine contribution or a reparameterization of depth-wise dropout with a learned mask.

2. Baseline Audit

The comparison set in the paper is dense transformers at matched FLOPs, plus a stochastic-routing variant as a control. That is a reasonable first pass. It is not sufficient.

The missing baselines fall into three categories. First, learned early-exit methods. CALM [Schuster et al. 2022] and the depth-adaptive transformer [Elbayad et al. 2020] solve a closely related problem: per-token compute allocation over depth. MoD differs by keeping the computation graph static and routing around blocks rather than halting, but the underlying research question, does per-token depth allocation help at fixed compute?, is shared. Omitting these baselines makes it impossible to disentangle the value of the static-graph property from the value of the routing itself.

Second, deterministic positional routing. The null hypothesis for Claim C is that the router learns $k$ slots corresponding to a fixed positional pattern, for instance, every other token, or tokens at sentence boundaries. A baseline that hard-codes a positional mask (e.g. drop every $N - k$ positions in a stride pattern) would establish the floor. If learned MoD beats this by less than a standard deviation, the router is not doing semantic work; it is amortizing a positional prior.

Third, Mixture-of-Experts at matched active parameters. Switch Transformer [Fedus et al. 2022] and GShard [Lepikhin et al. 2021] already allocate compute per-token, only across width rather than depth. The paper does evaluate MoD + MoE (MoDE) but does not, to my reading, isolate MoD vs. MoE at truly matched active FLOPs and matched training tokens. A fair head-to-head would ask: given a fixed compute budget and parameter budget, is per-depth routing more sample-efficient than per-expert routing? That experiment is absent.

The baseline was not properly tuned in one specific sense: the dense isoFLOP baseline uses the same learning-rate schedule as the MoD variants. MoD's effective batch of active tokens per block is smaller, which alters the gradient noise scale [McCandlish et al. 2018]. Without a learning-rate sweep per configuration, some portion of the reported win could be an artifact of the dense baseline being slightly mistuned at the smaller active-FLOP operating point.

3. Ablation Completeness

The paper provides ablations over $k$ (the capacity factor), routing granularity (every block vs. every other block), and MoD + MoE composition. These isolate architectural choices. They do not isolate the claim that matters.

The ablation I would have run, and which is conspicuously missing, is a router-counterfactual study. Freeze a trained MoD model. Replace the learned routing decisions at inference time with: (a) a random mask of the same density, (b) a positional mask that selects the same $k$ indices every time, (c) a mask derived from token entropy under a fixed reference model, (d) an inverted mask, route in exactly the tokens the learned router would have routed out. If (a) and (b) substantially degrade performance and (d) catastrophically degrades it, the router is doing real work. If the gap to (a) is small, the learned router is redundant.

A second missing ablation concerns depth-wise routing dynamics. Does a token routed out at layer $ℓ$ tend to be routed out at layer $ℓ + 1$ as well? If yes, the router has effectively learned an early-exit policy with extra steps. If no, the routing is genuinely layer-specific and the expressive gain over early-exit is real. The paper shows routing visualizations, but I could not find a quantitative per-token cross-layer correlation matrix. That single figure would settle whether MoD is early-exit in disguise.

A third: train-test routing distribution shift. At training time the router receives gradient signal only for the tokens it routes in. At inference, the top- $k$ operation is hard, so a token that was borderline during training may land on the opposite side at test time. The paper addresses this via an auxiliary predictor, but the ablation that removes the predictor while holding everything else constant should be reported with variance across seeds. The devil lives in the evaluation protocol, and the train-test mismatch for non-causal top- $k$ is exactly where it hides.

4. Statistical Rigor

This is where the audit grows uncomfortable. The isoFLOP plots report loss curves, but the reported variance across random seeds is, so far as I can determine from the public version, either a single seed per configuration or unreported. For a headline claim that hinges on loss differences of roughly 0.01-0.05 nats between MoD and dense at matched FLOPs, unreported seed variance is a methodological hole.

Consider the effect size. In language model pretraining, within-configuration seed-to-seed standard deviation on validation loss at the 1B scale typically sits around $σ \approx 0.005$ to $0.02$ nats, depending on tokens-per-parameter regime [see e.g. the Pythia suite; Biderman et al. 2023]. If MoD's reported improvement is $Δ L \approx 0.02$ nats and $σ \approx 0.015$ nats, the effect size is $d \approx 1.3$ , large per-seed but marginal at $n = 1$ . A proper significance test requires at minimum three seeds per configuration, and the confidence interval should be reported along the loss-vs-FLOPs curve, not merely at the best point.

The compute-optimal claim also interacts with Chinchilla-style scaling laws [Hoffmann et al. 2022]. A model that wins at one tokens-per-parameter ratio may lose at another. Without a full isoFLOP sweep at multiple aspect ratios, the claim that MoD is Pareto-optimal is extrapolation beyond the tested envelope.

Reproducibility is not optional; it is the minimum. For a paper that purports to shift the FLOPs-quality frontier, error bars are the minimum viable evidence.

5. Dataset & Evaluation Concerns

MoD is evaluated on language-modeling loss over a proprietary pretraining mixture, plus downstream zero-shot and few-shot evaluation on standard benchmarks. Two concerns deserve attention.

First, loss is not the quantity of interest for a routing method. MoD changes the inductive bias of the network: some tokens receive more compute, some less. In principle, this should help hard-to-predict tokens more than easy ones. Aggregate validation loss averages over both populations. A stratified evaluation, reporting loss separately for high-entropy and low-entropy tokens under a reference model, would reveal whether MoD reallocates compute where it matters. Such evaluation is standard in the efficient-inference literature [Schuster et al. 2022], and its absence here is noticeable.

Second, downstream benchmarks at 220M-1B scale are noisy. At these scales, many of the standard LM-eval benchmarks (HellaSwag, PIQA, ARC-easy) have between-seed standard deviations on the order of $1$ to $3$ accuracy points. Reporting point estimates without confidence intervals at these scales is reporting noise. The paper's tabulated downstream numbers need a $\pm$ .

Data contamination is a lesser concern here, since the method is architectural rather than a benchmark-targeted finetune, but the pretraining mixture is proprietary, which fundamentally caps external reproducibility.

6. Reproducibility Assessment

Given the paper's level of specification, reproducing the core result requires the routing module and its auxiliary predictor, the top- $k$ operator with straight-through gradient estimation (the paper employs a specific non-causal workaround whose details matter), a transformer stack with periodic MoD blocks, and a pretraining corpus at multi-billion-token scale. A conservative estimate of the compute required for the 1B-parameter isoFLOP sweep is on the order of $1 0^{21}$ to $1 0^{22}$ FLOPs total for a full sweep with three seeds per configuration. That is a small-cluster-weeks budget, not a laptop exercise.

The bottleneck for external reproduction is not the architecture. The router is conceptually simple, essentially a linear projection to a scalar weight, followed by top- $k$ selection. The bottleneck is the training data and the training recipe. Without the pretraining mixture, any reproduction will compare MoD-on-corpus-X against dense-on-corpus-X and will reproduce the relative claim only if the corpus does not interact with routing dynamics. That interaction is plausible: code-heavy corpora exhibit different token-level entropy structure than web text.

Formally, for a sequence of length $N$ and capacity factor $c = k / N$ , MoD replaces the standard block compute $O (N^{2} d + N d^{2})$ with

T_{MoD} (N, d, c) = O (c^{2} N^{2} d + c N d^{2}) + O (N d)

where the last term is the router itself. The quadratic savings in attention are the main FLOPs win; the MLP savings are linear in $c$ . At $c = 0.125$ , the attention term drops by $64 \times$ , which is why the method looks dramatic on paper. It is also why the method's advantage concentrates at long context, and why reporting only short-context results (as much of the paper does) understates both the benefit and the evaluation risk.

7. Limitations & Failure Modes Not Addressed

Three failure modes merit explicit discussion beyond what the paper acknowledges.

Failure mode 1: long-context generation with autoregressive causal routing. The paper's training-time top- $k$ is non-causal: the router sees the entire sequence to pick $k$ tokens. At inference, for autoregressive generation, this information is not available. The paper proposes an auxiliary predictor to approximate the routing decision causally. The predictor's error rate compounds over generated tokens. For generation lengths of $L$ tokens and per-token routing error $ϵ$ , the expected number of misrouted tokens grows as $L ϵ$ , and the loss penalty is superlinear in misrouting rate, because misrouted hard tokens hurt more than misrouted easy tokens. Long-form generation quality should be evaluated, not just loss.

Failure mode 2: distribution shift in routing statistics. The router is trained on a specific pretraining distribution. At inference on a domain-shifted input (e.g. code after pretraining predominantly on web text), the calibration of the routing decision may degrade. The capacity factor $k$ is fixed, so the router is forced to route in exactly $k$ tokens regardless of whether the input is hard or easy globally. A uniformly difficult input (e.g. a mathematical proof) receives the same compute as a uniformly easy input (e.g. boilerplate text). This is a structural property of the fixed- $k$ design, and it trades off against the static-graph advantage.

Failure mode 3: gradient pathology at the routing boundary. The router output is a scalar weight multiplied with the block output for routed-in tokens and zero for routed-out tokens. Straight-through estimation at the top- $k$ boundary means gradient flows to the router only through the selected tokens. This creates a winner-takes-all dynamic: tokens consistently selected receive refined gradients, while tokens consistently dropped receive sparse signal. The implicit assumption is that the router's init distribution is good enough to prevent collapse. At larger scales or with different init schemes, mode collapse of the router is a real risk and ought to be stress-tested.

8. Questions for the Authors

1. What is the seed-to-seed standard deviation of validation loss for the 1B MoD model vs. the 1B dense isoFLOP baseline? How many seeds were run?

2. If you freeze a trained MoD model and replace the learned routing at inference with (a) a stride-based positional mask of the same density and (b) a random mask, what fraction of the MoD-vs-dense improvement remains?

3. What is the cross-layer correlation of routing decisions for the same token? If a token is routed out at layer $ℓ$ , what is the probability it is routed out at $ℓ + 1$ ?

4. How does the auxiliary causal predictor's routing agreement with the teacher-forcing non-causal top- $k$ decay as a function of position in the sequence during autoregressive generation?

5. At long context (say, 32K tokens), do the reported FLOPs-quality gains hold, and does the static-graph benefit over CALM-style early exit remain practically meaningful?

9. Verdict

Mixture-of-Depths is a clean idea with a plausible mechanism and a genuinely useful engineering property: static computation graphs at heterogeneous per-token compute. As a piece of architectural engineering, I would classify this as (d) engineering improvement with (c) empirical finding as a secondary contribution. It is not (a) a new theoretical result, and the evidence for (c), that the router learns semantic difficulty, is, at present, weak.

Novelty: moderate. The closest prior art lies in the early-exit and adaptive-computation lineage [Graves, 2016; Elbayad et al. 2020; Schuster et al. 2022] combined with token-level MoE routing [Shazeer et al. 2017; Fedus et al. 2022]. What is new is the specific combination: routing over depth rather than width, fixed- $k$ for static graph, and integration with standard transformer stacks. What is not new is the observation that per-token compute allocation helps.

Evidence strength: moderate for Claim A, weak for Claim C. The headline FLOPs-quality result is plausible but unverified at reported precision without seed variance. The semantic-routing claim is supported by visualizations, not by the router-counterfactual ablation that would settle it.

At a top venue, I would lean accept with a strong revision request for seed variance, the router-counterfactual ablation, and a head-to-head against a learned early-exit baseline at matched compute. Without those, the paper's contribution is the engineering artifact (static graph, fixed- $k$ ), not the scientific claim (learned semantic routing).

Negative results and careful audits of positive-looking methods are both contributions. This review is an invitation to run the experiments that would render MoD's claims bulletproof rather than merely suggestive.

10. Reproducibility & Sources

Primary paper. Raposo, D. Ritter, S. Richards, B. Lillicrap, T. Humphreys, P. C. Santoro, A. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. arXiv:2404.02258, 2024.

Code repository. No official code released in the abstract. External unofficial re-implementations exist but should not be treated as authoritative.

Datasets. The pretraining mixture is proprietary and not publicly redistributed; downstream evaluation uses standard LM benchmarks (HellaSwag, PIQA, ARC, BoolQ, WinoGrande) accessible through the lm-evaluation-harness.

Reproducibility rating.

Dimension	Rating (1-5)	Note
Code availability	2	No official release; re-implementation is feasible from the paper
Data availability	1	Pretraining data is proprietary
Experimental detail	3	Architecture described; seeds, learning-rate sweeps, and full routing analysis underspecified

Key prior works referenced in this review. Vaswani et al. 2017 (attention); Graves, 2016 (ACT); Dehghani et al. 2019 (Universal Transformer); Elbayad et al. 2020 (depth-adaptive transformer); Schuster et al. 2022 (CALM); Shazeer et al. 2017 (sparsely-gated MoE); Lepikhin et al. 2021 (GShard); Fedus et al. 2022 (Switch Transformer); Hoffmann et al. 2022 (Chinchilla); Kaplan et al. 2020 (scaling laws); McCandlish et al. 2018 (gradient noise scale); Biderman et al. 2023 (Pythia).