Marigold Under the Microscope: Can a Repurposed Latent Diffusion Prior Truly Replace Supervised Regression for Monocular Depth?

Abstract

Marigold [Ke et al. 2023; arXiv:2312.02145] argues that a frozen Stable Diffusion v2 backbone, fine-tuned on roughly 74k purely synthetic image-depth pairs from Hypersim and Virtual KITTI 2, can deliver zero-shot affine-invariant monocular depth that rivals or surpasses models trained on orders of magnitude more real data (MiDaS, DPT, Omnidata, LeReS, HDN). The headline claim is that the *generative prior* encodes scene-structure knowledge that supervised regression backbones lack. In this commentary I first steelman that claim, then argue the paper has not demonstrated it. Most of Marigold's reported gain can be reproduced by a simpler alternative explanation: synthetic-data domain coverage, an affine-invariant loss in latent space, and test-time ensembling. The generative-prior hypothesis is an *underdetermined* inference from the experiments shown. What would change my mind is a very specific ablation that the authors elected not to run.

1. Steelman: Why Marigold Is a Serious Result

Let me first present the strongest reading, because the paper is genuinely well-executed and I want the critique to land on the right target.

The setup is deceptively simple. Marigold takes the Stable Diffusion v2 U-Net $ϵ_{θ}$ , freezes the VAE encoder $E$ and decoder $D$ , and reformulates depth estimation as a latent conditional diffusion problem: given latent $z^{(x)} = E (x)$ of the RGB image $x$ , learn to denoise a depth latent $z^{(d)} = E (d^{'})$ , where $d^{'}$ is a three-channel replicated, affine-normalized depth map in $[- 1, 1]$ . The denoising objective is the standard one,

L = E_{z_{0}^{(d)}, ϵ, t} [∥ ϵ - ϵ_{θ} (z_{t}^{(d)}, z^{(x)}, t) ∥_{2}^{2}],

with the image latent injected by channel concatenation at the U-Net input. At inference, the authors run DDIM with 50 steps and an ensemble of 10 noise draws, aligning each sample by a per-pair affine transform before taking the pixel-wise median.

The empirical evidence for the steelman is not trivial. On NYUv2 [Silberman et al. 2012], Marigold reports AbsRel $\approx 5.5%$ , $δ_{1} \approx 96.4%$ ; on KITTI [Geiger et al. 2012], AbsRel $\approx 9.9%$ , $δ_{1} \approx 91.6%$ ; on ETH3D [Schops et al. 2017], AbsRel $\approx 6.5%$ ; on ScanNet [Dai et al. 2017], AbsRel $\approx 6.4%$ ; on DIODE [Vasiljevic et al. 2019], AbsRel $\approx 30.8%$ . These figures sit at or above the zero-shot numbers reported by MiDaS v3 / DPT-Large [Ranftl et al. 2020; Ranftl et al. 2022] and by HDN and Omnidata v2, all of which were trained on much larger real or real-plus-synthetic mixtures.

If one accepts the comparison at face value, the takeaway is striking: 74k synthetic pairs beat 1.5M+ real pairs, and the only plausible explanation is that Stable Diffusion has already internalized a usable 3D scene prior during its 5B-image pretraining. That would be a genuine architectural finding, and it would justify the paper's framing.

2. The Weakest Link: The Generative Prior Hypothesis Is Underdetermined

This is where I think the argument breaks.

The paper attributes its zero-shot generalization to the *pretrained diffusion prior*. But the experimental design does not isolate that factor from at least three confounds, each of which could individually account for a large fraction of the reported gain.

Confound A: Synthetic-data coverage, not the prior, may be doing the work. Hypersim [Roberts et al. 2021] is an extraordinarily rich dataset: 461 indoor scenes rendered with path-traced global illumination, physically plausible materials, and dense perfect depth. Its per-pixel depth fidelity is precisely what real RGB-D datasets lack. NYUv2's Kinect depth suffers from sub-meter holes, edge bleeding, and range saturation beyond 5 m. ScanNet depth is better, but still noisy. Marigold fine-tunes on clean depth supervision and evaluates with *affine-invariant* metrics. An equally valid control would be this: take a DPT backbone initialized from DINOv2 [Oquab et al. 2023] or from ImageNet, and fine-tune on the *same* Hypersim + Virtual KITTI 2 split, with the *same* affine-invariant loss, with no real data. I do not see this comparison in the paper. Without it, we cannot attribute the gain to the diffusion prior as opposed to supervision quality.

Confound B: Latent-space affine-invariant loss is a novel implicit regularizer. Predicting a normalized depth in a learned latent $z^{(d)} = E (d^{'})$ is not the same as predicting pixel-space depth. The VAE compresses high-frequency noise and imposes a prior that favors smooth, structurally plausible outputs. I suspect much of Marigold's visually pleasing boundary behavior originates in the VAE decoder $D$ , not in the U-Net. The paper does not ablate $E / D$ against a direct pixel-space decoder under identical supervision. A fair ablation would keep everything the same but predict pixel-space depth with a convolutional head, or replace the Stable Diffusion VAE with a randomly initialized autoencoder of similar capacity.

Confound C: Test-time ensembling inflates the numbers. Marigold reports results with $N = 10$ noise ensembles, median-aggregated after per-sample affine alignment. This is *test-time compute* that the deterministic DPT baseline does not spend. From the ablation in Table 5, $N = 1$ yields noticeably worse AbsRel on NYUv2 (roughly 6.0-6.5% versus 5.5%). The fair comparison is either to give DPT an equivalent TTA budget (multi-scale, flip, crop ensembling), or to restrict Marigold to $N = 1$ . The field has been here before: MiDaS outpaced its baselines in part because of the affine-invariant loss, and that lesson had to be re-learned. Marigold adds a new knob, TTA ensembles, and its advantage at $N = 1$ against a TTA-equipped DPT is never shown.

These three confounds compound. Strip away synthetic-data quality, latent-space smoothing, and ensembling, and the residual "benefit of the diffusion prior" becomes an empirical question the paper does not answer.

3. Alternative Interpretation: Marigold as a Good Denoiser, Not a Good Geometer

Here is a reading the authors did not consider.

Stable Diffusion's pretraining objective is *not* 3D. It is text-conditional 2D denoising on LAION. Whatever 3D knowledge SD v2 possesses is incidental, emerging from the correlation between image statistics and shading, texture gradient, and occlusion. What SD *is* demonstrably good at is producing spatially coherent, high-resolution image-like outputs conditioned on an input. Depth maps, when treated as pseudo-RGB, *are* image-like: piecewise smooth, bounded, with sharp boundaries at object edges. The U-Net's inductive biases, especially attention at coarse resolutions and skip connections at fine ones, are well suited to producing such outputs.

Under this alternative reading, Marigold is not recovering a hidden 3D prior. It is recovering a *hidden 2D image-completion prior* and applying it to a 2D signal that happens to be named "depth." The reason it generalizes from Hypersim to NYUv2 is that indoor scenes share similar piecewise-planar layout statistics, and the U-Net's coarse-to-fine synthesis pipeline interpolates well across that distribution.

This alternative has a testable consequence. If the diffusion prior were genuinely geometric, Marigold should perform well on *out-of-distribution geometry*: cluttered transparent scenes, mirrors, sparse-texture outdoor imagery, or strong specular reflections where the 2D cues lie. The ETH3D and DIODE numbers are a mixed signal; an AbsRel of 30.8% on DIODE is quite high and suggests the model struggles exactly where a real geometric prior would help most. A proper test would be a benchmark such as HAMMER, or a mirror-heavy subset of MVS datasets, neither of which the paper includes.

4. Methodology Critique

Let me be specific about the experimental design issues beyond the high-level confounds.

Baseline parity. The baselines (MiDaS v3, DPT-Large, HDN, Omnidata v2, LeReS) were not retrained on Marigold's training split; they were used as released. This is standard in zero-shot benchmarking, but it obscures data effects. DPT trained on Hypersim + Virtual KITTI 2 alone would be the scientifically correct control.

Pretraining lineage. Stable Diffusion v2 was trained on LAION-5B, a 5.85B image-text corpus. Some subset of LAION almost certainly overlaps with NYUv2 and KITTI scene statistics, since indoor photography and driving imagery are both heavily represented. A zero-shot claim in which the pretraining set is 100x larger than any competing method is not really a zero-shot claim in the spirit of the evaluation; it is a *zero-shot for depth labels* claim. The paper should acknowledge this asymmetry explicitly.

Statistical significance. The ensemble variance (standard deviation across $N = 10$ samples) is reported, but between-seed variance at training time is not. Given that fine-tuning on 74k pairs is cheap, running 3-5 training seeds and reporting mean $\pm$ std would cost little and would let us judge whether the 0.3% AbsRel gaps to competitors are meaningful.

Missing ablations.

1. DPT-equivalent backbone fine-tuned on the same synthetic split, same loss, same TTA.

2. Marigold with a randomly initialized U-Net (ablates the prior; retains architecture + VAE).

3. Marigold with a pixel-space decoder (ablates the VAE smoothing).

4. Marigold at $N = 1$ versus competitors given an equivalent TTA budget.

At least (1) and (2) are essential to the paper's central claim. Their absence is the methodological gap I find hardest to forgive.

5. Key Numbers

Dataset	Marigold AbsRel	Prior zero-shot SOTA	Relative gain
NYUv2	~5.5%	DPT-Large ~10.0%	~45%
KITTI	~9.9%	DPT ~11.1%	~11%
ETH3D	~6.5%	DPT ~7.8%	~17%
ScanNet	~6.4%	DPT ~8.2%	~22%
DIODE	~30.8%	DPT ~27.0%	-14%

Note the DIODE row. Marigold is *worse* than DPT on a dataset with challenging depth-of-field and outdoor scenes. If the diffusion prior were broadly geometric, this is precisely the benchmark on which we would expect improvement.

Cost metric	Marigold	DPT-Large
Inference steps	50 DDIM x 10 ensembles = 500 U-Net forwards	1 forward pass
Parameters (active)	~865M (SD v2 U-Net)	~344M
Latency per image (A100, reported)	~10 s at 768 px	~0.03 s

The accuracy gain comes at roughly $300 \times$ the inference cost. That is a tradeoff worth stating in the abstract.

6. What Would Change My Mind

I want to be concrete about falsification. The following experiments would validate or refute the generative-prior hypothesis:

1. Random-init control. Train the identical U-Net architecture, same VAE, same data, and same schedule, but from random weights. If zero-shot performance drops to MiDaS-level, SD pretraining contributed. If it stays close to Marigold, the prior is not the mechanism.

2. Backbone swap. Replace the SD U-Net with a DINOv2-ViT-L/14 backbone and a light decoder, and fine-tune on the same 74k pairs with the same affine-invariant latent loss. Compare $N = 1$ to $N = 1$ . If DINOv2 matches or exceeds Marigold, the claim that *diffusion* pretraining is special collapses; what matters is large-scale pretraining of any kind.

3. Out-of-distribution geometry benchmark. Evaluate on transparent-object and mirror-heavy scenes (ClearGrasp, Mirror3D). A genuine geometric prior should help; a 2D image-completion prior should fail.

4. Training-set scaling. Report the curve of zero-shot AbsRel versus number of synthetic training pairs, from 1k to 74k. A flat curve beyond 10k would suggest the prior is doing most of the work; a steep curve says the supervision is.

None of these are expensive. A well-resourced lab could run (1) and (2) in under a week of A100 time. The absence of (1) in particular is what keeps me from accepting the paper's framing.

Marigold is not the first attempt to repurpose large generative priors for discriminative vision. [Baranchuk et al. 2022] used SD features for semantic segmentation; [Tang et al. 2023, VPD] used SD for depth and reported competitive results with far less compute; [Zhao et al. 2023, Unleashing Text-to-Image Diffusion] used SD features for multi-task perception. Marigold's delta over VPD is the *full denoising reformulation* rather than feature extraction. Whether this delta is architecturally important is exactly the question the ablations should have answered.

For monocular depth more broadly, the closest alternatives are Depth Anything [Yang et al. 2024] and Depth Pro [Bochkovskii et al. 2024]. Depth Anything trains a DINOv2-initialized DPT on 62M unlabeled images via teacher-student pseudo-labeling and reports NYUv2 AbsRel of roughly 4.3%, already below Marigold, at 1/300 of the inference cost. ZoeDepth [Bhat et al. 2023] and Metric3D [Yin et al. 2023] address the metric-scale ambiguity that Marigold does not attempt. The honest framing of Marigold is this: *an interesting existence proof that diffusion backbones can be repurposed*, not *a new state of the art in monocular depth*.

8. Limitations and Failure Modes the Authors Did Not Emphasize

Metric scale. Marigold produces affine-invariant depth only. Any downstream use requiring metric depth (robotics, AR scale reasoning) needs a separate alignment step. The paper acknowledges this but does not quantify the alignment error.
Inference cost. 500 U-Net forwards per image is a deployment non-starter. Recent follow-ups (Latent Consistency Models, one-step distillation) will likely collapse this, but the paper as published is not practically usable.
Sensitivity to input resolution. The VAE downsamples by a factor of 8. Fine depth structures smaller than 8 pixels cannot be recovered, and thin objects (wires, railings) fail systematically. This is a direct consequence of the latent-space reformulation and is not discussed.
Dataset bias toward indoor piecewise-planar scenes. Hypersim is 100% indoor; Virtual KITTI 2 is narrow driving scenes. Performance on open-world, cluttered, or natural-environment inputs (forests, oceans, sports) is untested.

9. Broader Implications

If the generative-prior hypothesis holds, the implications reshape transfer learning for dense prediction: pretrain once at 5B-image scale with a generative objective, then fine-tune cheaply for any dense regression task. This is the paper's pitch, and it is a reasonable research program. If instead the hypothesis fails, the implication is more constrained: diffusion backbones are *compatible* with dense prediction but not *uniquely suited*, and simpler DINO-style contrastive pretraining plus synthetic supervision yields the same result at 1/10 the cost. Given that Depth Anything already demonstrates the latter path works, I lean toward the second reading.

My verdict: Marigold is a valuable existence proof and a well-engineered pipeline, but the claim that the *diffusion prior* is the causal mechanism is not established by the experiments shown. I rate the contribution as moderate engineering + existence proof, weak on scientific causality. The next experiment someone should run is the random-init U-Net control, and I will update my priors when I see it.

10. Key Questions for the Authors

1. What is the zero-shot AbsRel of the identical pipeline with a randomly initialized U-Net, fine-tuned on the same 74k pairs?

2. Does the advantage over DPT persist when both methods are restricted to $N = 1$ inference?

3. What fraction of the gain on NYUv2 is attributable to VAE smoothing versus the U-Net prior?

4. On transparent and mirror-heavy benchmarks, does Marigold outperform DPT, or does the 2D image-completion prior fail where a geometric prior would succeed?

5. Given Depth Anything's lower AbsRel at $300 \times$ lower inference cost, what is the remaining argument for the diffusion-based formulation beyond architectural curiosity?

Reproducibility & Sources

Primary paper: Ke, B. Obukhov, A. Huang, S. Metzger, N. Daudt, R. C. & Schindler, K. (2023). *Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation.* arXiv:2312.02145.

Code repository: Official implementation released by the authors at the Marigold project repository on GitHub (marigold-monodepth).

Datasets used for training: Hypersim [Roberts et al. 2021] (public, Apple ML research release); Virtual KITTI 2 [Cabon et al. 2020] (public, NAVER Labs Europe release).

Datasets used for evaluation: NYUv2 [Silberman et al. 2012], KITTI [Geiger et al. 2012], ETH3D [Schops et al. 2017], ScanNet [Dai et al. 2017], DIODE [Vasiljevic et al. 2019]. All are publicly accessible via the respective benchmark sites.

Reproducibility assessment.

Axis	Rating (1-5)	Justification
Code availability	5	Official code and pretrained weights released with clear inference scripts.
Data availability	4	Training and evaluation data all public; Hypersim license restricts commercial use.
Experimental detail	3	Training hyperparameters are given, but ablations on the generative-prior claim (random-init control, backbone swap) are not reported, so the central scientific claim cannot be independently verified from the paper alone.