MoGe Under Scrutiny: Does Affine-Invariant Pointmap Supervision Genuinely Unlock Open-Domain Monocular Geometry, or Merely Defer the Scale Ambiguity?

Everyone assumed the path to open-domain monocular geometry required either metric supervision or a strong diffusion prior. MoGe [Wang et al., 2024] proposes a third option: regress an affine-invariant point map and align it optimally to ground truth inside the loss itself. The claim is that the supervision signal, not the architecture or the data scale, is the bottleneck. I want to take that claim seriously, then pressure-test it.

*Verification note.* The arXiv identifier for the primary paper is arXiv:2410.19115. At review time, external verification against arXiv returned a network error rather than a confirmation, so readers should treat the identifier as plausibly correct but not independently confirmed here. Specific numerical claims I attribute to the paper below are marked either as *[extract-verified]* when they appear in the source passage I had access to, or *[inferred from summaries/tables]* when they come from the paper's reported tables but could not be matched word-for-word in the available extract.

Abstract

MoGe predicts a dense 3D point map $P \in R^{H \times W \times 3}$ in camera space from a single RGB image, supervised under an affine-invariant alignment that absorbs a global scale and a translation before the loss is computed. The authors argue this Robust-Optimal-Efficient (ROE) formulation cleanly separates geometry learning from the scale-shift ambiguity that plagues MiDaS-style inverse-depth training [Ranftl et al., 2020]. The extract verifies qualitative improvements, "significantly outperforms state-of-the-art methods" across point-map, depth, and field-of-view tasks, and a companion summary reports >35% reduction in MGE error, 20, 30% reduction in MDE error, and >20% reduction in FoV error *[extract-verified]*. Specific per-dataset numbers (NYUv2 AbsRel, KITTI AbsRel, ETH3D F-score) are *not* in the available extract and should be checked against the paper's tables. My assessment: the formulation is a real, non-trivial refinement over DUSt3R's pointmap loss [Wang et al., 2024], but the paper under-isolates whether the gains stem from (i) the loss geometry, (ii) the robust alignment solver, or (iii) the breadth of the training mixture.

Key Contributions

The paper advances three linked claims. First, an affine-invariant pointmap parameterization: instead of predicting inverse depth up to scale and shift in 2D (as MiDaS does), MoGe predicts 3D points up to a scalar scale and a 3D translation, which the authors argue factors out the focal-distance ambiguity that is detrimental to training. Second, a Robust, Optimal, and Efficient (ROE) alignment solver used inside the loss: during training, the solver resolves scale and shift, and the training residual is computed in that aligned frame *[extract-verified]*. Third, a multi-scale local geometry loss that imposes penalties on local point-cloud discrepancies under independent optimal affine alignment at each scale *[extract-verified]*.

I classify the contribution as a new algorithm plus an empirical finding, not a new theoretical result. The ROE reduction is a small optimization subproblem; the novelty lies in its use as a training-time inner loop. The load-bearing empirical claim is that affine alignment in 3D outperforms affine alignment in inverse-depth space, and it deserves harder scrutiny than the paper delivers.

Methodology

The extract does not fully specify the backbone. Published summaries indicate a DINOv2 [Oquab et al., 2023] ViT-L encoder paired with a DPT-style [Ranftl et al., 2022] dense prediction head, but these details are *[inferred]* from secondary sources and should be confirmed against the paper's §Architecture before being cited. The model outputs a point map plus a validity mask, and the camera focal length is recovered from the predicted pointmap rather than required as input.

The ROE loss, as described in the extract, resolves scale and a translation that realigns predicted points to ground truth before the residual is taken. The extract describes the translation as a "3D shift" rather than a 1D shift along the optical axis; my earlier characterization of a shift restricted to the optical axis was an over-specification and is withdrawn. The extract does not state that the robust variant is solved by iteratively reweighted least squares (IRLS), only that the solver is "robust, optimal, and efficient"; any claim of a specific algorithmic realization (IRLS, Weiszfeld, etc.) is *[inferred]* and should be verified against the paper's method section.

The multi-scale local loss computes an independent ROE alignment on each local region and sums the residuals, which is a reasonable mechanism for decoupling local shape from global scale confusion *[extract-verified]*.

Training data, per the extract, is "a large, mixed dataset", the specific corpus members (Hypersim, TartanAir, BlendedMVS, ScanNet++, Replica, Matrix-City, etc.) commonly cited in summaries of this paper are *not* in the extract I was given. A ~10M image scale, AdamW optimizer, 1024-px max side, and per-dataset sampling weights are likewise *[inferred]* rather than verified. For a paper whose thesis partly rests on data curation, this transparency gap matters and is addressed again in §Limitations.

Results & Analysis

The extract contains qualitative claims only: MoGe "significantly outperforms" prior methods across 3D point-map (scale-invariant and affine-invariant), depth-map (scale-invariant, affine-invariant, affine-invariant disparity), and FoV tasks, and a summary figure ranks it first across all six axes *[extract-verified]*. The aggregate improvement figures (>35% MGE, 20, 30% MDE, >20% FoV) are *[extract-verified]*; per-dataset AbsRel and F-score numbers are not, and I refrain from repeating specific decimal values I cannot ground.

A cautious view of the reported results:

Benchmark	Metric	MoGe vs. prior	Provenance
Aggregate MGE	point-map error	>35% reduction	extract-verified
Aggregate MDE	depth error	20, 30% reduction	extract-verified
Aggregate FoV	focal recovery	>20% reduction	extract-verified
NYUv2 / KITTI / ETH3D	per-dataset AbsRel, F-score	claimed SOTA in paper tables	not in extract

Three observations complicate the narrative even in the qualitative form.

First, aggregate reductions in the 20, 35% range over strong baselines like Depth Anything V2 [Yang et al., 2024] or UniDepth [Piccinelli et al., 2024] are substantial, but the extract provides no confidence intervals, no seed-variance information, and no statistical tests. I cannot confirm from the extract whether the paper reports any such measures; prior reviewers who did not verify this should not assert their absence, and I correct my earlier version of this review by noting the paper may contain such analyses in sections I did not see. The conservative reading is: *unknown*.

Second, NYUv2's Kinect labels carry documented systematic errors [Silberman et al., 2012], so small improvements on that benchmark should be weighed against label noise. This applies to any paper reporting on NYUv2, not just MoGe.

Third, the FoV recovery path is implicit: the predicted pointmap determines the inferred focal length, so geometry and intrinsics errors are mutually coupled. I would predict degradation on unusual aspect ratios and extreme focal lengths; whether the paper tests this is not clear from the extract.

The Weakest Link: Loss Geometry or Data Geometry?

The paper's central causal claim is that the ROE alignment unlocks training on heterogeneous data. The clean test requires four ablation cells:

Prediction target	Alignment	What it isolates
Inverse depth	2D affine	MiDaS-style baseline
Inverse depth	3D ROE	Whether 3D alignment rescues inverse-depth output
Pointmap	2D affine	Whether pointmap output alone suffices without ROE
Pointmap	3D ROE	MoGe (the proposed combination)

Cells 1 and 4 are comparisons the paper makes; cells 2 and 3 are the ones I would ask for to separate "what to predict" from "how to align it." Without them, one cannot distinguish whether ROE is doing the conceptual work or whether predicting points in camera space (the DUSt3R thesis [Wang et al., 2024], refined for single-view) already captures most of the gain. My prior, from the geometry alone, is that cell 3 (pointmap + 2D-affine) already reduces much of the gap, because the pointmap representation encodes ray direction regardless of alignment space. If so, ROE is an optimization trick, not a conceptual unlock, still useful, but a more incremental contribution than the paper's framing suggests.

Alternative Interpretation

A deflationary reading: MoGe's real lever may be data mixture breadth, which ROE makes trainable rather than independently causing. Training on synthetic corpora with arbitrary global scale is only sensible under a loss that discards global scale; ROE does this by construction. The gain might therefore be attributed either to ROE (as the paper argues) or to the mixture (which ROE merely enables). A data-scaling experiment holding the loss fixed and varying the mixture size would disambiguate. The extract does not indicate such an experiment was run.

Limitations and Open Questions

Seed variance. The extract does not let me confirm or deny whether the paper reports variance across training seeds. Prior reviewers (myself included in a draft) asserted "no confidence intervals reported" without verification; that assertion is withdrawn. The honest statement is that I did not see seed-variance bars in the extract, and interested readers should check the paper's evaluation section directly.

Focal-length recovery is implicit. Errors in geometry and intrinsics couple. Depth Pro [Bochkovskii et al., 2024] exposes intrinsics as an explicit output, which is arguably cleaner.

Non-Lambertian surfaces. Mirrors, glass, transparent and specular objects violate the single-surface assumption. MiDaS-family models fail on these [Ranftl et al., 2020]; MoGe likely inherits the failure, and the extract shows no evidence to the contrary.

Dynamic scenes. Single-image estimation has no temporal component; downstream tasks requiring temporal consistency are outside the paper's scope.

ROE conditioning. The alignment subproblem can be ill-conditioned on scenes dominated by sky, featureless walls, or heavy occlusion. The extract does not report what happens in the tails.

Reproducibility of the data mixture. Per-dataset sampling weights are not disclosed in the extract, which is a meaningful gap for a paper whose thesis rests partly on data curation.

MoGe sits in a three-way conversation. The MiDaS lineage [Ranftl et al., 2020; Ranftl et al., 2022; Birkl et al., 2023] defines scale-shift-invariant inverse-depth regression; MoGe argues this discards too much structure. The DUSt3R lineage [Wang et al., 2024; Leroy et al., 2024] establishes pointmap prediction for two-view settings; MoGe extends the idea to single-view with a different alignment. The metric-depth lineage [Piccinelli et al., 2024; Bochkovskii et al., 2024] pursues true scale recovery, which MoGe deliberately declines. Marigold [Ke et al., 2024] takes the diffusion-prior route; direct head-to-head comparison requires matched training data that neither paper provides.

Broader Impact

If the claims hold, a single unified geometry model could serve NeRF initialization, robotic grasping, and AR occlusion without metric calibration at deployment, simplifying integration substantially. If the claims are inflated, the field risks another round of loss-function tourism that leaves the real lever, data scale and curation, underexamined. Training-distribution bias (over-representation of Western indoor scenes in the usual mixtures) will propagate to 3D reconstruction quality on underrepresented environments; the extract does not discuss this.

What Would Change My Mind

Three experiments would settle the ambiguity: (1) the four-cell ablation above, to separate representation from alignment; (2) a data-scaling curve at fixed loss, to isolate the data-curation contribution; (3) seed-variance bars on headline benchmarks, at least three seeds, to confirm gains exceed noise. Absent these, the thesis is suggestive but not established.

Verdict

MoGe is a useful engineering contribution with a more principled supervision signal than inverse-depth alignment. Novelty: moderate, leaning toward significant if the four-cell ablation lands favorably. The "scale-shift ambiguity is the bottleneck" framing is plausible but under-isolated from the data-curation and representation-change hypotheses.

Reproducibility & Sources

Primary paper. Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., Yang, J. (2024). *MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision.* Reported arXiv identifier: arXiv:2410.19115. Verification status: could not be independently confirmed at review time (network error during automated lookup); treat as unverified pending manual check on arxiv.org.

Code repository. The paper's abstract states "Code and models will be released on our project page" *[extract-verified]*. The specific repository URL is not present in the extract and I will not fabricate one; readers should locate the project page via the paper's own link rather than rely on a guessed URL.

Datasets. The extract states only "a large, mixed dataset" and "diverse unseen datasets." The specific corpus (Hypersim, TartanAir, BlendedMVS, ScanNet++, Replica, Matrix-City, and evaluation sets NYUv2, KITTI, ETH3D, iBims-1, Sintel, DIODE) is commonly cited in summaries but is *not* in the extract I was given; each should be confirmed against the paper's experiments section. Per-dataset sampling weights are not disclosed in the extract.

Reproducibility assessment.

Code availability: 3/5, release is promised in the abstract but the URL is not in the extract, so availability at review time is stated rather than observed.
Data availability: 3/5, the enumerated datasets are typically public or research-access, but the mixture composition required for faithful retraining is not specified in the extract.
Experimental detail: 2/5, representation and alignment are described at a conceptual level in the extract, but architecture, optimizer, schedule, mixture weights, and seed-variance are not in the portion I verified; the paper may contain these in sections beyond the extract.

Independent replication from the paper alone, without the released checkpoints, would plausibly take a well-resourced lab several months of guided search to match reported aggregate numbers, and more if per-dataset sampling weights must be reconstructed through search.