3D Gaussian Splatting Under the Microscope: Does an Explicit Point-Based Primitive Genuinely Obviate the NeRF Field, or Merely Trade Radiance Integration for Rasterization-Shaped Overfitting?

Abstract

Zoom into a rendered frame from Kerbl et al.'s 3D Gaussian Splatting (3DGS) [arXiv:2308.04079] and the story is immediately bimodal. On textured, well-observed regions, the anisotropic Gaussians tile the surface with near-photographic fidelity at 100+ FPS. Pan the camera toward a weakly supervised region, and the same primitives fray into elongated needles, chromatic popping, and view-dependent hallucinations that resolve only once you return to a training view. This is not a minor cosmetic defect; it is a direct fingerprint of the method's implicit prior. The rasterizer rewards primitives that happen to be cheap to project, and the adaptive densification heuristic selects for that reward rather than for geometric correctness. The paper is presented as a principled replacement for NeRF's volumetric field [Mildenhall et al. 2020], yet the combination of (i) screen-space gradient thresholding for densification, (ii) spherical harmonic (SH) view-dependent color of degree 3, and (iii) alpha-blending with sorted-depth tile rasterization conflates three very different inductive biases into a single artifact. This review disentangles them. I argue that 3DGS delivers a *significant* (not transformative) engineering contribution, that its real-time speed is bought by baking ambiguity into the primitives themselves, and that the published PSNR gains on Mip-NeRF 360 understate the generalization tax the method pays. The qualitative results are more revealing than the numbers.

Key Contributions

The paper advances four claims, and each must be evaluated on its own terms.

First, an *explicit* scene representation as a set $G = {g_{i}}_{i = 1}^{N}$ of anisotropic 3D Gaussians, each parameterized by position $μ_{i} \in R^{3}$ , covariance $Σ_{i} \in R^{3 \times 3}$ (factored as $Σ_{i} = R_{i} S_{i} S_{i}^{⊤} R_{i}^{⊤}$ , with $R_{i}$ a quaternion-encoded rotation and $S_{i}$ a diagonal scale), opacity $α_{i} \in [0, 1]$ , and SH coefficients $c_{i} \in R^{3 (ℓ_{m a x} + 1)^{2}}$ . The *novelty* relative to classical EWA splatting [Zwicker et al. 2001] lies not in the Gaussian primitive but in the combination of (a) differentiable tile-based rasterization and (b) joint optimization of geometry and appearance end-to-end from photometric loss. This is moderate novelty, not transformative.

Second, a *tile-based differentiable rasterizer* that sorts Gaussians per tile by depth and alpha-composites them front-to-back. This is the engineering core of the paper and, I would argue, its strongest contribution. It is also the element most likely to survive the next architecture cycle.

Third, an *adaptive density control* procedure that clones small Gaussians exhibiting large view-space positional gradients and splits large Gaussians under the same criterion, interleaved with periodic opacity resets. This is the component most deserving of forensic scrutiny: a screen-space heuristic masquerading as a geometric regularizer.

Fourth, the empirical claim that 3DGS matches or exceeds Mip-NeRF 360 [Barron et al. 2022] on PSNR/SSIM/LPIPS at two to three orders of magnitude lower inference latency. The speed claim is unambiguous. The quality claim is conditional, and the conditions matter.

Problem Formalization

The radiance field problem is to recover a function $f : R^{3} \times S^{2} \to R^{3} \times R_{+}$ mapping position $x$ and view direction $d$ to emitted radiance $c$ and density $σ$ , such that the volume rendering integral

C (r) = \int_{t_{n}}^{t_{f}} T (t) σ (r (t)) c (r (t), d) d t, T (t) = exp (- \int_{t_{n}}^{t} σ (r (s)) d s)

reproduces observed pixel colors along camera rays $r (t) = o + t d$ . NeRF parameterizes $f$ as an MLP; 3DGS replaces this continuous functional with a discrete measure

σ (x) d x \approx i \sum α_{i} N (x; μ_{i}, Σ_{i}) d x

and reduces the ray integral to the discrete alpha-blend

C (p) = i \in S (p) \sum c_{i} (d) α_{i}^{'} j < i \prod (1 - α_{j}^{'}),

where $S (p)$ is the depth-sorted set of Gaussians overlapping pixel $p$ and $α_{i}^{'}$ is the 2D-projected Gaussian opacity evaluated at $p$ after affine approximation of the perspective projection (the Zwicker, Pfister Jacobian trick). Note what has been assumed away: the commutativity of perspective projection with the Gaussian, which is exact only under a local affine approximation. This is the first crack in the theoretical foundation, and it widens under oblique viewing and small focal lengths.

The optimization objective is

L = (1 - λ) L_{1} (\hat{I}, I) + λ L_{D-SSIM} (\hat{I}, I)

with $λ = 0.2$ . Observe what is missing: no geometry regularizer, no TV prior, no depth supervision, no multi-view consistency loss beyond the implicit consistency enforced by sharing Gaussians across views. The *only* geometric signal is that a Gaussian must survive the photometric gradient from every view that sees it.

Assumptions and Their Justification

A1 (Local affine projection). The 3D Gaussian $N (μ, Σ)$ projects to a 2D Gaussian $N (μ_{2 D}, J W Σ W^{⊤} J^{⊤})$ , where $W$ is the viewing transform and $J$ is the Jacobian of the projection at $μ$ . This identity is exact only locally. For large Gaussians or steep perspectives, the approximation error accumulates in the tails. The authors partially mitigate this by clamping scales, but the practical consequence is that the densification procedure prefers *small* Gaussians, which is one reason scene counts balloon to 1M, 6M primitives.

A2 (Spherical harmonic adequacy for view-dependent radiance). Radiance at a surface point is modeled as $c (d) = \sum_{ℓ = 0}^{ℓ_{m a x}} \sum_{m = - ℓ}^{ℓ} c_{ℓ m} Y_{ℓ}^{m} (d)$ with $ℓ_{m a x} = 3$ , yielding 16 basis coefficients per channel. SH bases are band-limited; they cannot represent sharp view-dependent features such as specular highlights or the mirror reflections handled (imperfectly) by Ref-NeRF [Verbin et al. 2022]. On the NeRF synthetic dataset this matters little; on glossy real-world scenes it produces the low-frequency 'halo' artifact visible in the supplementary videos.

A3 (SfM-initialized point cloud is a reasonable prior). The authors initialize from COLMAP [Schönberger & Frahm, 2016] sparse points. In textureless or symmetric scenes where SfM fails, the method falls back to random initialization and, as the paper briefly notes, quality degrades. The implicit assumption is that there *exists* a reasonable multi-view-stereo prior. This is circular reasoning for any scene where one would hope a radiance-field method could succeed precisely because classical MVS does not.

A4 (Screen-space positional gradient is a proxy for geometric error). This is the assumption underwriting densification. A Gaussian whose projected mean has large $∥ \nabla_{μ_{2 D}} L ∥$ is deemed 'under-reconstructing' and is cloned (if small) or split (if large). But the screen-space gradient also fires for *correctly* placed Gaussians at high-frequency texture boundaries, and fails to fire for *incorrectly* placed Gaussians in specular regions compensated by SH coefficients. The densification heuristic is thus systematically biased against smooth specular surfaces and toward over-subdivided texture boundaries.

A5 (Gradient-based optimization of covariance via quaternion + scale parameterization remains on the manifold of SPD matrices). The authors parameterize $Σ = R S S^{⊤} R^{⊤}$ precisely to enforce positive-definiteness, a standard trick. Fair enough. But the gradient flow on the quaternion sphere and the log-space scale is not equivalent to natural gradient on the SPD manifold [Arsigny et al. 2007], so optimization anisotropy is not controlled.

Methodology: A Forensic Pass

The Rasterizer

The tile-based rasterizer divides the image into $16 \times 16$ tiles, computes a per-tile list of overlapping Gaussians via a frustum test on the 2D projected covariance at $3 σ$ , radix-sorts the list by projected depth of the mean $μ_{i}$ , and alpha-composites. The radix sort is approximate: two Gaussians with interpenetrating extents but nearby mean-depths are ordered by *mean* depth alone, which produces the popping artifact as the camera moves and the mean-depth order flips. This is a classical splatting problem [Zwicker et al. 2001; Kopanas et al. 2021], and 3DGS does not solve it, it merely moves fast enough that you might not notice.

The Optimization Schedule

Training runs for 30K iterations on a single A6000. Densification occurs between iterations 500 and 15000, every 100 iterations. Every 3000 iterations, opacities are reset to a small value $α \approx 0.005$ to cull unused Gaussians, letting subsequent optimization push them back up or let them die. This 'opacity reset' is a genuinely clever regularization move, but it is also a symptom: the optimizer accumulates 'floater' Gaussians that the photometric loss fails to prune, and the reset is a periodic hammer. A principled regularizer on the opacity distribution, an entropy term, say, or a Beta prior, would be a natural ablation the paper does not run.

Densification Criterion

A Gaussian is flagged for cloning or splitting if the average norm of its view-space positional gradient over the last 100 iterations exceeds $τ_{pos} = 0.0002$ . The threshold is tuned per-scene in the released code. Cloning moves a duplicate in the gradient direction; splitting samples two new Gaussians from the existing 3D distribution and scales them by $1/1.6$ . The $1/1.6$ factor is empirical, not derived. A first-principles derivation would require matching the second moment of the split distribution to the parent; instead, the paper resorts to a heuristic that happens to work.

Results and Analysis

Table 1 summarizes the numbers reported by Kerbl et al. on the three standard benchmarks.

Dataset	Method	PSNR	SSIM	LPIPS	Train time	FPS
Mip-NeRF 360	Mip-NeRF 360 [Barron et al. 2022]	27.69	0.792	0.237	~48 h	0.06
Mip-NeRF 360	Instant-NGP [Müller et al. 2022]	25.59	0.699	0.331	~5 min	9.43
Mip-NeRF 360	3DGS (30K iter)	27.21	0.815	0.214	41 min	134
Tanks & Temples	3DGS (30K iter)	23.14	0.841	0.183	26 min	154
Deep Blending	3DGS (30K iter)	29.41	0.903	0.243	36 min	137

Several of these numbers deserve unpacking.

The Mip-NeRF 360 comparison is not strictly apples-to-apples on PSNR (3DGS trails by 0.48 dB), but 3DGS wins on SSIM and LPIPS. SSIM rewards structural similarity and LPIPS rewards perceptual patch similarity; both are sympathetic to the point-based primitive, which produces crisp local textures. PSNR, being pixel-MSE, penalizes the popping and density-heuristic-induced texture misalignments that the other two metrics downweight. The choice of *which* metric to headline is itself a claim about what matters, and the paper headlines SSIM/LPIPS. A fairer read is that 3DGS is *comparable* to Mip-NeRF 360 in quality, not superior.

The train-time comparison is more honest. Forty minutes versus forty-eight hours is real, and it would remain real even if the PSNR were 1 dB worse. Mip-NeRF 360's MLP integrates along rays with hierarchical importance sampling, a procedure fundamentally quadratic in effective samples. 3DGS dodges the integral entirely.

The FPS comparison is where the paper is most unambiguous. 134 FPS at 1080p on an A6000 is what makes this method deployable, not merely publishable. But note: FPS is scene-dependent, since it scales with the number of Gaussians, and the reported scenes typically converge to 1M, 3M primitives. A scene that converges to 6M (which happens on wide outdoor captures with weak initialization) halves the frame rate.

The Missing Error Bars

The paper reports single-run numbers per scene. No confidence intervals, no seed variance. Given that the densification procedure is a stochastic greedy process punctuated by opacity resets, seed variance is non-negligible. Reproductions from the community (see the gaussian-splatting-lightning and nerfstudio ports) report $\pm 0.2$ , $0.4$ dB PSNR variance across seeds on Mip-NeRF 360 scenes. The 0.48 dB gap to Mip-NeRF 360 sits *within* this variance band. The headline claim of 'comparable quality' is correct; claims of equivalence at the decimal are not supported.

The Missing Ablation

The paper ablates densification, SH degree, and anisotropy. It does *not* ablate the opacity reset, the densification threshold $τ_{pos}$ , or the split scaling factor $1/1.6$ . Of these three omissions, the opacity reset is the most consequential: without it, floater accumulation is observable in every community reproduction. A clean ablation would separate its contribution from the rest of the pipeline.

Also missing: a cross-dataset generalization study. Every reported number is 'train on scene, test on held-out views of the same scene', per-scene overfitting by design. The question of whether 3DGS Gaussians transfer across scenes, or whether densification produces a representation that is essentially scene-specific memorization, goes unaddressed. Compare Pixel-NeRF [Yu et al. 2021] or MVSNeRF [Chen et al. 2021], which explicitly test cross-scene generalization. This is a different regime, but the omission matters once readers extrapolate 3DGS to generative or feed-forward settings.

The Spherical Harmonic Color Term is Doing More Work Than Advertised

Here the forensic angle pays its dividend. The SH coefficients absorb *any* view-dependent photometric residual the geometry fails to explain. If a Gaussian is placed 2 cm off the true surface, the SH term can compensate for the resulting view-dependent shift by encoding a fake specular pattern. At $ℓ_{m a x} = 3$ , 16 coefficients per channel yield 48 degrees of freedom per Gaussian. Across 1M Gaussians, that amounts to $4.8 \times 1 0^{7}$ parameters of color freedom, before counting geometry.

Run this ablation mentally: set $ℓ_{m a x} = 0$ (constant color per Gaussian). The paper reports a PSNR drop of roughly 1.2 dB on Mip-NeRF 360 scenes when view-dependent color is disabled. That 1.2 dB is not purely 'view-dependent appearance'; part of it is the SH term masking geometric misplacement. Disentangling these contributions would require a controlled experiment with ground-truth geometry (e.g. the DTU dataset with structured-light meshes [Jensen et al. 2014]), and the paper does not run it. This is the single most under-examined claim in the paper.

Proof Architecture and Theoretical Gaps

The paper contains no formal convergence argument. This is defensible: the objective is non-convex, and no NeRF-class method has a convergence proof either. But there are specific theoretical questions that deserve attention and do not receive it.

Q1. Is the representation identifiable? Given infinitely many views, does the Gaussian mixture converge to a unique geometric truth, or do equivalent mixtures produce identical renderings? The answer is that it is *not* identifiable. Two Gaussians with means $μ_{1}, μ_{2}$ and appropriate covariances can be replaced by a single Gaussian whose projection onto every training view matches the alpha-composite of the original pair, up to sort-order ambiguity. Non-identifiability is a structural feature, not a bug, but it means the recovered 'geometry' is only well-defined up to an equivalence class.

Q2. What is the approximation error of the affine projection? The Zwicker, Pfister approximation has error $O (∥ x - μ ∥^{2} / z^{2})$ , where $z$ is the depth of the Gaussian. For Gaussians with scale comparable to their depth (which happens in close-range captures), this error is not small. The paper neither bounds nor measures it.

Q3. What is the sample complexity of densification? The densification process is a stochastic greedy procedure. At best, one could hope for an argument of the form 'given enough iterations, the set of Gaussians converges to a locally optimal covering of the radiance field'. No such argument is attempted. This is a reasonable target for follow-up work.

Limitations and Open Questions

Beyond those the authors acknowledge, aliasing at grazing angles, floaters, memory footprint scaling with scene size, I will flag four concrete failure modes.

F1. Textureless planar surfaces. The densification heuristic fires on photometric gradients. A white wall produces no gradient, so Gaussians covering it are never split. If the SfM initialization places only a handful of points on the wall, the wall is under-reconstructed. The failure mode is visible in Tanks & Temples' *Truck* background and in the *Kitchen* scene's cabinet faces from Mip-NeRF 360.

F2. Specular surfaces under sparse views. With few views of a glossy surface, the SH term overfits to the specific highlights present in training views and produces view-dependent popping as the camera moves. Ref-NeRF-style surface normal estimation [Verbin et al. 2022] would be a natural remedy, and follow-up work (GaussianShader, 2DGS, 3DGS-MCMC) has begun to address this.

F3. Dynamic scenes. 3DGS assumes a static scene. Extension to dynamic scenes requires either per-timestep Gaussians (memory-explosive) or a deformation field on the Gaussian parameters (4DGS, Dynamic 3D Gaussians). The static assumption is a hard constraint in the published formulation.

F4. Anti-aliasing. 3DGS does not implement a mip-chain equivalent. Rendering at a resolution different from training resolution produces aliasing. Mip-Splatting [Yu et al. 2023] addresses this post hoc by adding 3D smoothing filters, but the original 3DGS is fundamentally a fixed-resolution method.

NeRF and its efficiency descendants. NeRF [Mildenhall et al. 2020] establishes the radiance-field paradigm. Instant-NGP [Müller et al. 2022] accelerates inference via hash-grid encoding but remains implicit and MLP-evaluated. Plenoxels [Fridovich-Keil et al. 2022] and TensoRF [Chen et al. 2022] move toward explicit voxel/tensor representations, giving up MLP evaluation at inference. 3DGS is the logical endpoint of this trajectory: fully explicit, MLP-free, rasterizable. Compared to Plenoxels, 3DGS trades the regular voxel grid for adaptive point placement, winning on memory at the cost of densification complexity.

Point-based neural rendering. Kopanas et al. [2021] and Rückert et al. [ADOP, 2022] use point primitives with neural shading. 3DGS differs by making the primitive itself differentiable (anisotropic covariance) and removing the neural shader. This is the cleaner design, and probably why 3DGS won adoption.

Classical EWA splatting. Zwicker et al. [2001] introduced the elliptical weighted average splatting that underlies the 3DGS projection math. 3DGS is, in a literal sense, EWA splatting with differentiable parameters and photometric optimization. The novelty is not the primitive; it is the end-to-end pipeline.

Broader Impact

3DGS has already reshaped the practical deployment of novel-view synthesis. Unlike NeRF, which required a GPU-resident MLP for inference, 3DGS assets can be streamed to web browsers and mobile devices. WebGL and Unity ports appeared within weeks of release. The downstream consequences are tangible: real-estate capture, cultural heritage preservation, film pre-visualization, and robotics simulation (because Gaussians are trivially queryable for collision approximation) all become more accessible.

The ethical flags are standard for photographic capture methods, and I will not belabor them: consent for capturing spaces and people, authenticity concerns when edited Gaussians are passed off as real captures, and a small but real privacy exposure when Gaussians recoverable from public photo sets reveal scene geometry the photographer did not intend to share.

Verdict

3DGS is a *significant engineering contribution* with *moderate architectural novelty*. The anisotropic Gaussian primitive is not new; the tile-based differentiable rasterizer is. The densification heuristic works but remains under-theorized and under-ablated, and the SH color term is doing compensatory work the paper does not acknowledge. The PSNR numbers sit within seed variance of Mip-NeRF 360; the speed numbers do not, and that is what matters.

The next experiment should be the following: train 3DGS with ground-truth geometry from a structured-light scanner, freeze the Gaussian positions to the true surface, and measure the residual that SH must explain. If the residual is small, the SH term is doing its advertised job, view-dependent appearance. If the residual is large, the SH term is laundering geometric error into color, and the entire 'explicit geometry' story needs revision. I'd bet on the second outcome.

Key Questions for the Authors

1. What fraction of the final PSNR is attributable to the SH term compensating for geometric misplacement, and how would you measure this without ground-truth geometry?

2. The opacity reset every 3000 iterations is a critical regularizer that is not ablated. What is the quality delta without it, and what principled regularizer would replace it?

3. The split scaling factor $1/1.6$ appears empirically tuned. What is its derivation, and does the optimum vary with scene scale?

4. Cross-seed PSNR variance is non-trivial in community reproductions. Why are error bars absent from the main results?

5. Under what scene statistics (textureless fraction, specular fraction, view sparsity) does the densification heuristic systematically fail, and can this be characterized before training?

Reproducibility & Sources

Primary paper. Kerbl, B. Kopanas, G. Leimkühler, T. & Drettakis, G. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. *ACM Transactions on Graphics (SIGGRAPH)*. arXiv:2308.04079.

Code repository. Official implementation released at github.com/graphdeco-inria/gaussian-splatting under a non-commercial research license. Community ports include gsplat (Nerfstudio) and gaussian-splatting-lightning.

Datasets.

Mip-NeRF 360 (Barron et al. 2022): public at jonbarron.info/mipnerf360.
Tanks and Temples (Knapitsch et al. 2017): public at tanksandtemples.org.
Deep Blending (Hedman et al. 2018): public at repo-sam.inria.fr/fungraph/deep-blending.

Reproducibility assessment.

Axis	Rating (1, 5)	Justification
Code availability	5	Official, well-maintained CUDA implementation; trivial to install on a single consumer GPU.
Data availability	5	All three benchmark datasets are public with standardized splits.
Experimental detail	3	Training hyperparameters are specified, but the densification threshold is tuned per scene in the released code; opacity reset schedule and split scaling factor lack derivation; no seed variance or error bars reported.

References

Mildenhall, B. et al. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. *ECCV*.
Barron, J. T. et al. (2022). Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. *CVPR*.
Müller, T. et al. (2022). Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. *SIGGRAPH*.
Fridovich-Keil, S. et al. (2022). Plenoxels: Radiance Fields without Neural Networks. *CVPR*.
Chen, A. et al. (2022). TensoRF: Tensorial Radiance Fields. *ECCV*.
Zwicker, M. et al. (2001). EWA Volume Splatting. *IEEE Visualization*.
Kopanas, G. et al. (2021). Point-Based Neural Rendering with Per-View Optimization. *Computer Graphics Forum*.
Verbin, D. et al. (2022). Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields. *CVPR*.
Schönberger, J. L. & Frahm, J.-M. (2016). Structure-from-Motion Revisited. *CVPR*.
Jensen, R. et al. (2014). Large Scale Multi-view Stereopsis Evaluation. *CVPR*.
Arsigny, V. et al. (2007). Geometric Means in a Novel Vector Space Structure on Symmetric Positive-Definite Matrices. *SIAM Journal on Matrix Analysis*.