Abstract

Everyone assumed the bottleneck in multimodal LLMs was the language model. Tong et al. (arXiv:2406.16860) argue the opposite, and they spend more than twenty vision encoders, a new benchmark (CV-Bench), and a connector design (Spatial Vision Aggregator, SVA) pressing that argument. Half of the claim, I think, is right; the other half is an evaluation artifact. The paper is a careful empirical study with the right instinct, namely that MLLM benchmarks have drifted toward language-heavy QA in which a strong LLM can paper over a weak visual encoder. But its corrective instrument, CV-Bench, still carries enough textual and resolution confounders that the headline conclusion (vision representation is *the* bottleneck) demands sharper causal evidence. The qualitative comparisons are more revealing than the numbers.

Core Contribution in Plain Terms

Strip the jargon first. A multimodal LLM takes an image, runs it through a vision encoder (usually CLIP-style), projects the resulting tokens into the LLM's embedding space, and lets the LLM answer questions. If your benchmark is dominated by questions that a blind LLM could largely guess, common-sense VQA, OCR-heavy tasks where the text in the image is the answer, or MMLU-style factual questions, then swapping the vision encoder barely moves the needle. You are measuring the LLM.

Tong et al. build CV-Bench to be the opposite: spatial relation, counting, depth ordering, and relative distance tasks where a language prior should not help. They then run a large-scale bakeoff, self-supervised (DINOv2, MAE), language-supervised (CLIP, SigLIP), and hybrid encoders, at multiple resolutions and across five data regimes. On top of that, they propose SVA, a learned spatial cross-attention connector that lets the LLM query a multi-scale feature pyramid rather than a fixed flat token list.

The formal story. Let be the vision feature map from encoder and let be a learned query. SVA computes

aggregating across scales with a spatially-aware positional bias. The output is then consumed by the LLM as visual tokens. In essence, this is DETR-style cross-attention ([Carion et al. 2020]) slotted into the visual connector that LLaVA ([Liu et al. 2023]) had left as a simple MLP.

Key Contributions and Novelty Rating

I would rate the overall contribution as moderate, with one caveat. Three pieces deserve separate scoring.

CV-Bench is a useful instrument, though not a new scientific idea. The MMVP benchmark from the same group ([Tong et al. 2024, "Eyes Wide Shut"]) had already made the case that CLIP-based MLLMs fail on spatially grounded perception. CV-Bench scales that insight up and systematizes it. Incremental-to-moderate.

The encoder bakeoff is the real value. A careful, compute-matched comparison across DINOv2 ([Oquab et al. 2023]), CLIP ([Radford et al. 2021]), SigLIP ([Zhai et al. 2023]), ConvNeXt ([Liu et al. 2022]), and hybrid setups at a consistent data scale is genuinely missing in the literature. Most prior MLLM papers compared their new method against LLaVA with a single encoder and declared victory. Moderate.

SVA is a plausible connector, but architecturally close to Q-Former ([Li et al. 2023] BLIP-2) and Perceiver-style resampling ([Jaegle et al. 2021]). The novelty lies in the spatial bias and multi-scale aggregation. Incremental.

The combined package is valuable because the CV community has been running MLLM benchmarks without auditing whether those benchmarks actually test vision. That audit is what Cambrian-1 delivers.

Methodology and Experimental Design

The setup follows the now-standard MLLM recipe: a frozen or lightly-tuned vision encoder, a connector (MLP or SVA), and a Vicuna/LLaMA-3 backbone, trained in two stages, alignment pretraining, then instruction tuning. The training data mixture, Cambrian-10M and the curated Cambrian-7M subset, is itself a contribution: the authors publish exactly which data sources are included and at what ratio. That transparency matters for reproducibility.

The ablation grid. The authors vary (a) encoder type, (b) input resolution (224, 336, 448, 1024 effective for high-res variants), (c) connector (MLP vs. SVA), (d) data scale, and (e) LLM backbone size (7B, 8B, 13B, 34B). Every cell in this grid matters, because the hardest confounder in MLLM evaluation is resolution. A DINOv2-L at 518px will beat a CLIP-L at 224px on CV-Bench regardless of anything else, because the spatial questions CV-Bench asks, counting and relative position, are resolution-limited. To the authors' credit, they hold resolution fixed in the key comparisons. To their debit, the headline "SSL beats CLIP" narrative still mixes encoder family with effective receptive field in ways that, I think, are not fully disentangled.

Loss functions are the standard next-token prediction cross-entropy during instruction tuning, with the connector and (optionally) the encoder unfrozen. There are no contrastive or auxiliary losses, which is worth noting because some competing approaches (e.g. [Chen et al. 2023] PaLI) use multi-task objectives.

Results & Analysis

The numbers Cambrian-1 reports on its own benchmark suite are approximately as follows.

ModelParamsAvg on General VQACV-Bench 2DCV-Bench 3D
LLaVA-1.5 (7B)7B~62~55~55
Mini-Gemini-HD (7B)7B~66~60~60
Cambrian-1 (8B)8B~67~72~70
Cambrian-1 (13B)13B~69~73~72
Cambrian-1 (34B)34B~72~75~73
GPT-4V?~75~64~69

(Approximate values; see the paper's Table 4 and CV-Bench section for exact figures.)

Two things stand out. First, Cambrian-1 outperforms GPT-4V on CV-Bench 2D and 3D, which is the paper's dramatic moment. Second, on general VQA the gap to GPT-4V narrows but does not invert. That asymmetry is consistent with the authors' thesis: on language-biased benchmarks, a larger proprietary LLM wins; on vision-centric ones, a well-designed open 8B model catches up or surpasses.

But here I pause. GPT-4V is tested through an API with unknown preprocessing, unknown resolution handling, and unknown system-prompt behavior. Declaring an open model "beats GPT-4V" on a benchmark you designed, where the API's preprocessing is opaque, is a claim with known reproducibility issues ([Balanced evaluation concerns raised in Zhang et al. 2024]). The correct baseline would be a GPT-4V probe in which you verify that image resolution reaches the model intact, which is not controllable.

The ablations tell the real story. The most informative experiments are internal: (i) hold the LLM fixed and swap encoders; (ii) hold the encoder fixed and swap connectors; (iii) vary resolution. These show that, at matched resolution, SSL and language-supervised encoders converge in quality on most VQA, yet SSL pulls ahead on CV-Bench. That is a meaningful finding, because it suggests CLIP-family encoders have a specific deficit in fine spatial structure, consistent with the register-token analysis of [Darcet et al. 2024] and the CLIP-patch-similarity failures of [Tong et al. 2024, MMVP].

Effect-size honesty. Many of the encoder differences are 1-3 points on VQA benchmarks where the bootstrap 95% CI for a 2000-sample eval set is roughly points. To my reading, the authors do not report confidence intervals. Some of the rank-orderings in the encoder bakeoff therefore sit inside the noise floor.

Connections to Adjacent Fields

This is where the paper becomes interesting for non-CV readers.

Psychophysics and human vision. The CV-Bench tasks, counting, relative position, depth ordering, are descendants of classic visual cognition experiments dating back to [Treisman & Gelade, 1980] on feature integration. Humans do not "see" counting as a parallel operation; we serialize attention over items above a subitizing threshold of roughly four. If you run CV-Bench with humans and plot reaction time as a function of count, you get a piecewise curve. MLLMs do not exhibit this signature: they either solve counting in one forward pass or fail regardless of count. That implies the model is doing something fundamentally different from human counting, and that the benchmark is probably not measuring what it thinks it is measuring for counts >5.

Signal processing. The connector design, SVA, is essentially a learned anisotropic downsampler. Classical multi-scale analysis ([Lindeberg, 1994] on scale-space theory) offers tools the authors do not use. In particular, the question of *how many scales to include, and at what ratio* is treated empirically here, whereas scale-space theory gives principled answers (half-octave spacing, Gaussian scale-parameter choices). Importing that theory might let SVA use fewer scales for the same information content.

Statistics and causal inference. The claim "vision representation is the bottleneck" is a causal claim dressed up in correlational clothing. The proper framing is an intervention: if I could surgically replace the vision representation with an oracle-perfect one while holding the LLM and connector fixed, does performance saturate? The Cambrian-1 ablations approximate this by comparing encoder quality, but they never actually reach a saturation point, and so the bottleneck claim remains speculative. [Pearl, 2009]'s do-calculus framing would sharpen the experimental design: identify the minimal sufficient adjustment set and check that the encoder swap is not confounded by pretraining data overlap.

Neuroscience. The multi-scale feature pyramid in SVA is reminiscent of the dorsal-stream coarse-to-fine hierarchy in primate visual cortex, where V1 provides high-resolution local features and V4/IT provide coarser, semantically aggregated features that converge in PFC for task-directed reasoning ([Felleman & Van Essen, 1991]). The MLLM literature could benefit from explicitly modeling which "stream" the LLM is querying for which task.

What CV Can Learn From Adjacent Fields

Three imports are overdue.

First, statistical power analysis from experimental psychology. CV-Bench subsets contain a few hundred questions each. At that size, detecting a 2-point accuracy difference with 80% power requires an effect-size calculation the paper does not perform. Psychology papers are now routinely required to preregister power analyses; MLLM benchmarks should too.

Second, item-response theory from psychometrics. Not every CV-Bench question is equally informative. IRT would let the authors identify which questions discriminate between strong and weak MLLMs, and which ones every model passes (rendering them useless) or every model fails (rendering them suspect for annotation error). This would shrink the benchmark and make it more discriminating.

Third, adversarial evaluation from the robustness literature. The correct test of a vision-centric benchmark is whether a language-only model given the question text alone achieves near-chance accuracy. The authors do a version of this (a blind-LLM baseline), but they do not perturb images adversarially to check that the model is actually attending to image content rather than exploiting image-independent priors correlated with question structure.

Technical Intuition Building: Why the Connector Matters

The hardest concept here is not SVA itself; it is why the connector should be spatially aware at all, given that the LLM has a full attention mechanism that could, in principle, learn spatial relations from a flat token sequence.

Think about it this way. If you hand the LLM a flat sequence of 576 visual tokens (24x24 CLIP patches), the LLM must infer spatial layout from token-index order alone. Positional embeddings help, but the LLM was trained on language, where position denotes sequence, not 2D topology. To answer "is the cat to the left of the dog?" the LLM must internalize that tokens and are horizontal neighbors while and are vertical neighbors, a fact that must be learned rather than built in.

What SVA does is encode that topology *into the queries themselves*. Each learned query carries a spatial bias, so query preferentially attends to features near grid position . The LLM then sees a structured visual input in which 2D topology is preserved in query order. You can think of this as convolutional inductive bias reintroduced at the interface layer.

The receptive-field question. What does the model actually attend to? The authors show attention maps confirming that SVA queries localize to semantically meaningful regions. Those maps are informative but also cherry-picked. A stronger analysis would quantify the entropy of attention weights and correlate it with task accuracy on spatial versus non-spatial questions.

Limitations and Failure Modes

Four limitations the paper does not, in my view, adequately address.

1. Resolution-as-hidden-variable. Many of the "SSL wins" results at fixed nominal resolution still differ in effective receptive field because encoders pre-process differently (CLIP often uses aggressive center-cropping, DINOv2 less so). A concrete failure scenario: on a benchmark where the target object sits at the image periphery, CLIP's crop bias causes systematic failure unrelated to representation quality.

2. Benchmark contamination. CV-Bench draws from ADE20K ([Zhou et al. 2017]), COCO, and Omni3D, all of which likely appear in the pretraining image pools of at least some encoders, and possibly in the LLM's text-grounded training. The authors filter by image hash, but scene-level semantic overlap (same photographers, same scene types) is not filtered.

3. LLM-size decoupling is incomplete. The 8B vs. 34B comparison shows that larger LLMs help on VQA but help less on CV-Bench. This is offered as evidence that LLM scale is not the bottleneck. But the 34B model is not trained with more visual tokens or a better connector; the comparison isolates LLM scale *given* a fixed visual pipeline, which is a tautology. A fairer test would scale the visual pathway (more tokens, more encoder parameters) at fixed LLM size and see whether performance scales symmetrically.

4. No failure-case analysis on negatives. The paper shows successful qualitative examples. It does not show, for the best model, the distribution of CV-Bench failures by sub-task. Without that breakdown, we cannot tell whether the remaining 25-28% of CV-Bench errors are uniformly distributed or concentrated on specific subskills such as depth ordering, which would imply a very different diagnosis.

[Liu et al. 2023] LLaVA established the MLLM recipe Cambrian-1 iterates on. [Li et al. 2023] BLIP-2 introduced Q-Former, the closest architectural ancestor to SVA. [Tong et al. 2024, "Eyes Wide Shut"] is the direct precursor paper from the same group that identified CLIP's spatial deficits via MMVP; Cambrian-1 is its larger, more constructive sibling. [Darcet et al. 2024] on register tokens explains part of why ViT-based encoders have artifact tokens that corrupt spatial analysis. [Chen et al. 2023] PaLI and [Alayrac et al. 2022] Flamingo are architecturally different, Perceiver-based and image-text interleaved, respectively, and Cambrian-1 would benefit from direct comparison with them at a matched data scale.

Key Questions for the Authors

1. What is the 95% confidence interval for each row of the encoder bakeoff table, and how many rank-orderings reverse under bootstrap resampling?

2. Does SVA's advantage over a simple MLP connector survive when the MLP is given the same multi-scale feature pyramid as input (flattened)?

3. At fixed 336px resolution and matched pretraining compute, does DINOv2 still beat SigLIP on CV-Bench, or does the gap close?

4. Have you measured CV-Bench performance for a pure language baseline (question-only, no image)? What is the floor?

5. How do Cambrian-1 models perform on benchmarks designed by *other* groups to stress vision (e.g. BLINK, MMVP-equivalent held-out sets)?

Impact Assessment

The most durable contribution here is not the model or the connector. It is the normalized evaluation infrastructure: an open training recipe, open data, a vision-centric benchmark, and a compute-matched encoder comparison. That infrastructure is precisely what the open MLLM community was missing. If CV-Bench-style evaluation becomes standard, the field will stop rewarding papers that win MMLU-MM by having a better LLM, and start rewarding papers that actually improve visual grounding. That is a healthy direction.

The cross-pollination opportunities are clear. Robotics and embodied AI need exactly the spatial perception CV-Bench tests, and a well-characterized MLLM backbone that passes CV-Bench is a better starting point for vision-language-action models than one tuned purely on web VQA. The next experiment I would want to see is CV-Bench-Robotics: real-robot tabletop manipulation queries in which the LLM must ground its answers in geometry it actually has to act on.

Verdict

Cambrian-1 is a good empirical study with an overclaimed headline. The encoder bakeoff and open infrastructure are valuable. The "vision is the bottleneck" framing is partially correct, but needs sharper causal evidence than what is presented. CV-Bench is a welcome corrective, yet it carries its own confounders around resolution and benchmark provenance. Recommend accept with major revisions: fix the confidence intervals, add the pure-language baseline on CV-Bench, decouple resolution from encoder choice explicitly, and temper the GPT-4V comparison until the API's preprocessing can be verified.

Reproducibility & Sources

Primary paper. Tong et al. "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs", arXiv:2406.16860.

Code repository. Official code released at github.com/cambrian-mllm/cambrian (referenced by the paper; verify at time of use).

Datasets. Cambrian-10M and Cambrian-7M (released by the authors). CV-Bench is constructed from ADE20K ([Zhou et al. 2017]), COCO ([Lin et al. 2014]), and Omni3D ([Brazil et al. 2023]). All public.

Reproducibility assessment.

AxisRating (1-5)Justification
Code availability4Official training and evaluation code released under an open license; minor gaps in data-filtering scripts reported by third parties.
Data availability4Cambrian-10M/7M and CV-Bench are public; some constituent VQA sources require separate download and licensing.
Experimental detail3Hyperparameters and training stages are documented; however, the paper does not report confidence intervals, blind-LLM baselines on all CV-Bench splits, or resolution-matched controls for every encoder, which limits tight reproduction of specific rank-orderings.

Inline citations used. [Liu et al. 2023] LLaVA; [Radford et al. 2021] CLIP; [Oquab et al. 2023] DINOv2; [Zhai et al. 2023] SigLIP; [Li et al. 2023] BLIP-2; [Alayrac et al. 2022] Flamingo; [Chen et al. 2023] PaLI; [Tong et al. 2024] Eyes Wide Shut / MMVP; [Darcet et al. 2024] ViTs need registers; [Carion et al. 2020] DETR; [Jaegle et al. 2021] Perceiver; [Liu et al. 2022] ConvNeXt; [Treisman & Gelade, 1980] feature integration; [Lindeberg, 1994] scale-space; [Pearl, 2009] causality; [Felleman & Van Essen, 1991] cortical hierarchy; [Zhou et al. 2017] ADE20K; [Lin et al. 2014] COCO; [Brazil et al. 2023] Omni3D.