SAM 2 Under Peer Review: Auditing Streaming Memory Attention, Occlusion Recovery, and the Long-Horizon Drift Tax

*Ravi, Gabeur, Hu, Hu, Ryali, Ma, Khedr, Rädle, Rolland, Gustafson, Mintun, Pan, Alwala, Carion, Wu, Girshick, Dollár, Feichtenhofer. "SAM 2: Segment Anything in Images and Videos." arXiv:2408.00714, 2024.*

The prevailing assumption held that promptable video segmentation required a dedicated tracker stacked atop an image model. SAM 2 rejects that decomposition and argues that a single transformer equipped with a streaming memory bank can do both jobs at once. It is a clean architectural claim, and the SA-V benchmark table appears to support it at face value. The question this review poses is not whether the numbers are real. It is whether the numbers *tell us what the authors say they tell us*.

1. Summary

SAM 2 extends the Segment Anything model [Kirillov et al. 2023] from single images to video through a promptable interface. The architecture reuses SAM's prompt encoder and mask decoder, swaps in a Hiera backbone [Ryali et al. 2023] for the image encoder, and inserts a *memory attention* block between encoder and decoder. Memory attention performs self-attention over the current frame's features, followed by cross-attention into a *memory bank* holding recent spatial features and per-object pointer embeddings. At inference, the model operates as a streaming tracker: at frame $t$ , the decoder conditions on user prompts (clicks, boxes, masks) together with memory of previous frames; at frame $t + 1$ , the updated mask is encoded back into memory. Formally, if $F_{t} \in R^{H W \times d}$ is the current-frame feature map and $M_{t} = {(F_{t - k}^{mem}, o_{t - k})}_{k = 1}^{K}$ the memory bank of spatial features and object pointers, memory attention computes

\tilde{F}_{t} = SelfAttn (F_{t}) + CrossAttn (SelfAttn (F_{t}), M_{t}),

and $\tilde{F}_{t}$ is passed to SAM's original mask decoder.

Alongside the model, the authors release SA-V, a dataset of roughly 50.9K videos with 642.6K masklets and 35.5M individual masks, annotated semi-automatically using SAM 2 itself in a three-phase data engine. On image segmentation benchmarks (23 zero-shot suites), SAM 2 matches or exceeds SAM while running 6× faster. On promptable video segmentation (PVS) and semi-supervised VOS (DAVIS 2017 [Pont-Tuset et al. 2017], YouTube-VOS [Xu et al. 2018], MOSE [Ding et al. 2023]), it outperforms XMem [Cheng & Schwing, 2022] and Cutie [Cheng et al. 2024] on aggregate $J & F$ . The claim is unification: one model, one training recipe, two modalities.

My assessment. SAM 2 is a substantive engineering contribution with a clean streaming formulation and a genuinely useful dataset. Yet the central scientific claim, that memory attention unifies image and video, rests on an experimental protocol that conflates architecture, training data, and annotation pipeline. The ablations are stronger than most video papers but leave key questions about occlusion recovery and long-horizon drift unanswered. This would be a clear accept at CVPR, but it is not the closed-book solution to video segmentation the framing sometimes implies.

2. Significance and Novelty Assessment

Rating: significant, with qualifications.

Let me separate what is new from what is assembled. The memory attention block is architecturally close to STM [Oh et al. 2019] and its descendants (STCN, AOT [Yang et al. 2021], XMem [Cheng & Schwing, 2022]), all of which maintain a memory of past frames and cross-attend from the current frame. XMem in particular organizes memory into short-term, long-term, and working tiers explicitly to handle long videos. SAM 2's memory bank is simpler: a FIFO of the last $N$ frames plus prompted frames, augmented with per-object pointer tokens that summarize the tracked object over time. The memory mechanism itself, then, is *not* novel in kind, only in integration.

What *is* new, and where I would locate the paper's actual contribution:

1. Promptable video segmentation as a unified task. Treating image segmentation as a one-frame instance of video segmentation, with consistent prompt semantics (clicks, boxes, mask prompts across frames), is a reframing rather than a result. But it is the right reframing, and it is what enables training on mixed image and video data.

2. The SA-V data engine and dataset. 35.5M masks, semi-automatically annotated, make this the single largest video segmentation corpus by an order of magnitude. SA-V is the paper's most durable contribution. Most downstream gains likely flow from this, not from the architecture.

3. Streaming inference with lightweight memory. Unlike STM-family models that can recompute attention over large memories, SAM 2 commits to a bounded memory bank of size $\sim 6$ frames plus pointer tokens. Inference is $O (H W \cdot N_{mem} \cdot d)$ per frame with small $N_{mem}$ , yielding real-time throughput.

What is engineering refinement rather than insight: Hiera as the backbone, the prompt encoder reuse, and the mask decoder with occlusion head. These are good choices, but not contributions.

An honest positioning: SAM 2 is to XMem what SAM was to interactive image segmentation before it, a model whose novelty lies less in the architectural primitive than in the scale-and-interface unification. That remains significant, but reviewers should not be sold on "memory attention" as the load-bearing insight. The data engine is.

3. Technical Correctness Audit

The paper is empirical, so the audit concerns experimental logic rather than proofs.

Loss construction. The mask decoder is trained with a linear combination of focal loss [Lin et al. 2017] and Dice loss, augmented by an IoU regression head and an occlusion prediction head supervised by binary cross-entropy:

L = λ_{focal} L_{focal} + λ_{dice} L_{dice} + λ_{iou} L_{iou} + λ_{occ} L_{occ} .

No red flags here; this is standard. The interesting design choice is that the occlusion head is kept separate from the mask logits, which means the model can output a "valid but empty" mask. That is the mechanism by which an object disappearing behind another does not corrupt future memory entries. Whether this works reliably is an empirical question the paper addresses only partially (see §5).

Memory update dynamics. The paper is light on the *selection policy* for memory entries. The memory bank contains the last $N$ frames plus prompted frames. In crowded scenes with brief re-appearances, the FIFO can evict the most informative past frame. A principled alternative would be XMem's long-term memory consolidation [Cheng & Schwing, 2022] or a learned memory controller. The authors compare against neither. This is a defensible simplification, but also a methodological gap: we do not know how much the simpler memory costs us.

Training-inference mismatch. Training uses fixed-length clips (8 frames) sampled from videos, with simulated interactive prompts. Inference is streaming over arbitrarily long videos. The model is never trained on its own rollouts of 100+ frames. This is classical exposure bias, and while the memory attention design partially mitigates it, no curriculum experiment addresses how errors compound. Reading the rollouts carefully, the drift is visible on longer MOSE clips.

Data leakage risk. SA-V was annotated with SAM 2 in the loop. Evaluations are conducted on a held-out SA-V val/test split, but any split of a self-annotated corpus is ambiguous: the *error structure* of the annotating model is baked into the labels. A boundary the model cannot see, it cannot teach itself to see. This is not a fatal flaw, and the paper partially addresses it by including expert human correction in Phase 3, but the gap between human-only and model-assisted annotations on the same frames is never quantified. That is the missing control.

4. Experimental Rigor

The experiments are broader than most video segmentation papers. The ablations are where I want to push harder.

4.1 Headline numbers

Benchmark	Metric	SAM 2	Prior best	Source (prior)
SA-V val	$J & F$	76.0	60.1 (Cutie-base+)	[Cheng et al. 2024]
MOSE val	$J & F$	74.1	69.9 (Cutie-base+)	[Cheng et al. 2024]
DAVIS 2017 val	$J & F$	90.7	88.8 (Cutie-base)	[Cheng et al. 2024]
YT-VOS 2019 val	$G$	88.6	87.0 (Cutie-base+)	[Cheng et al. 2024]
23 image datasets (zero-shot)	mIoU avg	+1 to +3 vs SAM		[Kirillov et al. 2023]

The SA-V jump of roughly +16 points over Cutie is enormous. The DAVIS and YouTube-VOS gains of roughly 2 points are modest. That gap matters: SAM 2 dominates on its own benchmark and only nudges ahead on the traditional ones. A reader should weigh these asymmetrically.

4.2 Baseline adequacy

The comparisons to XMem and Cutie use those models' *released checkpoints*, trained on DAVIS + YouTube-VOS + MOSE. SAM 2 is trained on SA-1B + SA-V + the same VOS data. The data regimes are not matched. A fair comparison requires retraining Cutie on SA-V, which the authors did not do. Without that control, we cannot attribute the SA-V gap to the architecture rather than to the training distribution.

This is the single most important missing experiment. My suspicion, grounded in the modest DAVIS and YT-VOS gains, is that a Cutie-on-SA-V run would close most of the SA-V gap. The authors likely know this and chose not to run it.

4.3 Ablation completeness

The ablations (backbone choice, memory size, pointer tokens, data mixture) are adequate but leave key questions open:

Memory bank size sweep is shallow. They show 1, 3, and 6 frames. Beyond 6, performance saturates. But saturation on *8-frame training clips* says nothing about 100-frame rollouts. A proper analysis would plot $J & F$ as a function of video length at each memory size.
No occlusion ablation isolating the occlusion head. They show it helps on SA-V but do not test re-identification after occlusions longer than the memory bank window. This is the drift tax the title of this review flags.
Prompt-type ablation is missing. Click prompts, box prompts, and mask prompts likely have different robustness profiles. Only aggregate numbers are reported.
No multi-seed error bars. Running a 200M+ parameter model three times is expensive, but a single bootstrap estimate over evaluation sets would cost nothing and is now standard at NeurIPS.

4.4 Efficiency claims

The paper reports 44 FPS for Hiera-B+ on A100 at 1024×1024. This is genuinely fast. The FLOPs accounting omits the memory encoder's contribution at steady state, which is small per frame but non-zero. The claim of "6× faster than SAM" on images is plausible because Hiera is more efficient than ViT-H. Honest, useful numbers.

5. Limitations the Authors Did Not Address

The paper lists some limitations (similar-looking objects, fast motion, small objects). Here are those I would add.

Long-horizon identity drift. The memory bank is FIFO-bounded at 6 non-prompt frames. For an object occluded for more frames than the bank can retain, the only remaining conditioning is the object pointer token $o_{t}$ , which is a compressed summary. My prediction: on clips exceeding 10, 15 seconds of continuous occlusion, identity swap rates rise sharply. The SA-V test set's median clip length is short enough that this regime is underrepresented. A clean experiment would construct synthetic occlusion intervals of 1s, 5s, 15s, and 60s on a DAVIS subset and measure re-identification rate.

Shot-boundary robustness. The streaming formulation assumes temporal continuity. A hard cut inserted mid-clip breaks the Markov assumption implicit in memory attention. SA-V is curated to exclude shot boundaries within a single masklet, so the model has never seen this during training. Downstream users editing videos in post will hit this immediately.

Domain transfer. SA-V is natural imagery. Medical, satellite, microscopy, and thermal imagery are out of distribution. The image-side zero-shot numbers on 23 datasets do *not* transfer to the video case, because no comparable video OOD suite exists. A concrete failure scenario: cell tracking in phase-contrast microscopy, where the backbone's feature statistics diverge from SA-V.

Memory attention complexity at high resolution. The memory cross-attention is $O (H W \cdot N_{mem} \cdot K \cdot d)$ , where $K$ is the number of objects tracked. For $K \sim 20$ objects in MOT-style scenes, the per-frame cost grows linearly in $K$ and multi-object inference is not truly single-pass. The paper reports single-object and few-object results but does not stress-test dense scenes.

The annotation, model feedback loop. SA-V was built by SAM 2 correcting itself through successive phases. Any systematic error mode of the early-phase model, boundary over-smoothing on hair, transparent objects, is potentially baked into the later-phase ground truth. A held-out human-only annotation subset is the control I would demand.

SAM 2 sits at the confluence of three lines.

*Promptable image segmentation*: SAM [Kirillov et al. 2023] established the prompt-encoder / mask-decoder / ambiguity-aware output paradigm. SAM 2 extends this directly.

*Video object segmentation with memory*: STM [Oh et al. 2019], AOT [Yang et al. 2021], XMem [Cheng & Schwing, 2022], and Cutie [Cheng et al. 2024]. SAM 2's memory attention is a simpler, streaming variant of XMem's tiered memory. The conceptual lineage is clear, even if the paper frames the design as fresh.

*Foundation models for video*: Contrast with InternVideo [Wang et al. 2022] and VideoMAE [Tong et al. 2022], which target representation learning rather than dense prediction. SAM 2 deliberately forgoes the masked-modeling pretraining route and instead leans on SA-V's scale. Whether scale substitutes for self-supervision is an open question this paper cannot answer alone.

One paper is conspicuously absent from the authors' discussion: DEVA [Cheng et al. 2023], which decouples image segmentation from temporal propagation with a learned bi-directional propagator. DEVA's philosophy, reuse strong image models, add a propagator, is the competing paradigm to SAM 2's unification, and the paper does not compare directly.

7. Questions for the Authors

1. Matched-data baseline. If you retrain Cutie or XMem on SA-V with identical data, resolution, and schedule, how much of the SA-V $J & F$ gap remains attributable to memory attention rather than the dataset?

2. Long-occlusion re-ID. On a controlled test (e.g. synthetic occlusions of $\geq 10$ seconds inserted into DAVIS clips), what is the identity swap rate as a function of occlusion length? Does the object pointer token suffice once memory evicts all visible frames?

3. Shot-boundary behavior. What happens when a clip contains a hard cut mid-track? Does the model track a different object with similar features, or does it correctly reset?

4. Annotation bias audit. How does SA-V ground truth differ from a human-only annotation on a held-out subset, particularly for boundary-difficult categories (hair, glass, motion-blurred edges)?

5. Scaling law for memory. Is there an empirical relationship between memory bank size, object pointer dimensionality, and maximum reliable track length? A scaling plot would help practitioners choose memory budgets.

8. Broader Impact

SA-V is the most consequential release here. Video segmentation datasets have lagged image segmentation by roughly an order of magnitude in mask count, and SA-V closes the gap. Expect a wave of downstream work using SA-V to pretrain non-SAM-2 architectures, which will further clarify what SA-V (the data) buys versus SAM 2 (the model).

On the ethics side: interactive video segmentation lowers the cost of targeted content removal, rotoscoping, and surveillance-style tracking. The authors acknowledge this briefly. The dataset itself was curated with consent considerations, but the model does not distinguish between authorized and unauthorized deployment contexts. The more pointed question is whether SAM 2 enables easier deepfake compositing by providing clean mask tracks, and the answer is yes.

9. Verdict and Recommendation

At a top venue, this is a clear accept. The dataset alone justifies publication. The architectural unification is pragmatically important, even if it is less algorithmically novel than the framing suggests. The experiments are broader than most in the VOS literature.

My conditional criticism: the paper is marketed as "solving" promptable video segmentation, whereas a more honest framing would be "a strong streaming baseline paired with the first truly large video segmentation corpus, whose gains are largely data-driven." Reviewers and readers should calibrate accordingly.

The next experiment I would want to see: a side-by-side ablation in which Cutie, XMem, and SAM 2 are all trained on SA-V from scratch with matched compute. That is the study that would isolate what memory attention actually contributes. Until it is run, the architectural claim rests on a single controlled comparison, and a single comparison is not a conclusion.

10. Reproducibility and Sources

Primary paper. Ravi, Gabeur, Hu, Hu, Ryali, Ma, Khedr, Rädle, Rolland, Gustafson, Mintun, Pan, Alwala, Carion, Wu, Girshick, Dollár, Feichtenhofer. *SAM 2: Segment Anything in Images and Videos.* arXiv:2408.00714, 2024.

Code repository. Official code and checkpoints released at github.com/facebookresearch/sam2.

Datasets.

SA-V (released with paper): ai.meta.com/datasets/segment-anything-video/
SA-1B: segment-anything.com
DAVIS 2017: davischallenge.org
YouTube-VOS: youtube-vos.org
MOSE: henghuiding.github.io/MOSE/

Reproducibility assessment (1, 5 scale).

Axis	Rating	Justification
Code availability	5/5	Full training and inference code released, with weights at multiple model sizes.
Data availability	4/5	SA-V is public but licensing restricts some commercial use; SA-1B access requires agreement.
Experimental detail	4/5	Architecture, training schedule, and evaluation protocols are documented in the appendix. Missing: precise memory selection policy, ablation seeds, and matched-data baseline retraining protocol.

The overall reproducibility posture is strong for a FAIR paper. The open question is not whether one can run SAM 2, but whether one can run the *comparison* that would pin down memory attention's independent contribution. That study is left to the community.

Cited Works

Kirillov et al. *Segment Anything*, ICCV 2023.
Ryali et al. *Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles*, ICML 2023.
Oh et al. *Video Object Segmentation using Space-Time Memory Networks (STM)*, ICCV 2019.
Yang et al. *Associating Objects with Transformers for Video Object Segmentation (AOT)*, NeurIPS 2021.
Cheng & Schwing, *XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model*, ECCV 2022.
Cheng et al. *Putting the Object Back into Video Object Segmentation (Cutie)*, CVPR 2024.
Cheng et al. *Tracking Anything with Decoupled Video Segmentation (DEVA)*, ICCV 2023.
Pont-Tuset et al. *The 2017 DAVIS Challenge on Video Object Segmentation*, arXiv:1704.00675, 2017.
Xu et al. *YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark*, ECCV 2018.
Ding et al. *MOSE: A New Dataset for Video Object Segmentation in Complex Scenes*, ICCV 2023.
Lin et al. *Focal Loss for Dense Object Detection*, ICCV 2017.
Tong et al. *VideoMAE: Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training*, NeurIPS 2022.
Wang et al. *InternVideo: General Video Foundation Models via Generative and Discriminative Learning*, arXiv:2212.03191, 2022.
Vaswani et al. *Attention Is All You Need*, NeurIPS 2017.