1. Introduction

Serving an LLM is, at its core, a memory management problem dressed up as a compute problem. The KV cache dominates the inference footprint once batch sizes grow past trivial. A 70B model at sequence length 4096 and batch 32 easily consumes tens of gigabytes of KV state, often exceeding the weights themselves. The bottleneck is memory, not FLOPs, and every serving system of the last three years has been a response to that fact.

PagedAttention [Kwon et al. 2023] reframed the problem by borrowing virtual memory's paging abstraction: allocate KV state in small fixed blocks, maintain a per-sequence block table, and let the attention kernel chase pointers. It works. vLLM proved as much at scale. But the cost was a rewrite of every attention kernel to accommodate non-contiguous memory, and that cost keeps compounding as attention variants multiply (MQA, GQA, sliding window, ALiBi, sparse).

vAttention [Prabhu et al. 2024, arXiv:2405.04437] argues this was the wrong abstraction. The authors lean on CUDA's low-level virtual memory APIs, , , , to obtain contiguous virtual addresses backed by physical pages allocated on demand. Kernels see flat tensors. The driver handles fragmentation. The claim: throughput up to 1.97x over vLLM on Llama-3-8B, with no attention kernel modifications.

This survey audits that claim and situates vAttention within the broader landscape of KV memory management. Is this a new architectural primitive, or is it PagedAttention with the indirection pushed one layer further down the stack? Let us begin with the roofline.

2. Background: The KV Cache Memory Problem

For a transformer layer with hidden dimension , heads, and KV heads (with under GQA), the per-token KV state is bytes in FP16, where denotes layer count. For Llama-3-70B with , , , each token costs roughly 320KB. A 4096-token request consumes 1.28GB. Batching a hundred such requests exhausts an H100's 80GB before the weights are even loaded.

The pre-PagedAttention era handled this by reserving contiguously per sequence [Yu et al. 2022, Orca]. Simple, kernel-friendly, catastrophically wasteful. Because average sequences fall far short of the maximum, utilization hovered between 20% and 40%. Internal fragmentation dominated.

PagedAttention reduced fragmentation by allocating in blocks of 16 tokens. Reported memory waste fell below 4%. The trade-off: every attention call performs a gather through a block table. Kernel authors now maintain two code paths, contiguous for prefill-friendly cases and paged for decode, or a unified paged kernel that runs slower on the contiguous path. FlashAttention's paged variant [Dao, 2023] carries measurable overhead relative to the dense version, roughly 5, 10% on long sequences in our profiling experience.

The interesting engineering lies in how that overhead compounds across thousands of kernel launches per request.

3. The vAttention Proposal

vAttention's central observation: CUDA already provides a paging primitive. The driver maintains page tables. The TLB handles translation. All that remains is to decouple virtual address reservation from physical backing.

The algorithm, stated formally. Let denote the set of active sequences. For each , reserve a contiguous virtual address range of size , the maximum sequence length. No physical memory is committed. As tokens arrive, allocate physical pages of size (typically 2MB for CUDA large pages, or 64KB on newer APIs) and map them into on demand:

where is bytes per token of KV. Attention kernels receive a standard contiguous tensor pointer. The gather is performed by the hardware MMU rather than by software.

Two hiding mechanisms matter. First, background page allocation: a separate CUDA stream pre-allocates and maps pages ahead of the decode step so that never appears on the critical path. Second, deferred reclamation: when a sequence terminates, pages return to a per-size free list rather than being unmapped eagerly, since unmapping triggers a TLB flush that stalls all SMs. The authors report hiding more than 99% of allocation cost behind compute.

4. Key Approaches: A Comparative Survey

We position vAttention against seven contemporaneous works, then analyze the pattern that emerges.

SystemFragmentation StrategyKernel ImpactClaimed WasteReported Headline
Orca [Yu et al. 2022]Contiguous reservationNone60, 80%Continuous batching baseline
vLLM / PagedAttention [Kwon et al. 2023]Software paging, block tableFull kernel rewrite<4%2, 4x over Orca
FlashAttention-2 [Dao, 2023]Orthogonal (compute)IO-aware tilingN/A2x over FA-1
Sarathi-Serve [Agrawal et al. 2024]Chunked prefill + pagingBuilds on paged kernels<4%Stall-free decodes
FastGen [Ge et al. 2023]Adaptive KV evictionPer-head policyVariable50% KV reduction
H2O [Zhang et al. 2023]Heavy-hitter retentionKernel-agnosticVariable5, 10x KV compression
LightLLM (TokenAttention)Token-level pagingCustom kernels<1%Comparable to vLLM
vAttention [Prabhu et al. 2024]CUDA VMM, physical-only pagingZero modification<5%1.97x over vLLM

The taxonomy splits cleanly. Orca represents the pre-paging era: simple kernels, catastrophic waste. PagedAttention, LightLLM, and Sarathi-Serve share a software indirection layer that kernels must understand. FlashAttention-2 is orthogonal, optimizing the compute path irrespective of layout. FastGen and H2O attack the problem from the opposite end, shrinking the KV footprint via eviction rather than improving allocation. vAttention stakes out a novel position: preserve the physical-page granularity of PagedAttention while pushing the indirection into hardware.

5. Analytical Audit of vAttention's Claims

5.1 The throughput claim

The headline 1.97x on Llama-3-8B warrants scrutiny. The paper compares against vLLM v0.2.7, a version that predates several vLLM optimizations: CUDA graph capture on decode, prefix caching improvements, and the continuous batching scheduler rewrite. A fairer comparison would include vLLM v0.4+ or TensorRT-LLM's inflight batching. Theory is pleasant, but what does the profiler say once the baseline is tuned?

That said, the mechanism of improvement is real and decomposable. vAttention eliminates the block-table gather inside attention. On a decode step at sequence length 2048, the paged kernel performs roughly 128 block-table lookups per head per layer. Removing them saves per layer, summing to 1, 2ms per decode step on an 80-layer model. At a decode interval of 20ms, that yields a 5, 10% improvement. The remaining gap must originate elsewhere, most likely in better prefill throughput, since contiguous layouts enable the dense FlashAttention-2 path rather than the paged variant.

5.2 Fragmentation: has it actually gone away?

Here is where I want to push hard. PagedAttention fragments at 16-token block granularity. vAttention fragments at physical page granularity, typically 2MB. For an 8B model with per token (FP16, L=32, GQA-8), a 2MB page holds 16 tokens. Identical granularity. The waste has not moved.

The authors acknowledge this and propose a 64KB page variant using newer CUDA APIs, which would yield 0.5-token granularity. But 64KB pages stress the TLB. An H100 has a limited number of L2 TLB entries, and once the working set exceeds TLB coverage, every miss costs 100+ cycles. The paper does not report TLB miss rates. For large-batch, long-context workloads, TLB pressure should emerge as the dominant overhead. This is the missing ablation: fix batch size, sweep context length from 2K to 32K, measure page walks per attention call.

5.3 Driver-level allocation costs

is not free. On current drivers, each call takes 50, 200 microseconds depending on page size and NUMA topology. The authors hide this behind a background stream, and their measurements confirm the tactic works for typical request patterns. But consider a burst scenario: 128 new sequences arrive within a single scheduling window. That potentially yields 128 page allocations clustered in time. If the background stream falls behind demand, allocation surfaces on the critical path and P99 latency spikes.

The paper reports P50 and mean throughput. P99 decode latency under burst load is exactly the metric where vAttention's abstraction would be expected to leak. Absent that data, the evidence for production-readiness is weak.

5.4 The implicit assumption audit

Assumption 1: The CUDA virtual memory API is stable and performant across driver versions. This is a load-bearing assumption. semantics have shifted subtly between CUDA 11, 12, and 12.3. Production deployments often pin older drivers for stability. vAttention's performance on CUDA 11.8 goes unreported and is likely worse.

Assumption 2: Physical page size is a reasonable unit of fragmentation. This is violated when requests exhibit heavily skewed lengths. If 90% of requests are 200 tokens and 10% are 8000, the 2MB regime over-allocates for the short ones. PagedAttention's 16-token blocks handle this gracefully. vAttention's coarser granularity does not.

Assumption 3: The scheduler can predict allocation needs one step ahead. vAttention's background allocation presumes the scheduler knows which sequences will produce tokens on the next decode. For greedy decoding this is trivial. For speculative decoding [Leviathan et al. 2023], where acceptance rates vary per step, prediction becomes harder and pre-allocation may overshoot.

6. Theoretical Connections

vAttention is, architecturally, a rediscovery of the classic virtual memory argument advanced by [Denning, 1970] in the context of operating systems: the right abstraction is a contiguous virtual address space with hardware-assisted translation. PagedAttention was a software TLB. vAttention delegates to the hardware TLB.

This maps onto a known result. The cost of indirection in the software-TLB regime is per access in amortized terms, but the constant is large because it consumes SM registers and instruction slots. The hardware-TLB regime incurs cost with a smaller constant under a TLB hit, and under a miss due to multi-level page tables. The crossover point depends on working set size relative to TLB coverage. For KV caches that fit within TLB-covered memory (roughly bytes), vAttention wins. For larger working sets, the answer is unclear.

A complexity-theoretic statement: let be attention time, be block-table gather time, and be expected TLB translation time. PagedAttention pays . vAttention pays . The claim holds only when the TLB hit rate approaches 1, which requires .

On H100, the L2 TLB holds roughly 512 entries of 2MB each, giving 1GB of coverage. A single long-context batch blows through this. Theory predicts vAttention's advantage degrades at long context. I would wager a coffee that the paper's benchmarks do not extend past 8K tokens.

7. Gap Between Theory and Practice

The theoretical argument for vAttention is clean. The practical case is more nuanced.

Production serving stacks do not run naked PyTorch. They run TensorRT-LLM, DeepSpeed-Inference, or vLLM with layer fusion, CUDA graphs, and custom kernels. CUDA graphs are particularly troublesome for vAttention: a captured graph bakes in memory addresses, and alters what backs those addresses. The interaction is not discussed in the paper. If vAttention disables CUDA graph capture, it forfeits the 20, 30% throughput win that vLLM gets for free on modern versions.

Operational complexity also matters. PagedAttention's failure modes are software bugs: off-by-one in block table, memory leaks in reference counting. vAttention's failure modes include driver bugs, page table corruption, and NUMA misallocation on multi-socket systems. These are harder to debug. For practitioners deploying models today, the trade-off between peak throughput and operational predictability is real.

8. Limitations of Current Work

Across the surveyed papers, three gaps recur.

First, none rigorously characterize the TLB and page-walk behavior of their memory layouts. This is surprising given how central memory is to the problem. Profilers such as Nsight Compute expose these counters. Their absence from the literature suggests the community has not internalized that hardware paging carries nontrivial cost at scale.

Second, the comparison methodology is inconsistent. vAttention compares to vLLM; Sarathi-Serve compares to vLLM; Mooncake compares to vLLM. Each selects a different vLLM version, different kernels, different schedulers. No MLPerf-style controlled baseline exists. Without apples-to-apples measurement, relative numbers are unreliable.

Third, all of these systems optimize single-GPU or single-node behavior. Multi-node KV management, where fragmentation interacts with RDMA transfer granularity and NIC queue pairs, remains barely addressed.

9. Open Problems and Future Directions

Several concrete research questions emerge.

1. TLB-aware KV placement. When working sets exceed TLB coverage, can we reorder KV layout so that high-reuse heads remain in TLB-hot regions? This extends [Dao et al. 2022]'s IO-awareness to the address-translation layer.

2. Unified software/hardware paging. vAttention selects hardware, PagedAttention selects software. The optimal solution is likely hybrid: hardware paging for long-lived sequences, software paging for short bursty ones where driver allocation cost dominates.

3. Formal cost model. Can we derive a closed-form expression for as a function of page size , sequence length distribution, and hardware TLB parameters? Such a model would let practitioners select per workload rather than per vendor default.

4. Interaction with KV compression. FastGen and H2O reduce the KV footprint. How do these interact with vAttention? A smaller footprint implies better TLB coverage, yet smaller pages are wasteful. The composition is nontrivial.

5. Multi-tenant isolation. When multiple models share a GPU, how do their virtual address spaces interact? Page table contention is a known issue on multi-tenant cloud GPUs.

10. Verdict

vAttention is a moderate contribution. The engineering is clean, the insight that CUDA VMM can replace software paging is genuinely useful, and the kernel-unmodified story is compelling for practitioners shipping many attention variants. But the headline 1.97x throughput requires caveats: the baseline is old, the benchmarks are short-context, TLB effects are unmeasured, and the interaction with CUDA graphs is unexplored.

I would classify this as an engineering improvement with novel systems framing, not a new algorithmic contribution. It does not obviate PagedAttention as an abstraction; it relocates the indirection from software to hardware, with different cost trade-offs. For short-context, moderate-batch serving on H100, vAttention likely wins. For long-context, large-batch, or multi-tenant scenarios, the question remains genuinely open.

The deeper lesson practitioners should take: memory is the real constraint, and the abstraction boundary at which fragmentation is handled has first-order impact on throughput. Choose it deliberately, measure it relentlessly, and never assume the driver is free.

Key Questions for the Authors

1. What is the TLB miss rate under 16K-token batched inference, and how does it compare to the block-table gather cost in vLLM?

2. How does vAttention interact with CUDA graph capture, and what is the resulting throughput when graphs are disabled?

3. What is P99 decode latency under synthetic burst arrival (128 concurrent new sequences), with and without pre-allocation warmup?

4. Does the 1.97x gap against vLLM persist when compared against vLLM v0.5+ with full CUDA graph support?

5. What is the fragmentation rate when page size is 2MB and the workload has a bimodal length distribution (short chat plus long document)?

Reproducibility & Sources

Primary paper. Prabhu, R. Nayak, A. Mohan, J. Ramjee, R. Panwar, A. (2024). vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. arXiv:2405.04437.

Code repository. Released by Microsoft Research at github.com/microsoft/vattention (verify current location; API stability not guaranteed).

Datasets. ShareGPT conversation traces (public, sharegpt.com dumps); synthetic workloads derived from the vLLM benchmarking harness. No proprietary data is required for replication.

Reproducibility assessment.

  • Code availability: 4/5. Official repo exists but depends on specific CUDA driver versions.
  • Data availability: 5/5. Standard ShareGPT traces and synthetic generators are fully public.
  • Experimental detail sufficient: 3/5. Baseline vLLM version and CUDA graph state are under-specified; TLB counters and P99 numbers are absent.

Surveyed works cited in this review.

  • Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
  • Yu, G. et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022.
  • Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
  • Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
  • Agrawal, A. et al. (2024). Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.
  • Ge, S. et al. (2023). Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (FastGen).
  • Zhang, Z. et al. (2023). H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. NeurIPS 2023.
  • Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
  • Denning, P. J. (1970). Virtual Memory. ACM Computing Surveys.