1. Introduction

LLM inference economics hinge on one uncomfortable fact: prefill and decode are different workloads running on the same silicon. Prefill is compute-bound, saturating tensor cores at arithmetic intensities above the A100's ridge point of roughly FLOPs/byte. Decode is memory-bandwidth-bound, with per-token arithmetic intensity collapsing to once the KV-cache dominates the read. Serving both on a single replica guarantees that one phase or the other always leaves the accelerator underutilized.

The field has converged on two architectural responses. *Disaggregation* [Patel et al. 2023; Zhong et al. 2024] runs prefill and decode on separate GPU pools with distinct parallelism strategies, paying a KV-cache transfer cost to decouple the phases. *Hybrid batching*, exemplified by Sarathi-Serve (Agrawal et al. 2024; arXiv:2403.02310), keeps the replica unified but chunks prefill into small pieces that piggyback onto decode iterations, so every forward pass is a mixed batch.

This review audits Sarathi-Serve against the disaggregation literature as a methodology and reproducibility report. The question is not 'which is faster' in isolation, but rather: under what workloads, SLOs, and hardware does stall-free scheduling actually hold, and when does disaggregation's additional operational complexity pay for itself?

2. Background

A decoder-only transformer with layers, hidden dim , and sequence length performs prefill in FLOPs. Decode at position executes in FLOPs but reads a KV-cache of size bytes. The operational intensity of decode is therefore approximately

where is bytes per parameter. For FP16, , three orders of magnitude below the A100 ridge point. No amount of kernel engineering reclaims that gap at batch size 1. Increasing the decode batch size scales intensity roughly linearly until the KV-cache exceeds HBM capacity, the regime vLLM's PagedAttention [Kwon et al. 2023] targets.

The original serving systems, Orca [Yu et al. 2022] and FasterTransformer, used iteration-level scheduling with continuous batching but still issued full prefills as single scheduler events. A long prefill stalls every co-resident decode for the full prefill duration, producing a characteristic sawtooth in the P99 inter-token latency (TBT, time-between-tokens). This is the specific tail that Sarathi-Serve attacks.

3. Key Approaches

3.1 Sarathi-Serve: chunked prefill with stall-free scheduling

The core mechanism is simple. Given a request with prompt length , the scheduler splits the prefill into chunks of size and co-batches each chunk with ongoing decodes. The forward pass processes tokens in total, where is the number of decode-phase requests in the batch. The chunk size is chosen so that the combined batch stays within a *token budget* calibrated to the hardware's ridge point.

The stated invariant is that every scheduler iteration produces a decode token for every in-flight decode request. No decode is stalled behind a prefill. The claim of up to throughput improvement at a fixed TBT SLO on Mistral-7B and Falcon-180B rests entirely on this invariant holding under realistic workload mixes.

3.2 Splitwise: physical phase splitting

Splitwise [Patel et al. 2023; arXiv:2311.18677] takes the opposite architectural stance. Prefill runs on one GPU pool and decode on another, with the KV-cache transferred over NVLink or InfiniBand after prefill completes. The authors argue that because the two phases have distinct optimal parallelism strategies, prefill benefits from tensor parallelism for latency, decode from pipeline or data parallelism for throughput, a single replica is always a compromise. Their evaluation on Llama2-70B reports throughput at the same cost-per-query envelope.

3.3 DistServe: goodput-optimal disaggregation

DistServe [Zhong et al. 2024; arXiv:2401.09670] formalizes the disaggregation argument around *goodput*, the rate of requests meeting both TTFT (time-to-first-token) and TBT SLOs. They show that for any fixed deployment budget, there exists an allocation of GPUs between prefill and decode pools that Pareto-dominates a co-located configuration. The theoretical argument is a bin-packing result: the co-located scheduler cannot simultaneously optimize for TTFT-critical and TBT-critical requests, because they push the batch composition in opposing directions.

3.4 LoongServe: elastic sequence parallelism

LoongServe [Wu et al. 2024; arXiv:2404.09526] addresses a different axis: long-context requests where a single prefill exceeds what one GPU can hold. It dynamically reshapes sequence parallelism across the request lifecycle, shrinking the parallel group as the KV-cache grows and arithmetic intensity drops. This is orthogonal to the chunked-versus-disaggregated debate, but it exposes an assumption shared by Sarathi-Serve and DistServe alike: that a single replica's resources remain static over a request's lifetime.

3.5 Comparison

SystemPrefill/decode couplingKey control parameterClaimed improvementBaselineHardware
Sarathi-ServeCo-located, chunkedToken budget , chunk size 5.6 throughput at SLOvLLM, OrcaA100, H100
SplitwiseDisaggregated poolsPool ratio, transport1.4 throughput at costFasterTransformerA100, H100
DistServeDisaggregated, goodput-optimalParallelism per pool4.48 goodputvLLMA100
LoongServeElastic sequence parallelismParallel group size schedule3.85 throughput on long contextDeepSpeed-MII, vLLMA100

These improvements are not directly comparable. Each paper adopts different baselines, different models, different SLO definitions, and different workload traces. This is the first methodological red flag.

4. Analysis

4.1 What Sarathi-Serve actually proves

The paper's core empirical claim holds under the workloads evaluated. Chunked prefill does eliminate the generation stalls that unified continuous batching produces, and the P99 TBT improvements are real. The token budget abstraction is clean and maps directly onto the roofline: one picks such that the forward pass lands just below the ridge point, leaving no compute on the table.

But the stall-free invariant is conditional, not universal. It assumes:

1. Admission control is willing to queue prefills. If a new request arrives and no chunk can fit into the current batch without exceeding , the prefill waits. Under bursty arrival patterns, this manifests as TTFT inflation that the paper underweights. The TTFT CDFs shown are for steady-state Poisson arrivals.

2. Chunk size is tuned per hardware. The optimal depends on the ridge point, the attention kernel's efficiency at small sequence lengths, and the KV-cache read cost. A suboptimal destroys the benefit. The paper reports a sweep, but on A100 only.

3. Attention over the chunk is efficient. Computing attention for a prefill chunk requires attending to all prior tokens in the same request, which demands a mixed-length attention kernel. FlashAttention-3 [Shah et al. 2024; arXiv:2407.08608] handles this well on H100 via warpgroup specialization, but the paper's measurements predate widespread FA3 deployment and rely on FlashAttention-2 equivalents.

4.2 The disaggregation counter-argument

DistServe's goodput-optimality claim rests on the observation that co-located systems cannot independently tune the parallelism of the two phases. A Sarathi-Serve replica uses one tensor parallel degree for both. If prefill benefits from for latency while decode benefits from for throughput, the unified replica leaves on the table per phase. Chunking does not recover this; it merely hides the stall.

The honest position is that Sarathi-Serve and DistServe optimize different objectives:

  • Sarathi-Serve minimizes the TBT tail at fixed replica cost, accepting TTFT degradation.
  • DistServe maximizes goodput across both SLOs at fixed cluster cost, accepting operational complexity and KV-transfer overhead.

A fair comparison requires holding the *cluster* fixed rather than the *replica*, and both papers fail to do this against each other. DistServe benchmarks against vLLM with unified scheduling; Sarathi-Serve benchmarks against vLLM and Orca but omits DistServe as a baseline. This is the ablation gap that matters most.

4.3 KV-cache transfer cost: the elephant

Disaggregation pays a KV-transfer tax. For Llama2-70B at FP16 with prompt tokens, the cache is GB per request. Over a GB/s NVLink link, that translates to ms of raw transfer, before kernel launch and serialization. Splitwise argues this is amortized over decode duration; for short outputs (say, 50 tokens), the transfer consumes roughly of the decode wall-clock. For long outputs (beyond tokens), it becomes negligible.

The disaggregation payoff is therefore *output-length-dependent*. Chatbot workloads with median responses of around tokens sit in the ambiguous middle. Sarathi-Serve's advantage grows as outputs shorten; DistServe's grows as outputs lengthen.

4.4 Interaction with parameter-efficient serving

S-LoRA [Sheng et al. 2023; arXiv:2311.03285] introduces another dimension: serving thousands of adapter variants on a single base model. This favors co-location, because adapter switching carries latency, and adapter state should stay close to whichever phase is executing. A disaggregated system must either replicate adapter state across both pools or pay a second transfer cost. Sarathi-Serve is naturally compatible with S-LoRA; DistServe requires additional engineering. The Sarathi-Serve paper does not evaluate LoRA workloads, which is a missed opportunity given their prevalence in production.

4.5 Long-context interaction

LoongServe's contribution reveals a regime in which both Sarathi-Serve and DistServe struggle. When a single prefill requires more than one GPU's KV budget (think -context Llama3), chunking within a single replica does not help; sequence parallelism is required, and it interacts poorly with chunked prefill's small-chunk attention kernels. Ring Attention [Liu et al. 2023; arXiv:2310.01889] solves the math, but the scheduler must now coordinate chunks across a ring of GPUs, and Sarathi-Serve's token budget analysis no longer applies directly. This is an unaddressed composition problem.

5. Methodology Audit

5.1 Method description completeness

The Sarathi-Serve paper describes the scheduler's token budget mechanism clearly but underspecifies three components:

  • Initialization of the token budget $T$. The paper provides values for Mistral-7B and Falcon-180B but no calibration procedure. A practitioner porting the system to Qwen2-72B on H100 must rediscover empirically.
  • Chunk-size schedule. Whether is fixed per request or adapted online is not fully detailed. The implementation uses a fixed derived from and the current decode batch, but the adaptation logic under varying is thin.
  • Preemption semantics. If a high-priority request arrives mid-chunk, the paper does not specify whether the chunk completes or is preempted. This matters for multi-tenant deployments.

5.2 Computational requirements

The evaluation spans A100-40GB and A100-80GB. Training-free evaluation is cheap: a full workload sweep for Mistral-7B is feasible on a single 8xA100 node in roughly GPU-hours. Falcon-180B experiments require TP=8 and approach GPU-hours for a complete sweep. This is accessible to well-resourced academic labs but marginal for smaller groups. No H100 numbers are reported in the original release, which dates the ridge-point calibration.

5.3 Hyperparameter sensitivity

This is where the paper's transparency shines, and also where the risk for practitioners concentrates. The token budget is the single most consequential knob. Too low and prefill starves, inflating TTFT. Too high and the decode tail returns. The reported sensitivity curves show a window around optimal within which performance degrades by less than , which is forgiving. Outside that window, the cliff is sharp.

Chunk size is coupled to but carries its own interaction with attention kernel efficiency. Below , attention becomes launch-overhead-bound; above , the chunk itself can stall decodes. A practitioner on non-A100 hardware must re-sweep.

5.4 Implementation complexity

The trickiest detail is the mixed-batch attention kernel. Co-batching a prefill chunk with decode tokens requires an attention kernel that handles variable sequence lengths per batch element and different KV-cache states. FlashAttention's varlen API handles this but does not, by default, interleave prefill and decode efficiently. Getting this wrong yields a correct-but-slow system in which the claimed throughput gains evaporate.

A second subtle failure mode involves KV-cache eviction under paged memory. Chunked prefill increases the number of scheduler iterations per request, which in turn increases page allocation events. If the paging allocator is naive, fragmentation inflates memory pressure and forces preemptions that negate the stall-free property.

5.5 Practical deployment considerations

Sarathi-Serve's operational story is genuinely simpler than disaggregation. One replica type, one set of health checks, one autoscaling policy. Disaggregation demands managing two pools with potentially divergent scaling rates, handling KV-cache-transfer failures, and ensuring prefill/decode ratios track workload composition. For teams deploying a single model at moderate scale (fewer than replicas), Sarathi-Serve is the lower-ops choice.

At hyperscale with tight goodput SLOs on both TTFT and TBT, disaggregation's independent-parallelism argument likely dominates. DistServe's goodput numbers are unattainable by any co-located scheduler bound to a single parallelism strategy.

5.6 Novelty rating

Moderate-to-significant. The chunked prefill mechanism itself is a natural extension of continuous batching and has precedent in earlier iteration-level scheduling work. What is novel is the token budget formulation that explicitly maps the scheduling decision onto the roofline, together with the empirical demonstration that the stall-free property is achievable without latency regression for realistic workloads. This is the kind of engineering contribution MLSys rewards: a clean primitive, carefully calibrated, with solid measurement. It is not a theoretical breakthrough.

6. Limitations and Open Questions

Limitations the authors do not fully address:

1. Workload distribution assumptions. All evaluations use Poisson arrivals with specific prompt/output length distributions. Real production traffic exhibits heavy-tailed prompt lengths and bursty arrivals that stress the token budget's worst-case behavior.

2. Cross-replica interactions. Single-replica analysis ignores queuing across replicas. A cluster-level comparison with DistServe is absent.

3. Low-precision regimes. FP8 and INT4 inference shift the ridge point dramatically. Token budget calibration is not revisited for these precisions.

4. Speculative decoding interaction. Speculative decoding [Leviathan et al. 2023] alters the effective decode batch composition. Chunked prefill scheduling has not been shown to compose cleanly with speculators.

Key questions

1. Does the throughput improvement survive a head-to-head comparison with DistServe on identical cluster budgets and SLO definitions?

2. How does optimal scale with HBM bandwidth across generations (A100 H100 B200)?

3. Under what prompt-length and output-length distributions does chunked prefill outperform disaggregation, and can this be characterized analytically?

4. Can the token budget be made adaptive online without destabilizing the stall-free invariant?

5. How does Sarathi-Serve compose with speculative decoding and with LoRA serving, both of which are production-critical?

7. Adoption Recommendation

Use Sarathi-Serve when:

  • TBT SLO is the binding constraint and TTFT SLO is loose.
  • Deployment is moderate scale, single model (possibly with LoRA adapters), and operational simplicity matters.
  • Hardware is A100 or H100 class with well-characterized ridge points.
  • Output lengths are short to moderate (median below tokens).

Prefer disaggregation (DistServe, Splitwise) when:

  • Both TTFT and TBT have tight SLOs and goodput is the governing metric.
  • Cluster scale is large enough to amortize operational overhead.
  • Output lengths are long, rendering KV-transfer a small fraction of decode wall-clock.
  • Independent parallelism tuning per phase is worth the added complexity.

The interesting engineering lies in the details. Neither architecture dominates; they optimize different objectives on different workloads. The field's current framing as 'chunked versus disaggregated' obscures the real question: which constraints bind in your deployment?

8. Verdict

Sarathi-Serve is a well-executed systems contribution with a clean primitive and careful measurement. Its throughput claims hold under the evaluated conditions, and the stall-free scheduling invariant is real but conditional on workload and hardware assumptions the paper does not fully stress-test. The absence of a direct DistServe baseline is the most consequential methodological gap. For practitioners, it is a strong default for moderate-scale, TBT-sensitive deployments. For researchers, the open composition questions with speculative decoding, LoRA serving, and elastic sequence parallelism are where the next contributions lie.

Theory is nice, but what does the profiler say? In this case, the profiler says that chunked prefill works, within a bounded envelope. Outside that envelope, disaggregation earns its operational tax.

Reproducibility & Sources

Primary paper: Agrawal et al. *Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve*, arXiv:2403.02310.

Surveyed papers:

  • Agrawal et al. Sarathi-Serve, arXiv:2403.02310
  • Zhong et al. DistServe, arXiv:2401.09670
  • Patel et al. Splitwise, arXiv:2311.18677
  • Wu et al. LoongServe, arXiv:2404.09526
  • Shah et al. FlashAttention-3, arXiv:2407.08608
  • Sheng et al. S-LoRA, arXiv:2311.03285
  • Liu et al. Ring Attention, arXiv:2310.01889

Code repository: The Sarathi-Serve reference implementation is released by Microsoft Research on GitHub under the microsoft/sarathi-serve repository.

Datasets: Evaluations use synthetic Poisson traces derived from published chatbot distributions and ShareGPT-style prompt/output length profiles. ShareGPT is publicly scraped, not an official release.

Reproducibility assessment:

  • Code availability: 4/5. The reference implementation is public and integrates with common serving stacks, though production-grade hardening is left to the user.
  • Data availability: 3/5. Traces are synthetic and procedurally described, but ShareGPT-derived distributions depend on scrape snapshots that drift.
  • Experimental detail: 4/5. Token budget calibration is documented for specific hardware, but extrapolation requires re-sweeping. Preemption and priority semantics remain underspecified.