PagedAttention and vLLM serve Large Language Models faster and cheaper

PagedAttention and vLLM are two technologies that make large language models (LLMs) faster and more efficient. PagedAttention, which is inspired by the classic techniques of virtual memory and paging in operating systems, significantly reduces the memory consumption of LLMs, while vLLM, an open-source library, enables a fast and simple serving of LLMs using PagedAttention.

Together, PagedAttention and vLLM address the challenges of LLM deployment, such as memory optimization, variable sized input prompts, sequential token generation, and low batch size (the number of inputs that are processed by a LLM simultaneously).

The new approach was proposed by a research team from UC Berkeley, Stanford University, and UC San Diego. vLLM’s source code is publicly available here.

vLLM delivers a remarkable performance boost, achieving a 2-4x improvement in throughput (the number of requests that an LLM can handle per unit of time) compared to state-of-the-art systems like FasterTransformer and Orca, all while maintaining equivalent latency.

The impact of PagedAttention becomes more evident when dealing with longer sequences, larger models, and more complex decoding algorithms.

Improving LLM serving performance: batching techniques and KV cache management

Delivering high-quality LLM services to users, such as summarization, essay writing, or question answering demands the simultaneous processing of multiple requests. This approach, known as batching, is essential for improving the throughput.

  • No batching: each request must be handled sequentially, demanding the model to be loaded, computations to be performed, and the output sequence to be generated for each individual request. This approach is inefficient and time-consuming.
  • Batching: the LLM works on multiple inputs and outputs simultaneously. By loading the model only once and performing computations across multiple requests in parallel, batching enhances the overall efficiency.

Traditionally, LLMs rely on key-value (KV) caches to store intermediate computations during inference. These caches help LLMs compute the attention scores that measure how much the LLM should focus on different parts of the input and the output when making predictions.

In the next picture we can see the memory layout when serving an LLM with 13B parameters on NVIDIA A100 GPU. The parameters (gray) persist in GPU memory throughout serving. The memory for the KV cache (red) is (de)allocated per serving request.

Left: Memory layout when serving an LLM with 13B parameters on NVIDIA A100. Right: vLLM prevents the KV cache memory from increasing rapidly, as seen in existing systems. (source: paper)

These caches often encounter memory bottlenecks, particularly when handling multiple requests simultaneously. Inefficient KV cache management, especially fragmentation and redundant duplication of memory, can severely reduce the batch size.

As shown in the following profiling results, vLLM can use 96.3% of the KV cache memory, whereas the current systems only reach a KV cache memory usage of 20.4% – 38.2%.

Average percentage of memory wastes in different
LLM serving systems. (source: paper)

PagedAttention for memory optimization

To optimize the memory usage of KV caches and increase the batch size, PagedAttention applies two techniques: managing the memory at the block level and sharing the memory across different inputs. These techniques are based on the concepts of virtual memory and paging used in operating systems.

1. Block-level memory management: PagedAttention works by dividing the KV cache into smaller blocks of equal size, each containing the keys and values (KV) for a fixed number of tokens (the smallest units of text that LLMs work with). It stores the KV blocks in non-contiguous memory locations (as shown in the next picture).

PagedAttention: The memory space for KV cache is divided into blocks that do not need to be contiguous in memory space. (source: project page)

A block table maps the logical blocks of a sequence (which are next to each other) to the physical blocks in memory (which have different locations). The physical blocks are only assigned when new tokens are generated.

Example generation process for a request with PagedAttention. (source: project page)

2. Memory sharing allows PagedAttention to reuse the same blocks for different LLM requests if they have similar text inputs or outputs. This way, PagedAttention can avoid creating new blocks for each request, and further reduce the memory consumption. This sharing is especially useful for tasks that use the same decoding algorithm or have common text patterns.

Example of parallel sampling from the same prompt. (source: project page)

vLLM for fast and simple LLM deployment

vLLM is a high-throughput LLM serving engine that achieves near-zero waste in KV cache memory. It implements PagedAttention as its core attention algorithm.

The vLLM architecture has two main components:

  • A centralized scheduler to control the execution of distributed GPU workers.
  • A KV cache manager that efficiently handles the KV cache using a paged approach, powered by PagedAttention. It controls the physical KV cache memory on the GPU workers based on instructions received from the centralized scheduler.
vLLM architecture overview. (source: paper)

The scheduler and the KV cache manager work together to make vLLM a fast system for LLM inference and serving.

vLLM allows users to customize the sampling parameters for each request, such as the maximum sequence length and the beam width (the number of different texts that the LLM can generate for each request).

Evaluation

The performance of vLLM was measured under different workloads, with a focus on the serving throughput. The results show that vLLM can increase the throughput of LLM serving by 2-4 times with the same latency, compared to the best existing systems, such as FasterTransformer and Orca.

FasterTransformer, Orca, and vLLM: Single sequence generation with OPT models on the ShareGPT and Alpaca dataset. (source: paper)

The throughput of vLLM was measured and compared with HuggingFace Transformers (HF), the most popular LLM library, and HuggingFace Text Generation Inference (TGI), the previous best system. The experiments were done in two settings: LLaMA-7B on an NVIDIA A10G GPU and LLaMA-13B on an NVIDIA A100 GPU (40GB). vLLM achieved up to 24x higher throughput than HF and up to 3.5x higher throughput than TGI (see the next pictures).

Serving throughput when each request asks for one output completion . vLLM achieved the highest throughput (source: project page)
Serving throughput when each request asks for three parallel output completions. vLLM achieved the highest throughput (source: project page)

The implementation of PagedAttention has yielded remarkable results, enabling near-zero waste in KV cache memory.

Conclusion

PagedAttention and vLLM manage the memory for LLM serving more efficiently, and solve the problems of fragmentation and redundant duplication. They achieve 2-4 times higher throughput compared to the most advanced existing systems.

The attention algorithm uses virtual memory and paging techniques inspired by operating systems to intelligently handle the dynamic memory needs for unpredictable LLM output lengths.

Learn more:

Other popular posts
  • Is generative AI making us think less? Microsoft investigates

  • DeepSeek-R1 revolutionizes the AI landscape

  • Cosmos simulates physical worlds for training AI systems

  • Create dynamic multi-angle videos with CAT4D diffusion model

  • IBM’s Docling converts PDFs into other digital formats

  • Hunyuan-Large, the largest open-source Mixture of Experts model from Tencent