vllm
Auto-generated from vllm-project/vllm by Mutable.ai Auto WikiRevise
vllm | |
---|---|
GitHub Repository | |
Developer | vllm-project |
Written in | Python |
Stars | 17k |
Watchers | 171 |
Created | 02/09/2023 |
Last updated | 04/03/2024 |
License | Apache License 2.0 |
Homepage | docs.vllm.ai |
Repository | vllm-project/vllm |
Auto Wiki | |
Revision | |
Software Version | p-0.0.4Premium |
Generated from | Commit c64cf3 |
Generated at | 04/03/2024 |
The vLLM repository serves as a high-throughput and memory-efficient inference and serving engine for Very Large Language Models (VLLMs). It provides a suite of tools and frameworks that enable efficient execution and management of language models, particularly in resource-intensive scenarios. Engineers can leverage this repository to deploy and serve VLLMs, facilitating tasks such as natural language processing, text generation, and language understanding at scale.
The most significant parts of the repository include the attention mechanisms, model execution and management, speculative decoding, transformers utilities, and API entry points. The attention mechanisms, located in …/attention
, are crucial for the performance of VLLMs as they determine how the model selectively focuses on different parts of the input data. The repository implements various attention backends and operations, such as FlashAttentionBackend
, TorchSDPABackend
, and XFormersBackend
, which are optimized for different hardware and use cases. The attention layer integration is designed to be modular, allowing for easy extension and customization.
Model execution and management are handled through a combination of specialized layers, model architectures, parallel processing utilities, guided decoding, and weight management, as found in …/model_executor
and …/worker
. These components work together to load, execute, and manage large language models across different hardware backends, including CPUs and Neuron devices. Executors like GPUExecutor
and NeuronExecutor
manage the execution strategy, ensuring that models run efficiently on the intended hardware.
Speculative decoding, detailed in …/spec_decode
, is a novel feature that reduces per-token latency by using a smaller model to propose speculative tokens, which are then scored by a larger model. This directory contains the core components, such as SpecDecodeWorker
and BatchExpansionTop1Scorer
, which implement this functionality.
Transformers utilities, located in …/transformers_utils
, provide essential support for managing transformers-based language models. This includes configuration management, tokenizer management, and incremental detokenization. The Detokenizer
class, for instance, is key for decoding model outputs into human-readable text.
The API entry points, found in …/entrypoints
, offer a way to interact with the VLLM system. The OpenAI-compatible API server, implemented in the api_server.py
file, sets up a FastAPI application with various endpoints for handling language model requests. The LLM
class serves as the primary interface for generating text using the VLLM engine.
Key algorithms and technologies the repo relies on include CUDA for GPU acceleration, PyTorch for model operations, and the Triton language for custom kernel development. The repository's design choices emphasize modularity, extensibility, and efficiency. For example, the attention mechanism's design allows for easy integration of different backends, and the model executor's structure supports various model architectures and parallel processing strategies.
In summary, the vLLM repository is a comprehensive solution for deploying and serving large language models, with a focus on performance and resource efficiency. It leverages advanced computing techniques and thoughtful design to provide a robust engine for VLLM inference and serving.
Attention MechanismsRevise
References: vllm/attention
The …/attention
directory orchestrates the implementation of attention mechanisms for Very Large Language Models (VLLM), supporting various attention types such as multi-head and grouped-query attention. The attention mechanism is a critical component in language models, enabling the model to focus on different parts of the input sequence when making predictions.
Attention BackendsRevise
References: vllm/attention/backends
The …/backends
directory hosts the implementations for various attention mechanisms, central to the operation of large language models. It includes abstract base classes that define a common interface for attention operations and concrete classes that provide specific attention computation strategies.
Attention OperationsRevise
References: vllm/attention/ops
The PagedAttention
class in …/paged_attn.py
is central to managing the paged attention mechanism, which is crucial for handling large key-value caches in transformer models. The class offers static methods for cache shape computation, cache splitting, and forward attention operations. Notably, get_kv_cache_shape()
calculates the cache's dimensions, while split_kv_cache()
divides the cache into key and value components. The forward_decode()
method orchestrates the attention process, choosing between two custom CUDA kernels based on model context length and other factors.
Attention Layer IntegrationRevise
References: vllm/attention/layer.py
, vllm/attention/selector.py
The Attention
class in …/layer.py
serves as a flexible attention layer within PyTorch models, accommodating various attention mechanisms such as multi-head and grouped-query attention. It is designed to integrate seamlessly with different backend technologies, optimizing performance based on the hardware and data types in use.
Attention Mechanism ExtensibilityRevise
References: vllm/attention/__init__.py
The __init__.py
in the …/attention
directory establishes a modular framework for integrating various attention mechanisms into the VLLM system. It exposes essential classes and functions that facilitate the extensibility of the attention mechanism.
Model Execution and ManagementRevise
References: vllm/executor
, vllm/model_executor
, vllm/worker
Execution of large language models within the VLLM system is managed through a series of executor classes that interface with various hardware backends. These executors are designed to handle the complexities of model execution, including asynchronous operations and different execution strategies.
Specialized Layers and UtilitiesRevise
References: vllm/model_executor/layers
The VLLM project leverages a variety of specialized layers and utilities to construct and manage Very Large Language Models. These components are essential for optimizing performance and providing the necessary building blocks for the system.
Model ArchitecturesRevise
References: vllm/model_executor/models
The VLLM project encompasses a variety of large language model architectures, each tailored with specific features and design choices to address different aspects of language processing and generation.
Parallel Processing and DistributionRevise
References: vllm/model_executor/parallel_utils
The …/parallel_utils
directory contains utilities that facilitate parallel processing and distribution of computations across multiple GPUs. These utilities are essential for the efficient operation of the Very Large Language Model (VLLM) system, particularly when dealing with the large-scale models that require distribution over several hardware units.
Guided DecodingRevise
Guided decoding in the VLLM system is facilitated through the use of processors that apply constraints to text generation, leveraging regular expressions, JSON schemas, and context-free grammars. The processors are designed to bias the logits output by the language model, steering the generation process towards text that adheres to specified patterns or structures.
Model Loading and Weight ManagementRevise
References: vllm/model_executor/model_loader.py
, vllm/model_executor/neuron_model_loader.py
, vllm/model_executor/sampling_metadata.py
, vllm/model_executor/weight_utils.py
The get_model()
function in …/model_loader.py
serves as the central mechanism for loading machine learning models with the appropriate configurations. It determines the model architecture from a ModelConfig
object and initializes the model with either LoRA configurations or standard parameters, depending on the model's capabilities. For vision-language models, specific configurations are passed to the constructor. The function also handles model quantization by setting the default PyTorch data type before instantiation. If a "dummy" load format is specified, the model weights are initialized with random values for performance benchmarking purposes. Once the model is created and weights are loaded, it is set to evaluation mode before being returned.
Hardware Backend ExecutionRevise
References: vllm/worker
The execution of machine learning models across various hardware backends is managed by a set of classes within the …/worker
directory. These classes are tailored to handle the intricacies of different hardware, such as CPUs and Neuron devices, ensuring that models are executed efficiently regardless of the underlying infrastructure.
Executors and Strategy ManagementRevise
References: vllm/executor
Executors in …/executor
manage the execution of machine learning models across various hardware backends. The directory includes abstract base classes and concrete implementations for CPU, GPU, and Neuron devices, as well as a Ray cluster-based executor for distributed environments.
Speculative DecodingRevise
References: vllm/spec_decode
Speculative decoding in the VLLM project leverages a smaller "draft" model, referred to as the "proposer," to generate preliminary token predictions. These speculative tokens are then evaluated by a larger, more powerful language model, known as the "scorer," to ensure the quality of text generation. The process aims to reduce the latency per token during decoding by quickly proposing tokens that are likely to be correct, and only using the larger model's computational resources to score these proposals.
Core Speculative Decoding ComponentsRevise
The SpecDecodeWorker
orchestrates speculative decoding by coordinating between a draft model, known as the proposer, and a larger language model, the scorer. The proposer generates speculative tokens, which the scorer evaluates to determine the most probable tokens. This process is facilitated by a rejection sampler that filters the proposed tokens based on their acceptance probability.
Batch Expansion ScoringRevise
References: vllm/spec_decode/batch_expansion.py
BatchExpansionTop1Scorer
is a class that implements the SpeculativeScorer
interface, designed to score speculative tokens generated during the speculative decoding process. The scoring is accomplished through a technique known as batch expansion, which is pivotal when a multi-query attention (MQA) kernel is not available. The class facilitates the scoring of multiple query positions by transforming them into a format that can be processed as a single query position per sequence.
Metrics and Performance TrackingRevise
References: vllm/spec_decode/metrics.py
The SpecDecodeWorkerMetrics
dataclass and AsyncMetricsCollector
class play a pivotal role in the speculative decoding process by providing a structured approach to performance tracking.
Multi-Step Decoding WorkflowRevise
References: vllm/spec_decode/multi_step_worker.py
The MultiStepWorker
class extends the Worker
class to enhance the efficiency of model inference by enabling multiple forward passes within a single invocation. This approach minimizes the scheduling overhead typically associated with separate forward pass calls. Key functionalities of MultiStepWorker
include:
Utility Functions for Speculative DecodingRevise
References: vllm/spec_decode/util.py
The util.py
module in …/spec_decode
provides a collection of utility functions that facilitate the speculative decoding process. These functions are essential for handling sequence data, converting sampling outputs to PyTorch tensors, and profiling CUDA operations.
Transformers UtilitiesRevise
References: vllm/transformers_utils
Within the VLLM project, the …/transformers_utils
directory serves as a hub for utilities that facilitate the interaction with transformer-based language models. A key aspect of this utility suite is the management of tokenizers, which are essential for converting text to a format that the models can process. The BaichuanTokenizer
class, for example, provides methods for text tokenization and vocabulary management, which are fundamental for preparing input data for language models (Tokenization Utilities).
Configuration ManagementRevise
References: vllm/transformers_utils/configs
Within the VLLM project, the …/configs
directory houses classes for configuring various language models. Each class encapsulates the parameters necessary to define the model's architecture and behavior.
Tokenizer ManagementRevise
References: vllm/transformers_utils/tokenizer_group
Tokenizers are managed through the BaseTokenizerGroup
class, which serves as an interface for encoding text prompts and handling LoRA adapters. The TokenizerGroup
class provides a default implementation, managing a cache of LoRA-enabled tokenizers and encoding text prompts. For distributed environments, RayTokenizerGroupPool
leverages the Ray framework to pool tokenizer instances for asynchronous tokenization.
Tokenization UtilitiesRevise
References: vllm/transformers_utils/tokenizers
The BaichuanTokenizer
class, defined in …/baichuan.py
, handles text tokenization and vocabulary management for the VLLM project. It is the primary interface for converting raw text into a format suitable for language model processing. The class is initialized with parameters for special tokens like unk_token
, bos_token
, eos_token
, and pad_token
.
Model Configuration LoadingRevise
References: vllm/transformers_utils/config.py
The process of loading model configurations within the VLLM project is managed by the file …/config.py
. It encompasses a registry and a set of functions to handle the configuration of various language models. The key components include:
Incremental DetokenizationRevise
References: vllm/transformers_utils/detokenizer.py
The Detokenizer
class plays a pivotal role in transforming the output of language models into human-readable text. It leverages a tokenizer group to decode sequences incrementally, which is crucial for applications that require real-time text generation or streaming outputs. The class is designed to work with tokenizers that may have an expanded vocabulary, ensuring that custom tokens are handled correctly during the detokenization process.
Tokenizer Caching and RetrievalRevise
References: vllm/transformers_utils/tokenizer.py
Tokenizers are essential for converting raw text into a format that machine learning models can understand. In the VLLM project, efficient retrieval and caching of tokenizers are crucial for performance optimization, especially when dealing with large-scale language models.
Usage Statistics CollectionRevise
References: vllm/usage
The …/usage
directory is dedicated to the collection and reporting of usage statistics for the VLLM system. The primary class within this directory is UsageMessage
, which is found in …/usage_lib.py
. This class is tasked with the aggregation of platform-specific and VLLM-related data, which it then sends to a centralized server for analysis.
API and Entry PointsRevise
References: vllm/entrypoints
Routes in …/api_server.py
are managed by the FastAPI
application, which is configured to handle endpoints for health checks, chat completions, and text completions. The application is augmented with middleware for authentication and Prometheus metrics. The AsyncLLMEngine
instance is initialized using command-line arguments to manage model execution.
OpenAI-Compatible API ServerRevise
References: vllm/entrypoints/openai/api_server.py
, vllm/entrypoints/openai/cli_args.py
, vllm/entrypoints/openai/protocol.py
, vllm/entrypoints/openai/serving_chat.py
, vllm/entrypoints/openai/serving_completion.py
, vllm/entrypoints/openai/serving_engine.py
The OpenAI-compatible API server is established through the api_server.py
file, which configures a FastAPI application to serve endpoints for chat and text completion requests. The server is customizable via command-line arguments defined in cli_args.py
, allowing for settings adjustments such as host, port, and log level. Authentication is managed through middleware that validates an API key against incoming requests, ensuring secure access.
LLM InterfaceRevise
References: vllm/entrypoints/llm.py
The LLM
class serves as the primary interface for text generation within the VLLM project, encapsulating the components necessary for generating text from prompts. It is located in …/llm.py
.
Command-Line InterfaceRevise
References: vllm/entrypoints/openai/cli_args.py
The command-line interface for the VLLM OpenAI-compatible API server is configured via the …/cli_args.py
file. It utilizes the argparse
library to define and parse command-line options that control the server's runtime behavior. The interface allows for customization of network settings, logging, security, and model-specific parameters.
API Protocols and Data ModelsRevise
References: vllm/entrypoints/openai/protocol.py
In …/protocol.py
, a series of data models define the structure and protocols for API interactions. These models are pivotal for handling requests and generating responses that align with the OpenAI API specifications.
API Serving ClassesRevise
References: vllm/entrypoints/openai/serving_chat.py
, vllm/entrypoints/openai/serving_completion.py
, vllm/entrypoints/openai/serving_engine.py
The OpenAIServingChat
class extends OpenAIServing
to manage chat completion requests. It supports streaming and non-streaming responses, utilizing AsyncLLMEngine
for language model interactions. The class handles request validation, applies chat templates, and orchestrates response generation.
Testing and BenchmarkingRevise
References: tests
, benchmarks
The VLLM system is validated through a series of tests that encompass unit, integration, and end-to-end testing methodologies. These tests are designed to verify the functionality and performance of the system across various components and features.
Core Testing FrameworkRevise
References: tests/async_engine
, tests/core
, tests/tokenization
The VLLM project's core testing framework is structured to validate the asynchronous engine, block management system, and tokenization processes. Key components of this framework include the AsyncLLMEngine
, RequestTracker
, and Detokenizer
, each ensuring specific functionalities within the system.
Distributed Systems TestingRevise
References: tests/distributed
The …/distributed
directory validates the distributed capabilities of the VLLM project, focusing on tensor-parallel operations and the NCCLCommunicator
class. The tests ensure that the distributed communication operations, such as all-reduce, all-gather, and broadcast, function correctly across multiple GPUs, which is essential for the scalability of the VLLM system.
Speculative Decoding TestsRevise
References: tests/spec_decode
, tests/spec_decode/e2e
The speculative decoding feature is tested through several classes within the …/
directory, each focusing on a specific aspect of the functionality.
Sampling and Generation TestsRevise
References: tests/samplers
The …/samplers
directory validates the sampling techniques integral to text generation in the VLLM library. These techniques include beam search, logprob retrieval, token ranking, rejection sampling, and seeded random sampling. The tests ensure that the library's sampling methods produce consistent and correct results.
Model Output Comparison TestsRevise
References: tests/models
In …/models
, a series of test suites validate the output consistency between Hugging Face (HF) and VLLM models across different language models. These tests are crucial for verifying that the VLLM's implementation of language models aligns with the established outputs from HF models, using greedy sampling as the basis for comparison.
Quantization and Configuration TestsRevise
References: tests/quantization
In …/quantization
, the test suite validates the quantization type identification for Marlin models loaded from autogptq configurations. The primary file, …/test_autogptq_marlin_configs.py
, contains a ModelPair
dataclass to hold model IDs for Marlin and GPTQ versions, and a MODELS_QUANT_TYPE
list that maps model IDs to their expected quantization types.
Block Management End-to-End TestsRevise
References: tests/core/block/e2e
End-to-end tests within …/e2e
validate the BlockSpaceManagerV2
by comparing its behavior to the previous BlockSpaceManagerV1
. These tests are crucial for ensuring that memory block management within language models is handled correctly, especially when transitioning to newer versions of the block manager.
Performance Benchmarking ScriptsRevise
References: benchmarks/kernels
The …/kernels
directory hosts scripts to benchmark key components of the VLLM system, allowing developers to assess and optimize performance across different configurations.
API Server and Entry Points TestsRevise
References: tests/entrypoints
In …/entrypoints
, the test_openai_server.py
file orchestrates a series of tests against the OpenAI-compatible API server, focusing on validating the server's ability to handle various operations integral to the VLLM system. The ServerRunner
class, leveraging ray.remote
, is instrumental in initializing and managing the server process for testing.
Metrics and Usage Statistics TestsRevise
References: tests/metrics
In …/test_metrics.py
, two test cases validate the accuracy of metrics that track the number of tokens during language model generation. The counter_prompt_tokens
metric counts the tokens provided as prompts to the model, while counter_generation_tokens
counts the tokens generated by the model in response to the prompts.
Prefix Caching and Block Allocation TestsRevise
References: tests/prefix_caching
The …/prefix_caching
directory validates the CachedBlockAllocator
class from the vllm.core.block_manager_v1
module, which is integral to the efficient memory management of PhysicalTokenBlock
objects during language model execution. The tests ensure that the caching system operates correctly, maintaining the delicate balance between memory usage and performance.
Basic Correctness and Compatibility TestsRevise
References: tests/basic_correctness
The …/basic_correctness
directory contains a suite that validates the outputs of VLLM models against those from Hugging Face models using greedy sampling. The primary script, …/test_basic_correctness.py
, executes this validation by comparing the generated text and token IDs from both model types across a range of configurations.