Auto-generated from vllm-project/vllm by Mutable.ai Auto Wiki
|Apache License 2.0
The vLLM repository provides a library and components for executing very large language models (VLLMs) to generate text. It implements model loading, efficient parallel execution on GPUs, scheduling algorithms, caching, and serving interfaces.
The key components include:
…/model_executorprovides model loading from checkpoints, execution on GPUs with optimizations like quantization and tensor parallelism, and sampling text outputs. It leverages optimized CUDA kernels from
…/corehandles scheduling sequences of tokens using policies like first-come-first-served. It manages logical token blocks and maps them to physical blocks in a key-value cache.
…/engineenables asynchronous request processing by running models concurrently and tracking requests.
…/entrypointsimplements REST APIs for text generation by initializing models and handling requests.
examplescontains examples for model interaction like web UIs, offline use, and API clients.
The library achieves high throughput via continuous batching to amortize overhead. Quantization reduces model size. Classes encapsulate configurations. Tensor parallelism allows distributed execution. Asynchronous engines manage concurrent requests.
The APIs provide text completion and conversational modeling. Clients interact via streaming or non-streaming modes. Scheduling maximizes resource utilization across sequences. Caching avoids redundant computations. Kernels optimize critical model operations on GPUs.
The core functionality covered under the Text Generation section is generating text from pretrained language models through various components. Key aspects include interfaces for text generation through API endpoints and Python clients, as well as offline generation capabilities without a server.
…/entrypoints directory provides REST API endpoints for generating text using the VLLM model. The
…/api_server.py file implements a FastAPI server defining an endpoint at
/generate that handles generation requests. It supports both streaming and non-streaming response modes. The
…/llm.py file defines a class that handles the core model execution, taking prompts and sampling parameters as input and returning the generated outputs.
examples directory contains Python clients that interface with the model APIs to perform common natural language tasks. The
…/api_client.py file provides a client that can generate text by sending prompts to the VLLM API server endpoint. It handles both streaming and non-streaming response modes. The
…/offline_inference.py example generates text from prompts offline without a server by initializing a model and calling its generation method.
Some key implementation details:
The class encapsulates the core generation logic. Its method queues prompts for processing, while method executes pending requests and returns outputs.
The API server in
…/api_server.pyuses the class to delegate generation. It yields responses from a generator for streaming, or collects all outputs for non-streaming.
The API client abstracts requests and response parsing via methods like method which handles the API interaction.
…/entrypoints directory provides a REST API for generating text completions from prompts. At its core, the
…/api_server.py file implements endpoints that support the OpenAI text completion and chat completion APIs. These endpoints leverage several important classes and functions defined in other files.
…/api_server.py file handles generation requests for the endpoints. It first validates the request, and then calls a method to generate the completions.
For streaming responses, it yields formatted responses continuously. For non-streaming, it collects all outputs and returns the full response.
…/openai_completion_client.py file demonstrates interacting with these endpoints through a client to generate completions from prompts.
This section focuses on components for conversational text generation. The key functionality is implemented in the
This directory contains two important files. The
…/api_server.py file implements a FastAPI application to serve the API endpoints. It contains a class which is used to generate responses for chat completions. This class loads the language model and defines a method which takes a chat request and returns responses.
…/openai_chatcompletion_client.py file defines a client that can interact with the chat completion endpoint. It initializes a client class passed the API URL and key. The client has a method to list available models, and performs chat completions by calling a method and passing the conversation and selected model.
The class is the main way of interacting with the language model to generate conversational responses. Its method takes the chat request containing an initial conversation, and returns responses by:
- Loading the requested model
- Iterating through each message in the conversation
- Generating a response for each message using the model
- Formatting the responses
…/gradio_webserver.py shows how to build a basic web interface for interacting with a text completion model using Gradio. It defines a function that takes a prompt string as input, makes a HTTP request to the specified model API endpoint, and returns the completed text output.
This function is used with Gradio's context manager to create two widgets - a text input box and output box. The input box is wired to call the function, passing the result to the output box. This allows live interactions with the model via the web interface.
When run, the file parses command line arguments for the host, port, and model URL. It builds the Gradio demo with the function and launches a web server at the given URL, enabling sharing of the demo. This provides an easy way to deploy a web frontend for a text completion model without additional server programming.
This code enables running models for text generation without requiring an online server. The
…/offline_inference.py file demonstrates how to initialize a pre-trained model locally and generate text responses offline. Some sample prompts are first defined as a list of strings. It then creates an object to specify parameters for text generation. It initializes a model using the pre-trained "opt-125m" model from Facebook. It calls the model's method, passing in the prompts and parameters. This returns objects containing the prompt, generated text, and other metadata for each prompt. Finally, it loops through the outputs and prints the prompt and generated text for each one.
The main classes used are:
Object for specifying text generation parameters
Model class that handles loading models and generating text
By initializing the model with "opt-125m", it loads pre-trained weights for that specific model variant from Facebook. This allows generating texts offline using the pre-trained capabilities of that model.
The core functionality provided by the code under
…/model_executor is executing pretrained models efficiently in parallel across multiple devices like GPUs. This is achieved through the use of various classes, utilities, and functions defined throughout the codebase.
A key aspect is the
…/parallel_utils directory, which contains utilities for parallelism and distributed training ported from Megatron-LM. These utilities provide functionality for distributed communication using NCCL and executing steps in parallel across devices while handling initialization and cleanup. This is done through utilities defined in
…/communication_op.py file contains utilities for distributed communication across model parallel groups in PyTorch. It utilizes PyTorch distributed communication APIs and relies on utilities from
…/parallel_state.py to get properties of the model parallel group like rank and world size.
…/parallel_state.py file contains utilities for initializing and working with tensor and pipeline model parallel groups in a distributed setting. It initializes the groups and stores the initialized groups and ranks in global variables, ensuring consistency between world size, ranks and group sizes during initialization.
…/utils.py file contains various utility functions used to support tensor parallelism, such as functions for checking divisibility and splitting tensors.
Overall, these parallelization utilities provide functionality for distributed communication, initializing model parallel groups, and common utilities - enabling efficient parallel execution of models across multiple devices.
Models in VLLM are loaded from pretrained checkpoints and executed to generate predictions or text. The core components for handling model loading and execution are defined in the
…/models directory contains Python files that define various machine learning model architectures. Files like
…/llama.py contain classes that implement the different components of each model, including attention layers, feedforward layers, and full model definitions. These files also contain functions for model initialization from pretrained weights.
…/layers directory defines common neural network layer classes that can be composed to build models. It includes classes for attention layers, normalization layers, and linear layers with different parallelization schemes. These layers provide optimized CUDA kernel implementations when possible.
Model loading is typically handled by classes that inherit from the base model class and add functionality. For example, classes provide functionality. These classes define methods that map pretrained weights to the model parameters.
Model execution is performed by calling the model class' method on input sequences. This runs the inputs through embedding, attention, and feedforward layers to generate outputs. Classes are used to sample from model predictions during inference.
…/parallel_utils directory contains utilities enabling parallel execution of models. It implements common patterns for data parallelism using abstractions that distribute work across devices and average gradients.
Key algorithms implemented include:
Distributed communication using NCCL for efficient GPU-GPU communication
Executing steps in parallel across devices while handling initialization and finalization
…/parallel_state.py file initializes tensor and pipeline model parallel groups using PyTorch distributed primitives. It stores the initialized groups and ranks in global variables, ensuring consistency.
Utilities for collective operations are provided in
…/communication_op.py. Functions perform operations across the model parallel group.
Additional utilities in
…/utils.py include checking tensor divisibility and integer tensor splitting. These help enable common parallelism patterns.
…/layers directory contains Python modules that define core neural network layer classes. These layers form the basic building blocks used to construct complex VLLM models.
Key layer classes include:
Handles regular linear transformations.
Performs linear transformations with weights partitioned across GPUs along the column dimension.
Handles multi-head attention, with optimizing caching of keys and values.
Applies layer normalization, providing both PyTorch and optimized CUDA kernel implementations.
Contains activation functions. Classes like provide both PyTorch and optimized CUDA versions of GELU.
Specialized subclasses optimize performance for techniques like quantization, model parallelism, and custom hardware. Configuration classes centralize hyperparameters. Utility functions aid layer initialization and selection.
By composing these basic building blocks, the layers directory enables flexible and efficient construction of complex VLLM models tailored to different hardware backends and use cases.
The core functionality for applying quantization to model layers is handled through quantization configuration classes and linear layer classes defined in
…/__init__.py provides a registry for quantization configuration classes. It defines an abstract base class that all quantization config classes must inherit from. Specific configuration classes are defined in files like
Quantization is applied within linear layer classes defined in files like
…/awq.py. It handles quantizing weights into groups and packing them into lower precision tensors. It stores the quantized weights, scales, and offsets. Methods are defined to perform operations using the quantized weights tensors.
Similarly, files like
…/squeezellm.py define classes that implement different quantization schemes, representing quantized weights as integers packed into tensors and applying quantization parameters.
These linear layer classes apply the key steps of different quantization schemes to reduce model sizes and accelerate inference through lower precision computations.
…/attention code implements different attention mechanisms in a generic way that works across numeric types via abstraction. Common attention layers are defined to leverage these generic operations. This allows attention functionality to work transparently with different types while exposing a clean interface for building attention models.
Some key aspects of the attention implementation:
Generic operations like matrix multiplications are implemented for different data types in type-specific files
Attention layers call the generic operations, avoiding duplicate code for each type
…/attention_dtypes.hfile provides a common interface and collects the necessary type-specific dependencies
…/ops.h header declares optimized CUDA kernel implementations of functions used throughout the model. These leverage the generic attention code and handle caching keys/values for efficient attention computation.
…/attention directory contains implementations of different attention mechanisms for transformer models. It provides generic attention functionality that can work with different numeric types like float and half via abstraction.
Key aspects are implemented using templates that allow operations to work generically with different data types. Concrete types implement interfaces to provide attention functionality in a way that is transparent to numeric type.
Base interfaces are used to apply attention while keeping numeric type implementation separated, enabling attention operations to work with both float and half precision types.
…/core directory provides important classes and functions for managing sequences of tokens across devices and scheduling them for efficient model execution. Classes track sequence groups in different states. Methods manage addition and scheduling.
The core scheduling logic performs the scheduling. It uses the scheduling policy to determine priority and join waiting sequences if possible. It then reserves slots for running sequences, and may preempt lower priority sequences. This can result in token blocks needing to be swapped between CPU and GPU memory, which is handled by code in
…/block_manager.py. Code tracks physical token blocks and allows sequences to be mapped to these blocks, appending new tokens or swapping sequences between devices.
The abstract class and default FCFS implementation in
…/policy.py provide the interface for scheduling algorithms. Classes that inherit from the abstract class must implement a method that orders sequences by priority. This allows different scheduling strategies.
The core functionality enabling asynchronous request handling in the VLLM system is provided by several key components defined in
…/async_llm_engine.py. This file initializes classes that coordinate concurrent processing.
A class stored in
…/test_request_tracker.py allows adding, retrieving, and updating requests and handling completion.
…/api_server_async_engine.py runs an API server using the engine for concurrent processing. It adds logging and exposes an endpoint. Configuration and initialization sets it as the global object.
…/async_llm_engine.py file contains the core code for enabling asynchronous request processing. It defines a class which provides an asynchronous wrapper for the base engine class. This class overrides key methods to make them async and handles request processing concurrently.
…/test_async_llm_engine.py create mock classes to simulate engine behavior without dependencies. This allows thoroughly testing asynchronous functionality in isolation.
The class stored in
…/test_request_tracker.py is critical. It allows adding new requests, which returns an instance. This has a property to check completion status. New requests are detected via an event flag. It also handles retrieving pending/completed requests, aborting requests, and updating status after processing outputs.
Requests are handled asynchronously via FastAPI by subclassing the AsyncLLMEngine class defined in
…/async_llm_engine.py. The subclass AsyncEngine defined in
…/api_server_async_engine.py overrides the request handling to increment an internal counter each time a request is aborted. It also defines a method to return this counter value.
/stats endpoint exposes this functionality. It uses the method to return the number of aborted requests. When initializing the API server, it parses command line arguments to configure the engine. This includes settings like the number of worker processes and threads to use. The engine instance is then set as the global object for the
/stats endpoint. Uvicorn is configured to serve the FastAPI application, with options like the host, port, and logging level determined by the command line arguments.
…/test_request_tracker.py file contains tests for concurrently tracking asynchronous requests made to an AI model and their responses. A class is used to monitor outstanding requests. It allows adding new requests via a unique ID, checking request status, and retrieving pending or completed requests.
When a new request is added, it detects it by checking a flag. Each request is represented by an instance containing the request ID and status. Request status can be checked directly on the instance. New and finished requests can be retrieved before and after their completion.
Requests can be aborted by calling a method, which updates the status. It also supports finishing requests by processing response outputs. It calls a method internally to mark the corresponding request as completed.
The tests cover key usage patterns. They add requests to exercise detecting new requests and retrieving requests by status. Requests are aborted and finished through different means to validate status updating. Unique IDs and concurrent access are also tested. This provides thorough testing of tracking and synchronizing asynchronous requests and responses.
This section discusses the functionality in the code for text tokenization and encoding. The
…/tokenizers directory contains the core tokenization logic.
…/baichuan.py file handles the main tokenization work. It loads a vocabulary file during initialization. The tokenize method takes input text and uses the vocabulary to tokenize it into IDs. Methods also exist to convert between tokens and IDs.
When tokenizing a sequence, it adds special tokens like CLS and SEP with a method. It returns a mask array to indicate which tokens are special. The class can also generate token type IDs for sequence pair tasks.
The vocabulary is saved back to disk, allowing different instances to share the same vocabulary.
…/__init__.py exports functionality for import.
…/configs directory contains configuration functionality for various transformer models used in the library. It provides a standardized interface for defining and accessing hyperparameters of each model type.
Individual configuration classes encapsulate hyperparameters like vocabulary size, hidden dimensions, and number of layers. They initialize these attributes either with default values or values passed during instantiation.
…/__init__.py file collects the various configuration classes in one namespace. It imports classes defined in files, allowing other code to easily import configurations without needing to know where each class is defined.
…/configs directory contains configuration classes for various transformer models used by the VL-LM library. The file
…/aquila.py defines hyperparameters for configuring Aquila models. The file
…/baichuan.py contains a configuration class which inherits functionality.
…/chatglm.py defines a configuration class that initializes hyperparameters to configure ChatGLM models. The file
…/falcon.py contains a configuration class that supports configuring settings.
…/mpt.py defines a configuration class that inherits from a base class and configures Mosaic Parameter-efficient Transformer models. The file
…/qwen.py contains a configuration class that initializes hyperparameters for the Qwen architecture.
Each configuration class provides an interface to configure its respective model while reusing functionality from base classes. They encapsulate hyperparameters and validate configurations.
The configuration classes in
…/configs define default values for hyperparameters used by different models. The classes initialize hyperparameters as class attributes, setting sensible defaults if values are not passed. This allows configuring models without specifying every parameter.
…/aquila.py class defines hyperparameters like vocabulary size, hidden size, number of layers, and attention heads for the Aquila model. It inherits common functionality and doesn't implement the model directly.
…/baichuan.py class inherits from another and calls a method while passing attributes, storing them in a structured way for initialization.
…/falcon.py class initializes parameters like vocab_size, hidden_size, number of layers, attention heads, dropout values, and other settings. It overrides initialization to set default values and check backward compatibility. Properties calculate dependent values.
…/qwen.py class stores hyperparameters for the QWen architecture like vocabulary size, hidden size, number of layers and attention heads. It initializes parameters as properties with default values.
…/yi.py class defines hyperparameters for the Yi model such as vocabulary size and layer dimensions, initializing them.
The configuration class
…/falcon.py checks for backward compatible keyword arguments and sets default values. It also checks a flag is set if a condition is met. The class
…/baichuan.py validates hyperparameters by calling a method while initializing attributes.
The base configuration functionality is defined in the file
…/__init__.py. This file collects the various model-specific configuration classes into one namespace.
Configuration classes inherit from a common base class to define model hyperparameters and defaults. The base class provides functionality like model saving and loading that is reused.
By inheriting from the base class, model configuration classes gain common functionality while allowing customization. This enables code reuse and a consistent interface for all model types.
…/layers directory contains Python modules that define core neural network layer classes for building VLLM models. These layers leverage optimized CUDA kernels to enable efficient execution on GPUs when possible.
Key layers include linear layers and normalization layers.
LayerNorm provides normalization with both PyTorch and CUDA kernel implementations.
Specialized layers meet the needs of transformer architectures. The
RotaryEmbedding class in
…/rotary_embedding.py handles positional encodings, optionally applying scaling techniques.
Configurations ensure layers work together seamlessly. Quantization classes define hyperparameters.
…/layers directory contains implementations of core neural network layers used in transformer models. It defines several classes for different types of layers, including normalization and linear layers.
…/layernorm.py file implements normalization layers. It contains classes for performing normalization which contain parameters and delegate normalization either to a PyTorch implementation or an optimized CUDA kernel.
…/linear.py file implements various linear layer classes. It contains abstract base classes defining the interface for applying linear transformations. Classes in this file provide linear transformations by initializing weights and applying them. One class implements column parallelism by partitioning the weight matrix and loading partitioned slices during initialization based on partition rank.
The core functionality for applying quantization to layers is handled in the
…/quantization directory. This directory contains subclasses that implement different quantization schemes for linear layers in VLLM models.
The main classes that apply quantization are defined in
…/squeezellm.py. Each class is responsible for quantizing the weights for a linear layer according to the specified quantization scheme.
Configuration classes define the hyperparameters for each quantization scheme, such as bitwidth and group size.
/__init__.py file registers specific config classes and provides the function to dynamically retrieve the proper config class based on a quantization method name.
csrc directory contains optimized GPU operations through CUDA kernels and attention mechanism implementations in the
…/attention subdirectory. This implements different attention mechanisms in a generic way that works with various numeric types through abstraction. Common attention layers follow best practices and are defined using generic operations.
…/ops.h header file declares several optimized C++ functions for GPU execution, including functions for attention and normalization. The
…/attention_dtypes.h header abstracts away data type details and provides a common interface for attention code to generically use various numeric types via templates and typedefs.
…/pybind.cpp module exposes optimized CUDA kernels to Python via Pybind11 by grouping related operators such as attention and normalization together in submodules.
…/attention directory contains implementations of different attention mechanisms for transformer models. It provides generic functionality that works across different numeric types like float and quantized integer values.
The key aspects are defined in files under the
…/attention directory. Files define the data types used.
Classes inherit from an abstract base class and implement type-specific computation. For example, subclasses define computations using the appropriate operations for that type.
Other classes handle variable sequence lengths efficiently via packing sequences into fixed-size tensors.
Generic operations are implemented via templates and abstract base classes, keeping the interface consistent across types. This allows different attention implementations to be used interchangeably while reusing most of the codebase.
Optimization is provided by leveraging CUDA kernels to parallelize attention computations on GPUs.
docs directory contains all the files and organization necessary to collaboratively develop documentation for the vLLM project using Sphinx. The source code, requirements, and build instructions allow contributors to build and preview documentation during development without needing to deploy the docs separately.
The main documentation source files are contained in
…/source. The file
…/conf.py configures Sphinx for building the documentation. It sets basic project metadata and configures Sphinx extensions, templates, and HTML output settings. Key functionality includes setting the HTML theme and configuring options for it like the logo image path.
…/README.md file provides instructions for building and viewing documentation locally. It contains steps to install requirements from
…/requirements-docs.txt, build the HTML docs using Sphinx commands, and start a local HTTP server to preview the documentation in a browser.
docs directory contains all the files and organization necessary to collaboratively develop documentation for the vLLM project using Sphinx. At the core, Sphinx is used to build the documentation from reStructuredText (.rst) source files located in
The main configuration file is
…/conf.py, which defines Sphinx settings like the HTML theme, extensions to use, and output options. It configures the Sphinx HTML Builder to generate documentation pages from the .rst source files in
The Sphinx documentation build process is initiated from instructions specified in the
…/README.md file. This file provides steps for installing dependencies and viewing documentation locally during development.
…/requirements-docs.txt file contains dependencies needed to build documentation. It lists three requirements needed by the documentation build process. By specifying these, all documentation dependencies can be automatically installed to ensure reproducibility across environments. This file provides a simple way to coordinate package dependencies for compiling documentation.