Mutable.ai logoAuto Wiki by Mutable.ai

xformers

Auto-generated from facebookresearch/xformers by Mutable.ai Auto Wiki

xformers
GitHub Repository
Developerfacebookresearch
Written inPython
Stars6.8k
Watchers 71
Created2021-10-13
Last updated2024-01-07
LicenseOther
Homepagefacebookresearch.github.io/xformers
Repositoryfacebookresearch/xformers
Auto Wiki
Generated at2024-01-07
Generated fromCommit 660000
Version0.0.4

XFormers is a Python library focused on building, optimizing and evaluating Transformer neural network models. It provides a comprehensive set of modular components, optimizations, tools and framework integrations to simplify Transformer construction, accelerate training/inference, and rigorously benchmark model performance.

At the core, XFormers enables configurable model architecture construction from reusable components like attention mechanisms and feedforward networks. The …/factory module provides a model factory using classes and utilities to instantiate Transformer encoder-decoder models from configuration files in a flexible way. Components like attention are registered and selected based on the config.

A key focus is providing bleeding-edge optimizations to push efficiency of Transformers. The …/ops module contains optimized operators for memory-efficient multi-head attention computations on GPUs. Techniques like split key attention are used to save memory. Low-level CUDA kernels in …/cuda provide additional optimizations.

The library integrates well with PyTorch for ease-of-use while enabling custom kernels. Utilities handle operator registration, conversion and dispatching between frameworks. Autograd support is provided for custom kernels.

The …/benchmarks module provides utilities to rigorously benchmark model performance. Classes generate different attention patterns and transformer configurations to evaluate across tasks defined in …/LRA. Profilers measure runtime, memory usage, and hardware metrics.

Overall XFormers focuses on providing an extensible, optimized Transformer toolkit with modular components, efficient kernels, benchmarking, and deep framework integration. The design centers around configurability, performance and cutting-edge techniques.

Transformer Model Construction

References: xformers/factory, xformers/components, examples/build_model

The …/factory directory implements a factory pattern to construct modular Transformer blocks and models from configurable components in an extensible way. Key functionality includes classes that take configuration objects and compose attention, feedforward, and other modules into reusable blocks.

The ModelFactory class handles building the overall transformer model module by stacking blocks along with embedding and output layers. It parses a configuration and initializes blocks.

The …/block_configs.py file defines configuration classes that are used to configure different components of a transformer block.

The …/__init__.py file exposes functionality for building blocks and models. It imports classes for composing modules into reusable blocks.

The …/my_model.py example script demonstrates how to instantiate a transformer model using the factories. It loads a configuration at /path/to/config.yaml and builds the model architecture specified in the config.

Modular Components

References: xformers/components/attention, xformers/components/feedforward, xformers/components/positional_embedding

The XFormers library provides modular components that enable flexible construction of transformer models. Classes discussed in the summary for …/base.py and discussed in the summary for …/base.py define reusable base components that can be combined to build models.

The …/attention directory contains various attention mechanism implementations that inherit from the discussed in the summary for …/base.py base class.

The …/feature_maps framework discussed in the summary allows experimenting with different query/key projection strategies.

Feedforward networks are implemented in …/feedforward.

Positional encodings are defined in …/positional_embedding. Classes discussed in the summary for …/param.py and discussed in the summary for …/sine.py provide different encoding schemes.

These modular, reusable components can be assembled to build a wide variety of transformer architectures. The library aims to balance flexibility and performance via optimizations like sparse operations and framework integrations.

Model Factory

References: xformers/factory, examples/build_model

The model factory class builds the full Transformer model. It handles parsing the configuration, constructing encoder and decoder blocks, and implementing the overall forward pass.

The class takes a configuration dataclass object as input. This configuration specifies properties like the number of encoder/decoder stacks, their dimension sizes, and other hyperparameters.

The class first parses the configuration to extract these parameters. It then uses block classes to construct the encoder and decoder stacks. These block classes compose attention, feedforward, and other modules into reusable blocks.

The encoder blocks are initialized with calls to construct blocks and added to a list. Similarly, the decoder blocks are initialized and added to a separate list.

Some validation of the configuration is also performed.

The weights of the entire model are initialized with a call to initialize weights. This function supports different initialization schemes. It initializes each module in the model recursively.

The main forward pass handles encoding, decoding if necessary, and returning outputs based on the model type. This implements the overall transformer computation.

The configuration dataclass defines the configuration structure for the full model. Its method parses the configuration and ensures the correct block types are used.

Configurable Architectures

References: xformers

The …/build_model directory contains examples of building transformer models from configurations. The configuration specifies the overall model architecture. The class contains the core logic for building out each component of the model from the configuration.

Utilities extract the nested block configs from the overall model config.

This demonstrates how to build a model programmatically without code changes.

Optimized Operators

References: xformers/ops, xformers/triton, xformers/csrc

The optimized operators in XFormers focus on efficiently implementing attention mechanisms and reducing memory usage. A key aspect is leveraging optimized linear algebra primitives for operations central to attention. The …/ops directory contains several important operator implementations.

The …/fmha module provides memory-efficient multi-head attention operators. The base class defines the core attention computation across frameworks. Concrete subclasses like implement the computation using libraries like Cutlass and Triton respectively.

General matrix multiplication is crucial to attention and implemented efficiently. The …/_triton module contains optimized Triton kernels for operations like self-attention via _triton/rope_padded_kernels.py and indexed operations in _triton/k_index_select_cat.py and _triton/k_scaled_index_add.py. These kernels parallelize work across GPU cores for performance.

The …/cuda directory contains highly optimized CUDA implementations of multi-head attention and related linear algebra. The kernels for the forward and backward passes are defined in and . GEMM operations are implemented in threadblock-scoped kernels within. Shared memory is used via classes which efficiently load data tiles from. Compiler-generated kernels in select best variants.

Row-wise normalization is also optimized. The …/rmsnorm.py class interfaces optimized rmsnorm_kernels.py kernels. The kernels accumulate statistics incrementally in blocks using techniques like shuffling.

Quantization helps reduce memory usage. supports int4 quantization of keys and values via. implements split key attention with Triton to further reduce memory.

Autograd is implemented for optimized attention. The …/autograd directory contains classes which define the forward and backward passes for operations like. This allows differentiating optimized operators.

Optimized Attention Kernels

References: xformers/ops/_triton, xformers/csrc/attention/cuda

The XFormers library provides highly optimized CUDA implementations of key algorithms for efficient self-attention and related linear algebra operations on NVIDIA GPUs. The core building blocks leverage CUDA, shared memory, and compiler-generated kernels to efficiently map these algorithms to GPU hardware.

The …/fmha directory contains optimized implementations of multi-head attention and related linear algebra focused on fast matrix multiplication (FMHA) techniques. The …/kernel_forward.h and …/kernel_backward.h files define CUDA kernels for the forward and backward passes of multi-head attention. These kernels are organized to process queries, keys, and values in parallel across thread blocks.

General matrix multiplication (GEMM) operations are implemented efficiently via threadblock-scoped kernels in …/gemm. The …/custom_mma.h file contains templates to generate optimized GEMM kernels for different problem sizes at compile time. Kernels in this directory leverage CUDA cores, shared memory, and SIMT instructions to efficiently perform GEMM.

The …/iterators directory contains classes that efficiently load tiles of data from shared memory for use in kernels. Iterators like those defined in …/default_warp_iterator_from_smem.h select the optimal way to read data based on hardware and data type.

Epilogue operations such as rescaling outputs are implemented in …/epilogue. The …/epilogue_rescale_output.h file contains a class that applies rescaling in an optimized way.

The …/autogen directory contains code generated at compile-time via …/generate_kernels.py to produce optimized kernel variants indexed for runtime selection, allowing flexible dispatch of attention computations.

Memory-Efficient Attention

References: xformers/ops/fmha

The …/fmha directory contains classes that implement memory-efficient multi-head attention. Concrete subclasses like classes defined in …/cutlass.py, …/triton.py, and …/flash.py implement the forward and backward passes by overriding methods.

These operator classes leverage optimized libraries and kernels to efficiently perform attention on GPUs. For example, classes in …/cutlass.py uses the NVIDIA Cutlass library, while classes in …/triton.py leverages the Triton library developed by Anthropic. Small key attention is supported via classes defined in …/small_k.py.

Input validation and preprocessing is handled by common utilities in …/common.py. This includes dataclasses like classes used uniformly across operators. Attention biases are defined in classes that inherit from classes in …/attn_bias.py.

Operator dispatch is handled by …/dispatch.py, which defines priority lists to select the most performant operator based on input properties. Quantization is supported via classes in …/triton.py, which can leverage int4 quantization of keys and values when using Triton. Split key attention is implemented in classes in …/triton_splitk.py, which defines a Triton kernel to parallelize the computation across GPU cores.

Framework Integrations

References: xformers

The …/csrc directory contains implementations that integrate core XFormers operators and computations with PyTorch. It provides registrations, dispatching logic, and conversions between the C++/CUDA backend and the Python frontend through PyTorch.

The …/swiglu subdirectory handles the interface between C++ and PyTorch. The …/swiglu_op.cpp file registers operators with PyTorch by defining macros that call PyTorch functions. This allows calling the operators from Python.

The …/attention subdirectory leverages frameworks like CUDA to efficiently map attention and related linear algebra to CPUs and GPUs. It provides CPU, CUDA, and autograd implementations of operations. The …/cuda directory contains highly optimized CUDA kernels for multi-head attention and primitives on NVIDIA GPUs.

Conversions between C++ and Python objects are handled by …/boxing_unboxing.cpp. It defines functions to convert between objects and extract Python objects from C++ using type conversion and pointer extraction.

Overall, this directory provides critical optimized building blocks, operator registrations, and interface functionality that underpin efficient XFormers models across hardware through integrations with frameworks like PyTorch, CUDA and Triton.

Normalization Kernels

References: xformers/ops/_triton/rmsnorm_kernels.py

The file …/rmsnorm_kernels.py contains optimized CUDA kernels for performing row-wise root mean squared (RMS) normalization. RMS normalization is commonly used in neural networks as a form of normalization before or after linear transformations. The kernels in this file enable efficiently performing RMS normalization on GPUs in a memory-efficient blocked manner.

The kernels implement both the forward pass of RMS normalization where each element is divided by the root mean square of its row, as well as a version that supports adding two tensors together before normalization. This allows normalizing the sum of activations from multiple sources. The kernels operate on the input data in blocks to improve memory locality and coalesced access.

Key aspects of the implementation include using code blocks to incrementally accumulate mean square statistics within the block, calculating the reciprocal square root of the mean to perform the division, and selecting an optimal block size. Python wrapper functions handle preprocessing inputs, launching the kernels with chosen block sizes, and returning outputs. Performance is optimized by ensuring inputs are contiguous and transferring data between GPU memory and registers efficiently in blocks.

Quantization

References: xformers/ops/fmha/triton.py

The …/triton.py file implements quantization support for fast multi-head attention using Triton. It allows the Triton attention operator to work with half and bfloat16 data types, reducing memory usage compared to float32.

The file contains classes which implement the forward and backward passes of Triton attention. They both support half and bfloat16 dtypes and only work on CUDA devices with SM 8.0 or higher, as required by Triton.

After the Triton ops complete, the results are cast back to the original input dtypes. This allows quantizing inputs to Triton while preserving different dtypes in the rest of the model.

Split Key Attention

References: xformers/ops/fmha/triton_splitk.py

This code in …/triton_splitk.py implements split key attention to reduce memory usage during self-attention. It defines a class that handles the overall forward pass of split attention.

During the forward pass, the class first reshapes and transposes the input tensors as needed. It then defines a Triton kernel that performs a single chunk of the split attention computation on blocks of the input. This kernel is JIT compiled to run efficiently on the GPU. The class calls this kernel independently on each split chunk of the key in parallel, with the queries and values passed unchanged. Another Triton kernel is used to reduce and merge the outputs from each split chunk.

The class supports int4 quantization of keys and values if the input dtypes allow for it, providing an additional memory savings. Overall, this allows self-attention to be computed efficiently in a data-parallel manner across GPU cores, while using less memory than a single computation.

Autograd Support

References: xformers/csrc/attention/autograd

The …/autograd directory implements autograd support for optimized attention operations. It contains a C++ implementation of the forward and backward passes for matrix multiplication with an optional mask in the file …/matmul.cpp.

The core functionality is implemented via saving inputs and computing gradients. In the forward pass, it saves the input tensors and optional mask for later use in the backward pass. It then calls a function to perform the actual masked matrix multiplication.

In the backward pass, it retrieves the saved inputs. If a mask was provided, it first calls a function to mask the gradient before computing gradients. It then calls another function to calculate the gradients of the multiplication with respect to the input tensors.

This implementation is exposed to PyTorch via a wrapper function, which allows it to be differentiated from Python. The function simply wraps the forward pass. It is registered with PyTorch's autograd engine so it can be differentiated. This integrates it with PyTorch's autograd functionality for end-to-end model training.

Benchmarking

References: xformers/benchmarks, xformers/profiler

The core functionality of the …/benchmarks code is to provide tools and frameworks for evaluating attention patterns, models, and hardware accelerators. This is done through several main components:

The …/benchmark_blocksparse_transformers.py file contains code to benchmark different attention mask patterns and block sparse matrix multiplication implementations. Experiments are defined in classes to test performance of different mask patterns, sparsity levels, and block sizes. Utilities are defined in the file for operations on blocked sparse tensors. Metrics like mask sparsity and FLOP counts are also calculated.

The …/benchmark_triton_fused_linear.py file benchmarks the performance of modules on different shapes and data types. It runs forward and backward passes on random input data and measures the bandwidth achieved to calculate throughput in GB/s.

The …/benchmark_triton_blocksparse.py file benchmarks different matrix multiplication modes using PyTorch operations. It creates input tensors and defines the computation graph for each backend. It runs each operation and measures time and TFLOPs. Results are printed and plotted for analysis.

Benchmark Utilities

References: xformers/benchmarks/utils.py

The …/utils.py file contains various utility functions for running and analyzing benchmarks of PyTorch models. It contains functions for benchmarking models and timing iterations to gather results. Results are returned as a list which can then be passed to further processing functions. One function handles the core benchmarking workflow, accepting flags like the number of warmups and iterations to run. It loads the model and runs warmups, then times iterations to collect performance data. Another utility function takes the raw timing data and modifies it when multiple algorithms are compared, for example by computing relative performance metrics. It supports reporting both average time and samples processed per second.

Long Range Arena

References: xformers/benchmarks/LRA

The Long Range Arena (LRA) benchmark suite contains code and scripts for evaluating Transformer models on a standardized set of tasks. The suite implements several benchmark tasks designed to test long-range dependencies, including sequence modeling, question answering, and language understanding. Models are evaluated based on their ability to capture dependencies between elements in an input that are distant from each other.

The core functionality for running benchmarks is contained in Python scripts. Separate Python scripts preprocess different datasets into an efficient format for the benchmarks.

The main logic for running benchmarks on different tasks is contained in the …/run_tasks.py file. It handles loading configurations, building models, setting up data loading, training models, and evaluating models.

The benchmarks are designed to be run via the command line or programmatically submitted to a cluster manager using functionality. Configuration files specify hyperparameters, tasks, and attention types to benchmark. Results are saved in a structured format for analysis.

Some important classes used include classes for loading and preprocessing examples.

Attention Benchmarking

References: xformers/benchmarks/benchmark_blocksparse_transformers.py

This section focuses on benchmarking different attention patterns. The …/benchmark_blocksparse_transformers.py file contains relevant code.

Utilities in the file help with operations on blocked sparse tensors, like calculating mask sparsity and FLOP counts. The file contains functions for measuring the performance of sparse dot product calculations. It organizes running benchmarks, measuring performance metrics like throughput.

Model Benchmarking

References: xformers/benchmarks/benchmark_transformer.py

The …/benchmark_transformer.py module contains classes and functions that enable integrating Transformer models with benchmarks for performance testing. It replaces the standard attention and MLP modules in a model with benchmarking modules when running performance tests. Model configurations, optional precision changes, and test cases are generated by functions in this module. Benchmarking is done by running models with different inputs and modifiers, measuring the forward and backward pass times, and returning the results. The module contains efficient operator implementations from other parts of the library that are used to modify the existing modules when benchmarking.

Profiling

References: xformers/profiler

The profilers in xformers provide tools to measure model performance and hardware usage during training and inference. The …/profiler directory contains several profiler implementations that can operate sequentially or individually.

It handles inserting hooks into the model.

interfaces with PyTorch's native profiler APIs to trace execution time and identify bottlenecks.

captures CUDA memory allocation traces. It dumps trace data to files to analyze memory usage over time.

It constructs objects with timing, FLOP, and I/O stats for each call. These are aggregated and FLOPs are estimated using shape information. Results are saved to JSON.

interfaces with NVIDIA's Nsight Compute profiler to analyze CUDA and GPU operations for performance profiling when run on supported hardware.

Functionality like estimating FLOPs for common ops is implemented in helper functions. Device specifications and limits are defined in dataclasses to provide hardware context.

Sparse Attention

References: xformers/sparse, xformers/components/attention

The core functionality provided by the code under Sparse Attention is enabling efficient computations and representations for sparse attention patterns in transformer models. This is achieved through classes, utilities, and algorithms for working with sparse tensors and sparse attention computations.

The …/__init__.py file collects the main sparse tensor functionality in one place. It imports classes representing tensors with different sparse formats.

The class in …/blocksparse_tensor.py overloads operators like @ to efficiently operate on the sparse blocks. It utilizes optimized sparse backends when available.

The class in …/csr_tensor.py represents tensors in compressed sparse row format, storing only non-zero values and indices. It overloads common operations like multiplication and softmax to efficiently operate on the sparse tensor values and indices via calls to custom operators.

The file …/_csr_ops.py contains utilities for performing sparse linear algebra on sparse matrices via optimized backends. It dispatches operations based on sparsity and shape. Functions implement operations like sparse softmax.

The file …/attention_patterns.py contains implementations of various attention patterns and functions for converting between patterns and sparse layouts. Classes define how to generate fixed and variable sparse patterns.

The file …/sparsity_config.py uses a tensor to represent the sparse attention map, with dimensions. Classes populate this tensor with different block patterns.

Sparse Tensor Classes

References: xformers/sparse, xformers/sparse/blocksparse_tensor.py, xformers/sparse/csr_tensor.py

The xformers library provides classes for working with sparse tensors.

A class represents a tensor with a sparse block structure, where certain blocks of values are all zeros. This allows more efficient storage and operations by skipping the all-zero blocks. Operators like multiplication and softmax are overloaded to efficiently operate on the sparse blocks. It leverages optimized operators from Triton if available, and implements custom PyTorch kernels otherwise. Methods are provided to convert to and from dense formats.

A class uses the Compressed Sparse Row (CSR) format to store a sparse tensor compactly. Only the non-zero values and their indices are stored. Common operations and element-wise functions are overloaded to perform the operations directly on the stored values and indices. The class stores the tensor data as attributes, and functions extract and reconstruct the data between dense and sparse representations. Conversion methods are provided.

The …/sparse directory provides the main interface for sparse tensors. It imports the classes from other files without defining any classes or functions itself. The …/_csr_ops.py file contains utilities for performing sparse linear algebra in CSR format. The …/utils.py file offers functions for sorting indices and extracting CSR data from masks.

Sparse Linear Algebra

References: xformers/sparse/_csr_ops.py

The …/_csr_ops.py file contains utilities and autograd functions for performing sparse linear algebra operations on CSR (compressed sparse row) matrices in an efficient manner. It handles dispatching between sparse tensor formats like COO and CSR depending on the sparsity and shape of input matrices. Key functionality includes:

  • The file determines whether to use COO or CSR format for a matrix multiplication based on the sparsity and device, and dispatches the operation accordingly.

  • For highly sparse matrices, it uses CSR format which enables more efficient computations.

  • It implements the forward and backward passes for softmax in a way that leverages optimized sparse linear algebra routines.

  • Autograd functions allow defining sparse linear algebra operations involving CSR matrices and computing gradients, enabling end-to-end training with sparse tensors.

Sparse Utilities

References: xformers/sparse/utils.py

The utilities in the …/utils.py file provide important functionality for working with sparse tensors, with a focus on the compressed sparse row (CSR) format. The file contains functions for sorting indices and converting between sparse formats. Overall, these utilities provide a set of important sparse tensor manipulation functions focused on operations in the CSR format used throughout the library.

Block Sparse Attention

References: xformers/components/attention/attention_patterns.py, xformers/components/attention/sparsity_config.py

The file …/attention_patterns.py contains implementations for various attention patterns. It provides functions for block-sparse attention patterns and converting between patterns and block-sparse layouts.

The file …/sparsity_config.py defines configurations for different sparsity patterns. The main functions generate the sparsity layout tensor, and populate the layout with different block patterns.

The classes encapsulate the different algorithms for configuring the sparse patterns by populating this tensor. They provide a consistent interface to work with various block-sparse patterns in a modular way.

Long Range Attention

References: xformers/components/attention/lambda_layer.py, xformers/components/attention/linformer.py

The XFormers library includes two attention mechanisms that enable modeling long-range dependencies without quadratic attention costs: the attention mechanism defined in …/lambda_layer.py and the class defined in …/linformer.py.

The attention mechanism in …/lambda_layer.py implements the Lambda Networks approach. It uses learned relative positional embeddings to model positional interactions between tokens. These embeddings are multiplied by the value vectors to calculate position-based attention. Separately, it calculates content-based attention using softmax over the key vectors and multiplying by values. The position-based and content-based attentions are summed to obtain the final attention values. By decomposing attention into content and position components, and using relative positional encodings, this attention mechanism can model long-range interactions efficiently.

The class in …/linformer.py inherits from another class and overrides the forward pass. In the forward pass, it first projects the queries, keys, and values to a lower dimensional space using separate learned projections:

Wrap
Copy
projects the queries and keys using a learned projection.
It projects the values separately with another projection.

It then calls a function to compute attention in the reduced k dimensional space before returning the attended values. By attending to a projected space, this class circumvents the quadratic attention problem to enable efficient long-range modeling.

Visual Attention

References: xformers/components/attention/visual.py

This section covers attention mechanisms designed for visual inputs like images. The file …/visual.py contains implementations of visual attention in transformers.

The key class defined is a module to compute the attention map. This module contains several convolutional layers with different configurations.

It directly exposes the 2D structure of the input by reshaping to and from the flattened representation, allowing the attention to preserve spatial structure for the module.

The module inherits from the base attention class and overrides flags. It contains convolutional and nonlinearity layers, as well as a module to apply the attention mechanism. This allows the attention to be computed over the 2D structure of visual inputs like images, while the rest of the transformer preserves the remaining spatial relationships.

The file registers visual attention with the base attention class via a decorator, allowing it to be used as a drop-in replacement for standard attention in transformers. This provides a way to apply visual attention to image or spatially structured inputs within the transformer framework.

Integrations

References: xformers/csrc, third_party/sputnik

This section discusses integrations between XFormers and other frameworks like PyTorch, Triton, CUDA, as well as libraries like Sputnik.

XFormers provides several different levels of integration. At the lowest level, it implements optimized CUDA kernels for attention and linear algebra primitives in the …/cuda directory. These kernels leverage CUDA features for performance on Nvidia GPUs.

The …/autograd directory contains autograd implementations for the CUDA kernels. This allows them to be differentiated when using frameworks like PyTorch. The kernels are registered with the PyTorch autograd engine via functions defined here.

Another integration pathway is through the Sputnik GPU library, located at …/sputnik. Sputnik contains highly optimized CUDA primitives for operations like convolution, matrix multiplication, and activation functions. XFormers leverages these primitives by calling Sputnik functions for pieces like sparse softmax in …/sparse_softmax.cpp. This allows benefiting from Sputnik's low-level kernel optimizations without reimplementing the functionality.

The Triton integration is not detailed in the summaries. XFormers may interface with Triton for distributed or model serving scenarios, but no code related to this was provided.

PyTorch Integration

References: xformers/csrc/swiglu

The …/swiglu directory implements core XFormers operators and computations using PyTorch. It contains functionality in the …/swiglu_op.cpp file to register operators with PyTorch.

The …/swiglu_packedw.cpp file implements packed matrix multiplication using a class. This class contains methods to perform the packed matrix multiplication on CPU or GPU by calling kernel functions defined in namespaces. Utilities are used to add shape checking assertions to the methods.

CUDA Kernels

References: xformers/csrc/attention/cuda

The …/cuda directory contains highly optimized CUDA implementations of multi-head attention and related linear algebra operations for NVIDIA GPUs. This includes kernels for both the forward and backward passes.

The …/fmha subdirectory implements multi-head attention using fast matrix multiplication (FMHA) techniques. The …/kernel_forward.h and …/kernel_backward.h files contain kernels for the forward and backward passes respectively. These kernels perform the core attention computations in parallel using CUDA threads organized into blocks and warps.

General matrix multiplication (GEMM) operations are crucial to the attention computations. These are implemented efficiently using threadblock-scoped kernels in the …/gemm subdirectory. The kernels leverage several techniques to map the algorithms efficiently to GPU hardware. Shared memory is used extensively to store intermediate results between operations. Iterators defined in files like …/iterators handle efficiently loading tiles of data from global and shared memory. The …/tile_smem_loader.h class provides a way for threads to cooperatively load tiles from global to shared memory.

The …/autogen subdirectory contains code generated at compile-time via …/generate_kernels.py to produce optimized kernel variants for different data types, problem sizes, and GPU architectures. Functions are included to dispatch the correct kernel variant at runtime.

Epilogue operations after the matrix multiplications are also implemented, such as rescaling outputs in the …/epilogue directory.

Autograd Support

References: xformers/csrc/attention/autograd

The …/autograd directory implements autograd support for optimized CUDA attention operations. It contains a C++ implementation of the forward and backward passes for matrix multiplication with an optional mask.

The core functionality is defined in …/matmul.cpp. In the forward pass, it saves the input tensors and optional mask. It then calls a CUDA kernel to perform the masked matrix multiplication operation.

In the backward pass, it retrieves the saved inputs. If a mask was provided, it first masks the gradient before computing gradients. It then calculates the gradients for the input tensors.

A wrapper function exposes the functionality to PyTorch's autograd. This allows it to be differentiated from Python.

The code registers this operation with the PyTorch autograd engine. This makes the optimized CUDA attention operation available from Python and allows it to be differentiated automatically during training.

This implementation provides optimized CUDA kernels for attention while still allowing end-to-end differentiation of models using these kernels. This is crucial for training Transformer models with sparse or optimized attention patterns.

Sputnik Integration

References: third_party/sputnik

The Sputnik library provides optimized GPU primitives through functions that leverage low-level CUDA features for performance and portability across Nvidia GPU hardware. The main components of Sputnik integration are:

The …/sputnik directory contains the Sputnik GPU linear algebra library. Within this directory, the …/sputnik subdirectory contains the core implementation of Sputnik. It provides efficient GPU kernels and algorithms for common deep learning primitives like convolution, matrix multiplication, and activation functions.

The …/CMakeLists.txt file defines a CMake project for building the Sputnik library. It sets minimum required CMake version and defines the project name. It includes dependencies and configures build options. This file handles compiling the Sputnik source files in the …/sputnik directory into a shared library.

Depthwise convolution operations are implemented in …/cuda_depthwise.cu.cc. This file contains efficient CUDA kernels for depthwise convolution that leverage techniques like tiling, padding, and vectorization.

Matrix multiplication primitives are provided via kernels for sparse matrix-dense matrix multiplication (SpMM) implemented in …/cuda_sddmm.cu.cc. This file contains CUDA kernel implementations of SpMM that use tiling and optimizations like shared memory.

Activation functions are implemented in …/bias_relu.cu.cc which contains a highly optimized kernel for applying bias and ReLU activation.

The Sputnik primitives can be integrated via function declarations in header files like …/cuda_depthwise.h and …/bias_relu.h. These provide the interface to call kernels from other code.

Model Training & Inference

References: examples, xformers/components

The examples directory contains several examples that demonstrate how to use the XFormers library for model training, evaluation, and inference.

The …/build_model subdirectory focuses on programmatically building models from configurations without code changes. The …/my_model.py script loads a configuration.

The …/llama_inference subdirectory handles efficient inference for large pretrained models. It loads a PyTorch checkpoint and SentencePiece tokenizer for text generation.

Other examples include …/cifar_MetaFormer.py and …/cifar_ViT.py which implement and train hierarchical and Vision Transformer models.

Model Training

References: examples/cifar_MetaFormer.py, examples/cifar_ViT.py, examples/microGPT.py

This section covers code and examples for training transformer models. The key functionality is:

  • The …/cifar_MetaFormer.py file defines a model for image classification on CIFAR-10 using a MetaFormer architecture. It programmatically defines the model configuration. It adds a classification head on top of the trunk and implements the forward pass. In the init method it saves hyperparameters and initializes the model, criterion, and metrics.

  • The …/cifar_ViT.py file defines a class for classifying CIFAR-10 images.

  • The …/microGPT.py file loads and preprocesses Shakespeare text data. In the main section, it trains the model on this data, and samples new text from the trained model.

Inference

References: examples/llama_inference

This section details how trained models can be used for text generation and prediction through inference. The core functionality is handled by the code in …/generate.py.

Upon initialization, it loads a pretrained model checkpoint and tokenizer from the specified directory. It handles encoding input prompts into token IDs using the tokenizer defined in …/tokenizer.py. An attention bias is created to indicate the length of each prompt during self-attention.

It runs the model to efficiently generate each next token in parallel. It conditions generation on the encoded prompts using the attention bias. The model output logits are then used to sample or take the argmax with functions from …/sample_utils.py to predict the next tokens.

After each step of generation, the attention bias state is updated to increase the self-attention lengths. This allows autoregressive generation of the output sequence. The predicted tokens are decoded back to text with the tokenizer. Generation is complete when the end-of-sequence token is encountered.

This provides a simple interface for text generation that handles encoding, model conditioning, decoding, and updating the attention bias state during generation. Key aspects include efficiently running the model and updating the attention bias for self-attention at each step.

Utilities

References: examples/llama_inference/mp_utils.py, examples/llama_inference/sample_utils.py, examples/llama_inference/stats.py, examples/llama_inference/tokenizer.py

The utilities in this section are used for common tasks during model training and inference.

The …/mp_utils.py file contains utilities to support model parallelism in PyTorch. The main functionality includes initializing the model parallel environment and providing functions to get the world size and local rank of the current process. It also includes functions for gathering and reducing tensors across processes, allowing communication during distributed training or inference.

The …/sample_utils.py file contains the function for top-p sampling. It takes a probability distribution and threshold as input, sorts the probabilities in descending order and calculates the cumulative sum to create a mask for probabilities below the threshold. The masked probabilities are set to 0 and the distribution is renormalized before sampling.

The …/stats.py tracks timing and statistic information for different phases of a process. It uses a dataclass to represent statistics like name, tokens, and time for individual phases. The class initializes with no phases and has methods to terminate the current phase and store its stats, start a new phase, and access all phase stats.

Testing

References: tests

The xformers library provides a comprehensive suite of tests for validating components using parameterization and utilities that support writing robust and reusable tests. The tests are organized into directories under tests that match the component structure under test. This allows grouping related tests together in an intuitive way.

The tests utilize the testing framework located in the directory. Many key aspects of the testing approach include:

  • Components are thoroughly tested across different configurations, devices, and dtypes using utilities located in …/__init__.py to generate test data.

  • Correctness is validated by comparing results to reference implementations located directly in test files or to other components.

  • Functionality is tested on CPU and CUDA compatible devices to ensure robustness across hardware.

  • Gradient correctness is validated in addition to forward pass results using automatic differentiation.

Some important classes include:

Test Organization

References: tests, tests/__init__.py

Tests are organized into subdirectories by component under the tests folder. Tests for attention functionality can be found in …/test_core_attention.py.

The …/__init__.py file defines utilities used across tests.

…/test_core_attention.py contains tests for attention modules.

…/test_attention_patterns.py exercises classes implementing attention patterns. It validates layouts generated under different parameters.

Parameterization tests components under various configurations, such as numbers of heads and dropout.

Utilities generate test data and validate expected functionality.

Functionality Testing

References: tests/test_feedforward.py

The …/test_feedforward.py file contains tests for feedforward neural network components in Xformers. These tests validate that feedforward layers function correctly across different device configurations by constructing each layer with random parameters, passing dummy data through, and checking the outputs.

The tests first define common batch size, sequence length, and embedding dimension parameters. All registered feedforward layers in Xformers are then checked by constructing each layer with a random name, activation function, and device. Dummy data is passed through each layer and the output dimensions are validated.

Specific tests requiring CUDA are also included. These tests construct layers with different expert configurations. Dummy data is passed through and the outputs are checked.

A key part of the tests is validating the basic forward pass without testing internal implementation details. This ensures the expected functionality of feedforward components across different model configurations and devices. Refer to the Test Organization section for more details on how tests are structured.

Gradient Checking

References: tests/test_feedforward.py

The …/test_feedforward.py file contains tests that validate gradient correctness for feedforward layers in addition to checking the forward pass results. These gradient checks are implemented using finite difference approximations. The tests construct feedforward layers like those registered in the component registry with different configurations, including varying the number of experts. Dummy data is passed through the layers during a forward pass to get outputs. Finite difference approximations are then calculated for the gradients to validate they match the gradients produced. Any discrepancies would indicate issues with the backward implementation. This gradient checking process is performed for different feedforward layers and configurations to ensure their gradients are being computed correctly during training.

Correctness Testing

References: tests/test_core_attention.py

The unit tests in the …/test_core_attention.py file validate the correctness of the sparse attention kernels by comparing their results to the dense implementation under different conditions. Tests are run with different attention masks and input shapes specified in the test functions.

Key tests check that results between a sparse CSR implementation and dense masks are identical. Mixed precision tests ensure the correct output dtype of float16 or float32 is used depending on if autocasting is enabled. Blocksparse tests verify the blocksparse kernels are selected appropriately based on the mask type.

The tests validate that the sparse kernels produce numerically equivalent results to the dense reference implementation under different masking conditions and input configurations. This confirms the correctness of the optimized sparse kernels.

Performance Testing

References: xformers

The XFormers library contains comprehensive tests to ensure components achieve their expected speed and memory usage goals. The tests directory contains rigorous performance tests that benchmark key components across a variety of configurations.

The …/test_feedforward.py file runs performance tests on feedforward neural network components in Xformers. It constructs components with random hyperparameters, prepares input data, and measures runtime. Benchmarks are run across different devices, data types, and batch sizes to test performance robustness. Any regressions in throughput or memory efficiency would be caught.

Model Testing

References: xformers

The tests directory contains tests for complete models constructed from Xformers components in different configurations. Tests validate models by passing data through and checking outputs and gradients.

Configurations specify options like attention patterns or number of layers. This validates models constructed from components function correctly across architectures.