Create your own wiki
AI-generated instantly
Updates automatically
Solo and team plans


repo logo
Last updated
BSD 3-Clause "New" or "Revised"
Software Version
Generated from
Generated on

The Apex repository provides utilities to accelerate deep learning workloads in PyTorch using techniques like mixed precision training and distributed data/model parallelism. It contains optimized CUDA/C++ implementations of performance-critical model components and training functions to improve throughput and resource utilization during training.

Some of the key functionality includes:

  • Mixed precision training using FP16 formats to improve performance while maintaining accuracy. This works by wrapping models/optimizers and handling scaling, overflow issues, etc via components.

  • Distributed data parallel training using model wrapping and gradient synchronization utilities. Tests validate functionality like race conditions.

  • Optimized CUDA kernels for operations like attention, convolution, normalization and more. These fuse computations like GEMM for faster training on GPUs.

  • Optimized deep learning modules like normalization layers, attention layers, and optimizers improve throughput. Tests validate correctness.

  • Specialized utilities like optimized kernels and model/tensor parallelism support accelerate transformer training.

  • Building blocks for recurrent models like LSTM/GRU cells and utilities to construct RNNs.

  • Standardized implementations of multi-layer perceptrons with a common interface.

Extensive test suites validate functionality across levels, devices, and workflows. Utilities test installability across PyTorch Docker images.

In summary, Apex accelerates deep learning workloads by providing optimized CUDA kernels, model components like attention and normalization layers, mixed precision and distributed training utilities, and comprehensive testing. It allows improving performance, throughput, and scalability of PyTorch models.

Mixed Precision

References: apex/amp, apex/fp16_utils

Apex provides functionality for enabling mixed precision training using both Automatic Mixed Precision (AMP) and manual FP16 utilities. The main components are the …/amp module and …/fp16_utils module.

Read more

Mixed Precision Utilities

References: apex/amp, apex/fp16_utils

The core functionality for enabling mixed precision training using FP16 formats is provided by utilities in the …/fp16_utils directory. These utilities implement important algorithms and components for FP16 training.

Read more

Function Wrapping and Registration

References: apex/amp/, apex/amp/lists

The …/ file handles function wrapping and registration to enable mixed precision training. It contains the main functionality for determining which functions require special handling.

Read more

Optimizer Handling

References: apex/amp/, apex/amp/

The …/ file contains a communication object. This file defines a central communication object and provides helper functions to interface with it, following a typical pattern for sharing state across modules in a thread-safe way.

Read more

Loss Scaling

References: apex/amp/, apex/fp16_utils/

Loss scaling functionality is handled in the file …/

Read more

Initialization and Configuration

References: apex/amp/, apex/amp/

This section covers the entry points and configuration options for initializing mixed precision training in Apex.

Read more

Distributed Training

References: apex/parallel, tests/distributed

The main utilities for distributed training in Apex are handled in the …/parallel directory. This directory contains several important files and classes for distributed data and model parallelism.

Read more

Distributed Data Parallel

References: apex/parallel/, tests/distributed/DDP

The gradients are allreduced across processes using bucketing. During each iteration, the input tensor is filled with unique values on each device to intentionally cause race conditions. This tests that gradients accumulate correctly in different configurations. Configurations like message size and number of allreduce streams are adjusted.

Read more

Automatic Mixed Precision

References: tests/distributed/amp_master_params

Apex provides functionality for mixed precision in distributed training through Automatic Mixed Precision (AMP). AMP allows training with smaller datatypes like half-precision floats to accelerate training, while maintaining the accuracy achieved with full precision. It handles operations like loss scaling, optimizer wrapping, and synchronization of parameters across devices.

Read more

Optimized Building Blocks

References: apex/contrib, csrc

The …/csrc directory contains optimized C++/CUDA implementations of core deep learning operations and components. It provides low-level building blocks that can be used to accelerate models and training via the PyTorch C++ interface.

Read more

Comprehensive Testing

References: apex/contrib/test, tests

The Apex test suites provide comprehensive validation of optimized implementations through extensive unit testing. Key aspects include:

Read more

Transformer Utilities

References: apex/transformer

The …/transformer directory contains specialized functionality for efficiently training large Transformer models at scale. This includes utilities for both training based on NVIDIA's Megatron-LM using techniques like tensor and pipeline model parallelism, optimized kernels, batch sampling, and mixed precision training.

Read more

Transformer Model Utilities

References: apex/transformer, apex/transformer/amp, apex/transformer/functional, apex/transformer/layers, apex/transformer/pipeline_parallel, apex/transformer/tensor_parallel, apex/transformer/_data

The …/transformer directory contains utilities for efficiently training large Transformer models using techniques like tensor and pipeline model parallelism, optimized kernels, and batch sampling. This allows models to effectively scale to larger sizes and batch sizes during pretraining.

Read more

Optimized Kernels

References: apex/transformer/amp, apex/transformer/functional

The …/functional directory contains utilities for performing optimized kernel implementations of operations commonly used in transformer models. This includes applying positional encodings, performing attention, and applying normalization layers like layer normalization.

Read more

Pipeline Model Parallelism

References: apex/transformer/pipeline_parallel

Pipeline model parallelism partitions the model across multiple GPUs such that each GPU processes a subset of layers sequentially. This allows training much deeper models than would fit on a single device. Apex provides utilities for implementing pipeline parallelism during transformer training.

Read more

Tensor Model Parallelism

References: apex/transformer/tensor_parallel

The Apex library provides several utilities to support tensor model parallelism for efficiently training large Transformer models across multiple GPUs. Tensor model parallelism involves splitting model weights, activations, and gradients across GPUs along the tensor (model) parallel dimension.

Read more

Batch Sampling

References: apex/transformer/_data

The …/_data directory provides functionality for sampling batches of data during pretraining of transformer models in a data parallel manner. It contains implementations for this purpose.

Read more

Layer Normalization

References: apex/transformer/layers

The layer normalization implementations provided in Apex are optimized for efficiently training transformer models. This functionality is contained within the …/layers directory and its submodules.

Read more

Testing Utilities

References: apex/transformer/testing

The testing utilities in …/testing provide functionality for validating Transformer models and components. Centralized argument parsing and validation is handled by …/, which checks hyperparameters are compatible and returns a validated namespace. Global state like timers and batch size tracking is managed by …/

Read more

Recurrent Neural Networks

References: apex/RNN

The …/RNN directory provides the core building blocks for implementing recurrent neural network (RNN) models in PyTorch. It contains utilities for constructing RNN models by stacking cells together into deeper networks.

Read more

RNN Cells

References: apex/RNN/

The file …/ implements core RNN cell types. It utilizes functions to handle the LSTM cell computation. Linear layers compute the input, forget, output, and cell gates from the input and hidden state. It updates the new cell state and calculates the new hidden state. This provides the core LSTM cell logic that can be used to build LSTM models.

Read more

RNN Utilities

References: apex/RNN/, apex/RNN/

The …/ file provides high-level RNN model classes that handle creating modules with the appropriate RNN cell types.

Read more

RNN Initialization

References: apex/RNN/

This section handles initializing and re-exporting the core RNN functionality defined in Apex. The …/ file imports several common RNN cell classes and activation functions from the …/ submodule.

Read more

Multi-Layer Perceptrons

References: apex/mlp

Apex provides standardized implementations of multi-layer perceptron (MLP) models through functionality in the …/mlp directory. The core functionality is defined in …/ and represents an MLP model. It takes hyperparameters like the layer sizes, bias, and activation. Weights and biases are initialized as parameters and methods are used to run the forward and backward passes of the network.

Read more


References: docs

The core documentation functionality is defined in the docs directory and subdirectories. This includes configuring Sphinx documentation builds and customizing the documentation theme and styling.

Read more

Sphinx Documentation Configuration

References: docs/source/

The …/ file contains the configuration needed to build the Sphinx documentation for Apex. This file handles important tasks like:

Read more

Documentation Theme Customization

References: docs/source/_templates, docs/source/_static/css

The customization of the default Sphinx theme is implemented through templates and CSS files. The main template file is …/layout.html. This file uses Jinja templating to extend the base Sphinx layout template. It defines blocks that are rendered after calling the parent implementation.

Read more


References: tests

The tests directory contains automated test suites that validate the core functionality of Apex. It includes lower-level unit tests in …/L0, higher-level integration tests across components in …/L1, and tests for distributed and parallel functionality in …/distributed.

Read more

L0 Tests

References: tests/L0

The …/run_amp subdirectory contains extensive unit tests for Apex's Automatic Mixed Precision (AMP) functionality in PyTorch. It tests many aspects of using AMP for mixed precision training. Some key functionality tested includes type promotion behavior, casting between data types, caching behavior during training/evaluation, checkpointing models, functionality of optimizers like with AMP, handling multiple models/losses/optimizers, and dynamic loss scaling.

Read more

L1 Tests

References: tests/L1

The …/L1 directory contains higher-level integration tests across Apex components. These tests ensure different optimizations and functionality work together correctly.

Read more

Distributed Tests

References: tests/distributed

The …/distributed directory contains automated tests for distributed training functionality in Apex. This includes testing distributed data parallelism, automatic mixed precision, and synchronized batch normalization.

Read more

Docker Extension Build Tests

References: tests/docker_extension_builds

This section tests installation of Apex across multiple PyTorch Docker images. The code in …/docker_extension_builds loops through an array of image names. For each image, it prints a banner, pulls the image, runs a Docker container to install Apex, and checks the exit code. It records pass/fail results for each image.

Read more