Mutable.ai logoAuto Wiki by Mutable.ai

apex

Auto-generated from NVIDIA/apex by Mutable.ai Auto Wiki

apex
GitHub Repository
DeveloperNVIDIA
Written inPython
Stars7.8k
Watchers 102
Created2018-04-23
Last updated2024-01-05
LicenseBSD 3-Clause "New" or "Revised"
RepositoryNVIDIA/apex
Auto Wiki
Generated at2024-01-07
Generated fromCommit 87c4de
Version0.0.4

The Apex repository provides utilities to accelerate deep learning workloads in PyTorch using techniques like mixed precision training and distributed data/model parallelism. It contains optimized CUDA/C++ implementations of performance-critical model components and training functions to improve throughput and resource utilization during training.

Some of the key functionality includes:

  • Mixed precision training using FP16 formats to improve performance while maintaining accuracy. This works by wrapping models/optimizers and handling scaling, overflow issues, etc via components.

  • Distributed data parallel training using model wrapping and gradient synchronization utilities. Tests validate functionality like race conditions.

  • Optimized CUDA kernels for operations like attention, convolution, normalization and more. These fuse computations like GEMM for faster training on GPUs.

  • Optimized deep learning modules like normalization layers, attention layers, and optimizers improve throughput. Tests validate correctness.

  • Specialized utilities like optimized kernels and model/tensor parallelism support accelerate transformer training.

  • Building blocks for recurrent models like LSTM/GRU cells and utilities to construct RNNs.

  • Standardized implementations of multi-layer perceptrons with a common interface.

Extensive test suites validate functionality across levels, devices, and workflows. Utilities test installability across PyTorch Docker images.

In summary, Apex accelerates deep learning workloads by providing optimized CUDA kernels, model components like attention and normalization layers, mixed precision and distributed training utilities, and comprehensive testing. It allows improving performance, throughput, and scalability of PyTorch models.

Mixed Precision

References: apex/amp, apex/fp16_utils

Apex provides functionality for enabling mixed precision training using both Automatic Mixed Precision (AMP) and manual FP16 utilities. The main components are the …/amp module and …/fp16_utils module.

The …/amp module implements mixed precision using automatic function wrapping and registration. The …/amp.py file contains the core logic. It handles casting and registering functions to different precisions during execution.

The main entry point is the function in …/_initialize.py, which initializes models and optimizers for AMP. It casts parameters and processes optimizers.

The …/fp16_utils module provides manual mixed precision control. Utilities in …/fp16util.py allow model and tensor conversion between precisions.

Both AMP and FP16 utilities leverage techniques like loss scaling, master weights, and selective casting to enable stable mixed precision training. AMP handles much of this automatically while the FP16 utilities provide lower-level manual control.

Mixed Precision Utilities

References: apex/amp, apex/fp16_utils

The core functionality for enabling mixed precision training using FP16 formats is provided by utilities in the …/fp16_utils directory. These utilities implement important algorithms and components for FP16 training.

The …/__init__.py file imports several important functions.

Function Wrapping and Registration

References: apex/amp/amp.py, apex/amp/lists

The …/amp.py file handles function wrapping and registration to enable mixed precision training. It contains the main functionality for determining which functions require special handling.

It manually wraps any functions listed in override lists to implement the desired precision rules. It also handles functions that are banned from mixed precision or require special handling due to being in-place.

The functionality allows user-defined functions to be transparently optimized during training by casting them to different precisions based on the active AMP context.

Special care is taken for functions imported from modules that require non-trivial wrapping due to internal optimizations or restrictions.

RNN handling patches the RNN cell attribute to make it mutable. This allows directly wrapping RNN classes and registered cell functions.

Optimizer Handling

References: apex/amp/_amp_state.py, apex/amp/_process_optimizer.py

The …/_amp_state.py file contains a communication object. This file defines a central communication object and provides helper functions to interface with it, following a typical pattern for sharing state across modules in a thread-safe way.

The …/_process_optimizer.py file handles optimizer processing during mixed precision training. It supports different modes.

Loss Scaling

References: apex/amp/_amp_state.py, apex/fp16_utils/loss_scaler.py

Loss scaling functionality is handled in the file …/loss_scaler.py.

A class represents a static scale factor that is applied to the loss during the backward pass. It simply scales the loss by a fixed factor.

A class implements dynamic loss scaling. It tracks the scale factor and automatically adjusts it up or down based on detecting overflow during backward passes. This aims to use the highest scale possible without causing overflow.

After each backward pass, it checks for infinity or NaN values in the loss using a utility function. If overflow is detected, it decreases the scale factor. If several iterations pass without overflow, it increases the scale. This allows it to "ride the edge" of the highest stable scale.

The scale factors are applied to both the loss value and its gradients before they are used in the backward pass. This effectively scales both loss and gradients, improving stability during training.

Initialization and Configuration

References: apex/amp/_initialize.py, apex/amp/__init__.py

This section covers the entry points and configuration options for initializing mixed precision training in Apex.

The main entry point for initializing AMP is the function in …/_initialize.py. This function takes the models, optimizers, and AMP properties as arguments. It then performs several key initialization steps.

First, it checks that the models and optimizers are in the expected format and have not been previously wrapped for mixed precision.

It casts the model parameters and buffers to a mixed precision type if specified in the AMP properties.

The optimizers are then processed to prepare them for mixed precision training.

For each loss value in the batch, an instance of is initialized. This class manages the loss scaling value separately for each loss.

Finally, it returns the initialized models and optimizers.

Configuration options for AMP are handled in …/__init__.py. This file serves as an initialization point, importing the core functions from the AMP module without containing significant logic itself. Specifically, it imports functions that handle casting PyTorch modules to different precisions from …/amp. It also imports the global AMP state which stores shared AMP state across the process.

Distributed Training

References: apex/parallel, tests/distributed

The main utilities for distributed training in Apex are handled in the …/parallel directory. This directory contains several important files and classes for distributed data and model parallelism.

The …/distributed.py file contains utilities for data parallel distributed training in PyTorch.

The …/multiproc.py file contains functions for launching multiple processes locally, with each process assigned to a separate GPU. This allows efficiently training models across all available GPUs on a single node.

The …/LARC.py file implements an adaptive learning rate optimizer.

The …/sync_batchnorm.py and …/sync_batchnorm_kernel.py files contain a class for synchronized batch normalization.

Distributed Data Parallel

References: apex/parallel/distributed.py, tests/distributed/DDP

The gradients are allreduced across processes using bucketing. During each iteration, the input tensor is filled with unique values on each device to intentionally cause race conditions. This tests that gradients accumulate correctly in different configurations. Configurations like message size and number of allreduce streams are adjusted.

The expected gradient values based on the input are computed and compared to the actual summed values. This verifies gradients are accumulating correctly without race conditions under different configurations.

The …/ddp_race_condition_test.py file contains the main test logic. It defines a simple model class with two parameters. This class is created and wrapped to parallelize it across devices. Functions are called to compute gradients on the unique inputs, and the gradients are checked after each iteration.

The …/run_race_test.sh shell script launches the test in a distributed manner across multiple devices.

Automatic Mixed Precision

References: tests/distributed/amp_master_params

Apex provides functionality for mixed precision in distributed training through Automatic Mixed Precision (AMP). AMP allows training with smaller datatypes like half-precision floats to accelerate training, while maintaining the accuracy achieved with full precision. It handles operations like loss scaling, optimizer wrapping, and synchronization of parameters across devices.

The …/amp_master_params directory contains tests of distributed training using AMP. The …/amp_master_params.py file sets up a simple linear model for distributed data parallel training across multiple GPUs. It initializes the model and optimizer for AMP with an opt level of "O2". This enables mixed precision operations and loss scaling. The model is then wrapped for distributed training. A training loop runs for 500 steps using fake data, scaling the loss before passing to the optimizer. After training, the model and master parameters are saved separately for each rank.

The …/compare.py file loads models from separate checkpoints for two ranks. It asserts that the model parameters match and that the master parameters match across ranks after casting the masters to half precision. This verifies synchronization of parameters under AMP.

These tests cover the full workflow of distributed training using AMP, from launching the job to validating final results. AMP handles operations like loss scaling, optimizer wrapping, and synchronization of parameters to enable mixed precision training in a distributed setting.

Optimized Building Blocks

References: apex/contrib, csrc

The …/csrc directory contains optimized C++/CUDA implementations of core deep learning operations and components. It provides low-level building blocks that can be used to accelerate models and training via the PyTorch C++ interface.

Some key functionality includes:

  • The …/conv_bias_relu.cpp file contains functions to efficiently fuse common CNN operations like convolution, bias addition, and activation functions using the cuDNN backend.

  • Files like …/group_norm_nhwc.cpp contain CUDA kernel implementations for the forward and backward passes of group normalization in NHWC layout.

Key aspects of the C++/CUDA implementations include:

  • Leveraging low-level CUDA primitives and parallelism to accelerate computations
  • Optimizing for half-precision via templates to support mixed precision
  • Providing a clean C++ interface that is exposed to Python via bindings
  • Implementing common deep learning primitives as reusable building blocks

Optimized CUDA Kernels

References: apex/contrib/csrc/conv_bias_relu, apex/contrib/csrc/focal_loss, apex/contrib/csrc/fmha/src, apex/contrib/csrc/group_norm, apex/contrib/csrc/index_mul_2d, apex/contrib/csrc/layer_norm, apex/contrib/csrc/multihead_attn, apex/contrib/csrc/optimizers, apex/contrib/csrc/transducer, apex/contrib/csrc/xentropy

The optimized CUDA kernels are implemented in several key files and directories:

  • …/conv_bias_relu contains implementations of common convolutional neural network operations.

  • …/focal_loss provides an optimized CUDA implementation of the focal loss calculation for object detection.

  • …/src contains CUDA kernel and C++ implementations for efficiently performing multi-head attention using half-precision matrices.

  • …/group_norm provides CUDA implementations of group normalization.

  • …/index_mul_2d_cuda.cpp defines CUDA kernels.

  • …/layer_norm contains CUDA kernels.

Comprehensive Testing

References: apex/contrib/test, tests

The Apex test suites provide comprehensive validation of optimized implementations through extensive unit testing. Key aspects include:

  • Thoroughly testing optimized CUDA kernels and modules across different configurations and data types. This ensures correctness and equivalence to reference implementations.

  • Unit tests defined in test classes validate core functionality.

  • Tests initialize random inputs and partition batches across devices if needed. They run optimized and reference implementations, comparing outputs and gradients. This validates optimized modules match expected behavior.

  • Benchmarking tests measure performance improvements.

  • Tests cover different configurations by varying shapes, types like float and half, activation functions, and other options. This exercises all variations.

  • Files like …/test_conv_bias_relu.py contain classes to test fused convolutional, bias, and ReLU operations. Tests initialize data and models, and compare results.

  • Tests inject errors to validate exception handling.

  • Distributed tests in directories like …/synced_batchnorm validate functionality across devices and configurations.

Transformer Utilities

References: apex/transformer

The …/transformer directory contains specialized functionality for efficiently training large Transformer models at scale. This includes utilities for both training based on NVIDIA's Megatron-LM using techniques like tensor and pipeline model parallelism, optimized kernels, batch sampling, and mixed precision training.

The …/tensor_parallel functionality handles splitting and scattering weights, activations, and gradients across GPUs during the forward and backward passes using functions defined in …/utils.py.

The …/schedules subdirectory implements scheduling strategies for running forward and backward passes across multiple GPUs, with utilities defined in …/common.py for building models and handling activations/gradients exchange between stages.

The …/functional directory contains utilities for performing operations commonly used in transformer models using functions defined in …/__init__.py.

The …/layers directory provides different implementations of layer normalization optimized for both CPU and GPU.

The …/_data module handles sampling batches of data during pretraining in a data parallel manner using functionality defined in …/__init__.py.

Transformer Model Utilities

References: apex/transformer, apex/transformer/amp, apex/transformer/functional, apex/transformer/layers, apex/transformer/pipeline_parallel, apex/transformer/tensor_parallel, apex/transformer/_data

The …/transformer directory contains utilities for efficiently training large Transformer models using techniques like tensor and pipeline model parallelism, optimized kernels, and batch sampling. This allows models to effectively scale to larger sizes and batch sizes during pretraining.

Some key functionality includes:

  • The …/tensor_parallel subdirectory handles splitting and scattering weights, activations, and gradients across GPUs during forward and backward passes using classes in …/layers.py. It also manages communications, random number generation, and checkpointing activations.

  • The …/pipeline_parallel subdirectory implements pipeline parallelism for training partitioned models across multiple GPUs. Schedules defined in …/schedules handle running the model through warmup, steady state, and cooldown phases with activation exchange between stages.

  • The …/_data subdirectory contains implementations for sampling batches of data in a data parallel manner.

  • Optimized kernels for transformer operations are implemented in …/functional.

  • Utilities are provided at the top level in files like /__init__.py, /parallel_state.py, and /microbatches.py.

Key functionality:

  • Classes in …/layers.py parallelize embeddings and linear layers, handling initializing weights and supporting gradient accumulation during the forward pass.

  • Classes in …/mappings.py implement splitting, gathering, reducing and scattering tensors using PyTorch distributed APIs.

  • Schedules contain classes that build models for pipeline parallelism and handle running forward and backward passes between stages.

Optimized Kernels

References: apex/transformer/amp, apex/transformer/functional

The …/functional directory contains utilities for performing optimized kernel implementations of operations commonly used in transformer models. This includes applying positional encodings, performing attention, and applying normalization layers like layer normalization.

The …/fused_rope.py file defines PyTorch autograd functions and wrapper functions for applying fused positional embeddings to input tensors via kernel fusion.

The …/fused_softmax.py file contains several classes for performing fused operations involving scaling, masking, and softmax. This includes classes for scaled masked softmax and attention softmax. The classes implement the operations by fusing the scaling, masking, and softmax calculation into efficient CUDA kernels.

The …/__init__.py file imports utilities for applying positional encodings.

Pipeline Model Parallelism

References: apex/transformer/pipeline_parallel

Pipeline model parallelism partitions the model across multiple GPUs such that each GPU processes a subset of layers sequentially. This allows training much deeper models than would fit on a single device. Apex provides utilities for implementing pipeline parallelism during transformer training.

The core logic is implemented in the …/schedules directory. This contains scheduling strategies for running the forward and backward passes across stages in a pipelined manner. The …/common.py file contains utilities for building partitioned models.

Scheduling strategies are implemented in files like …/fwd_bwd_pipelining_without_interleaving.py. This file defines a function that runs the model through three phases - warmup, steady state, and cooldown. In the steady state phase, it runs interleaved forward and backward passes on microbatches while communicating activations and gradients between stages using functions.

Lower level communication functions are contained in …/p2p_communication.py. This includes functions for sending and receiving tensors asynchronously between previous and next pipeline stages. Optimizations are provided, such as scatter-gather and batched communication.

Utilities for building partitioned models and handling activations/gradients are provided in …/common.py.

An entry point for scheduling functions is provided via …/__init__.py. Based on the model parallel configuration, this file returns the appropriate scheduling function to run forward and backward passes in a pipelined manner across stages.

Tensor Model Parallelism

References: apex/transformer/tensor_parallel

The Apex library provides several utilities to support tensor model parallelism for efficiently training large Transformer models across multiple GPUs. Tensor model parallelism involves splitting model weights, activations, and gradients across GPUs along the tensor (model) parallel dimension.

The …/tensor_parallel directory contains the primary code implementing tensor model parallelism in Apex. It includes functionality for splitting and scattering weights and data, communicating between GPUs, and managing randomness.

Key components include:

  • The …/__init__.py file contains general utilities used across files, including for communications, random number generation, and checkpointing activations between GPUs.

  • Classes in …/layers.py handle parallelizing embeddings and linear layers. They initialize shared weights and scatter subsets to each GPU, mapping tensors during the forward and backward passes.

  • …/mappings.py defines classes that implement splitting, gathering, reducing and scattering tensors across dimensions using PyTorch distributed APIs.

  • …/memory.py contains classes for managing memory buffers in a reusable, non-fragmented way across GPUs.

  • …/random.py provides utilities for independently seeding and tracking random number generator states on each GPU.

  • Other files implement specific functionality like calculating loss, broadcasting data, and splitting vocabularies and tensors for parallelism.

Batch Sampling

References: apex/transformer/_data

The …/_data directory provides functionality for sampling batches of data during pretraining of transformer models in a data parallel manner. It contains implementations for this purpose.

These sampling implementations are defined in the …/_batchsampler.py file. Both classes implement an abstract base class defined in this file, requiring them to implement certain sampling functions. This enforces a common interface for sampling batches.

One implementation handles randomly sampling batches of indices during pretraining. It takes arguments like the total sample count, samples already consumed, local batch size, data parallel rank and size. It samples batches by iterating randomly from the consumed sample index to the end of the data.

Another implementation is also used for sampling batches of indices, but may implement some non-random strategy. It also takes the same arguments as the random sampler. One possibility shown in the summaries is that it samples randomly from buckets of the dataset assigned to each data parallel process, to help balance samples.

The …/__init__.py file imports and re-exports these sampling implementations under the namespace of this submodule, making them available for use in pretraining transformer models.

Layer Normalization

References: apex/transformer/layers

The layer normalization implementations provided in Apex are optimized for efficiently training transformer models. This functionality is contained within the …/layers directory and its submodules.

The core layer normalization functionality is provided by classes defined in …/layer_norm.py. This file handles layer normalization in a way that is compatible with Megatron-LM.

The …/__init__.py file imports layer normalization functionality from the layers submodule. This provides a clean interface.

Testing Utilities

References: apex/transformer/testing

The testing utilities in …/testing provide functionality for validating Transformer models and components. Centralized argument parsing and validation is handled by …/arguments.py, which checks hyperparameters are compatible and returns a validated namespace. Global state like timers and batch size tracking is managed by …/global_vars.py.

Utilities for initialization, preprocessing, and model handling are in …/commons.py. This includes functions that return instances of classes to pass to test functions. Base classes for writing distributed tests are defined in …/distributed_test_base.py. These classes initialize the process group and select the communication backend.

Standalone BERT models can be tested using functionality in …/standalone_bert.py. This file contains functions for initializing models, creating attention masks, and processing outputs.

Testing GPT models is supported by code in …/standalone_gpt.py. The main class handles preprocessing, forwarding the language model, and postprocessing. It retrieves the GPT module and supports distributed functionality.

Generic transformer language models can also be built and run standalone using …/standalone_transformer_lm.py.

Recurrent Neural Networks

References: apex/RNN

The …/RNN directory provides the core building blocks for implementing recurrent neural network (RNN) models in PyTorch. It contains utilities for constructing RNN models by stacking cells together into deeper networks.

The …/RNNBackend.py file defines important classes for building RNNs.

The …/cells.py file focuses on implementing individual RNN cells. It contains a class which provides the core LSTM computation, checking for CUDA and using either regular PyTorch ops. The LSTM logic handles computing gates from inputs and updating the new cell and hidden states. serves as a basic building block that can be stacked together.

The …/models.py file defines common high-level RNN models by creating objects with the appropriate cell type and passing them to construct RNN models with the desired properties like number of layers and layer types.

RNN Cells

References: apex/RNN/cells.py

The file …/cells.py implements core RNN cell types. It utilizes functions to handle the LSTM cell computation. Linear layers compute the input, forget, output, and cell gates from the input and hidden state. It updates the new cell state and calculates the new hidden state. This provides the core LSTM cell logic that can be used to build LSTM models.

RNN Utilities

References: apex/RNN/models.py, apex/RNN/RNNBackend.py

The …/models.py file provides high-level RNN model classes that handle creating modules with the appropriate RNN cell types.

The …/RNNBackend.py file implements the core RNN functionality.

RNN Initialization

References: apex/RNN/__init__.py

This section handles initializing and re-exporting the core RNN functionality defined in Apex. The …/__init__.py file imports several common RNN cell classes and activation functions from the …/models.py submodule.

These RNN related objects are then re-exported at the top-level namespace for external usage. This allows any code importing from the namespace to directly access the objects without needing to specify the submodule, making the API cleaner and more intuitive. Under the hood, /__init__.py acts as a facade, abstracting away the submodule structure and providing a single entry point for the RNN functionality.

The actual implementation details of the objects would be defined in the …/models.py submodule. This lower-level module contains the business logic while /__init__.py focuses only on importing and re-exporting for external API usage.

Multi-Layer Perceptrons

References: apex/mlp

Apex provides standardized implementations of multi-layer perceptron (MLP) models through functionality in the …/mlp directory. The core functionality is defined in …/mlp.py and represents an MLP model. It takes hyperparameters like the layer sizes, bias, and activation. Weights and biases are initialized as parameters and methods are used to run the forward and backward passes of the network.

The …/__init__.py file acts as an entry point, importing all public classes, functions, and objects from the other files and re-exporting them. This simplifies imports for users.

Documentation

References: docs

The core documentation functionality is defined in the docs directory and subdirectories. This includes configuring Sphinx documentation builds and customizing the documentation theme and styling.

The main Sphinx configuration is defined in …/conf.py. This file imports the apex module and sets the path to include the current directory for imports during documentation builds. It configures Sphinx to build the documentation.

Templates for customizing Sphinx's default HTML layout are in …/_templates. The key template is …/layout.html, which uses Jinja templating to extend Sphinx's base layout. It defines blocks that are rendered after calling the parent template. In the blocks, it adds custom CSS styles to change colors and branding for Apex. It also styles definition terms in documentation content.

CSS styles for the Apex documentation theme are in …/pytorch_theme.css. This file contains styles like fonts, colors, and responsive layouts. It sets a white background and styles code elements with different colors.

No custom classes, functions, or algorithms are defined for the documentation functionality - it focuses on configuring Sphinx builds and customizing output through templates and CSS. The main implementation choices are using Jinja templating and CSS injection to modify Sphinx's default theme without JavaScript.

Sphinx Documentation Configuration

References: docs/source/conf.py

The …/conf.py file contains the configuration needed to build the Sphinx documentation for Apex. This file handles important tasks like:

  • Importing the apex module, which allows the documentation to reference classes and functions from the Apex codebase.

  • Setting the Python module path to include the current directory …/source, so that Sphinx can locate imports of Apex modules during the documentation build process.

  • Configuring Sphinx settings like the project name, copyright, version, and other metadata.

  • Setting directives that control how elements are displayed in the documentation.

  • Configuring the intersphinx mapping, which allows linking from Apex documentation to other documentation sets.

  • Adding custom CSS and JavaScript that is used to modify the documentation styling and themes.

  • Configuring the HTML builder and output settings like the theme and title.

By handling tasks like importing modules, configuring directives, setting paths, and customizing the output, the configuration file centralizes all of the settings needed to build coherent Sphinx documentation from the code comments and docstrings in Apex. This allows developers to focus on writing documentation without worrying about the build configuration.

Documentation Theme Customization

References: docs/source/_templates, docs/source/_static/css

The customization of the default Sphinx theme is implemented through templates and CSS files. The main template file is …/layout.html. This file uses Jinja templating to extend the base Sphinx layout template. It defines blocks that are rendered after calling the parent implementation.

In one block, custom CSS styles are added to change colors of elements like the sidebar search box and top navigation bar for the Apex branding. Definition term elements in documentation content are also styled. Another block injects additional CSS into the page footer.

The CSS files are located in …/css. The main file is …/pytorch_theme.css which contains styling for PyTorch documentation pages. It sets fonts, colors, layout adjustments and customizes how different elements are displayed across screen sizes.

Testing

References: tests

The tests directory contains automated test suites that validate the core functionality of Apex. It includes lower-level unit tests in …/L0, higher-level integration tests across components in …/L1, and tests for distributed and parallel functionality in …/distributed.

The …/L0 directory contains unit tests that validate individual components are functioning as expected. This includes tests for modules in …/run_mlp. The optimizer tests similarly create a reference implementation, initialize both the reference and Apex optimizers, and assert the parameters are equivalent after multiple iterations.

The …/common directory provides shared utilities and data structures for other test files.

The …/distributed directory contains tests for distributed functionality. The …/DDP directory tests race conditions in PyTorch's distributed data parallel implementation. It defines a simple model and inputs with unique values to intentionally cause races, validating gradients still accumulate correctly. The …/synced_batchnorm directory contains tests that partition batches across devices and use statistical tests to check equivalence between a reference module and the synchronized batch norm module.

L0 Tests

References: tests/L0

The …/run_amp subdirectory contains extensive unit tests for Apex's Automatic Mixed Precision (AMP) functionality in PyTorch. It tests many aspects of using AMP for mixed precision training. Some key functionality tested includes type promotion behavior, casting between data types, caching behavior during training/evaluation, checkpointing models, functionality of optimizers like with AMP, handling multiple models/losses/optimizers, and dynamic loss scaling.

Tests are implemented as Python unit test classes defined in files. Common utilities and constants are defined in …/utils.py. The file …/test_fused_sgd.py contains important tests for the fused SGD optimizer under AMP.

The …/run_mlp subdirectory contains tests for a multi-layer perceptron (MLP) implementation in C++. The main test file is …/test_mlp.py, which defines a class containing methods to test creation of the MLP, compare outputs and gradients to a PyTorch reference implementation, test various configurations, test without gradients, and benchmark performance.

L1 Tests

References: tests/L1

The …/L1 directory contains higher-level integration tests across Apex components. These tests ensure different optimizations and functionality work together correctly.

The main subdirectories covered under L1 tests are:

  • …/cross_product contains tests for the cross product operation. It generates random vectors and compares results of calling a test versus a manual computation.

  • …/cross_product_distributed includes distributed tests for cross product. The …/run.sh script runs a test, passing arguments to indicate distributed testing.

  • …/transformer focuses on testing transformer models using model parallelism. It profiles different partitioning strategies and asynchronous communication performance.

These tests cover a range of integration scenarios, from basic operations like cross product, to distributed functionality, to optimized transformer training.

The …/common directory holds shared utilities for L1 tests. This includes functions for validating results across configurations.

Tests rely on these common modules to implement functionality like data loading, loss computation, metric tracking, and optimization required for robust model evaluation. Classes pass different arguments to comprehensively test Apex over a range of settings.

Distributed Tests

References: tests/distributed

The …/distributed directory contains automated tests for distributed training functionality in Apex. This includes testing distributed data parallelism, automatic mixed precision, and synchronized batch normalization.

The key subdirectories contain important tests:

  • …/DDP tests race conditions that could occur in distributed data parallel training.

  • …/amp_master_params contains end-to-end tests for distributed training with Apex AMP.

  • …/synced_batchnorm contains unit tests for synchronized batch normalization.

The …/ddp_race_condition_test.py file defines a simple model and intentionally causes races during training to validate the DDP implementation is race-free.

The …/amp_master_params.py file sets up a linear model and runs distributed training across GPUs with AMP enabled. It validates synchronization of parameters and consistency across ranks.

Tests generate random inputs, partition batches across GPUs, and use statistical tests to check equivalence between modules and a reference implementation.

Docker Extension Build Tests

References: tests/docker_extension_builds

This section tests installation of Apex across multiple PyTorch Docker images. The code in …/docker_extension_builds loops through an array of image names. For each image, it prints a banner, pulls the image, runs a Docker container to install Apex, and checks the exit code. It records pass/fail results for each image.

After testing all images, it loops through the recorded codes and prints overall results. The main logic is:

  1. Declare array of images to test
  2. Loop through images
  3. Pull image using Docker
  4. Run Docker container, install Apex
  5. Check exit code
  6. Record exit code
  7. Loop through codes, print results
  8. Print overall success or failure

This allows systematically testing Apex builds across multiple PyTorch environments.