Mutable.ai logoAuto Wiki by Mutable.ai

composer

Auto-generated from mosaicml/composer by Mutable.ai Auto Wiki

composer
GitHub Repository
Developermosaicml
Written inPython
Stars4.8k
Watchers 49
Created2021-10-12
Last updated2024-01-07
LicenseApache License 2.0
Homepagedocs.mosaicml.com
Repositorymosaicml/composer
Auto Wiki
Generated at2024-01-08
Generated fromCommit 5592e4
Version0.0.4

Composer is a Python library for training deep learning models. It provides a modular framework for composing datasets, models, algorithms, and other utilities to build end-to-end training pipelines.

Some of the key functionality provided includes:

  • Models - Implementations of full models like ResNet and BERT as well as reusable model components. Models integrate with the training loop and handle tasks like forward passes, loss calculation, and metrics. (…/models, …/tasks)

  • Algorithms - Modules implementing techniques like optimization (e.g. …/sam), regularization (e.g …/label_smoothing), and architecture modifications (e.g. …/squeeze_excite). Algorithms modify models and training loops.

  • Datasets - Utilities for loading (…/datasets), caching, distributing, and augmenting samples from common datasets like ImageNet and CIFAR-10.

  • Trainer - The …/trainer module handles the core training loop and model optimization. Callbacks and events hook into this to implement functionality like distributed training, and profiling.

  • Testing - Comprehensive test suites (tests) validate functionality using mocks, fakes, and parameterization to cover many variants.

The modular architecture allows flexible composition of these components into custom training pipelines. Key frameworks like PyTorch handle lower-level model operations. The higher-level abstractions provide structure and simplify training loop orchestration.

For example, the …/models directory contains full implementations that handle integration details like loss functions and metrics. The …/algorithms directory provides modular blocks that modify models and loops. Together these allow users to easily build models and modify training.

The design choices focus on abstraction, composition, and testing. Components have narrow scopes that work together. Extensive testing handles validation across configurations and provides examples.

Models

References: composer/models, composer/models/tasks, composer/models/resnet_cifar

The …/models directory contains implementations of full model architectures as well as reusable model components. Key model implementations include:

  • The …/resnet_cifar subdirectory defines ResNet architectures adapted for CIFAR image classification tasks. The class in …/resnets.py implements the core ResNet model, containing residual blocks and classification layers defined in other files.

  • The …/efficientnetb0 subdirectory implements EfficientNet-B0, a convolutional neural network designed for image classification. The class in …/efficientnets.py represents the overall model architecture as sequential bottleneck blocks defined in other files.

  • The …/classify_mnist subdirectory contains a CNN for classifying MNIST digits defined by the class in …/model.py.

  • The …/tasks subdirectory provides reusable abstractions like the class defined in …/classification.py.

Convolutional Networks

References: composer/models/classify_mnist, composer/models/efficientnetb0

The core convolutional network implementations focus on classifying images from common computer vision datasets. The …/classify_mnist directory contains a convolutional neural network for classifying MNIST digit images.

The …/model.py defines the network architecture as a series of convolutional and pooling layers. It implements the inference pass through the model. A helper function creates instances of the model class to allow training and inference via the API.

The …/efficientnetb0 directory contains an implementation of the EfficientNet-B0 architecture. The model represents the overall model, containing sequential modules. These blocks alternate between pointwise convolutions and depthwise separable convolutions, implementing the inverted bottleneck structure. decodes block string specifications to construct the model from hyperparameters. A wrapper class allows using EfficientNet-B0 for classification tasks.

Vision Transformers

References: composer/models/vit_small_patch16

Vision transformer models apply self-attention mechanisms over patches of images rather than using convolutions. This allows capturing long-range dependencies better than CNNs.

The …/vit_small_patch16 directory contains an implementation of the Vision Transformer Small Patch 16 model for image classification. The …/__init__.py file imports and exports the model initialization function.

The key file is …/model.py. It contains the function which initializes the model using hyperparameters as input, loads pretrained weights, and returns an instance with the model attached.

The instance provides a simple API for tasks like predicting, calculating loss and metrics. This implementation allows easily loading the Vision Transformer Small Patch 16 model for transfer learning on new datasets. The model focuses on the core self-attention architecture, while the initialization function and instance handle common tasks.

ResNet Models

References: composer/models/resnet, composer/models/resnet_cifar

The …/resnet directory implements various ResNet architectures for image classification tasks. It provides configurable ResNet models of different depths.

Metadata on the various pretrained ResNet models is defined in …/__init__.py, including the expected performance on common datasets. This file also exports symbols for import.

Additional ResNet subclasses adapted specifically for the CIFAR datasets can be found in …/model.py.

Segmentation Models

References: composer/models/deeplabv3, composer/models/unet

This section covers models used for semantic and instance segmentation of images. Semantic segmentation involves assigning each pixel in an image to a class, while instance segmentation additionally distinguishes between individual objects of the same class.

The …/deeplabv3 directory implements models for semantic segmentation. The …/model.py file contains the model definition which defines the overall architecture. It builds on a ResNet backbone, applying spatial pyramid pooling and a decoder head.

The …/unet directory provides an implementation of the U-Net architecture for biomedical image segmentation tasks. Key components include:

  • The class in …/model.py defines the encoder-decoder architecture.

  • The file …/unet.py ties together the model, losses, and metrics for training and inference on 3D volumes.

  • The file …/__init__.py provides a pretrained model for segmenting brain tumors, along with metadata.

Language Models

References: composer/models/bert, composer/models/gpt2

This section covers functionality for loading and using transformer models for language tasks in Composer. The main functionality is provided by the …/bert and …/gpt2 directories.

The …/bert directory contains code for building BERT models using HuggingFace Transformers. The …/model.py file defines functions to initialize BERT models for masked language modeling and classification tasks. These functions handle loading pretrained models, initializing the tokenizer, and setting up appropriate metrics. They return an object containing the model, tokenizer, and metrics.

The …/gpt2 directory provides an interface for loading GPT-2 models. The …/__init__.py file imports a function to load GPT-2 models and defines a dictionary storing metadata about different model sizes. The …/model.py file defines a function to initialize a GPT-2 object. This loads the model and tokenizer, enables gradient checkpointing if specified, and sets up losses.

Both the BERT and GPT-2 initialization functions:

  • Handle loading pretrained or non-pretrained models
  • Initialize the tokenizer
  • Set appropriate metrics based on the task

Task Abstractions

References: composer/models/tasks

The …/tasks directory provides reusable functionality for easily training and evaluating common deep learning tasks with PyTorch. The key components it contains are the functionality provided in …/__init__.py and the functionality defined in …/classification.py.

The functionality in /__init__.py handles the overall training loop and adds methods for tasks like predicting outputs and logging evaluation metrics to track progress. It allows users to train a model in a standardized way without having to manually implement components like metric calculation. During training, it calculates validation metrics by calling methods defined in other files.

The functionality in /classification.py provides an interface for turning a PyTorch model into a module suitable for classification tasks. It standardizes functionality like loss computation and metric calculation methods. Methods return metric dictionaries and update tracked metrics by calling given metric functions.

Both allow for pluggable metric calculation and logging by allowing custom metrics. This flexibility supports different tasks beyond the default setup.

Algorithms

References: composer/algorithms, composer/algorithms/cutmix, composer/algorithms/sam

The …/algorithms directory contains implementations of various algorithms that can be composed together using the Composer framework. These algorithms span domains like optimization, model modifications, and other techniques to improve model training or performance.

Many algorithms are implemented as classes that inherit from PyTorch modules and override methods. For example, the …/sam.py file contains a class that wraps a base PyTorch optimizer to apply the Sharpness-Aware Minimization (SAM) algorithm during training.

Other algorithms are implemented as functions that take a model and apply the technique. For example, the …/gated_linear_unit_layers.py file contains a class implementing Gated Linear Units.

The …/ema.py file contains a class that handles applying EMA during training.

The …/seq_length_warmup.py file contains a class that implements the sequence length warmup algorithm.

The …/stochastic_layers.py file contains functions for implementing stochastic layers during training.

Data Augmentation

References: composer/algorithms/cutmix, composer/algorithms/randaugment, composer/algorithms/augmix

The Composer library provides several techniques for augmenting training data during neural network training.

CutMix is defined in the …/cutmix directory. The core functionality applies transformations to image batches, as described in the summary for …/cutmix.py.

RandAugment is defined in the …/randaugment directory. It applies random image transformations as described in the summary for …/__init__.py.

AugMix is defined in the …/augmix directory. It generates augmented images and mixes them as described in the summary for …/augmix.py.

Regularization

References: composer/algorithms/label_smoothing, composer/algorithms/gyro_dropout

Regularization techniques help prevent overfitting by adding noise or constraints during training. The Composer library includes implementations of dropout and label smoothing for regularization.

The …/label_smoothing directory contains code for label smoothing. It smooths one-hot targets by combining them with a uniform distribution. When added as a Composer algorithm, it applies this smoothing before loss and restores the original targets afterwards. This encourages the model to be less confident in its predictions.

The …/gyro_dropout provides an improved variant of dropout called Gyro Dropout. It implements this by masking the input based on preselected masks. A function replaces standard Dropout layers in a model with this during training. This focused training of subsets of the network increases diversity compared to standard dropout.

Optimization

References: composer/algorithms/sam, composer/algorithms/swa

The …/algorithms directory contains implementations of techniques to improve model training. Two important algorithms implemented here are Sharpness-Aware Minimization (SAM) and Stochastic Weight Averaging (SWA).

SAM is implemented in …/sam. The core component is defined in …/sam.py. Hyperparameters like the learning rate and interval control aspects of SAM's behavior.

Stochastic Weight Averaging is implemented in …/swa. The key class is defined in …/swa.py, which can be added to a trainer. It tracks the state. During training, weights are periodically updated by averaging the current weights. At the end of training, the original weights are replaced with the averaged weights. The update interval trades off quality for throughput. Both SAM and SWA aim to generalize beyond just minimizing the loss function.

Model Modifications

References: composer/algorithms/squeeze_excite, composer/algorithms/gated_linear_units

The …/squeeze_excite and …/gated_linear_units directories contain implementations of model modifications that can be applied during training.

The …/squeeze_excite directory implements blocks for adding channel-wise attention to convolutional neural networks.

The …/gated_linear_units directory replaces the linear layers in feed-forward networks with gated projections using a gating mechanism. The class in …/gated_linear_unit_layers.py implements the gated projection operation, containing the forward pass logic. The class provides an interface to apply modifications during training using module surgery.

Both directories leverage Composer's module surgery functionality to modify models during training. This allows easily adding blocks and replacing layers to improve model architectures.

Efficiency Improvements

References: composer/algorithms/channels_last, composer/algorithms/fused_layernorm

The …/channels_last directory contains utilities for applying the "Channels Last" memory format to models. This improves performance for convolutional networks used in computer vision tasks by storing tensors in NHWC format rather than the default NCHW. The key algorithm is implemented in …/channels_last.py and /__init__.py.

The …/fused_layernorm directory replaces instances of layer normalization with a faster implementation from the Apex library.

The function in …/fused_layernorm.py contains the core logic. It defines a replacement policy that maps the function to replace instances. The function recursively searches a model and replaces instances.

Utilities

References: composer/algorithms/utils

The utilities under …/utils provide common tools for working with data across frameworks and representations. The …/__init__.py file imports and re-exports an object containing predefined augmentation scheme objects. Each scheme object has an attribute containing a list of augmentation functions defined in …/augmentation_primitives.py. This file defines various image augmentation functions for PIL Images in code blocks. Another important utility is …/augmentation_common.py which contains functions for converting between representations and applying functions to batched data.

The object interface makes it easy for algorithms to apply common augmentation practices. New schemes can be added by defining new objects. The augmentation functions cover various transformations and are categorized. Intensity levels are normalized by helpers to suitable ranges. These utilities provide common tools across frameworks and representations that encourage data augmentation best practices.

Datasets

References: composer/datasets, scripts/ffcv

The …/datasets directory provides comprehensive functionality for loading, preprocessing, and transforming many common datasets used in machine learning research and applications. It contains code to handle the full data loading pipeline from source data through preprocessing, transforms, and into PyTorch data loaders.

Key aspects of the functionality include:

  • Support for many common image, text, medical, and other datasets through classes defined for each specific dataset.

  • Utilities for building PyTorch data loaders with different backends like standard PyTorch, optimized I/O through FFCV, or streaming data over a network. These include functions to load data, apply transforms, normalize inputs, collate samples, and distribute data across devices.

  • Preprocessing options like applying random augmentations during training to make models robust, and efficiently writing datasets to FFCV format.

  • Flexible data loading pipelines through composition of dataset classes, transforms, and data loader builders that can be customized for different use cases.

  • Modular implementation where each dataset is defined in its own file and common utilities are in shared files like …/utils.py for reusability.

The core …/__init__.py file provides the main entry point by importing all dataset functionality and re-exporting for easy access.

Some key classes and functions that power the functionality include:

  • Classes defined for each specific dataset.

  • Utilities in …/utils.py for tasks like collating samples.

Image Datasets

References: composer/datasets/cifar.py, composer/datasets/imagenet.py, composer/datasets/mnist.py

This section covers functionality for loading and preprocessing common image datasets. The code provides several options for loading popular image datasets like CIFAR-10, CIFAR-100, and ImageNet.

For CIFAR-10, it includes utilities in …/cifar.py to build PyTorch dataloaders using different backends. For example, it provides functions to build a dataloader from the standard PyTorch dataset using CIFAR-10, as well as from more efficient FFCV format files stored in …/cifar.py.

The code handles common preprocessing steps like normalization.

For ImageNet, the …/imagenet.py file builds a standard PyTorch dataloader for ImageNet stored locally. It handles data transformations and normalization. The file also contains utilities for converting ImageNet to the efficient FFCV format.

For MNIST, …/mnist.py provides functions to build data loading pipelines from both the standard dataset and a class that generates synthetic data. The synthetic data class takes parameters that allow creating infinite datasets.

Segmentation Datasets

References: composer/datasets/ade20k.py

The code in the …/ade20k.py file handles loading and preprocessing of the ADE20K semantic segmentation dataset. It defines the functionality for working with ADE20K images and annotations.

It represents each sample, containing the image and target mask. The method returns the image and target after applying any defined transformations. This allows building flexible data loading pipelines.

The key aspects are:

  • It loads the images and annotations, and encapsulates each sample
  • The method applies transformations and returns image and target
  • Functions return an object containing the transformed dataset, sampler, and batching parameters
  • Functions handle the dataset splits correctly

The primary interface for loading ADE20K data. It handles loading each sample via the method and applying any defined transformations. The dataloader building functions construct the full data loading pipeline by wrapping the dataset instance with sampling and batching via an object. This provides a clean interface for constructing flexible semantic segmentation dataloaders.

Text Datasets

References: composer/datasets/lm_dataset.py, composer/datasets/c4.py

The …/lm_dataset.py file handles loading language modeling datasets from their native format. It supports optionally subsampling the data and builds a PyTorch DataLoader with options like shuffling and collating for tasks like masked language modeling.

The …/c4.py file defines functionality for working with the large-scale C4 text dataset. The class iterates over samples, supporting shuffling and preprocessing with a given tokenizer to group text into fixed-length sequences. A preprocessing function collates samples into batches, masking tokens for masked language modeling and padding/truncating sequences to a maximum length. Together this provides an interface for easily constructing a distributed data loading pipeline to iteratively process the C4 dataset.

Medical Imaging Datasets

References: composer/datasets/brats.py

The …/brats.py file contains functionality for loading and preprocessing the BRaTS brain tumor segmentation dataset. It defines datasets for the training and validation splits, with the training dataset applying random augmentations. These transformations help train models to be robust by generating varied examples.

The key datasets are for the training split and for the validation split. The training dataset applies random augmentations via transformations like random crop, random horizontal flip, and random blur, while the validation dataset has no augmentations. Another important component handles distributing the batches across devices for distributed training.

The transformations are implemented as reusable classes to modularize the code. Each transformation class has a method to apply the transformation to an input sample. The transformations work together to generate a diverse set of augmented examples during training.

Utilities

References: composer/datasets/utils.py, scripts/ffcv

The …/utils.py file contains utility functions for working with datasets in Composer. It includes a class that can be used to normalize data after it has been loaded onto a device. It supports normalizing to a mean/std and optionally ignoring the background class.

A function that collates samples containing PIL Images into batches of tensors is also defined. This can be used as a collate_fn for DataLoaders.

Additionally, a function that adds a transform to the transforms of a VisionDataset is contained. It handles insertion before or after ToTensor based on a flag.

Training Utilities

References: composer/trainer, tests/trainer

The …/trainer directory contains the core functionality for model training in Composer. Training is orchestrated by initializing an object defined in …/__init__.py. This object coordinates the full training loop, executing algorithms on each batch.

Properties of the training data like batch sizes are specified. It also takes a model, optimizer, and other configuration. During training, a forward pass is run to get predictions on each batch, loss is calculated, and then each algorithm's method is called sequentially to update the model parameters.

Distributed training functionality is provided by strategies defined in …/dist_strategy.py.

Framework integrations are handled by files like …/_deepspeed.py. This file parses DeepSpeed configs, validates parameters, and sets batch and precision settings.

Testing of training functionality is done via test suites in …/trainer. Tests like …/test_trainer.py validate the core training loop, while …/test_ddp.py focuses on distributed training. Other files test checkpointing, evaluation, and functionality specific to algorithms.

Distributed Training

References: composer/trainer/dist_strategy.py, tests/trainer/test_ddp.py

Distributed training functionality is handled in the …/dist_strategy.py module. This module contains functionality for handling gradient synchronization strategies during training like synchronous, asynchronous, and hybrid synchronization. For distributed data parallel (DDP) training, models are wrapped.

The main functionality is in the method. This method recursively wraps model submodules, sets configuration options like sharding strategy and mixed precision. It also rebuilds optimizer param groups after sharding the model, and handles activation checkpointing if enabled. Process groups are cached for reuse across modules.

Testing of distributed functionality is covered in the Testing section. The …/test_ddp.py file contains tests for DDP training. It trains a model on synthetic data across devices using a custom that increments an atomic counter each time a sample is accessed. This validates the distributed sampler is partitioning data properly. A saves batches to files, and the test asserts these batches differ across processes. Tests are run under configurations like CPU, GPU, and FSDP to validate distributed functionality.

Mixed Precision

References: composer/trainer/_scaler.py

The …/_scaler.py file provides utilities for mixed precision training and gradient scaling. Gradient scaling is a technique used to prevent underflow issues that can occur when training with smaller precisions like float16. It works by scaling gradients up before applying optimizer steps, and scaling them back down after to preserve gradient magnitudes.

Gradient scaling is handled during training. The scaling factor is updated each iteration based on the ratio of current to initial loss. For distributed training, the scaling factor is coordinated across devices so all gradients use the same scale. This allows mixed precision training while supporting common closure-based optimizers and algorithms.

Framework Integrations

References: composer/trainer/_deepspeed.py, composer/trainer/meta_safe_apply.py

The …/_deepspeed.py file handles integrating Composer's training functionality with the DeepSpeed framework. It parses a DeepSpeed configuration from the file, validates it against the trainer state, and fills in default values. This allows DeepSpeed to be seamlessly configured based on the trainer.

Key functionality includes validating the batch size and related configurations, checking that the config does not contain an optimizer or scheduler since these will be managed by Composer, and setting the appropriate precision based on the trainer's configuration. Tensors in batches are converted to the correct precision format like FP16 if mixed precision is enabled.

The …/meta_safe_apply.py file contains the functionality for safely initializing meta tensors in PyTorch during application of a given function to a module. It recursively applies the function to both modules and parameters while ignoring certain modules specified in a set.

For each parameter, it checks if it should be ignored or is None before applying the function. It handles applying the function to gradients and replacing parameters and buffers inplace or with new tensors as needed based on their types. Module names are concatenated for identification during recursion.

This functionality allows initializing meta tensors across modules in a safe way by properly handling parameters, gradients, replacement, and name identification during the recursive application of a function to a module. It plays a key role in integrating Composer's training loops with frameworks like PyTorch that utilize meta tensors.

Testing

References: tests/trainer, tests/trainer/test_trainer.py, tests/trainer/test_scale_schedule.py

The …/trainer directory contains comprehensive unit tests for validating the core training functionality in Composer. Tests are organized into focused test files that target specific functionality or configurations. This allows rigorous validation of equivalent behavior across different scenarios.

Distributed training tests in files like …/test_ddp.py ensure the distributed training logic works properly when using data parallelism with the PyTorch DDP API.

Files like …/test_trainer.py validate initialization and usage of core functionality. Equivalence across configurations is checked by training models and asserting results match.

Files like …/test_fsdp.py contain tests for initializing models with the Fully Sharded Data Parallel algorithm. Properties are checked after training to validate the initialization.

Checkpointing tests in …/test_checkpoint.py verify weights, metrics, and callbacks are correctly restored from checkpoints under different configurations such as distributed settings.

Comprehensive parameterization and mocking techniques are used to test equivalent behavior across a wide range of configurations. This rigorous approach validates the core training components function as intended.

Testing

References: tests, tests/datasets, tests/models

The tests directory contains comprehensive unit and integration tests for the Composer library. It validates the core functionality is implemented correctly and works as expected across different configurations.

Some key subdirectories include:

  • …/algorithms: Contains tests for algorithms implemented in Composer. It tests algorithm training loops, resuming from checkpoints, and unit tests for individual algorithms.

  • …/callbacks: Tests callback functionality like loggers and profilers. It simulates training runs and ensures callbacks behave correctly by constructing trainers with callbacks.

  • …/common: Provides common testing utilities like datasets that generate fake data, and models for testing.

  • …/datasets: Validates dataset loading, preprocessing, transforms, and dataloading functionality. It generates synthetic data and loads real datasets to test preprocessing and transforms.

  • …/fixtures: Defines fixtures and utilities for configuring the test environment and common test objects. Fixtures isolate tests and avoid side effects.

Unit Tests

References: tests/algorithms, tests/callbacks, tests/common

Unit Tests

Integration Tests

References: tests/datasets, tests/models, tests/trainer

The integration tests in Composer aim to validate end-to-end functionality by combining different components in a realistic way. These tests exercise more complex scenarios than unit tests by integrating multiple pieces like models, datasets, and training loops.

Some key integration tests combine functionality from the Test Data, Models, and Trainer sections. The …/test_mmdet_model.py file defines fixtures to generate sample detection data and loads configs from external repositories to test with real configurations. It builds detection models, runs forward passes to check output properties and ensures the wrapper properly forwards data and calls to the underlying models during training and inference.

Tests in …/datasets integrate functionality by building data loaders from functions in the datasets module, loading batches, and asserting properties about the returned data match expectations. For example, …/test_mnist.py loads the MNIST dataset, iterates over the dataloader, and validates sample counts and shapes are correct. …/test_streaming_datasets_train.py initializes distributed training when needed, loads the appropriate streaming dataset class based on the dataset name, creates a dataloader from the streaming dataset, loads a model with metrics, and trains for batches to test end-to-end functionality on streaming data.

The …/trainer directory contains tests that combine models, datasets, and validate full training loops. For example, …/test_ddp.py trains a model for epochs across devices to test distributed functionality. …/test_trainer.py initializes the trainer with different configurations, runs training on sample data, and checks model/optimizer state to validate the core training logic.

Test Utilities

References: tests/fixtures, tests/utils

The …/__init__.py file contains common testing utilities that are used across different test files in the composer package.

It defines a class for writing test cases. Testing functionality involving temporary files or directories requires isolating each test run.

When tests require random data, a function generates a random string.

In summary, this file centralizes utilities for writing robust isolated tests, including base classes, temporary file handling, and data generation. These utilities form strong foundations for implementing tests interacting with files or networks.

Profiling and Analysis

References: composer/profiler, docs/source/trainer

The …/profiler directory contains tools for profiling model performance during training. It includes classes and functions for collecting different types of profiling data and integrating profiling into the training process.

The core functionality coordinates the overall profiling process. It handles scheduling profiling using a callback function. This module implements a cyclic profiling schedule.

collects host system metrics and PyTorch trace data respectively. Runs a background thread that periodically collects CPU, memory, disk and network usage metrics using psutil. These metrics are recorded in the profiler using markers and counters.

initializes the PyTorch profiler using a scheduler and handler function. The scheduler controls when traces are recorded based on the profiling cycle. The handler saves trace files in the /torch_profiler folder with filenames based on the batch and epoch. It can also upload traces to remote storage.

Profiling data is handled. The base handler interface allows concrete classes to handle specific event types.

The utilities module contains functions for visualizing memory profiles as HTML. It generates plots from memory data and saves figures.

In summary, these components provide tools for collecting system, model, and training loop metrics during training runs. The traces can be analyzed using the tutorials in Performance Analysis to help optimize model performance.

Performance Analysis

References: composer/profiler, docs/source/trainer/performance_tutorials

The …/profiler module contains tools for analyzing model performance and identifying bottlenecks during training. It provides functionality to collect profiling data using the PyTorch profiler as well as custom event timing for trainer loops, data loading, and system metrics.

The …/system_profiler.py file contains functionality to periodically collect host metrics like CPU and memory usage using psutil. Metrics are recorded using markers and counters in the profiler.

Trace handlers like the one defined in …/json_trace_handler.py handle different event types. It saves traces in Chrome JSON format and handles starting/stopping tracing according to the schedule. Traces from distributed training are merged into a single timeline for analysis using …/json_trace_merger.py.

Tutorials in …/performance_tutorials demonstrate using the profiler on MNIST. The …/analyzing_traces.md file shows an example of a trace identifying a dataloader bottleneck. Traces can be explored interactively in the Chrome Trace Viewer or TensorBoard.

This functionality allows identifying performance bottlenecks by visualizing event durations throughout training with tools like the Chrome Trace Viewer. The tutorials teach using the profiler and analyzing resulting traces to optimize models.

Profiling

References: composer/profiler, docs/source/trainer

The …/profiler directory contains functionality for profiling system metrics and model events during training. It implements a class which coordinates various profiling components.

The class handles scheduling profiling using a callback function returned by the …/profiler_schedule.py module. This implements a cyclic profiling schedule that skips batches initially, then cycles through waiting, warming up, and actively profiling batches each epoch.

The callback integrates system profiling by running a background thread to periodically collect host metrics like CPU, memory, disk, and network usage using psutil. These metrics are recorded in the profiler using markers and counters.

Model events are profiled using the callback. This initializes the PyTorch profiler with a scheduler and handler to control when traces are recorded. Traces are saved to files and optionally uploaded.

Events are recorded centrally. These record timing and metadata that is sent to trace handlers. The saves traces in Chrome JSON format for visualization.

Utilities include exporting memory profiles to HTML. The coordinates these components and exposes an API to integrate profiling into user code.

Environment Configuration

References: docker, docs/source/tutorials

The docker directory contains functionality for reproducing environments through Docker configurations. This allows users to run Composer models and code within standardized container environments.

The …/README.md file documents pre-built Docker images available on DockerHub. These images simplify setting up environments by containing pre-installed versions of Composer, PyTorch, and their dependencies. Specific tags are provided to pull image configurations for different versions, operating systems, and Python environments.

The …/generate_build_matrix.py file handles generating build matrices that are documented in …/README.md. It defines all combinations of input variables like Python and PyTorch versions. Helper functions then map each combination to generate values for fields in the build matrix. The main logic writes the results to files and Markdown tables for documentation. This encapsulates the mapping of inputs to outputs used to populate the build matrices.

The …/pillow_stub directory provides a minimal Python package configuration for testing purposes, without additional important functionality.

Docker Configurations

References: docker, docker/README.md, docker/generate_build_matrix.py

The docker directory contains Docker configurations for building and distributing Composer environments. This includes Dockerfiles that define the build environment and dependencies. Scripts automate building Docker images for different configuration combinations.

Pre-built Docker images for Composer are available on DockerHub to simplify setting up environments. The …/README.md file documents these images, including the available versions, dependencies, and specific tags that can be used to pull each image. Images contain Composer pre-installed along with common deep learning dependencies.

The …/generate_build_matrix.py script handles generating Docker image build matrices. It defines all possible input combinations, like Python and PyTorch versions. Helper functions then generate values for each field in the build matrix given an input combination. For example, functions would determine the Docker tag or base image to use. These generated values are written to files to populate the build matrices.

By encapsulating the mapping of inputs to outputs, this script can automatically generate comprehensive build matrices that cover many configuration variants. This simplifies maintenance by defining the matrix generation logic in one place. The build matrices help users find the appropriate pre-built Docker image for their specific Composer or PyTorch version, OS, and other requirements.

Cloud Utilities

References: docs/source/tutorials/train_resnet50_on_aws.md

This section covers tutorials for training Composer models on AWS. The tutorial in …/train_resnet50_on_aws.md demonstrates how to train a ResNet-50 model on ImageNet using AWS resources. It uses one of MosaicML's pre-built ResNet-50 recipes located in the /recipes folder. This recipe file contains the model architecture and training hyperparameters. Composer is able to load this recipe file and configure the training job accordingly.

The ImageNet dataset is mounted on an EBS or instance store volume at /datasets/ImageNet. The dataset is preprocessed unless these files already exist. The training job runs inside a Docker container for reproducibility. This container contains pre-installed Composer, PyTorch, dependencies, and the recipe code. Shared memory is increased to avoid out of memory errors when working with large datasets like ImageNet inside the container.

Documentation

References: docs/source, docs/source/method_cards, docs/source/_templates

The …/tables directory contains Python files and Markdown files for generating documentation tables about algorithms, models, and update methods in the Composer library.

The …/conf.py file contains Sphinx configuration.

Source Files

References: docs/source, docs/source/method_cards, docs/source/model_cards

The …/model_cards directory contains documentation on several popular machine learning models in the form of Markdown files. Each file provides details on the architecture, training procedures, and hyperparameters of a different model implemented in Composer.

Some of the important model card files include:

  • …/cifar_resnet.md: Documents ResNet models for image classification on CIFAR datasets. It describes the overall ResNet architecture of stacked convolutional blocks with shortcut connections.

  • …/efficientnet.md: Covers the EfficientNet model family, explaining the empirical scaling law used in their design. It also details the architectures of specific models like EfficientNet-B0.

  • …/deeplabv3.md: Provides information on the DeepLabV3 model for semantic segmentation, discussing its use of atrous convolutions and decoder modules.

  • …/unet.md: Describes a UNet architecture for medical image segmentation, explaining its contracting and expansive paths.

Each of these model card files discusses the key aspects of the model architecture at a high level without code details. They also explain important training procedures and hyperparameters used in the Composer implementation of each model. In addition, files like …/efficientnet.md include tables summarizing properties of different models in the family.

Templates

References: docs/source/_templates, docs/source/_templates/sidebar

The …/_templates directory contains templates that define documentation pages.

Templates provide consistency through common interfaces, while allowing customization.

Static Assets

References: docs/source/_static

The …/_static directory contains static assets like CSS, JavaScript, and image files that are used to style and enrich the Composer documentation site.

The …/css directory holds CSS files that define the visual styling of documentation pages. The main stylesheet is …/custom.css, which contains rules for customizing elements like related pages, card components, and the sidebar container.

The …/js directory contains client-side JavaScript code that enhances the user experience on documentation pages. Code blocks have syntax highlighting applied through inline styles that detect over 20 languages.

The file …/posthog.js initializes the PostHog analytics library for tracking usage. It defines a global object. People data is stored in an array property. The file loads code from an API host and wraps API calls for logging.