Auto-generated from karpathy/nanoGPT by Mutable.ai Auto Wiki
nanoGPT repository provides an efficient PyTorch implementation of Generative Pre-trained Transformer (GPT) models for natural language processing. It includes tools for training, evaluating, sampling from, and benchmarking GPT models like GPT-2.
The key functionality focuses on:
Defining the GPT model architecture in
model.py, including components like self-attention, MLP layers, embeddings, and sampling logic. It allows initializing from pretrained GPT checkpoints.
Distributed multi-GPU training of GPTs as shown in
train.py, using optimization techniques like mixed precision and gradient accumulation.
Utilities for sampling text continuations from trained GPT models via
Benchmarking throughput and speed of GPT models using
The library aims to provide building blocks for efficiently training and deploying GPTs from scratch or from existing pretrained checkpoints. The model architecture directly follows GPT papers. Key design choices focus on computational performance and ease-of-use.
This section covers the functionality for distributed training of GPT language models like GPT-2 and a miniature Shakespeare model. The
train.py file implements training of GPT models using PyTorch Lightning. It supports both single-GPU debugging as well as distributed data parallel training across multiple GPUs/nodes.
The main classes used for training are the
GPT model class and the
GPTConfig class for model configuration. The
GPT class defines the model architecture, while
GPTConfig handles setting hyperparameters.
Training is performed using the AdamW optimizer. Mixed precision is supported using PyTorch's
GradScaler to help train larger models. Distributed data parallel training is implemented using PyTorch's
The training loop iterates over minibatches, computes the forward and backward pass, calls the optimizer step, and evaluates loss at intervals. Learning rate scheduling is implemented via the
get_lr() function, which uses a cosine schedule. Checkpoints are saved periodically using the checkpointing functionality.
…/train_gpt2.py file sets hyperparameters for distributed training of the GPT-2 124M model across 8 GPUs. It configures a total batch size of around 0.5 million examples per step by sharding across GPUs. Training will run for 600,000 iterations with learning rate decay. Model evaluation runs every 1,000 steps.
…/train_shakespeare_char.py file contains configuration for training a character-level Shakespeare language model. It defines the model architecture using hyperparameters and sets training hyperparameters like batch size, learning rate schedule, and evaluation interval tailored for the smaller dataset.
Distributed training allows training a model across multiple GPUs or nodes to speed up the training process. The
train.py file implements distributed data parallel training for GPT models using PyTorch DistributedDataParallel.
When running with multiple GPUs, the model is replicated across each GPU. During the forward and backward passes, each GPU works on a subset of the mini-batch. Gradients are averaged across GPUs after each step. This parallelizes the computation and allows using much larger effective batch sizes.
The training loop in
train.py handles distributing the data and model. It uses PyTorch's
DistributedDataParallel module to distribute the model across GPUs. At each step, it calls
get_batch() to retrieve data splits for each GPU. The forward and backward passes are performed, then gradients are averaged and the optimizer step is run.
Periodically the loss is estimated on the full batch to track progress. Checkpoints are saved at intervals to the distributed storage, allowing resuming training from any GPU. This allows efficient multi-GPU and multi-node training of GPT models.
…/train_shakespeare_char.py handle configuring training hyperparameters and settings for different models. These files set values for hyperparameters to control the training process, without implementing any training logic.
…/train_gpt2.py file configures distributed training of the GPT-2 124M model across 8 GPUs. It sets hyperparameters like batch size, block size, and gradient accumulation steps to target a total batch size of around 0.5 million examples per training step. The number of training iterations and evaluation interval are also configured.
…/train_shakespeare_char.py file handles configuration for training a miniature character-level Shakespeare language model. It defines the model architecture as a 6-layer, 6-head Transformer with 384 hidden size. Training hyperparameters like the batch size, number of iterations, and evaluation interval are set.
Both files focus only on configuration and do not implement any training logic. Key classes referenced include
dataset to specify the training data.
Checkpointing is handled by the
train.py script. During training, checkpoints are periodically saved to disk so the model can be recovered later if training is interrupted.
train.py script saves checkpoints using PyTorch Lightning's functionality for checkpointing. At intervals determined by the training configuration, a checkpoint containing the full model state is saved.
Checkpoints are saved with a filename based on the current epoch and step count. This uniquely identifies each checkpoint and allows recovering training at the exact point where it left off.
To load a checkpoint, the checkpoint path can be passed to the GPT model which will load the state_dict from the checkpoint file. This resumes training from the saved epoch and step count.
train.py script handles logging of training metrics and results. During training, the values of important metrics like loss, accuracy, learning rate, and timing information are tracked at intervals.
The main metrics logged include:
- Loss value on each batch and over full epochs
- Accuracy on validation data (if available)
- Learning rate after each update
- Timings for forward/backward pass, data loading, etc
These metrics are logged to multiple destinations including:
- Standard output - printed to the terminal
- TensorBoard - for visualizing metrics over time using TensorBoard
- Checkpoints - metrics are saved within checkpoint files
This allows monitoring training progress in real-time via the terminal as well as visualizing changes over time using TensorBoard. Checkpoints also contain logged metrics, enabling resuming training with full history.
data directory contains scripts and code for loading and preprocessing various text datasets into an efficient format suitable for training natural language processing models. It includes subdirectories for different datasets:
prepare.py scripts in each subdirectory implement the key steps for loading and preprocessing their respective datasets:
prepare.pyloads the raw data using
load_dataset(), splits into train/val subsets, tokenizes the text with BPE using
tiktoken, saves tokenized ids to
memmapfor efficient writing.
For Shakespeare characters,
prepare.pydownloads the text, calculates the character vocabulary, encodes the character strings as integers, saves the encoded train and val data to binary files, and exports the encoding metadata.
sample.py handles sampling generations from a pretrained model. It loads a model checkpoint from disk using either
GPT.from_pretrained(), sets the model to eval mode, and moves it to the specified device. The file defines encoding and decoding functions to preprocess input text and generate continuations. Key sampling parameters like temperature and top-k can be configured.
The most important part is using
model.generate() within a latency-optimized context manager to sample continuations from the model in an efficient manner. This allows generating multiple samples to evaluate model quality and diversity.
The configuration files under
config initialize models from various pretrained checkpoints and set hyperparameters for evaluation like batch size and number of iterations. Files like
…/eval_gpt2.py configure evaluation of the standard GPT-2 124M model at a batch size of 8 and 500 iterations. Larger models have their own configuration files.
These configuration files do not implement any classes or complex logic, they simply provide the initialization and hyperparameters needed to run evaluation of different pretrained GPT variants. The core sampling functionality is contained in
Computing metrics like perplexity to measure a language model's performance is an important part of evaluating how well the model learns. The
train.py file includes functionality to estimate perplexity and other metrics on held-out validation data during training.
estimate_loss() function takes a model and validation data loader as arguments. It runs the model on batches of validation examples, calculates the loss, and averages the loss across all examples. This gives an estimate of the model's performance on new data.
The loss computed is cross entropy loss between the model's predicted token probabilities and the ground truth target tokens. This directly measures how surprised the model is by the next tokens in the validation sequences, with a lower loss corresponding to a better performing model.
Cross entropy loss can be converted to perplexity, which is a more interpretable metric for language models. Perplexity is the exponential of the cross entropy loss, and represents how confused the model is on average for each token - a lower perplexity means the model is less perplexed.
estimate_loss() function runs at the specified evaluation interval, such as every 1000 training steps. This allows monitoring validation performance throughout training to check for overfitting. The estimated metrics like perplexity are logged, which helps evaluate how well the model is learning the training distribution.
sample.py handles generating text samples from pretrained GPT models. It loads a model checkpoint from either a previous training run or a pretrained variant using
GPT.from_pretrained(). The model is set to evaluation mode and moved to the specified device.
Encoding and decoding functions are defined to preprocess input text and decode generated output. The core sampling functionality is contained in the
model.generate() method. It takes an encoded prompt as input and uses nucleus sampling to produce a probability distribution over the next tokens to generate a continuation. Temperature and top-k parameters can be adjusted to control the generation process.
Some key details:
Hyperparameters like init_from, num_samples, temperature, and device are set via arguments.
encode()function handles encoding the input prompt, defaulting to GPT-2 encoding.
model.generate()produces each output token in a latency-optimized context manager. It has options to set a max length and repetition penalty.
The decoded output is printed for each sample to display the model's continuation of the prompt.
The configuration files in the
config directory handle setting hyperparameters and parameters for evaluating GPT models. These files focus on initializing the model, preprocessing data, and running the model in evaluation mode to compute metrics without any training or fine-tuning.
…/eval_gpt2.py file configures evaluation of the GPT-2 small model. It sets the
batch_size to 8 and number of
eval_iters to 500. This ensures a large enough batch size for stable performance estimates, and runs evaluation over many batches to average results for an accurate overall metric. It also initializes the model weights from the pretrained GPT-2 checkpoint using
…/eval_gpt2_medium.py files serve similar purposes, but configure the large and medium GPT-2 models respectively. They set the
batch_size and number of
eval_iters, and initialize each model variant from its own pretrained checkpoint.
…/eval_gpt2_xl.py handles evaluation of the GPT-2 XL model, largest model variant. It sets the same evaluation hyperparameters and initializes from the pretrained GPT-2 XL weights.
All of these configuration files focus only on initializing models, preprocessing data, and setting hyperparameters for evaluation runs. They do not implement any custom classes, functions, or algorithms. The core evaluation logic is implemented in other code modules imported by the training and evaluation scripts.
model.py file provides functionality for loading pretrained GPT models and generating text. The
GPT class represents the full Transformer model and can be initialized from pretrained weights using the
from_pretrained() classmethod. This method matches up parameter names between the model definition and the pretrained weights, handling any necessary transposition of weights.
Once initialized, the model can be used for text generation by passing prefixes to the
generate() implements text sampling by taking the model predictions at each step of the sequence and feeding them back as additional context for the next step. It handles iterating the model over long-form text generation.
The key aspects that enable loading pretrained models and generation are:
GPTclass, which defines the overall model architecture and handling of embeddings, attention blocks, and projection head.
from_pretrained()method, which populates the model parameters from pretrained weights files.
generate()function, which implements autoregressive sampling from the model by feeding previous predictions back in as additional context.
This section details the scripts and code used to preprocess various text datasets into an efficient format for training natural language processing models with the
nanoGPT library. There are several datasets that can be preprocessed, including OpenWebText and various Shakespeare corpora.
The main preprocessing tasks include tokenization, encoding text into integer IDs, splitting data into train and validation portions, and saving the encoded sequences to binary files for efficient loading during training. This section will cover the key classes, functions and algorithms used to implement these preprocessing steps for different types of data.
data directory contains subdirectories for each dataset. For OpenWebText, the
…/prepare.py script handles the preprocessing. It first loads the raw text using
load_dataset() from HuggingFace. It then splits the data into train and validation subsets with
train_test_split(). The main tokenization step uses
encode_ordinary() functions to encode the text with Byte-Pair Encoding. The tokenized IDs and lengths are saved as new dataset features. NumPy's efficient
memmap is used to write the concatenated IDs to the
val.bin binary files in batches.
For the Shakespeare datasets, preprocessing is implemented in scripts located in each dataset subdirectory. The
…/prepare.py script downloads the raw text if needed. It splits the text into train and validation portions, then uses
encode_ordinary() to encode the splits with Byte-Pair Encoding. The encoded IDs are converted to NumPy arrays and saved to
.bin files using
tofile(). The character-level Shakespeare data in
…/prepare.py calculates character frequencies, creates
itos mappings for encoding/decoding, then encodes and saves the data in a similar manner.
…/shakespeare directory contains tools for preprocessing a "tiny Shakespeare" text dataset. This dataset was generated using a character-level language model from previous years and contains snippets of Shakespearean plays and sonnets. The
…/prepare.py script downloads the raw text if needed, then splits it into training and validation sets.
It uses the
tiktoken library to encode the text into byte-pair encodings (BPE) using the GPT-2 vocabulary. The
get_encoding() function returns a
BPE object containing the vocabulary, which is used by
encode_ordinary() to encode the plain text into integer token IDs. These encoded sequences are converted to NumPy arrays with
np.array() and saved as
.bin files in the local directory using
train.bin file contains 301,966 tokens for model training, while
val.bin contains 36,059 tokens for evaluation. These preprocessed data files can then be loaded directly into the language models during training and evaluation. In summary, this directory provides a preprocessed "tiny Shakespeare" dataset ready for use in natural language processing tasks.
…/openwebtext directory contains scripts and data for preprocessing the OpenWebText web crawl dataset. The OpenWebText dataset consists of text scraped from websites and is designed to match the statistics of text that people are likely to encounter in their daily lives. It was created by Anthropic to be more broadly representative of the world's languages and topics than other publicly available language models which were often trained on more narrow datasets.
The main script for preprocessing the data is
…/prepare.py. This script takes the raw OpenWebText dataset, which consists of raw HTML files, and prepares binary files suitable for efficient training of NLP models. It first loads the dataset using HuggingFace's
load_dataset() function. It then splits the dataset into train and validation subsets using
The text in each example is tokenized using GPT-2's byte-pair encoding with the
tiktoken library and
enc.encode_ordinary() function. The tokenized ids and length are saved to new features for each example using a map function. The examples are then concatenated into large binary files,
val.bin, where the ids are stored as uint16 for efficiency.
memmap is used to efficiently write the concatenated ids to the binary files in batches, without needing the entire dataset in memory at once. A tqdm progress bar visualizes the batch writing process. The final binary files can then be efficiently loaded and streamed for model training.
…/shakespeare_char directory contains a preprocessed character-level version of the Shakespeare dataset for language modeling. The
prepare.py script in this directory downloads the raw Shakespeare text and implements preprocessing steps to prepare character-level training and validation data.
It first downloads the tiny Shakespeare text file from a GitHub URL if not already present using the
requests module. It then extracts all unique characters from the text by converting it to a
set and sorting the list.
This prepares an efficient character-level dataset without the need for BPE tokens. The encoding mappings allow easy encoding and decoding with the trained language model.
model.py file implements the core GPT model architecture using PyTorch. It defines the main structural components of the Transformer model including
CausalSelfAttention class performs self-attention on the input sequence, using either PyTorch's "flash" attention for efficiency or a manual implementation with causal masking. It computes the attention in parallel across all heads.
MLP class contains a simple feedforward network, applying two linear transformations with a ReLU activation in between.
Block class composes the attention and feedforward sublayers along with residual connections and layer normalization. This forms the basic building block of the Transformer model.
GPT class represents the full model, handling the embeddings, and stacking multiple
Block layers. It supports initialization from a pretrained model checkpoint using the
Key functionality includes:
- Causal self-attention using
CausalSelfAttentionwith masking to prevent attention to subsequent positions
- Feedforward sublayers with
- Residual connections and layer normalization in each
- Embedding tokens and generating predictions with a final head in
- Loading pretrained weights from files with
config directory contains configuration files for initializing models and setting hyperparameters for tasks like training, evaluation, and fine-tuning. These configuration files control important aspects of experiments like data loading, model initialization, training hyperparameters, logging, and saving checkpoints without defining complex classes or algorithms.
The configuration files make it easy to run different tasks by just changing the configuration file used. For example, various GPT language models and datasets can be evaluated by changing the config file to
Key configuration files:
…/train_gpt2.pysets hyperparameters for distributed training of GPT-2 across multiple GPUs. It controls batch size, learning rate schedule, and evaluation frequency.
…/train_shakespeare_char.pyinitializes a small Transformer model and sets hyperparameters for character-level language modeling on Shakespeare.
…/finetune_shakespeare.pycontrols finetuning the Shakespeare model on a new task, setting hyperparameters like learning rate, checkpoint saving, and data loading.
These files focus only on configuration and do not define complex classes or functions. They make experimentation simple by controlling different model/task configurations without code changes. The actual training/evaluation logic is contained elsewhere and invoked via the parameters in these files.
bench.py file contains tools for benchmarking GPT models to measure performance and throughput. It initializes a GPT model using the
GPTConfig class to define the model configuration and the
GPT class which implements the Transformer architecture.
Training data is loaded and preprocessed into batches using
get_batch(). The model is then run on these batches to calculate loss via
model() and optimize weights using the optimizer initialized in
Performance is measured in two ways. The PyTorch profiler
torch.profiler.profile() provides detailed analysis of operations. For benchmarking, batches are run and timed with simple iteration counting to estimate throughput using
Both real training data from files and fixed random data via
torch.randint() can be used as input. The model, optimizer, and mixed precision via
autocast are initialized upfront. For supported hardware, the model is compiled with
GPTConfig class handles initializing the key components for training - the model, optimizer, and hyperparameters. It defines the training hyperparameters that control aspects like batch size, sequence length, and learning rate.
GPT class implements the Transformer architecture. Its
forward() method runs batches of input through attention and MLP blocks. The
estimate_mfu() method calculates a throughput metric from iteration timings.
get_batch() to retrieve data and
model() to run batches forward and calculate loss.
loss.backward() perform backpropagation. Weights are updated with
The code in
configurator.py handles command line and configuration file argument parsing. It uses a simple but effective approach - iterating through
sys.argv and checking each argument for the presence of an equals sign
= to determine if it is a configuration file path or a key-value override.
For configuration file paths, it executes the file contents with
exec() to import configuration defaults from an external file without needing to import or parse the file. This provides an easy way to manage different configuration presets.
For key-value arguments like
--key=value, it splits on the equals sign and attempts to evaluate the value as a Python object using
literal_eval() to support types like numbers, booleans, etc. If evaluation fails, it keeps the value as a string.
It performs a type check between the given value and the existing global variable to ensure type compatibility before overriding. This safety check avoids potential type errors.
The globals namespace is used to store and override the configuration defaults, so any code using this just has access to configuration variables without needing a special configuration class or parameter prefixing. This provides a clean interface.