logoAuto Wiki by


Auto-generated from karpathy/nanoGPT by Auto Wiki

GitHub Repository
Written inPython
Watchers 317
Last updated2024-01-08
Auto Wiki
Generated at2024-01-08
Generated fromCommit eba36e

The nanoGPT repository provides an efficient PyTorch implementation of Generative Pre-trained Transformer (GPT) models for natural language processing. It includes tools for training, evaluating, sampling from, and benchmarking GPT models like GPT-2.

The key functionality focuses on:

  • Defining the GPT model architecture in, including components like self-attention, MLP layers, embeddings, and sampling logic. It allows initializing from pretrained GPT checkpoints.

  • Distributed multi-GPU training of GPTs as shown in, using optimization techniques like mixed precision and gradient accumulation.

  • Utilities for sampling text continuations from trained GPT models via

  • Data preprocessing scripts in data for datasets like OpenWebText and Shakespeare. It tokenizes text and efficiently writes it to disk.

  • Configuration files in config for setting hyperparameters and controlling training, evaluation, and inference.

  • Benchmarking throughput and speed of GPT models using

The library aims to provide building blocks for efficiently training and deploying GPTs from scratch or from existing pretrained checkpoints. The model architecture directly follows GPT papers. Key design choices focus on computational performance and ease-of-use.

Training GPT Models

References:, config

This section covers the functionality for distributed training of GPT language models like GPT-2 and a miniature Shakespeare model. The file implements training of GPT models using PyTorch Lightning. It supports both single-GPU debugging as well as distributed data parallel training across multiple GPUs/nodes.

The main classes used for training are the GPT model class and the GPTConfig class for model configuration. The GPT class defines the model architecture, while GPTConfig handles setting hyperparameters.

Training is performed using the AdamW optimizer. Mixed precision is supported using PyTorch's autocast and GradScaler to help train larger models. Distributed data parallel training is implemented using PyTorch's DistributedDataParallel.

The training loop iterates over minibatches, computes the forward and backward pass, calls the optimizer step, and evaluates loss at intervals. Learning rate scheduling is implemented via the get_lr() function, which uses a cosine schedule. Checkpoints are saved periodically using the checkpointing functionality.

The …/ file sets hyperparameters for distributed training of the GPT-2 124M model across 8 GPUs. It configures a total batch size of around 0.5 million examples per step by sharding across GPUs. Training will run for 600,000 iterations with learning rate decay. Model evaluation runs every 1,000 steps.

The …/ file contains configuration for training a character-level Shakespeare language model. It defines the model architecture using hyperparameters and sets training hyperparameters like batch size, learning rate schedule, and evaluation interval tailored for the smaller dataset.

Distributed Training


Distributed training allows training a model across multiple GPUs or nodes to speed up the training process. The file implements distributed data parallel training for GPT models using PyTorch DistributedDataParallel.

When running with multiple GPUs, the model is replicated across each GPU. During the forward and backward passes, each GPU works on a subset of the mini-batch. Gradients are averaged across GPUs after each step. This parallelizes the computation and allows using much larger effective batch sizes.

The training loop in handles distributing the data and model. It uses PyTorch's DistributedDataParallel module to distribute the model across GPUs. At each step, it calls get_batch() to retrieve data splits for each GPU. The forward and backward passes are performed, then gradients are averaged and the optimizer step is run.

Periodically the loss is estimated on the full batch to track progress. Checkpoints are saved at intervals to the distributed storage, allowing resuming training from any GPU. This allows efficient multi-GPU and multi-node training of GPT models.


References: config/, config/

The files …/ and …/ handle configuring training hyperparameters and settings for different models. These files set values for hyperparameters to control the training process, without implementing any training logic.

The …/ file configures distributed training of the GPT-2 124M model across 8 GPUs. It sets hyperparameters like batch size, block size, and gradient accumulation steps to target a total batch size of around 0.5 million examples per training step. The number of training iterations and evaluation interval are also configured.

The …/ file handles configuration for training a miniature character-level Shakespeare language model. It defines the model architecture as a 6-layer, 6-head Transformer with 384 hidden size. Training hyperparameters like the batch size, number of iterations, and evaluation interval are set.

Both files focus only on configuration and do not implement any training logic. Key classes referenced include dataset to specify the training data.



Checkpointing is handled by the script. During training, checkpoints are periodically saved to disk so the model can be recovered later if training is interrupted.

The script saves checkpoints using PyTorch Lightning's functionality for checkpointing. At intervals determined by the training configuration, a checkpoint containing the full model state is saved.

Checkpoints are saved with a filename based on the current epoch and step count. This uniquely identifies each checkpoint and allows recovering training at the exact point where it left off.

To load a checkpoint, the checkpoint path can be passed to the GPT model which will load the state_dict from the checkpoint file. This resumes training from the saved epoch and step count.



The script handles logging of training metrics and results. During training, the values of important metrics like loss, accuracy, learning rate, and timing information are tracked at intervals.

The main metrics logged include:

  • Loss value on each batch and over full epochs
  • Accuracy on validation data (if available)
  • Learning rate after each update
  • Timings for forward/backward pass, data loading, etc

These metrics are logged to multiple destinations including:

  • Standard output - printed to the terminal
  • TensorBoard - for visualizing metrics over time using TensorBoard
  • Checkpoints - metrics are saved within checkpoint files

This allows monitoring training progress in real-time via the terminal as well as visualizing changes over time using TensorBoard. Checkpoints also contain logged metrics, enabling resuming training with full history.

Data Loading

References: data

The data directory contains scripts and code for loading and preprocessing various text datasets into an efficient format suitable for training natural language processing models. It includes subdirectories for different datasets:

The scripts in each subdirectory implement the key steps for loading and preprocessing their respective datasets:

  • For OpenWebText, loads the raw data using load_dataset(), splits into train/val subsets, tokenizes the text with BPE using tiktoken, saves tokenized ids to train.bin and val.bin files using memmap for efficient writing.

  • For Shakespeare, downloads the raw text if needed, splits into train/val sets, encodes the text with BPE tokens using tiktoken, saves encoded ids as numpy arrays to train.bin and val.bin.

  • For Shakespeare characters, downloads the text, calculates the character vocabulary, encodes the character strings as integers, saves the encoded train and val data to binary files, and exports the encoding metadata.

Evaluating GPT Models

References:, config

This section details the tools and processes for evaluating pretrained GPT language models on datasets. The main functionality is contained in and the configuration files under config. handles sampling generations from a pretrained model. It loads a model checkpoint from disk using either torch.load() or GPT.from_pretrained(), sets the model to eval mode, and moves it to the specified device. The file defines encoding and decoding functions to preprocess input text and generate continuations. Key sampling parameters like temperature and top-k can be configured.

The most important part is using model.generate() within a latency-optimized context manager to sample continuations from the model in an efficient manner. This allows generating multiple samples to evaluate model quality and diversity.

The configuration files under config initialize models from various pretrained checkpoints and set hyperparameters for evaluation like batch size and number of iterations. Files like …/ configure evaluation of the standard GPT-2 124M model at a batch size of 8 and 500 iterations. Larger models have their own configuration files.

These configuration files do not implement any classes or complex logic, they simply provide the initialization and hyperparameters needed to run evaluation of different pretrained GPT variants. The core sampling functionality is contained in

Evaluation Metrics

References: nanoGPT

Computing metrics like perplexity to measure a language model's performance is an important part of evaluating how well the model learns. The file includes functionality to estimate perplexity and other metrics on held-out validation data during training.

The estimate_loss() function takes a model and validation data loader as arguments. It runs the model on batches of validation examples, calculates the loss, and averages the loss across all examples. This gives an estimate of the model's performance on new data.

The loss computed is cross entropy loss between the model's predicted token probabilities and the ground truth target tokens. This directly measures how surprised the model is by the next tokens in the validation sequences, with a lower loss corresponding to a better performing model.

Cross entropy loss can be converted to perplexity, which is a more interpretable metric for language models. Perplexity is the exponential of the cross entropy loss, and represents how confused the model is on average for each token - a lower perplexity means the model is less perplexed.

The estimate_loss() function runs at the specified evaluation interval, such as every 1000 training steps. This allows monitoring validation performance throughout training to check for overfitting. The estimated metrics like perplexity are logged, which helps evaluate how well the model is learning the training distribution.



The file handles generating text samples from pretrained GPT models. It loads a model checkpoint from either a previous training run or a pretrained variant using GPT.from_pretrained(). The model is set to evaluation mode and moved to the specified device.

Encoding and decoding functions are defined to preprocess input text and decode generated output. The core sampling functionality is contained in the model.generate() method. It takes an encoded prompt as input and uses nucleus sampling to produce a probability distribution over the next tokens to generate a continuation. Temperature and top-k parameters can be adjusted to control the generation process.

Some key details:

  • Hyperparameters like init_from, num_samples, temperature, and device are set via arguments.

  • The encode() function handles encoding the input prompt, defaulting to GPT-2 encoding.

  • model.generate() produces each output token in a latency-optimized context manager. It has options to set a max length and repetition penalty.

  • The decoded output is printed for each sample to display the model's continuation of the prompt.


References: config/, config/, config/, config/

The configuration files in the config directory handle setting hyperparameters and parameters for evaluating GPT models. These files focus on initializing the model, preprocessing data, and running the model in evaluation mode to compute metrics without any training or fine-tuning.

The …/ file configures evaluation of the GPT-2 small model. It sets the batch_size to 8 and number of eval_iters to 500. This ensures a large enough batch size for stable performance estimates, and runs evaluation over many batches to average results for an accurate overall metric. It also initializes the model weights from the pretrained GPT-2 checkpoint using init_from.

The …/ and …/ files serve similar purposes, but configure the large and medium GPT-2 models respectively. They set the batch_size and number of eval_iters, and initialize each model variant from its own pretrained checkpoint.

The …/ handles evaluation of the GPT-2 XL model, largest model variant. It sets the same evaluation hyperparameters and initializes from the pretrained GPT-2 XL weights.

All of these configuration files focus only on initializing models, preprocessing data, and setting hyperparameters for evaluation runs. They do not implement any custom classes, functions, or algorithms. The core evaluation logic is implemented in other code modules imported by the training and evaluation scripts.

Using Pretrained GPT Checkpoints


The file provides functionality for loading pretrained GPT models and generating text. The GPT class represents the full Transformer model and can be initialized from pretrained weights using the from_pretrained() classmethod. This method matches up parameter names between the model definition and the pretrained weights, handling any necessary transposition of weights.

Once initialized, the model can be used for text generation by passing prefixes to the generate() function. generate() implements text sampling by taking the model predictions at each step of the sequence and feeding them back as additional context for the next step. It handles iterating the model over long-form text generation.

The key aspects that enable loading pretrained models and generation are:

  • The GPT class, which defines the overall model architecture and handling of embeddings, attention blocks, and projection head.

  • The from_pretrained() method, which populates the model parameters from pretrained weights files.

  • The generate() function, which implements autoregressive sampling from the model by feeding previous predictions back in as additional context.

  • The Block class, which composes the core attention and feedforward sublayers along with residual connections and layer normalization. The stacked Block layers make up the body of the GPT model.

Data Preprocessing

References: data

This section details the scripts and code used to preprocess various text datasets into an efficient format for training natural language processing models with the nanoGPT library. There are several datasets that can be preprocessed, including OpenWebText and various Shakespeare corpora.

The main preprocessing tasks include tokenization, encoding text into integer IDs, splitting data into train and validation portions, and saving the encoded sequences to binary files for efficient loading during training. This section will cover the key classes, functions and algorithms used to implement these preprocessing steps for different types of data.

The data directory contains subdirectories for each dataset. For OpenWebText, the …/ script handles the preprocessing. It first loads the raw text using load_dataset() from HuggingFace. It then splits the data into train and validation subsets with train_test_split(). The main tokenization step uses tiktoken's get_encoding() and encode_ordinary() functions to encode the text with Byte-Pair Encoding. The tokenized IDs and lengths are saved as new dataset features. NumPy's efficient memmap is used to write the concatenated IDs to the train.bin and val.bin binary files in batches.

For the Shakespeare datasets, preprocessing is implemented in scripts located in each dataset subdirectory. The …/ script downloads the raw text if needed. It splits the text into train and validation portions, then uses tiktoken's get_encoding() and encode_ordinary() to encode the splits with Byte-Pair Encoding. The encoded IDs are converted to NumPy arrays and saved to .bin files using tofile(). The character-level Shakespeare data in …/ calculates character frequencies, creates stoi and itos mappings for encoding/decoding, then encodes and saves the data in a similar manner.

Shakespeare Dataset

References: data/shakespeare, data/shakespeare/

The …/shakespeare directory contains tools for preprocessing a "tiny Shakespeare" text dataset. This dataset was generated using a character-level language model from previous years and contains snippets of Shakespearean plays and sonnets. The …/ script downloads the raw text if needed, then splits it into training and validation sets.

It uses the tiktoken library to encode the text into byte-pair encodings (BPE) using the GPT-2 vocabulary. The get_encoding() function returns a BPE object containing the vocabulary, which is used by encode_ordinary() to encode the plain text into integer token IDs. These encoded sequences are converted to NumPy arrays with np.array() and saved as .bin files in the local directory using tofile().

The train.bin file contains 301,966 tokens for model training, while val.bin contains 36,059 tokens for evaluation. These preprocessed data files can then be loaded directly into the language models during training and evaluation. In summary, this directory provides a preprocessed "tiny Shakespeare" dataset ready for use in natural language processing tasks.

OpenWebText Dataset

References: data/openwebtext, data/openwebtext/

The …/openwebtext directory contains scripts and data for preprocessing the OpenWebText web crawl dataset. The OpenWebText dataset consists of text scraped from websites and is designed to match the statistics of text that people are likely to encounter in their daily lives. It was created by Anthropic to be more broadly representative of the world's languages and topics than other publicly available language models which were often trained on more narrow datasets.

The main script for preprocessing the data is …/ This script takes the raw OpenWebText dataset, which consists of raw HTML files, and prepares binary files suitable for efficient training of NLP models. It first loads the dataset using HuggingFace's load_dataset() function. It then splits the dataset into train and validation subsets using train_test_split().

The text in each example is tokenized using GPT-2's byte-pair encoding with the tiktoken library and enc.encode_ordinary() function. The tokenized ids and length are saved to new features for each example using a map function. The examples are then concatenated into large binary files, train.bin and val.bin, where the ids are stored as uint16 for efficiency.

NumPy's memmap is used to efficiently write the concatenated ids to the binary files in batches, without needing the entire dataset in memory at once. A tqdm progress bar visualizes the batch writing process. The final binary files can then be efficiently loaded and streamed for model training.

Character-Level Shakespeare Dataset

References: data/shakespeare_char, data/shakespeare_char/

The …/shakespeare_char directory contains a preprocessed character-level version of the Shakespeare dataset for language modeling. The script in this directory downloads the raw Shakespeare text and implements preprocessing steps to prepare character-level training and validation data.

It first downloads the tiny Shakespeare text file from a GitHub URL if not already present using the requests module. It then extracts all unique characters from the text by converting it to a set and sorting the list.

Two mappings are created - stoi maps each character to a unique integer, while itos maps integers back to characters. This allows the text to be encoded and decoded.

The text is split into a 90% train and 10% validation set. The encode function uses the stoi mapping to encode each text sample into a list of integers.

These encoded train and validation integer lists are saved as numpy arrays with dtype uint16 to the train.bin and val.bin binary files respectively, for efficient reading during training.

The stoi, itos mappings and vocabulary size are saved to the meta.pkl file. This metadata can be loaded later to restore the encoding and decode predictions from the model back to characters.

This prepares an efficient character-level dataset without the need for BPE tokens. The encoding mappings allow easy encoding and decoding with the trained language model.

GPT Model Architecture


The file implements the core GPT model architecture using PyTorch. It defines the main structural components of the Transformer model including LayerNorm, CausalSelfAttention, MLP, and Block.

The CausalSelfAttention class performs self-attention on the input sequence, using either PyTorch's "flash" attention for efficiency or a manual implementation with causal masking. It computes the attention in parallel across all heads.

The MLP class contains a simple feedforward network, applying two linear transformations with a ReLU activation in between.

The Block class composes the attention and feedforward sublayers along with residual connections and layer normalization. This forms the basic building block of the Transformer model.

The GPT class represents the full model, handling the embeddings, and stacking multiple Block layers. It supports initialization from a pretrained model checkpoint using the from_pretrained classmethod.

Key functionality includes:

  • Causal self-attention using CausalSelfAttention with masking to prevent attention to subsequent positions
  • Feedforward sublayers with MLP
  • Residual connections and layer normalization in each Block
  • Embedding tokens and generating predictions with a final head in GPT
  • Loading pretrained weights from files with from_pretrained


References: config

The config directory contains configuration files for initializing models and setting hyperparameters for tasks like training, evaluation, and fine-tuning. These configuration files control important aspects of experiments like data loading, model initialization, training hyperparameters, logging, and saving checkpoints without defining complex classes or algorithms.

The configuration files make it easy to run different tasks by just changing the configuration file used. For example, various GPT language models and datasets can be evaluated by changing the config file to …/, …/, or …/

Key configuration files:

These files focus only on configuration and do not define complex classes or functions. They make experimentation simple by controlling different model/task configurations without code changes. The actual training/evaluation logic is contained elsewhere and invoked via the parameters in these files.



The file contains tools for benchmarking GPT models to measure performance and throughput. It initializes a GPT model using the GPTConfig class to define the model configuration and the GPT class which implements the Transformer architecture.

Training data is loaded and preprocessed into batches using get_batch(). The model is then run on these batches to calculate loss via model() and optimize weights using the optimizer initialized in GPTConfig.

Performance is measured in two ways. The PyTorch profiler torch.profiler.profile() provides detailed analysis of operations. For benchmarking, batches are run and timed with simple iteration counting to estimate throughput using model.estimate_mfu().

Both real training data from files and fixed random data via torch.randint() can be used as input. The model, optimizer, and mixed precision via autocast are initialized upfront. For supported hardware, the model is compiled with torch.compile().

The GPTConfig class handles initializing the key components for training - the model, optimizer, and hyperparameters. It defines the training hyperparameters that control aspects like batch size, sequence length, and learning rate.

The GPT class implements the Transformer architecture. Its forward() method runs batches of input through attention and MLP blocks. The estimate_mfu() method calculates a throughput metric from iteration timings.

Training uses get_batch() to retrieve data and model() to run batches forward and calculate loss. optimizer.zero_grad() and loss.backward() perform backpropagation. Weights are updated with optimizer.step().

Argument Parsing


The code in handles command line and configuration file argument parsing. It uses a simple but effective approach - iterating through sys.argv and checking each argument for the presence of an equals sign = to determine if it is a configuration file path or a key-value override.

For configuration file paths, it executes the file contents with exec() to import configuration defaults from an external file without needing to import or parse the file. This provides an easy way to manage different configuration presets.

For key-value arguments like --key=value, it splits on the equals sign and attempts to evaluate the value as a Python object using literal_eval() to support types like numbers, booleans, etc. If evaluation fails, it keeps the value as a string.

It performs a type check between the given value and the existing global variable to ensure type compatibility before overriding. This safety check avoids potential type errors.

The globals namespace is used to store and override the configuration defaults, so any code using this just has access to configuration variables without needing a special configuration class or parameter prefixing. This provides a clean interface.