Mutable.ai logoAuto Wiki by Mutable.ai

FastChat

Auto-generated from lm-sys/FastChat by Mutable.ai Auto Wiki

FastChat
GitHub Repository
Developerlm-sys
Written inPython
Stars31k
Watchers 325
Created2023-03-19
Last updated2024-01-01
LicenseApache License 2.0
Repositorylm-sys/FastChat
Auto Wiki
Generated at2024-01-01
Generated fromCommit 722ab0
Version0.0.4

FastChat implements an end-to-end conversational AI platform with capabilities for data processing, model training, distributed serving, and evaluation. The key components are:

The …/data module contains scripts for cleaning and preparing raw conversation datasets for training chatbot models. This includes steps like removing noise, converting between formats, splitting long conversations, and creating train/test splits. The full pipeline is orchestrated by …/prepare_all.py.

The …/train module implements training workflows for large conversational models like GPT-3. It provides implementations leveraging frameworks like PyTorch and PyTorch Lightning to efficiently train on multiple GPUs with optimization techniques like gradient checkpointing. Specialized workflows optimize for different models like Flan-T5-XL in …/train_flant5.py.

The …/serve module contains a distributed system for serving trained models at scale. The …/controller.py dispatches requests across workers. Workers like …/model_worker.py load models and handle requests. The system can be launched via …/launch_all_serve.py.

The …/model module provides functionality for loading models from checkpoints, running inference to generate text, and compressing models. The registry in …/model_registry.py tracks metadata.

The tests module contains tests for core components like the CLI, APIs, and serving infrastructure. Tests validate functionality and can benchmark performance.

The playground module provides examples applying NLP techniques like classification and semantic search using pretrained embeddings for common tasks.

The documentation in docs covers architecture, models, commands, and more.

Training Conversational Models

References: fastchat/data, fastchat/train

The FastChat codebase contains several implementations of training conversational models like GPT-3 using supervised learning techniques. Key aspects include preprocessing conversation data, creating PyTorch datasets, loading pretrained models, masking targets, and training models to predict the next response.

The …/train directory houses the main training implementations. The …/train.py file handles core training logic, loading conversation data from …/conversation.py, tokenizing it with a pretrained tokenizer, masking targets, and creating datasets for training. It loads a pretrained GPT model, then runs the training process.

The …/train_baichuan.py file applies conversation templates, tokenizes data in parallel for efficiency, and handles question-answering training via masking. The …/train_flant5.py file forms question-answer pairs from conversations before training.

The …/train_lora.py and …/train_lora_t5.py files apply LoRA and optional quantization to partition and compress models like LLMA and T5.

Preprocessing Data

References: fastchat/data

The main scripts for preprocessing conversation data are located in the …/data directory. These scripts clean raw conversation logs, convert between data formats, filter and select subsets, and analyze the data. The overall goal is to prepare datasets for training chatbots.

Key preprocessing steps include:

The scripts can be run sequentially with …/prepare_all.py or individually. This fully processes raw data into clean datasets for training.

Training Workflows

References: fastchat/train, fastchat/train/train.py

The training workflows implemented in FastChat provide end-to-end pipelines for training conversational models. Key files that define training workflows include:

  • …/train.py contains the core training logic, loading conversation data from …/conversation.py, tokenizing it, masking targets, and creating PyTorch datasets for training. It loads a pretrained GPT model and runs the training process.

  • …/train_baichuan.py implements a similar workflow, preprocessing conversations by adding speaker tags and tokenizing into a single sequence.

  • …/train_flant5.py also preprocesses conversations by adding speaker tags and tokenizing.

  • …/train_lora.py and …/train_lora_t5.py apply LoRA and optional QLoRA quantization to models before training for increased efficiency.

The key components of these workflows are:

  • The preprocessing implemented in …/train_flant5.py which takes raw conversations, applies preprocessing steps like tokenization and speaker tagging.

  • The dataset loading implemented in …/train_flant5.py loads the preprocessed data.

  • The trainer initialized in files like …/train_flant5.py which initializes the model, loads the prepared datasets, and runs the training loop.

These training workflows provide full end-to-end pipelines for applying techniques like LoRA and preprocessing data into supervised learning tasks to train conversational models at scale.

Attention Mechanisms

References: fastchat/train/llama_flash_attn_monkey_patch.py, fastchat/train/llama_xformers_attn_monkey_patch.py

The files …/llama_flash_attn_monkey_patch.py and …/llama_xformers_attn_monkey_patch.py replace the standard attention mechanism in LLMA models with more efficient alternatives. Both files directly replace methods in LLMA classes to modify how attention is calculated.

The …/llama_flash_attn_monkey_patch.py file defines a method that handles projecting hidden states into query, key, and value vectors. It applies positional embeddings and concatenates past states if provided. The states are stacked and transposed into the required format before passing to the attention calculation.

The …/llama_xformers_attn_monkey_patch.py file patches the LLama attention module by overwriting its method with a function defined in the file. This function implements attention using XFormers optimizations. It projects hidden states and applies position embeddings. It caches key and value states if passed in.

The main attention calculation is done differently depending on a flag value. If False, it uses an optimized function to calculate attention in a memory efficient way, passing a mask if provided. If True, it directly multiplies query and key states, applies the mask, and multiplies with value states to compute the output.

Both files replace standard attention to improve efficiency in LLMA models by modifying how the calculations are performed. This allows using more optimized attention implementations.

Memory Optimization

References: fastchat/train/train_lora.py, fastchat/train/train_lora_t5.py

The …/train_lora.py and …/train_lora_t5.py files implement training workflows that apply LoRA to reduce the size of models. They take command line arguments to specify LoRA hyperparameters.

LoRA partitioning is applied to separate the parameters into blocks. This is done by calling a function, which takes the model and LoRA hyperparameters as arguments. It partitions the parameters by layer and handles bias parameters separately.

The files prepare the training data and model for training. They initialize a class for training.

The class is used to train the model with LoRA applied. After training, the files save the final model state dictionary. For distributed training, it must detach, clone and move parameters before saving.

Specialized Training

References: fastchat/train/train_baichuan.py, fastchat/train/train_flant5.py

The files …/train_baichuan.py and …/train_flant5.py contain code for specialized training workflows optimized for models like T5 and BaiChuan.

…/train_baichuan.py implements supervised conversational modeling using conversation data in JSON/JSONL format. It preprocesses the data by applying conversation templates and tokenizing with a pretrained tokenizer. It defines dataclasses to configure the model, data, and training parameters. The main function initializes the model, tokenizer, and performs the supervised training, saving the final trained model.

…/train_flant5.py also implements supervised conversational modeling. It defines classes to configure the model, data, and training parameters. The dataset class loads preprocessed data and creates PyTorch datasets. The collator collates question-answer pairs into batches with padding and attention masks. The trainer class is initialized to train the model end-to-end, resuming from checkpoints. It saves the final trained model.

Serving Models

References: fastchat/serve

The core functionality of serving trained models involves loading models from checkpoints, handling requests to generate responses via APIs, and presenting interactive interfaces for users. This is implemented through a distributed system with …/controller.py, worker, and API components.

The …/controller.py file defines a class that manages registration and dispatching of AI workers. Workers are defined by classes like …/model_worker.py which handle loading models, preprocessing inputs, executing models to generate outputs, and returning predictions.

The FastAPI apps defined in worker files expose APIs for generation and status queries. Request handlers acquire semaphores using …/base_model_worker.py to synchronize access. Specific workers are implemented, such as models loaded via …/huggingface_api.py.

Scripts like …/launch_all_serve.py automate launching the components, starting the controller and workers as subprocesses while setting environment variables. This ensures correct startup of the distributed serving system.

Serving Infrastructure

References: fastchat/serve, fastchat/serve/controller.py, fastchat/serve/model_worker.py

The FastChat serving infrastructure implements a distributed system for serving models through controllers, workers, and APIs. The …/controller.py file defines a class that manages distributed AI workers. It keeps track of registered workers in a dictionary and can dispatch requests to workers using either a "lottery" or "shortest queue" method.

The class runs a heartbeat thread to periodically check workers and remove any that have stopped responding. It provides APIs for workers to register, get status updates, and receive requests via FastAPI endpoints defined in the file. This allows clients to get models, worker addresses, and streams from the controller.

Workers are implemented by subclasses of classes defined in …/base_model_worker.py. This handles model loading and execution, registering with the controller, sending heartbeats, and acquiring/releasing semaphores for concurrency control. Specific worker implementations like those in …/huggingface_api_worker.py override methods to execute models through different frameworks.

The FastAPI apps defined in worker files expose APIs for tasks such as generation, embedding retrieval, and status checks through functions. Request handlers acquire semaphores to synchronize access to workers. This provides a standardized way to load and run models across different frameworks.

Model APIs

References: fastchat/serve/huggingface_api_worker.py

The worker class handles API calls to models. It overrides methods. These methods call the API using an inference client to generate or stream responses.

The constructor takes configuration parameters. Methods handle requests, setting parameters, and returning errors. Yields responses.

Workers are initialized by loading a configuration file. An instance is created for each model storing attributes.

The app defines endpoints like one that acquires a lock for the worker. This limits concurrency. The endpoint calls the generation method and returns the response. Tasks release the lock after responses complete.

Load Balancing

References: fastchat/serve/gateway

The Nginx gateway balances load across multiple Gradio chatbot servers. It uses Nginx's reverse proxy functionality to route incoming requests among backends based on their configured weights. This provides scalability by allowing additional Gradio servers to be added easily as traffic increases.

The key aspects of the load balancing functionality include:

  • The Nginx configuration file /etc/nginx/nginx.conf controls the gateway behavior. It defines a listen port that receives all incoming requests.

  • The backends section configures the backends - in this case the Gradio chatbot servers. Each backend entry specifies a server name, IP/host, port, and weight.

  • Weights determine the percentage of requests that will be routed to each backend. For example, a backend with weight "3" will receive 3 times as many requests as one with weight "1".

  • Nginx uses consistent hashing on requests to map them consistently to backends, improving cache hit rates. It routes each new request to the backend that handled the previous request if it's available.

  • Requests are forwarded to the backends defined in the backends section.

  • Admins can modify the backends to add, remove, or change weights of Gradio servers without code changes or restarts, enabling dynamic scaling.

Monitoring

References: fastchat/serve/monitor

This section details the tools and code used for monitoring models, conversations, and battles from the FastChat chatbot serving system. Key functionality includes tracking model and user performance over time, analyzing conversation logs, and visualizing metrics in a dashboard.

The …/monitor directory contains the main monitoring code. It includes tools for cleaning and processing log data from FastChat servers stored in JSON format. Functions like clean raw logs into a usable format standardize fields and remove invalid records.

Statistics about events, users, and models are calculated from the cleaned logs. The …/basic_stats.py file contains functions for aggregating counts by timeframe and merging results. Visualizations of trends over time are generated.

Model performance is tracked using the Elo rating algorithm. It computes ratings after each battle and performs a bootstrap analysis to estimate uncertainty. Leaderboards and heatmaps of win rates are created.

Conversation logs are loaded and filtered and can be intersected between files. The …/clean_battle_data.py file analyzes user prompts and tags conversations.

User messages from logs are clustered by topic. Representative texts for each cluster are identified and summaries are generated.

The monitoring dashboard loads components generated by the above code using …/monitor.py. It runs a background thread to periodically update live data from logs.

Web Interfaces

References: fastchat/serve/gradio_web_server.py

The gradio web server handles routing conversations through models hosted on the backend. A class stores information about each conversation like the ID, messages, and model name.

The /chat route takes user input, runs it through filtering, and appends it to the conversation state object before returning.

The key function takes the conversation state and messages, determines the appropriate API to call based on the model name using logic, calls the API to get a response, appends it to the conversation, and returns the updated state.

The APIs are implemented in separate files like …/huggingface_api_worker.py. This allows hosting different model types by mapping model names to the correct file. The response is returned to the frontend to continue the conversation.

Automation

References: fastchat/serve/launch_all_serve.py

The …/launch_all_serve.py script automates launching a full FastChat server configuration from the command line. It handles initializing and starting all required processes in the correct order.

The script first parses any command line arguments to configure options like the controller address and model paths. It then launches the controller process by executing a shell command with the arguments.

For each model path, a worker is launched by extracting the host/port from the path. This is used to set environment variables and insert arguments into the launch command string. Each worker is launched in its own process using its customized command.

After all workers start, the OpenAI API server is launched similarly. Functions are defined to launch each component and check logs to ensure successful startup before exiting.

Environment variables can be set on the controller and worker processes based on arguments. This provides an easy, automated way to bootstrap a distributed FastChat server from a single script call, without needing to manually configure and start each individual process.

Dataset Release

References: fastchat/serve/monitor/dataset_release_scripts

The scripts in …/dataset_release_scripts handle processing and releasing raw conversation datasets. This involves several steps:

First, scripts like /filter_bad_conv.py clean the data by removing sensitive content and formatting fields.

Next, scripts such as /merge_oai_tag.py and /final_post_processing.py enrich the data by joining fields from other sources and removing unnecessary fields.

Analysis is also performed, with /compute_stats.py generating statistics, counts, lengths and visualizations.

Once fully processed, the clean datasets can be uploaded to repositories with /upload_hf_dataset.py. This loads the conversations, converts them to objects, and pushes them to repositories.

The full processing pipeline is automated via scripts like /process_all.sh which runs the Python scripts in order on the raw data stored in environment variables. Instructions in /instructions.md explain how to execute the end-to-end workflow.

Model Compression

References: fastchat/model

Model compression techniques in FastChat aim to reduce model size and efficiency. The …/compression.py file implements quantization to compress models. Quantization functions in this file compress and decompress weights during loading and inference.

The …/model_adapter.py file handles loading adapters. It defines a base class for loading models. Model-specific subclasses customize adapter loading.

The …/apply_lora.py file applies a LoRA adapter to a base model. It loads the base model and LoRA adapter, then combines them into a new merged model.

Quantization

References: fastchat/model/compression.py

The file …/compression.py implements model compression functionality for FastChat models through quantization. Quantization compresses models by reducing the number of bits used to represent each weight, allowing the weights to be stored with less memory.

The code defines a way to configure parameters for group-wise quantization like the number of bits and group size. It contains functions for compressing and decompressing weights during loading and inference. A function performs the inverse operation of compressing weights during inference, while another loads a model configuration and tokenizer, then loads and compresses the weights based on the names of linear layers. It returns the compressed model and tokenizer.

A class forwards calls to the underlying decompressed weights during inference, allowing the model to use the decompressed values while storing the compressed values. This avoids permanently storing weights in their compressed form.

The key aspects of quantization implemented are:

  • Group-wise quantization of weights to shared values, reducing the number of bits needed to represent each weight value. This provides the bulk of the compression.

  • On-the-fly decompression of weights during model load and inference rather than permanently storing weights in compressed form. This allows using the standard model while optimizing storage size.

Pruning

References: fastchat/model/compression.py

The …/compression.py file implements model compression techniques. It contains functions for compressing and decompressing model weights during loading and inference.

The key functions are:

Wrap
Copy
decompress() 

This performs the inverse operation of compressing weights during inference.

Wrap
Copy
load_compress_model()

The main loading function. It loads a model config and tokenizer, then loads and compresses the weights based on the names of linear layers.

It handles loading compressed models, checking for a ".pruned" file extension to identify pruned models. These models will have sparse weight matrices from removed weights being zeroed out. The loading function handles sparse weight matrices efficiently during inference.

Knowledge Distillation

References: FastChat

In FastChat, knowledge distillation functionality is implemented in the …/compression.py file. This file contains methods for distilling knowledge from a teacher model to a student model.

The main distillation process is handled by loading the teacher model from a checkpoint. The teacher's weights are frozen so it is not updated during training. A student model is then trained to match the teacher's outputs or activations on a dataset.

For output-based distillation, the loss is calculated as the Kullback-Leibler divergence between the teacher and student model outputs on a batch of data.

For activation-based distillation, the loss is calculated as the mean squared error between the teacher and student hidden state activations for each layer.

The calculated loss is used to optimize the student model weights through backpropagation calls. This process continues for multiple epochs to train the student model.

The key advantage is transferring knowledge from a large teacher to a more efficient student model, allowing it to achieve high performance using fewer parameters. In FastChat, distillation methods support output-based and activation-based approaches through modular loss computation.

Parameter Sharing

References: fastchat/model/compression.py

Sharing parameters across layers reduces model size by tying the weights of different layers together. The file …/compression.py implements parameter sharing during model compression.

It contains functions for compressing and decompressing model weights during loading and inference. The key functions are:

Wrap
Copy
load_compress_model() 

The main loading function. It loads a model config and tokenizer, then loads and compresses the weights based on the names of linear layers. It returns the compressed model and tokenizer.

By tying weights between layers, the total number of unique parameters is reduced. This allows compressing models further while retaining most of the model capacity. The Model Compression section covers other techniques used in combination with parameter sharing to compress FastChat models.

Low-Rank Approximation

References: fastchat/model/compression.py

The …/compression.py file implements low-rank approximation to reduce model size. Low-rank approximation works by decomposing weight matrices into a product of two smaller matrices.

The …/compression.py file contains functionality for compressing model weights during loading. It implements low-rank approximation using singular value decomposition (SVD) to factorize weight matrices. SVD decomposes each matrix W into three smaller matrices U, S, V such that W ≈ USV^T. Only the top k singular values and their corresponding left and right singular vectors are retained, approximating W with a rank k matrix.

The file handles loading and compressing models. It first loads the model configuration and tokenizer. It then iterates over the linear layers, applies SVD to decompose their weight matrices, and replaces the original weights. This compresses the model in-place during loading.

Model Adapters

References: fastchat/model/model_adapter.py

The …/model_adapter.py file contains code for loading pretrained models and adding task-specific adapters on top. It handles loading models containing "peft" by first loading the base model, then loading the PEFT adapter weights on top, allowing sharing base model weights between PEFT models. Classes are registered in priority order, with earlier ones taking precedence.

Efficient Attention

References: fastchat/train/llama_flash_attn_monkey_patch.py, fastchat/train/llama_xformers_attn_monkey_patch.py

The files …/llama_flash_attn_monkey_patch.py and …/llama_xformers_attn_monkey_patch.py implement efficient attention mechanisms to reduce compute during training. Both files directly replace the standard attention calculation in LLMA models.

The …/llama_xformers_attn_monkey_patch.py file patches the LLMA attention module to replace its attention calculation with XFormers attention. It overwrites the attention calculation method by directly replacing the method of the attention class from the LLMA modeling code with a new function defined in the file.

This function implements the attention calculation using XFormers optimizations such as projecting the hidden states into query, key, and value vectors. It applies position embeddings to the query and key states, and handles caching past key and value states if provided.

The main attention calculation is done differently depending on a flag value. If false, it uses an efficient memory optimization function to calculate attention, passing either an attention mask or no mask. If true, it calculates attention weights directly as a multiplication between query and key states, applies the mask, and computes the output as a multiplication with the value states.

The …/llama_flash_attn_monkey_patch.py file similarly replaces the standard attention with an optimized calculation. It defines a method that transforms the hidden states into the required format, applying positional embeddings and concatenating past states. It projects the states and transposes them before passing to the attention calculation function.

Both files aim to reduce compute costs during training by directly replacing the standard attention calculation method with more efficient implementations. This allows optimizing models like LLMA for scale.

LoRA

References: fastchat/train/train_lora.py, fastchat/train/train_lora_t5.py

LoRA reduces model size by partitioning parameters into blocks and approximating each block with a low-rank matrix. The FastChat code implements LoRA for efficient training and serving of large language models.

The …/train_lora.py file trains LLMA models using LoRA. It loads a pretrained LLMA model and applies LoRA parameter partitioning using the LoRAReduce class. This class partitions parameters into blocks and computes a low-rank approximation for each block by singular value decomposition. It saves the resulting low-rank parameter matrices and block indices in a state dictionary. The code then makes a supervised data module for training and initializes an Trainer to train the model. After training, it saves the final model state dictionary.

The …/train_lora_t5.py file applies LoRA to T5 models. It loads a T5 model and uses the LoRAReduce class to partition parameters into blocks and compute low-rank approximations for each block. It handles applying LoRA whether using FSDP or DeepSpeed ZeRO-3 for distributed training. It prepares the training data and model, initializes a Trainer, and trains the model. After training, it saves the final model state dictionary.

Both files specify LoRA hyperparameters like number of blocks, dropout, and target modules in a LoRAConfig dataclass. They detach, clone, and move parameters to CPU if using ZeRO-3 to ensure partitioning works properly. They also handle retrieving the partitioned LoRA parameter state dictionary after training.

Evaluating Models

References: fastchat/llm_judge, tests

The core functionality for evaluating models is contained in the …/llm_judge directory. This directory contains tools for analyzing model performance on conversational tasks through automated benchmarking and interactive evaluation.

The …/gen_judgment.py module implements generating judgments by comparing model responses. It creates matches between responses and loads data to pass to the judge model. This allows programmatically evaluating models by generating judgments on their responses.

The …/common.py module contains functions for loading necessary data.

The …/qa_browser.py module builds an interactive browser for viewing questions, model responses, and judgments. It loads necessary data and defines functions to display it in a Gradio UI, allowing manual inspection of model outputs.

The tests directory contains tests for automatically evaluating models on datasets through scripts. This allows benchmarking models programmatically and tracking performance over time.

Interactive Evaluation

References: fastchat/llm_judge/qa_browser.py

The …/qa_browser.py file provides an interactive browser for manually inspecting question-answer pairs from evaluation datasets along with model judgments on the answers. At its core, the browser loads evaluation data from files and converts the data into Markdown format for display in Gradio widgets.

This allows a user to manually step through question-answer pairs, view all the models' responses to a question, and inspect the provided judgments - gaining valuable insights into model performance beyond just aggregate metrics.

Automated Benchmarking

References: tests

The tests directory contains automated tests for programmatically evaluating models on datasets. The file …/test_cli.py tests different functionality of the command line interface, including running models with single and multiple GPUs as well as 8-bit quantization.

The key functionality is contained in classes and functions. A function is used to execute CLI commands as strings and check for errors. Tests pass arguments to the CLI via Python modules to configure options like multiple GPU usage or loading quantized models.

The file …/test_openai_api.py contains tests for the OpenAI API compatible server. It runs functions like python test_list_models() , python test_completion() , and python test_chat_completion() which make API calls to test endpoints. These functions loop through models returned by python test_list_models() and run all tests on each model. python test_completion_stream() and python test_chat_completion_stream() test streaming by iterating over response chunks. python test_openai_curl() directly calls APIs using curl commands.

Model Inference

References: fastchat/model, fastchat/modules

The FastChat codebase contains functionality for loading various pretrained models and running inference with them. This is implemented primarily through model-specific loading code in the …/modules directory and common inference logic in …/model.

The …/modules directory contains code for loading different model types including quantized models in …/awq.py, Exllama models in …/exllama.py, GPTQ quantized models using code in …/gptq.py, and XFasterTransformer models in …/xfastertransformer.py.

For example, …/exllama.py contains functionality for loading Exllama models.

Common inference functionality is defined in …/model. Functions contain the core generation logic.

The …/model_adapter.py module plays an important role in loading models. It defines model-specific subclasses that customize loading.

In summary, the FastChat code implements a modular approach to model loading and inference through model-specific loading code and common interfaces in …/model for generation.

Loading Models

References: fastchat/model/__init__.py, fastchat/modules

The …/__init__.py file provides a simple interface for loading models. It imports functionality from …/model_adapter.py, which handles the actual loading process.

Model loading is further abstracted by subclasses defined in files under …/modules. For example, …/exllama.py defines a class that encapsulates an Exllama model, cache, and tokenizer. The function handles loading these components based on configuration in a class, which specifies properties like maximum sequence length.

Similarly, …/gptq.py defines a class to configure quantization settings. The function loads a quantized model using the library, handling loading the library and finding the checkpoint file.

Other files provide analogous classes and functions for loading quantized models and models respectively. By abstracting model loading in this way, different model types can be loaded through a common interface while handling type-specific loading procedures.

Running Inference

References: fastchat/model/model_chatglm.py

This section discusses the core code for running model inference and generating responses. The file …/model_chatglm.py contains the key functionality. It includes handling generation when invalid log probabilities occur.

The code checks for invalid values during response generation, sets them to a default, and sets another value high to avoid generating invalid sequences. This handling is applied to the generation process.

A function initializes generation with a given prompt. It then yields responses by calling the model's generation method. Each response is processed to clean punctuation before returning. It tracks token usage and returns finish information with the last response. The generation parameters include options like temperature, repetition penalty, and max length. This allows streaming generation while processing responses.

Model Adapters

References: fastchat/model/model_adapter.py

The …/model_adapter.py file handles customizing model behavior for different tasks using adapters. It defines a base class for loading models. Subclasses can override functions.

The class first loads the base model, then loads the PEFT adapter weights on top, allowing sharing base model weights between PEFT models.

Classes are registered in priority order, with earlier ones taking precedence. Its method chooses the correct subclass based on the model path. uses the adapter to load models and tokenizers.

first checks for a cached base model, otherwise loads it. It then loads the PEFT adapter weights, using the model path as the adapter name. This allows each PEFT model to have its own task-specific weights loaded on top of the shared base model.

Compressing Models

References: fastchat/model/compression.py

This section details techniques for compressing models implemented in the …/compression.py file. The file implements model compression functionality for FastChat models. It contains functions for compressing and decompressing model weights during loading and inference.

Group-wise quantization reduces model size by quantizing weights to use fewer bits. Weights are grouped and each group is quantized to the same number of bits. This is configured using a dataclass to specify quantization parameters.

The key functions are:

  • A wrapped linear layer class forwards calls to the underlying decompressed weights during inference, allowing compressed models to be used normally without storing compressed weights permanently.

Modifying Models

References: fastchat/model/apply_delta.py, fastchat/model/make_delta.py

The …/apply_delta.py file contains functions for applying changes or deltas to base models to create target models.

The …/make_delta.py file contains a function to calculate the delta weights between a base model and a target model. It loads the base and target models using functions. It then iterates through the target model dictionary, calculating the difference between corresponding parameters in the base and target models using subtraction operators. This delta is then saved.

These files allow modifications to models in several ways. The …/apply_delta.py file can apply a pre-calculated delta to a base model, creating a new target model. The …/make_delta.py file calculates the delta directly. Both support loading very large models via batch processing functions. The delta format provides a standardized way to represent and apply small focused changes to models.

Model Metadata

References: fastchat/model/model_registry.py

The …/model_registry.py file contains a centralized registry for tracking metadata about models. It defines a named tuple to store the name, link, and description for each model. The file provides a function for registering new models by constructing an instance and adding it to a mapping of full names to objects. It also contains a function to retrieve an object by name from the registry.

By populating the registry through calls to register, all model info is organized in one place. Storing objects by full name allows flexible retrieval. This centralizes all model metadata while providing APIs for working with registered models. It is critical for any code interacting with different models, providing a single source of truth for metadata.

Processing Conversation Data

References: fastchat/data

This section details the Python scripts used to process raw conversation data into clean datasets suitable for training conversational models. The main steps include:

  • Cleaning raw conversation data to remove noise and format the text consistently. This is done in scripts like …/clean_sharegpt.py which cleans HTML data and converts it to Markdown format.

  • Files like …/prepare_all.py orchestrate running the full preprocessing pipeline by defining a list of commands to run various cleaning, filtering and splitting scripts.

  • Splitting long conversations into shorter samples of around 400 tokens is done in …/split_long_conversation.py. This splits conversations into new samples whenever the cumulative length exceeds the threshold.

  • Statistics about the data like length distributions are calculated in …/get_stats.py. This computes metrics in parallel and prints histograms to analyze the data distribution.

  • Filtering data for the correct format is implemented in …/filter_wrong_format.py using regular expressions defined to robustly match invalid patterns without hardcoding specific cases.

  • Loading, filtering and writing JSON conversation data uses consistent interfaces for modularity. Common functions handle tasks like loading from and dumping to files to reduce duplicative code.

Cleaning Data

References: fastchat/data/clean_sharegpt.py

The …/clean_sharegpt.py file contains code to clean HTML conversation data from platforms like ShareGPT and convert it to Markdown format for model training. This cleaning process removes unnecessary HTML tags and formatting from the raw data, standardizes the format to Markdown, and filters out low-quality samples based on criteria like alternating between 'human' and 'gpt' responses, not containing blocked words, and successfully converting to Markdown without errors.

The file runs the cleaning pipeline, loading the raw HTML data, running cleaning functions, and dumping the cleaned Markdown output. It handles converting code blocks from the original ShareGPT format languageCopy code to the standard Markdown language format. Samples are filtered based on quality criteria, returning an error code to indicate why a sample was filtered out.

The cleaning process standardizes the data into a uniform Markdown format so it can be easily used for model training. It also removes noisy or low-quality samples that could hurt model performance.

Converting Datasets

References: fastchat/data/convert_alpaca.py

This section covers converting between different conversation data formats like Alpaca and ShareGPT. The file …/convert_alpaca.py handles converting data from the Alpaca format to the ShareGPT format. It takes an Alpaca JSON file as input and loops through each example, formatting it before appending it to a list. The list is then written out to a new JSON file in the ShareGPT format.

The file contains no classes or functions, instead using a main block to handle the conversion process. It first parses command line arguments to get the input and output file paths. Then it loads the Alpaca data from the input file. Each example is looped through and formatted before being appended to the list. After printing the number of examples, the list is written to the output file. This direct conversion approach processes each Alpaca example without any additional cleaning or filtering.

Filtering Data

References: fastchat/data/extract_gpt4_only.py

This section focuses on extracting subsets of conversations from raw chat logs. The file filters conversations to only those generated by GPT-4. It takes command line arguments specifying the input and output files. The main steps are:

  1. Load the JSON from the input file
  2. Iterate through each conversation
  3. Check the "model" field to see if it matches "GPT-4"
  4. Append matching conversations to a new list
  5. Write the filtered conversations to the output file

The script filters the conversations using several key pieces of logic:

  • It loads the full JSON chat log data from the input file.

  • It then iterates through each conversation.

  • Inside the loop, it checks the "model" field of each conversation to see if it equals "GPT-4".

  • If it matches, it appends the conversation to a new list.

  • After filtering all conversations, it writes the filtered list to the output JSON file.

This allows extracting only a subset of conversations for focused analysis or training, by filtering on the field that identifies which model generated each conversation.

Inspecting Data

References: fastchat/data/inspect_data.py

This section details functionality in …/inspect_data.py for debugging conversation data samples during the data processing pipeline. The file contains code to parse command line arguments, load JSON data from a file path, iterate over selected sample indices, and print information about each sample including the index, ID, and conversations. The conversations are printed one by one with pauses between each to allow for manual inspection. This provides a simple interface to visualize samples from the dataset and ensure they are as expected before continuing with the pipeline. Key functionality includes:

  • Code to parse arguments passed via the command line.

  • Code to load the JSON data from the specified file path.

  • A for loop to iterate over the selected sample indices.

  • Code to print information about each sample such as the index and conversations.

  • Code to display the conversations and pause between each to allow inspection.

This utility file implements the important task of manually debugging raw conversation samples to catch any errors or unexpected patterns before training models. The printing and pause functionality allow verifying data quality through visual inspection, while selecting samples via indices facilitates focused debugging.

Splitting Data

References: fastchat/data/split_long_conversation.py, fastchat/data/split_train_test.py

This section details code for dividing long conversations into multiple samples and creating training and test datasets from raw conversation data.

The file …/split_long_conversation.py contains logic for splitting long conversations. A helper constructs a new sample object from an existing one, slicing out a subset of the conversations. A function post-processes the results to filter out any samples that don't properly alternate roles in each conversation.

The file …/split_train_test.py contains code to split a dataset into training and test sets. It parses command line arguments to specify the input file path, indices, and split ratio. It loads the dataset from the input JSON file. It shuffles the dataset with a fixed seed. It calculates the split index using the split ratio argument and the length of the dataset. It splits the shuffled dataset into train and test sets based on the split index. It writes the train and test sets to new JSON files. It prints the sizes of the resulting train and test sets.

Synthetic Data

References: fastchat/data/hardcoded_questions.py

This section details how the …/hardcoded_questions.py file adds synthetic conversations for robustness during data processing. The file contains hardcoded questions and answers that define an identity conversation with a model. It includes common questions and predefined answers containing the model's name and organization.

The …/hardcoded_questions.py file contains:

  • The name and organization of the model.
  • Lists of common identity questions and predefined answers.
  • A function that loops through the questions and answers, pairing each question with each answer to build dictionaries representing conversations.
  • Multiple calls to this function with different question/answer pairs to generate responses about the model identity and clarify it is not other models.
  • The conversations are appended to a list and then dumped to a JSON file.

This allows including basic identity conversations when processing data to improve the model's ability to have natural discussions about itself. The predefined questions and answers provide more robust examples during training without requiring additional data collection.

Merging Data

References: fastchat/data/merge.py

The …/merge.py script allows combining multiple conversation data files into a single consolidated dataset. It takes in multiple JSON files containing conversations as input, loads the content from each file into a list, then concatenates the lists and outputs the merged list to a single output file.

The main steps are:

  1. It uses an argument parser to define and parse the command line arguments for the input and output file paths.

  2. It loads the JSON content from each input file path and extends the list with the loaded content.

  3. It prints the length of the merged list to confirm the number of conversations.

  4. It dumps the concatenated list to the JSON output file path specified.

This provides a simple way to programmatically combine conversation data from multiple sources with different formats into a single consolidated file, preparing the data for downstream tasks like model training.

Analyzing Data

References: fastchat/data/get_stats.py

This section analyzes conversation data to generate statistics about the data distributions and properties. The …/get_stats.py script computes various metrics over tokenized conversation samples to understand the characteristics of the data.

The script first tokenizes each conversation sample in parallel over samples for efficiency.

Statistics are then computed. It calculates the total number of tokens, average number of turns per conversation, and length distributions by summing token and turn counts, and calculating average lengths.

Finally, the results are printed. The total statistics are output, and a histogram of length distributions is generated. This provides a high-level view of the data properties.

By tokenizing and computing metrics in a modular and parallelized way, large datasets can be efficiently analyzed. The statistics help inform data processing and model training decisions.

Full Pipeline

References: fastchat/data/prepare_all.py

The …/prepare_all.py script provides an end-to-end workflow for preparing various conversation datasets needed for training and evaluating chatbots. It takes command line arguments to specify input and output directories and filenames. The script then defines a list of commands to run Python scripts from the …/data package. These scripts clean, preprocess, split, filter and merge the raw conversation data.

Some of the key processing steps include:

  • Cleaning the raw ShareGPT HTML dataset with …/clean_sharegpt.py
  • Optional cleaning like language filtering
  • Splitting long conversations
  • Filterting for correctly formatted conversations
  • Splitting into train and test sets with …/split_train_test.py
  • Adding hardcoded questions with …/hardcoded_questions.py
  • Merging with identity data
  • Extracting only GPT-4 conversations

The script runs each step by executing Python scripts in the …/data directory. This provides a consistent way to orchestrate the end-to-end preprocessing workflow.

Documentation

References: docs

The docs directory contains documentation files that describe the various components, models, and functionality within the FastChat platform. This includes documentation on commands, models, datasets, servers, and more.

The most important subdirectory is …/commands, which contains markdown files documenting commands for tasks like managing conversation data and launching local clusters. These files provide code snippets and instructions for using important scripts.

Other key files include:

  • …/server_arch.md which contains a diagram illustrating the server architecture and describes components like the database, API, and websockets at a high level.

  • …/openai_api.md which documents the OpenAI compatible REST API and shows how to interact with the API server via the OpenAI Python SDK and cURL requests.

  • …/model_support.md which discusses supporting new models.

  • …/arena.md which describes the benchmarking platform and processes for adding new models by contributing code or hosting models on third party APIs.

Model Training

References: fastchat/data, fastchat/train

Training conversational AI models like GPT-3 involves loading conversation data, preprocessing it, and using supervised learning techniques. The …/train directory contains implementations of training workflows that can be composed for different models.

Key files implement various aspects of the training process. …/train.py loads conversation data from …/conversation.py, tokenizes it with a pretrained tokenizer, masks targets for prediction, and creates PyTorch datasets for training. It loads a pretrained GPT model and runs the training process.

The …/train_flant5.py file takes raw conversations, adds speaker tags, and tokenizes into a single sequence to form question-answer pairs by identifying the question as prior context and answer as subsequent speaker tokens. This prepares the data for supervised learning of question-answering.

…/train_baichuan.py similarly applies conversation templates, tokenizes data, and masks targets for supervised learning. It uses multiprocessing to parallelize preprocessing for large datasets. The main training loop initializes the model, tokenizer, and trainer to train the model end-to-end.

Model Serving

References: fastchat/serve

The FastChat system provides functionality for serving trained models via web APIs and user interfaces. Key components involved in this are defined in files under the …/serve directory.

The …/model_worker.py file handles loading models and executing them. It defines a class that loads a model and handles predictions.

The …/controller.py file defines a class that manages distributed AI workers. It provides APIs for workers to register and receive requests via FastAPI endpoints.

API implementations are defined in files such as …/huggingface_api_worker.py. This file inherits from …/base_model_worker.py and overrides methods to call APIs and handle generation.

Web interfaces are provided via UIs implemented in files like …/gradio_web_server.py. This file builds a Gradio interface and defines functions to handle user input and generation.

Scripts such as …/launch_all_serve.py automate launching the full distributed serving stack, including required processes.

Model Compression

References: fastchat/model

The …/compression.py file implements model compression functionality. A class wraps compressed linear layers.

The LoRA technique in …/apply_lora.py applies a low-rank adapter model to a base model, combining them while reducing parameters. …/make_delta.py calculates weight deltas between models.

Model Evaluation

References: fastchat/llm_judge, tests

This section analyzes tools used for evaluating conversational models. The core functionality involves generating judgments of model responses and calculating metrics based on these judgments.

The …/llm_judge directory contains several important modules for model evaluation. The …/gen_judgment.py module implements generating judgments by comparing model responses. It creates matches between responses and loads data. This allows automated evaluation of many model responses. The …/compute_agreement.py module calculates agreement metrics between different judges, quantifying how consistent their judgments are.

The …/qa_browser.py module builds an interactive browser for viewing evaluation results. It loads necessary data and defines functions to display questions, answers, and judgments in an interactive UI.

Model Inference

References: fastchat/model, fastchat/modules

The core functionality covered under this section is loading various pretrained models into memory and running inference with them to generate predictions or text.

The …/__init__.py file provides an interface for loading models and getting metadata.

…/exllama.py contains a class representing a loaded Exllama model with its cache.

…/model_chatglm.py contains code for ChatGPT model inference. A class handles invalid scores during generation.

…/model_xfastertransformer.py defines classes to encapsulate an XFasterTransformer model's configuration and loaded object.

Data Processing

References: fastchat/data

The …/data directory contains Python scripts for preparing raw conversation data for model training. The scripts clean, preprocess, filter and split conversation data into clean datasets suitable for training chatbots.

Key functionality includes:

Documentation

References: docs

The docs directory contains documentation files that describe the various components, models, and functionality within the FastChat platform. This includes documentation on commands, models, datasets, servers, and more.

The …/commands subdirectory contains files documenting commands for tasks like managing conversation data, running a leaderboard, launching local clusters, uploading packages to PyPI, and running web servers. The …/conv_release.md file documents the process for gathering, cleaning, and sampling chatbot conversation data for release. It describes using scripts to gather battles, add moderation tags, and filter conversations.

The …/data_cleaning.md file describes steps for cleaning raw conversation data. It provides instructions for installing libraries and shows examples of calling functions to perform tasks like format conversion and filtering conversations.

The …/server_arch.md contains a diagram of the server architecture.

The …/openai_api.md documents the OpenAI-compatible RESTful APIs.

The …/training.md provides examples for fine-tuning models.

Examples & Tests

References: playground, tests

This section covers example tasks and component tests implemented in the code. The playground directory contains Jupyter notebooks that demonstrate applying common natural language processing techniques to tasks like text classification, semantic search, and evaluating similarity with pretrained embeddings.

The …/test_embedding subdirectory focuses specifically on examples using embeddings. The file …/test_sentence_similarity.py contains functions for retrieving sentence embeddings from different models. It also includes functions for calculating cosine similarity between embeddings. This allows comparing how similarly sentences are embedded in various language models.

The file …/test_classification.py demonstrates building text classifiers on embedded data. It uses functions to generate embeddings, then splits the data and trains a classifier to classify texts. The accuracy is evaluated and printed.

For semantic searches, the file …/test_semantic_search.py contains functions that performs a nearest neighbor search on embedded review texts to find similar reviews to a query. It loads review data before searching.

Component tests are in the tests directory. The file …/test_openai_api.py tests the OpenAI API server functionality like listing models, text completion, and chat completion with functions defined in the file. The file …/test_cli.py runs CLI commands via the Python module to test single and multi-GPU usage, 8-bit quantization, and the HuggingFace API.

Examples & Tests

References: playground, tests

This section covers example tasks that demonstrate applying machine learning techniques to common natural language processing problems using pretrained word embeddings. It also discusses tests for core FastChat components.

The …/test_embedding directory contains Jupyter notebooks that showcase tasks like text similarity evaluation, text classification, and semantic searches. These examples leverage functions defined in code files under the directory to work with pretrained embeddings from various models.

The …/test_sentence_similarity.py file contains code for working with sentence embeddings from different models.

The …/test_classification.py file contains code demonstrating how to build text classifiers on embedded data.

For semantic search, …/test_semantic_search.py contains functions for working with embedded data.

The tests directory contains unit tests for core components. …/test_openai_api.py tests the OpenAI API server functionality like model listing and text completion. It calls functions on different models. …/test_cli.py runs CLI commands as strings to test functionality.

Examples

References: playground/test_embedding

The examples in …/test_embedding demonstrate how to apply pretrained embeddings to common natural language processing tasks. The code shows how to evaluate text similarity, build text classifiers, and perform semantic searches on text data using sentence embeddings.

The main tasks demonstrated include:

  • Evaluating text similarity by calculating the cosine distance between embedding vectors retrieved from different models.

  • Building text classifiers by cleaning and preprocessing text data, generating embeddings for the data, and training classifiers on the embedded data.

  • Performing semantic searches on text by creating an embedded data frame, calculating distances, and returning the nearest neighbors.

These examples demonstrate common NLP techniques of calculating embedding distances, using embeddings for machine learning, and searching embedded spaces. The modular functions allow comparing models.

Component Tests

References: tests

This section covers the unit tests located in the tests directory. These tests validate the core functionality of the FastChat application including the command line interface (CLI), OpenAI API server, GUI serving, and Peft serving.

The …/test_cli.py file contains tests for the CLI.

The …/test_openai_api.py file contains tests for the OpenAI API compatible server.

The …/README.md file describes how to run unit tests for CLI inference, the OpenAI API server, GUI serving, and Peft serving functionality.