Mutable.ai logoAuto Wiki by Mutable.ai

llama.cpp

Auto-generated from ggerganov/llama.cpp by Mutable.ai Auto Wiki

llama.cpp
GitHub Repository
Developerggerganov
Written inC
Stars48k
Watchers 465
Created2023-03-10
Last updated2024-01-09
LicenseMIT
Repositoryggerganov/llama.cpp
Auto Wiki
Generated at2024-01-09
Generated fromCommit 1fc2f2
Version0.0.4

The llama.cpp repository provides a framework and tools for building, training, evaluating, and deploying large language models. At its core is an efficient C++ library that implements the Transformer model architecture and optimizations like caching for fast inference.

The framework allows defining custom LLama model architectures through classes like llama_model and llama_layer as shown in …/baby-llama. It provides an API for building computation graphs that run on CPU or GPU, enabling model training from scratch as in …/train-text-from-scratch. Checkpoints can be saved during training and converted to formats like GGUF.

For generation, the framework offers sampling functions in …/main that leverage caching and parallel decoding for high throughput. Techniques like beam search in …/beam-search, infill completion in …/infill, and interactive decoding support advanced text generation use cases.

The library encapsulates models and execution state in classes like llama_context, providing APIs to deploy models via servers in …/server or generate embeddings as in …/embedding. Tools assist in model conversion between frameworks, evaluation via perplexity, and compression through quantization as in …/quantize.

Test suites in tests and continuous integration workflows in ci validate functionality and correctness. The framework aims to provide reusable components for building, training, evaluating and deploying large language models.

Model Definition and Training

References: examples/baby-llama, examples/train-text-from-scratch

The core functionality implemented in the code relates to defining model architectures and training models from scratch using the LLama framework. This is handled primarily through the llama_model and llama_layer classes defined in …/baby-llama.cpp.

The llama_model class represents the overall model architecture and stores the core model parameters and weights. It contains attributes like the hyperparameters in hparams and tensors holding values like the input embeddings and output projections. This class defines the overall model structure.

The llama_layer class defines the components of each layer in the model. It contains the weights used for self-attention, such as queries, keys, values, and output projections. During training, the weights of each llama_layer are learned.

Model training is implemented in …/train-text-from-scratch.cpp. This file defines a my_llama_model struct to represent the specific model being trained. It contains the core tensors and hyperparameters.

The llama_build_train_graphs function builds the computation graphs for the forward and backward passes by chaining together core LLama operations like attention and feedforward defined in the llama_layer classes.

The main training loop is handled by ggml_opt_resume_g, which runs the model through forward/backward passes on batches of training data. It calculates loss and uses the Adam optimizer to update the weights stored in the llama_model and llama_layer classes based on the gradients.

Periodic checkpointing saves the model to files that can be loaded later to resume training. This allows training to continue from the latest checkpointed state.

Model Architecture

References: examples/baby-llama

The LLama model architecture is defined through the llama_model and llama_layer classes in the …/baby-llama.cpp file.

The llama_model class represents the overall model. It contains attributes like the hyperparameters in its hparams member, as well as tensors that hold the model weights, including input embeddings, output projections, and layer weights.

The llama_layer class represents an individual self-attention or feedforward layer. It contains weight tensors for queries, keys, values, and output projections for self-attention. These weights are used during inference.

The core forward computation is implemented in the forward() and forward_batch() functions. They take the input and pass it through each llama_layer sequentially, applying self-attention and feedforward transformations using the weights stored in the layer objects.

A key optimization is the llama_kv_cache, which stores attention keys and values from prior layers. During self-attention in later layers, retrieving keys and values from the cache avoids recomputing them, speeding up inference.

Model initialization is handled by functions like init_model() and randomize_model(). They set up the model weights and parameters by initializing the tensors in the llama_model class.

Training Loop Implementation

References: examples/train-text-from-scratch

The core functionality implemented in …/train-text-from-scratch.cpp is the training loop for optimizing the LLama text generation model. This is handled by the ggml_opt_resume_g() function.

ggml_opt_resume_g() runs the main training loop, where for each iteration it:

  • Runs the forward pass by calling forward() on the computation graph to get the logits
  • Calculates the loss by calling loss() on the logits and labels
  • Runs the backward pass to get gradients by calling backward()
  • Updates the model weights using the Adam optimizer

Some key aspects of the implementation:

  • The my_llama_model struct represents the core model being trained, containing hyperparameters, tensors for embeddings/layers etc
  • The llama_build_train_graphs() function builds the forward and backward pass graphs by chaining together LLama operations like attention and feedforward
  • forward() runs the forward pass on the graph to get logits
  • loss() calculates loss from logits and labels
  • backward() runs backpropagation to get gradients

Callback functions allow periodically saving checkpoints and printing progress. The training data is efficiently tokenized into integer IDs using llama_context.

Checkpointing

References: examples/train-text-from-scratch

The code saves model snapshots during training to disk for resuming training later. This is handled by the ggml_opt_resume_g function in …/train-text-from-scratch.cpp.

This function contains the main training loop. During each iteration, it runs the forward and backward passes on a batch of training data. It then calls the optimizer to update the model weights.

Periodically during training, ggml_opt_resume_g calls a callback function. This function saves the current state of the my_llama_model struct to a checkpoint file on disk.

The my_llama_model struct represents the core model being trained. It contains the model hyperparameters, weights, biases, and other tensors as members. The callback function serializes the contents of my_llama_model to a file.

When training is resumed by running the program again, ggml_opt_resume_g checks for existing checkpoint files. If one is found, it calls a function to deserialize the saved my_llama_model struct from the file. This loads the model from the previous training state, allowing training to pick up where it left off.

The checkpoint files are saved periodically during training, with placeholders in the filename replaced with the current iteration number. This allows incrementing the checkpoint filenames without overwriting previous ones.

GGUF Conversion

References: examples/train-text-from-scratch/convert-train-checkpoint-to-gguf.py

The …/convert-train-checkpoint-to-gguf.py script handles converting models trained with …/train-text-from-scratch to the GGUF format. It defines several classes to represent the different components of the model and training state being converted.

The Tensor class represents tensor data and handles loading tensors from the checkpoint. The OptimizationContext class contains optimization state like Adam moments and uses Tensor objects to load this data from checkpoints.

The Model class represents the model parameters. It contains a ModelParams object to define the model architecture and stores Layer objects for each layer. The Layer class contains the per-layer parameters as Tensor objects.

The Checkpoint class ties together the full checkpoint conversion. Its load() method deserializes the checkpoint data and populates the Model and OptimizationContext objects. Its save_gguf() method writes out the equivalent data in GGUF format by calling save_gguf() on the loaded Model and OptimizationContext.

The main functionality converts a trained model checkpoint to GGUF format in the following steps:

  1. The Checkpoint loads the checkpoint data using its load() method
  2. The loaded Model and OptimizationContext represent the checkpoint data in-memory
  3. The Checkpoint calls save_gguf() to write out the Model and OptimizationContext in the GGUF format
  4. Tensor handles loading/saving tensor data between the two formats
  5. The classes work together to represent the checkpoint and convert it to the new GGUF format

Text Generation

References: examples/main, examples/lookup, examples/infill, examples/beam-search

The LLaMa library provides a number of tools and examples for generating text from trained models. Key functionality is demonstrated in the …/main and …/lookup directories.

…/main contains an executable called main that performs basic text generation. The main() function initializes a LLama model and context using the Context and Model classes. It then enters a decoding loop where it calls the llama_sampling_context's sample() method to iteratively build up an output sequence token-by-token.

…/lookup demonstrates prompt-guided generation. It contains an executable called lookup defined in lookup.cpp. The main work is done in a decoding loop similar to main. It interfaces with the LLama model and libraries through key classes like Context and functions like llama_sampling_init(), llama_sampling_sample(), llama_decode(), and llama_tokenize() to perform tasks like initialization, sampling, decoding, and tokenization. The Context class manages the generation context window. These provide the core framework and components for text generation applications.

Sampling Functions

References: examples/main, examples/lookup

The core sampling functionality is implemented in C++ functions and classes from the LLama.cpp library. Key components include the llama_sampling_context struct, which manages the sampling state and cached logits. Functions like llama_sampling_sample() and llama_sampling_accept() implement the sampling algorithms over this cached data.

Group attention allows extending the context size by virtually dividing the cache into windows and shifting them using functions like llama_kv_cache_seq_shift(). Context shifting physically moves cached data when the context fills up using llama_kv_cache_seq_rm() and llama_kv_cache_seq_shift() to replace old tokens.

The Model class represents the loaded LLama model.

Sampling is performed by initializing a llama_sampling_context with llama_sampling_init() and then calling llama_sampling_sample() in a loop. The sampled tokens can be accepted with llama_sampling_accept() or discarded to continue sampling freely. Between iterations, the KV cache is cleaned with llama_kv_cache_seq_rm() to remove unused tokens.

The main program interfaces with these sampling functions and classes, performing generation by iteratively calling llama_sampling_sample() and llama_sampling_accept() on the context to build up an output sequence token-by-token.

Beam Search Decoding

References: examples/beam-search

Beam search decoding allows generating text via an iterative, pruned search over the model's predicted token distributions. The …/beam-search directory implements beam search decoding for a large language model using the LLama library.

The beam_search.cpp source file defines the core beam search functionality. It contains the beam_search_callback which tracks progress and collects candidate tokens during decoding. This callback uses a beam_search_callback_data struct to store the context and growing response.

Beam search is performed with llama_beam_search, which handles pruning and iterating over predictions. The encoded context and callback are passed to this function to generate candidate responses.

The callback is critical, as it is called on each step to check for sentence ends, log progress, and collect matching tokens into the response vector using the callback data struct. This couples the growing response to the decoding process.

The CMakeLists.txt configures a build target for the beam-search executable. It links this target to the required common, llama, and CMake thread libraries.

Infill Completion

References: examples/infill

The …/infill directory contains tools for completing partial code snippets by filling in missing lines. The infill executable allows generating text to complete code contexts provided by the user. It handles command line parsing and initializes the LLama backend and model loading.

A key class is llama_sampling_context, which manages the sampling state and probabilities during generation. Its main methods are llama_sampling_init() to initialize a new context from sampling parameters, llama_sampling_sample() to sample the next token ID, and llama_sampling_accept() to accept a token into the context and update probabilities. The context helps ensure coherent generation across many tokens by tracking dependencies and metrics in the full input history.

The main loop in infill.cpp performs token generation, sampling the next token with llama_sampling_sample() and accepting it with llama_sampling_accept(). It displays and logs the output while tracking the remaining budget. Context resets are triggered with llama_sampling_reset() to avoid overflow. Interactive mode allows prompting the user for new input via readline.

The README.md provides documentation on infill usage, like specifying code context prefixes and suffixes with --in-prefix and --in-suffix. Interactive mode via -i receives suggestions in real-time. This allows interactively completing partial code snippets by filling in missing lines between the provided code context.

Interactive Decoding

References: examples/main, examples/infill

The …/main directory supports interactive text generation through its interactive mode. When the -i or --interactive flags are passed to the main executable, it enters a generation loop that responds to user input in real-time.

The core interactive logic is handled by functions like llama_sampling_sample() and llama_sampling_accept() in the llama_sampling_context struct defined in …/main.cpp. These functions implement the sampling algorithm over the cached logits and update the context based on new tokens. User input is processed via llama_tokenize() also defined in main.cpp.

The Context class manages the context window during generation, as described in …/README.md.

Generation continues until termination conditions are reached, reading user input with readline() and passing it to the context before continuing.

Output Formatting

References: examples/main

The …/main.cpp source file contains the core logic for formatting generated text. It uses the llama_decode() function to decode tokens from the model into readable text. This function handles replacing byte encoded tokens with their text representations.

Console colors are applied to the decoded output using ANSI escape codes. This allows formatting generations with colors for parts of speech, entities, or other metadata. The colors are added by wrapping token text in escape code strings before printing.

Logging functionality is provided by writing the generated text to files. If a log file path is specified during execution, this text will be written to that file on disk.

A Context manages caching predictions and the generation context. As new tokens are sampled during generation, they are added to the context. Context shifting functionality allows replacing old tokens as the context moves across the generated text.

The model provides access to functions to iteratively generate new tokens.

Model Conversion

References: examples/convert-llama2c-to-ggml, requirements

This section covers converting models between popular deep learning frameworks like PyTorch, LLama, and GGUF. The core functionality involves mapping models between the different representation formats used by each framework.

Key implementation details include defining structs or classes to represent models in each format. For example, the my_llama_model struct in …/convert-llama2c-to-ggml.cpp defines the model architecture as a collection of tensors. Conversion functions initialize these structs or classes based on metadata like hyperparameters. They then load weights from the source model into the target struct, before serializing out the full model.

The …/convert-llama2c-to-ggml directory contains important conversion functionality. The convert-llama2c-to-ggml executable is defined in …/CMakeLists.txt and builds from convert-llama2c-to-ggml.cpp. This file contains the llama_vocab struct representing the vocabulary, and the my_llama_model struct defining the target model architecture as tensors. The init_model function allocates these tensors, while checkpoint_init_weights loads weights into the TransformerWeights struct. save_as_llama_model then copies weights to the target tensors and saves the final model.

The requirements directory specifies dependencies and versions for various converters. The requirements-convert.txt file defines common requirements like NumPy and SentencePiece versions. Files like requirements-convert-hf-to-gguf.txt and requirements-convert-persimmon-to-gguf.txt specify PyTorch versions for those workflows.

Converting from LLama2.c

References: examples/convert-llama2c-to-ggml

This section details the process of converting models trained with the LLama2.c framework to Google's GGUF (Guided User Feedback) format. This allows models like stories42M.bin, trained with LLama2.c, to be used by tools built for the GGUF framework such as GPT-3.

The conversion is handled by the …/convert-llama2c-to-ggml directory. It contains an executable called convert-llama2c-to-ggml that performs the core conversion. The executable is built from convert-llama2c-to-ggml.cpp using CMake rules in CMakeLists.txt.

convert-llama2c-to-ggml.cpp contains important structs like llama_vocab to represent the vocabulary, my_llama_model to define the model architecture, and TransformerWeights to store the pretrained weights. It also includes functions such as load_vocab() to populate the vocabulary struct, init_model() to initialize the model struct, checkpoint_init_weights() to load weights, and save_as_llama_model() to save the converted model.

The executable takes a LLama2.c model, loads the vocabulary using load_vocab(), then initializes my_llama_model with init_model(). It loads the weights into TransformerWeights with checkpoint_init_weights() and transfers them to the model struct by calling save_as_llama_model(). This converts the weights to GGUF's tensor format. The converted model can then be used by GGUF tools.

Converting from Hugging Face

References: requirements/requirements-convert-hf-to-gguf.txt

The …/requirements-convert-hf-to-gguf.txt file specifies the requirements for converting models from the Hugging Face format to the Good Good Up (GGUF) format. It ensures the converter code uses a compatible version of PyTorch between 2.1.0 and 2.1.99 by specifying the requirement torch~=2.1.1. It also imports additional requirements from the …/requirements-convert.txt file that are needed for the conversion process.

The primary functionality covered under this section is converting pretrained models available in the Hugging Face format to the GGUF format used by LLama. This conversion process relies on the PyTorch version specified in requirements-convert-hf-to-gguf.txt to ensure compatibility. It also uses any additional requirements outlined in requirements-convert.txt, though these are not detailed in the provided summaries. The conversion code itself is not described, but presumably implements the necessary logic to load Hugging Face models and export them to the GGUF format.

Converting GGML

References: requirements/requirements-convert-llama-ggml-to-gguf.txt

The Graph class is the primary representation of game levels throughout the conversion process. It represents the graph as a collection of objects and acts as the central object model.

The conversion begins by parsing GGML files into a Graph object using an XML parser like Xerces. The parser handles creating the appropriate objects from the XML elements, setting their ids, types, and properties as defined in the GGML file.

Once parsed, the rules for mapping Graph objects between the formats are defined. It converts element types and properties between the formats based on pre-defined mappings.

Finally, the mapped object is serialized out to a JSON or XML target file. The Graph's objects are traversed and their data is serialized appropriately based on the target format.

Converting LoRa

References: requirements/requirements-convert-lora-to-ggml.txt

The …/requirements-convert-lora-to-ggml.txt file contains requirements for converting data from the LoRa format to GGML. LoRa is a long range wireless protocol for Internet of Things (IoT) devices, and this file provides requirements for code to take in a LoRa data payload and output an equivalent GGML document.

The conversion would be performed by parsing the LoRa payload and constructing a GGML document. The payload would first need to be parsed to extract metadata and measurements. Then a GGML document would be constructed and relevant measurements and metadata would be added as GGML elements.

Converting Persimmon

References: requirements/requirements-convert-persimmon-to-gguf.txt

This section covers converting dependency information from the Persimmon format to the GGUF (Google Graph Universal Format) dependency format. The core functionality relies on the file …/requirements-convert-persimmon-to-gguf.txt.

This file specifies the requirements for the conversion. No other implementation details are provided in the summaries.

The conversion process likely uses functionality to load Persimmon dependency data and then export it to the GGUF format. However, without access to code performing the actual conversion, specifics of the algorithms and data structures used cannot be determined. The file provides a high-level overview but no low-level implementation details.

Model Evaluation

References: examples/perplexity

The …/perplexity directory provides tools for analyzing model quality through metrics like perplexity. Perplexity is a standard way to evaluate language models, with lower perplexity indicating better performance.

The perplexity() function calculates perplexity over chunks of tokenized text. It first splits the tokens into batches based on the context size, then encodes each batch and passes it to the model to get logits via llama_decode() and llama_get_logits(). The logits are used to calculate perplexity across all batches. It supports multi-threaded processing of batches using std::thread to parallelize the workload.

An important class is llama_context, which manages the model state and handles passing batches to the model. Functions like llama_decode() and llama_get_logits() use it to interact with the model.

The perplexity_v2() function improves on perplexity() by calculating perplexity in strided chunks for better performance on long texts. It splits the work of processing batches within each chunk and accumulates results across threads using the process_logits() function.

The hellaswag_score() function evaluates models on the HellaSwag task by extracting contexts and endings, tokenizing them separately, encoding to get logits, and calculating log probabilities of endings to get a normalized accuracy score.

The CMakeLists.txt configures building the perplexity executable, linking it to dependencies like common and llama while setting C++11 compilation.

The README.md contains results quantizing a 70B model to different levels, measuring perplexity and size tradeoffs to evaluate compression quality loss.

Model Deployment

References: examples/server, .devops

This section covers deploying LLama models via servers and containers. The server deployment functionality is implemented in the …/server directory and subdirectories.

The …/server directory contains an example HTTP server that runs a LLama model. It includes a C++ backend API defined in server.cpp and a JavaScript frontend client that communicates with the backend over HTTP. The frontend code is located in …/public.

The server.cpp file handles HTTP requests and responses. It contains functions like completion which handles the "/completion" endpoint by calling the LLama model to generate responses. This class provides the core API that the frontend communicates with.

The frontend code in …/public implements a reactive chat interface using the Signals pattern. The MessageInput component handles user input and calls chat() on submit, which asynchronously runs completion on the prompt using llama() defined in completion.js. This updates the conversation stored in the ChatLog component.

Containerization for deployment is implemented using Dockerfiles and scripts in .devops. Dockerfiles like main.Dockerfile use a multi-stage build to produce optimized images. The tools.sh script provides a simple CLI to execute tools by calling the proper executables or scripts. This abstracts away tool commands for users.

Server Deployment

References: examples/server, examples/server/public

The conversational model server is deployed through the code in …/server. The C++ backend API server is implemented in server.cpp, which is built into an executable using the server target defined in …/CMakeLists.txt. This executable runs the HTTP server that serves the API endpoints for tasks like completion and tokenization.

The JavaScript frontend code for the chat interface is contained in …/public. It uses reactive programming principles, with the Signal class from index.js representing reactive state. Components like App and ChatLog in index.html declaratively render the UI using this reactive framework. completion.js contains key functions like llama() for asynchronously completing prompts with the backend server.

Requests are handled by the Flask application defined in api_like_OAI.py, which passes data to and from the C++ server. Functions such as make_postData() and make_resData() format the request and response bodies. The /chat/completions and /completions routes handle the main API endpoints.

The dependencies for the frontend are downloaded and compiled into header files using deps.sh. This bundles them into the server executable. Documentation for building, running and using the deployment is provided in …/README.md. Examples of interacting with the deployed model via the API are shown in scripts like chat.sh.

Containerization

References: .devops

The Dockerfiles in .devops handle containerization of the LLama model tools by building reproducible Docker images. This allows deploying LLama models to different hardware environments.

The Dockerfiles leverage official base images from NVIDIA, AMD, and Ubuntu to provide hardware acceleration libraries and toolchains. They install dependencies, copy code, configure builds, and define entrypoints.

The …/full-cuda.Dockerfile builds LLama with CUDA support using the nvidia/cuda image. It sets the CUDA_DOCKER_ARCH environment variable to target specific GPU architectures and enables cuBLAS for linear algebra.

The …/full-rocm.Dockerfile builds LLama with ROCm support using the official ROCm dev container. It sets variables like GPU_TARGETS, LLAMA_HIPBLAS, CC, and CXX to configure the build to target ROCm hardware and libraries.

The …/main.Dockerfile implements a multi-stage build, separating the build dependencies from the runtime dependencies. The build stage compiles the code, and the runtime stage copies the binary for a minimal final image size.

The …/tools.sh script provides a simple command line interface to execute LLama tools like convert.py, quantize, and server by parsing arguments and calling the appropriate executable or script.

Model Compression

References: examples/quantize, awq-py

The llama.cpp codebase provides techniques for compressing large language models through quantization and pruning. Quantization refers to reducing the numeric precision of model weights and activations, typically from 32-bit floats to 8-bit integers. This shrinks model size substantially with minimal impact on performance.

The …/quantize directory implements full-model post-training quantization. It includes a command line tool for quantizing models to different bitwidths specified by an enum. Quantization works by grouping weights into clusters and representing each cluster with its centroid. The tool runs this process on a model and reports the resulting bits per weight.

A more sophisticated technique is Activation Weight Quantization (AWQ) implemented in awq-py. AWQ quantizes weights while preserving accuracy through "scaling" of activations. The ScaledActivation class in …/apply_awq.py wraps activations and scales their outputs. Key functions apply different scaling strategies based on layer types, such as scale_ln_fcs() for LayerNorm + FC stacks. apply_scale() handles combining these strategies. apply_clip() clips weights post-scaling, and add_scale_weights() loads pre-computed quantized weights into a model. This allows efficient post-training quantization while maintaining performance.

The …/README.md file documents applying AWQ to various models. It describes installing dependencies, obtaining pre-computed AWQ results, converting models to GGUF format, quantizing to different bitwidths, testing quantized models, and reporting compression results. AWQ achieves significant model size reductions with minimal drops in perplexity.

Quantization

References: examples/quantize, awq-py

The core functionality of quantization is to reduce model sizes by representing weights and activations with lower precision numeric formats. This is done through techniques like post-training quantization, where models are trained at full precision but converted to lower precision after training is complete.

The …/quantize directory provides a simple command line interface for post-training quantization of LLama models. The …/quantize.cpp file implements the core quantization logic. It defines an enum of different quantization methods.

The file also contains functions for encapsulating quantization parameters like bitwidth and setting parameters based on the selected quantization method. The main function gets a model path from arguments, configures quantization, and calls the LLama API function to perform quantization.

The awq-py directory implements Activation Weight Quantization (AWQ), a technique for post-training quantization focusing on activations and weights. The ScaledActivation class in …/apply_awq.py wraps activations and scales their outputs to preserve accuracy during quantization.

Key functions include apply_scale() which applies different scaling strategies like scale_ln_fcs() based on layer types, apply_clip() for clipping weights, and add_scale_weights() to load pre-computed quantized weights. These provide an efficient way to quantize models while maintaining accuracy through the AWQ process.

Activation Weight Quantization

References: awq-py/awq, awq-py

The ScaledActivation class handles activations and weights in the Activation Weight Quantization (AWQ) process. Defined in the …/apply_awq.py file, ScaledActivation wraps an activation function and scales its output. This allows preserving accuracy when quantizing weights.

The key functions that utilize ScaledActivation are:

  • apply_scale(): Applies different scaling strategies to layers based on their type. It makes use of functions like scale_ln_fcs(), scale_fc_fc(), and scale_gelu_fc() which implement specific scaling patterns for common patterns like LayerNorm + FC stacks.
  • scale_ln_fcs(): Scales a LayerNorm and list of FullyConnected layers proportionally
  • scale_fc_fc(): Scales weights of two FullyConnected layers in a specific pattern
  • scale_gelu_fc(): Scales a GELU activation and FullyConnected layer proportionally

These scaling functions allow efficiently quantizing models post-training by learning the scaling in a data-dependent manner. Another important function is apply_clip(), which applies clipping bounds to layer weights post-scaling.

The final piece is add_scale_weights(), which loads pre-computed AWQ results from quantization and directly modifies the model weights. By combining ScaledActivation, scaling strategies, clipping, and loading quantized weights, the AWQ implementation is able to quantize models while preserving accuracy.

Multi-Task Models

References: examples/llava

The LLava framework implements multimodal language generation using both text and image inputs. LLava leverages both LLaMA for the language modeling component as well as CLIP for the image encoding component.

The llava_context struct central to representing the state of a LLava session ties together the clip_ctx for image encoding, llama_context for language modeling, and llama_model for the pre-trained parameters. Storing these together allows passing relevant context throughout processing.

The load_image function handles loading an image from either a base64 string embedded in the prompt or a separate file path. It uses llava_image_embed_make_with_prompt_base64 and llava_image_embed_make_with_filename to prepare the image representation.

process_prompt initializes the LLama model context by tokenizing the prompt, feeding the image and text embeddings, and setting the starting context pointer. This allows the model to consider both modalities when generating responses.

sample drives response generation by sampling from the LLama model conditioned on the initialized context, using llama_sampling_sample to get tokens and eval_id to feed them back and assemble the response.

The main function ties everything together by parsing arguments, initializing LLava, loading inputs, generating a response, and cleaning up, providing an end-to-end workflow.

LLava Framework

References: examples/llava, examples/llava/clip.cpp, examples/llava/clip.h, examples/llava/llava.cpp, examples/llava/llava.h

The LLava framework provides the core capabilities for building multimodal language models that can condition their text generation on image inputs. It combines the LLaMA language model with a CLIP vision model to enable encoding images into embeddings and incorporating those representations into the text generation process.

The key components of the LLava codebase are the clip_ctx class, llava_image_embed struct, and llava_context struct. The clip_ctx class, defined in clip.cpp and clip.h, represents a loaded CLIP vision model and contains functions for image encoding. The llava_image_embed struct, defined in llava.h, represents an image embedding that will be passed to the language model. It contains the embedding vector and metadata.

The llava_context struct ties together the CLIP and LLaMA contexts to track the state of a LLava session. Defined in llava-cli.cpp, it is used throughout the different processing steps. The encode_image_with_clip() function in llava.cpp handles encoding images to embeddings using CLIP, while llava_eval_image_embed() decodes the embeddings for the LLaMA model. These functions provide the core encoding and decoding logic.

The Python scripts in the examples directory help prepare models for LLava. llava-surgery.py splits a pretrained LLaVA model into its LLaMA and projection components. convert-image-encoder-to-gguf.py handles converting the CLIP image encoder to the required format. CMake builds the shared library and CLI executable from the C++ source files, allowing LLava to be built and run from source.

The llava-cli.cpp file implements the main workflow via the llava_context struct. It loads models and images, initializes the LLaMA context via process_prompt(), and generates responses by sampling the model with sample(). This provides an end-to-end CLI for text generation with LLava.

Model Preprocessing

References: examples/llava/llava-surgery.py

The Python script …/llava-surgery.py is used to split and convert pretrained LLava models. It takes a pretrained LLava model checkpoint as input and extracts the multimodal projection weights into a separate file called llava.projector. It also checks if the checkpoint contains CLIP weights and extracts those into a separate file called llava.clip.

The extraction is performed using the torch.load() and torch.save() functions to load and save tensors. All tensor names starting with "model.mm_projector" are retrieved using a list comprehension and stored in a new dictionary with the tensor names as keys and values loaded as floats. This dictionary is saved as the llava.projector file.

Similarly, tensors starting with "model.vision_tower" are checked for, saved to llava.clip if present, and removed from the checkpoint. Some additional cleanup of an "added_tokens" field is performed using for loops and del if the field exists.

The cleaned checkpoint without projection or CLIP weights is then saved back to the original path using torch.save(). Helpful messages are printed about next steps of converting the model using regular functions like print().

CLI Interface

References: examples/llava/llava-cli.cpp

The llava_context struct centralizes the state of a LLava session by tying together the clip_ctx for image encoding, llama_context for language modeling, and llama_model for the pre-trained parameters. Initializing a llava_context loads the necessary model components and prepares them to consider both text and image inputs.

The process_prompt function takes a prompt, image, and parameters and initializes the LLama model context. It tokenizes the prompt, feeds the image and text embeddings, and sets up the starting context pointer. This allows the model to consider both modalities when generating responses.

sample drives the response generation by sampling from the LLama model conditioned on the initialized context. It uses llama_sampling_sample to get a token, feeds it back to the model with eval_id, and assembles the text response.

The main function parses arguments, initializes a llava_context by loading the necessary model components, loads inputs, generates a response by calling process_prompt and sample, and cleans up resources. This provides an end-to-end workflow for initializing contexts and sampling responses from the command line.

Testing and CI

References: tests, ci

The ci directory implements continuous integration and testing capabilities for the llama.cpp project. At the core is the …/run.sh script, which executes various automated testing tasks on each code change. This script is run by both a custom CI system and GitHub Actions to ensure code quality.

The …/run.sh script contains several important functions for continuous integration. The gg_run function runs individual "test cases", executing commands and collecting outputs. This allows running discrete tests like building and running unit tests. The ctest_debug and ctest_release functions handle running cmake/make/ctest to build and test the project in both debug and release configurations. Larger functions like open_llama_3b_v2 download pretrained models, run benchmarks, and test quantization. Helper functions such as gg_sum_ctest_debug summarize output from different test cases.

By encapsulating key tasks in reusable functions, …/run.sh provides an extensible way to continuously validate the codebase. As the custom CI framework scales to different hardware, this script can easily integrate new testing capabilities. Developers are instructed to run it locally first via …/run.sh before publishing changes to validate passing automated checks. Overall the ci directory implements rigorous yet flexible continuous integration for llama.cpp.

Unit Testing

References: tests/test-tokenizer-0-falcon.cpp, tests/test-tokenizer-1-llama.cpp, tests/test-grammar-parser.cpp, tests/test-sampling.cpp, tests/test-opt.cpp, tests/test-quantize-fns.cpp, tests/test-quantize-perf.cpp, tests/test-backend-ops.cpp, tests/test-grad0.cpp, tests/test-rope.cpp

The test suites in the LLama library validate that key functionality operates as expected across different configurations and inputs. They thoroughly test core components like tokenization, sampling algorithms, quantization functions, optimization routines, and more.

The tests directory contains comprehensive unit test code for many critical parts of the framework. Test files like test-tokenizer-0-falcon.cpp and test-tokenizer-1-llama.cpp ensure the tokenizers handle different languages and edge cases correctly. test-sampling.cpp rigorously validates sampling algorithms like top-k and top-p produce expected probability distributions. test-quantize-fns.cpp and test-quantize-perf.cpp check quantization functions preserve accuracy and optimize for throughput.

Key classes tested include the llama_token_data class which represents tokens, and llama_token_data_array for passing token batches to sampling functions. The tests initialize random input data, build graph fragments, extract outputs, and compare results to baseline implementations or expected values within tight tolerances. They validate functionality across data types, dimensions, parameters, and hardware. Auxiliary functions like DUMP() provide debugging aids.

The integration tests in files like test-grammar-parser.cpp and test-backend-ops.cpp check full workflows and components interacting as intended. test-grammar-parser.cpp parses sample grammars and validates the output against well-defined symbol IDs and rules. test-backend-ops.cpp constructs graphs for operations, runs them on different backends, and ensures the outputs match closely between implementations.

Integration Testing

References: tests, CMakeLists.txt

The tests directory contains integration tests that validate different components of the LLama library work together correctly. These tests focus on end-to-end functionality rather than testing individual units in isolation.

One such test is in …/test-sampling.cpp, which validates the sampling algorithms by running them on tokenization outputs. Strings are tokenized into llama_token_data_array objects representing tokens and their probabilities with llama_tokenize(). Sampling algorithms like llama_sample_top_k() and llama_sample_repetition_penalties() are run on the tokenized outputs. The sampled token IDs are compared to expected outputs to validate the full pipeline from text to sampled tokens.

Another integration test is in …/test-tokenizer-1-llama.cpp. It checks that tokenizing strings with llama_tokenize() and detokenizing the results with llama_detokenize_spm() properly round-trips all text. This validates the tokenizer and tokenization/detokenization work end-to-end as expected.

The …/CMakeLists.txt file defines functions like llama_build_and_test_executable() that build test executables, link them to the LLama library, and run the integrated tests defined within. This validates different components like the tokenizer and sampler are linked and function correctly together at runtime.

CI Framework

References: ci, ci/run.sh

The …/run.sh script encapsulates key CI tasks in reusable functions. The gg_run function executes "CI test cases" by running commands via pipes and aggregating outputs and return statuses.

The ctest_debug function handles building and testing via cmake/make/ctest in debug mode. The ctest_release function does the same in release mode.

The open_llama_3b_v2 function downloads a 3B parameter LLM, runs inference on samples, quantizes the model, and executes performance benchmarks. The open_llama_7b_v2 function performs similar tasks for a 7B parameter LLM using CUDA.

By encapsulating tasks in reusable functions, …/run.sh provides an extensible way to run CI at scale. The custom CI framework described in …/README.md monitors the GitHub repository for commits. When it detects a commit, it provisions a cloud instance and runs …/run.sh to execute the full test suite on that hardware. Over time it will incorporate different machine types to test across architectures.

GitHub Actions CI

References: ci

The GitHub Actions CI system runs tests on every commit to the main branch. Workflows defined in .github/workflows directories use GitHub Actions to build, test, and deploy the code.

The main workflow runs the …/run.sh script to execute the full CI process. This script contains several important gg_run functions that run defined "CI test cases", run commands via pipes/tee, aggregate outputs and return status. This allows running unit tests, integration tests, and benchmark tests to validate changes.

The ctest_debug and ctest_release functions in …/run.sh are particularly important as they build and run the unit test suites with CMake in both debug and release modes. Key files like …/test-sampling.cpp and …/test-quantize-fns.cpp contain unit tests for critical sampling and quantization code. Maintaining a comprehensive suite of unit tests in files like these is crucial for validating changes to the core codebase.