llama.cpp
Auto-generated from ggerganov/llama.cpp by Mutable.ai Auto WikiRevise
llama.cpp | |
---|---|
GitHub Repository | |
Developer | ggerganov |
Written in | C++ |
Stars | 57k |
Watchers | 505 |
Created | 03/10/2023 |
Last updated | 04/29/2024 |
License | MIT |
Repository | ggerganov/llama.cpp |
Auto Wiki | |
Revision | |
Software Version | p-0.0.4Premium |
Generated from | Commit b8c147 |
Generated at | 04/29/2024 |
The llama.cpp
repository provides a C/C++ implementation for inference with large language models (LLMs), offering a practical solution for engineers seeking to integrate advanced natural language processing capabilities into their applications. It addresses the real-world problem of efficiently running LLMs on various hardware platforms, including both CPUs and GPUs, by providing optimized code for model inference, quantization, and hardware acceleration.
The most significant parts of the repo include the core language model support in llama.cpp
, the model conversion and quantization tools in .devops
, and the server functionality in …/server
. With the largest file count, the examples
directory showcases the versatility of the library through a wide array of example programs and tools that demonstrate text generation, fine-tuning, benchmarking, and multimodal applications.
Key functionalities of the codebase are:
- Support for a wide range of LLMs, including various versions and derivatives of the LLaMA model, as described in the
llama.cpp
directory. - A lightweight HTTP server compatible with the OpenAI API, enabling users to serve local models and connect them to existing clients, as detailed in
…/server
. - Language bindings for several programming languages, allowing for integration into diverse software ecosystems.
The code relies on several key algorithms and technologies, including:
- Quantization methods to reduce memory usage and improve inference speed, as implemented in the
tools.sh
script within.devops
. - Hardware acceleration support for multiple platforms, ensuring optimal performance across different devices.
- A grammar parser to constrain the model's output, ensuring that generated text adheres to specified formats or rules.
Key design choices in the code include:
- The use of Docker to create isolated environments for building and running the Llama project, which simplifies the deployment process and ensures consistency across different systems.
- The provision of a command-line interface in the
tools.sh
script for easy access to common operations such as model conversion and quantization. - The implementation of a persistent interaction feature, allowing users to save and resume chat sessions across multiple calls to the main program.
For more details on the example programs and tools, refer to the Example Programs and Tools section. Information on model conversion and quantization can be found in the Model Conversion and Quantization section. Details on the server functionality are available in the Server Functionality section.
Language Model SupportRevise
References: llama.cpp
The Llama language model is designed to support a variety of large language models, with a framework that accommodates different model architectures. The core implementation focuses on the initialization, training, and deployment of these models, emphasizing modularity and performance.
Example Programs and ToolsRevise
References: examples
The examples
directory showcases a variety of example programs and tools that demonstrate the capabilities of the LLaMA language model. These examples serve as practical demonstrations of the model's functionalities, ranging from text generation to fine-tuning and benchmarking.
Model Initialization and TrainingRevise
Initialization and training of the LLaMA model involve setting up model parameters, training on custom datasets, and saving the trained model. The process starts with defining hyperparameters and initializing the model's tensors. For example, init_model()
and init_model_lora()
initialize the LLaMA and LORA models respectively, creating tensors for embeddings, normalization, and layers. Parameters are marked as trainable using set_param_model()
and set_param_model_lora()
, and randomized with randomize_model()
and randomize_model_lora()
.
Text Generation and Sampling TechniquesRevise
References: examples/batched
, examples/beam-search
, examples/infill
The LLaMA model leverages various sampling techniques to generate text that is coherent and contextually relevant. These techniques include top-k, top-p, and temperature sampling, which are integral to the model's ability to produce high-quality language outputs.
Multimodal Language Model SupportRevise
References: examples/llava
The LLaVA model, encapsulated within the …/llava
directory, is a multimodal language model capable of processing both text and image inputs. It integrates the CLIP model for image understanding, which is essential for the multimodal capabilities of LLaVA.
Language Model Evaluation and BenchmarkingRevise
References: examples/benchmark
, examples/llama-bench
Benchmarking the LLaMA language model involves evaluating matrix multiplication performance using the GGML library, as seen in …/benchmark
. The benchmark program in benchmark-matmult.cpp
supports both floating-point (F32) and quantized (Q4_1) data types. Users can specify the number of threads and iterations for the benchmark, allowing for performance tuning and scalability testing.
Language Model Fine-Tuning TechniquesRevise
References: examples/finetune
, examples/export-lora
Fine-tuning a pre-trained LLaMA language model is achieved through the application of Low-Rank Adaptation (LoRA) adapters, which are designed to adapt the model to specific tasks or datasets. The process involves the following key steps:
Language Model Server ImplementationRevise
References: examples/server
The server-side application located in …/server
provides a REST API for interacting with the LLaMA language model. The server is built using the httplib
library for HTTP server functionality and nlohmann::json
for JSON handling. It integrates with the llama.cpp
library to offer LLM inference capabilities.
Android and SwiftUI ImplementationsRevise
References: examples/llama.android
, examples/llama.swiftui
The Android implementation of the LLaMA language model is encapsulated within the …/llama.android
directory, which includes the necessary components for the Android application. The MainActivity
serves as the entry point, setting up the user interface with Jetpack Compose and managing the application's state. It interacts with the MainViewModel
, which orchestrates the UI logic and communication with the LLaMA model for operations such as sending messages and benchmarking.
Model Conversion and QuantizationRevise
References: .devops
Conversion and quantization of models within the Llama project are managed using the tools.sh
script located at …/tools.sh
. This script serves as an interface for executing key operations such as model format conversion and model optimization through quantization.
Model Format ConversionRevise
References: .devops/tools.sh
The tools.sh
script located at …/tools.sh
manages the conversion of the Llama model from PyTorch (PTH) format to the GGML format, a necessary step before the model can be quantized or executed within the project's ecosystem. The conversion is initiated through the --convert
or -c
command-line argument, which triggers the execution of the convert.py
script with the provided arguments.
Model QuantizationRevise
References: .devops/tools.sh
Quantization in the llama.cpp
codebase is a process designed to optimize the GGML model by reducing its size and potentially increasing inference speed, with a focus on maintaining accuracy. The script …/tools.sh
provides a command-line interface to facilitate this process with the --quantize
or -q
option.
Combined Conversion and Quantization WorkflowRevise
References: .devops/tools.sh
The --all-in-one
operation in …/tools.sh
script automates the workflow of converting a Llama model from PyTorch format to GGML and then applying quantization. This operation is designed to simplify the process for users by combining two distinct steps into a single command execution.
Environment Setup for Conversion and QuantizationRevise
References: .devops/full-cuda.Dockerfile
, .devops/full-rocm.Dockerfile
, .devops/full.Dockerfile
, .devops/main-cuda.Dockerfile
, .devops/main-intel.Dockerfile
, .devops/main-rocm.Dockerfile
, .devops/main-vulkan.Dockerfile
, .devops/main.Dockerfile
, .devops/server-cuda.Dockerfile
, .devops/server-intel.Dockerfile
, .devops/server-rocm.Dockerfile
, .devops/server-vulkan.Dockerfile
, .devops/server.Dockerfile
Docker build files such as …/full-cuda.Dockerfile
, …/full-rocm.Dockerfile
, and …/full.Dockerfile
are utilized to create tailored environments for the conversion and quantization of models, accommodating various hardware accelerations like CUDA and ROCm. These environments are essential for ensuring that the conversion and quantization processes can be executed efficiently and are compatible with the target hardware configurations.
Hardware Acceleration SupportRevise
References: llama.cpp
Support for hardware acceleration is a critical aspect of the llama.cpp
project, enabling the LLaMA language model to leverage GPUs and CPUs for improved performance. The project includes various tools and scripts that facilitate the use of hardware acceleration:
Docker Environment Setup for Hardware AccelerationRevise
References: .devops/full-cuda.Dockerfile
, .devops/full-rocm.Dockerfile
, .devops/full.Dockerfile
, .devops/main-cuda.Dockerfile
, .devops/main-intel.Dockerfile
, .devops/main-rocm.Dockerfile
, .devops/main-vulkan.Dockerfile
, .devops/main.Dockerfile
, .devops/server-cuda.Dockerfile
, .devops/server-intel.Dockerfile
, .devops/server-rocm.Dockerfile
, .devops/server-vulkan.Dockerfile
, .devops/server.Dockerfile
Docker build files within the .devops
directory are tailored for various hardware acceleration technologies, ensuring the Llama project can leverage the full potential of different hardware platforms. The Dockerfiles are designed to create environments that support CUDA, ROCm, Intel OneAPI, and Vulkan, each catering to specific hardware acceleration needs.
SYCL Support for Intel GPUsRevise
References: examples/sycl
The …/sycl
directory equips developers with tools for leveraging SYCL, a high-level programming model, on Intel GPUs within the llama.cpp
library context. The primary utility, ls-sycl-device
, serves to enumerate SYCL-compatible devices, providing insights into their capabilities such as compute units and memory size.
Server-Side Application with Hardware AccelerationRevise
References: examples/server
The server-side application, located at …/server
, provides a REST API for interacting with the LLaMA language model. It is designed to support LLM inference of F16 and quantum models on both GPU and CPU hardware. The application is built using the httplib
library for HTTP server functionality and the nlohmann::json
library for JSON data handling, ensuring efficient communication and data exchange.
Quantization Techniques for Model OptimizationRevise
References: examples/quantize
The quantize.cpp
file within …/quantize
equips users with a command-line interface to apply quantization to pre-trained LLaMA models. The quantization process is crucial for optimizing the model to efficiently run on various hardware platforms by reducing the model size and memory footprint.
Android Support for LLaMA ModelRevise
References: examples/llama.android/app/src/main/cpp
The Android platform integration for the LLaMA model is facilitated through the …/cpp
directory, which houses the C++ source code necessary for the model's operation on Android devices. The implementation ensures that the model can leverage the hardware acceleration features available on these devices for efficient inference.
Server FunctionalityRevise
References: examples/server
Routes for handling REST API requests are managed by the server.cpp
file, which utilizes the httplib
library to create a lightweight HTTP server. The server is designed to be compatible with the OpenAI API, offering similar endpoints and functionalities. The server's API includes endpoints for health checks, text completion, tokenization, embeddings, and more, as detailed in the REST API Endpoints and Functionality section.
Server Setup and ConfigurationRevise
References: examples/server/CMakeLists.txt
, examples/server/deps.sh
The server is built using a CMakeLists.txt
file located at …/CMakeLists.txt
, which specifies the target name as server
. During the build process, users can toggle options such as LLAMA_SERVER_VERBOSE
for verbose logging and LLAMA_SERVER_SSL
for enabling SSL support. The build process involves compiling source files including server.cpp
, utils.hpp
, and httplib.h
. Additionally, public asset files like index.html
, index.js
, completion.js
, and json-schema-to-grammar.mjs
are converted into C++ header files and included in the build.
REST API Endpoints and FunctionalityRevise
References: examples/server/README.md
The server leverages httplib
and nlohmann::json
to provide a RESTful API, interfacing with the llama.cpp
library to deliver a range of language model functionalities. The API endpoints include:
Server-Side BenchmarkingRevise
References: examples/server/bench
, examples/server/bench/README.md
, examples/server/bench/bench.py
, examples/server/bench/script.js
Benchmarking the server-side application involves a set of tools and scripts located in …/bench
. The primary tool used for load testing is k6
, which is extended with the xk6-sse
module to enable server-sent events (SSE) functionality. The benchmarking process is automated by the bench.py
script, which orchestrates the setup, execution, and collection of performance metrics.
Client-Side Web ApplicationRevise
References: examples/server/public
, examples/server/public/completion.js
, examples/server/public/index.html
, examples/server/public/index.js
Interaction with the server for text completions and chat functionality is managed through JavaScript files located in …/public
. The main entry point of the web application is index.html
, which structures the user interface and includes script references to handle the application logic.
Server Testing and ValidationRevise
References: examples/server/tests
, examples/server/tests/features
, examples/server/tests/tests.sh
, examples/server/tests/features/environment.py
Behavior-driven development (BDD) tests for the server application are managed through the Behave framework, leveraging the aiohttp
and asyncio
libraries for asynchronous HTTP client functionality. The testing process involves a series of steps defined in Python scripts, which are executed to validate the server's behavior against expected outcomes.
Chat Interaction ScriptsRevise
References: examples/server/chat-llama2.sh
, examples/server/chat.sh
Bash scripts …/chat-llama2.sh
and …/chat.sh
facilitate chat-based interactions with an AI assistant. They manage chat prompts, tokenization, and retrieval of AI responses through the server's REST API. Key functionalities include:
Language BindingsRevise
References: llama.cpp
Language bindings in the llama.cpp
project enable interaction with the Llama language model across different programming environments. These bindings facilitate the integration of the Llama model's capabilities into various applications and platforms.
Python Language BindingRevise
References: gguf-py
The gguf-py
library provides a Python interface for interacting with the Generalized Graph Universal Format (GGUF), which is a binary format for storing tensors and associated metadata. The library is structured into several components, each handling different aspects of GGUF file manipulation.
Swift Language BindingRevise
In the Swift implementation of the Llama language model, the LlamaState
class within …/Models
manages the lifecycle of the model, including loading and initializing to text generation and benchmarking. The class provides methods such as loadModelsFromDisk()
, which scans the documents directory for downloaded models, and loadDefaultModels()
, which loads the default model if available or marks it for download.
C++ Language BindingRevise
References: examples/server
, examples/benchmark
, examples/beam-search
, examples/baby-llama
, examples/tokenize
, examples/gguf
, examples/lookup
The server-side application located in …/server
provides a REST API for interacting with the LLaMA language model. The server is built using the httplib
library for HTTP server functionality and nlohmann::json
for JSON data handling. It integrates with the llama.cpp
library to offer LLM inference capabilities.
Android Language BindingRevise
References: examples/llama.android/app/src/main/cpp
The Android language binding for the Llama language model is encapsulated within the …/cpp
directory, providing a bridge between the native C++ codebase and the Android platform. The binding facilitates several key operations:
Docker Support for Language BindingsRevise
References: .devops
Docker build files within the .devops
directory support the construction, deployment, and execution of the Llama language model project, enabling the use of language bindings in a containerized environment. These Dockerfiles create tailored environments for different hardware accelerations and software configurations, ensuring compatibility with various platforms.
Persistent Interaction and Constrained OutputRevise
References: llama.cpp
Support for persistent interaction in the Llama language model is achieved through the maintenance of conversation state across multiple requests. This allows the model to provide coherent and contextually relevant responses over the course of an interaction. The server-side application utilizes the llama_ngram_cache
to store and retrieve n-gram frequencies and associated tokens, which aids in efficient text generation and drafting of potential next tokens.
Server-Side Application with Grammar ConstraintsRevise
References: llama.cpp
The server-side application in the Llama project provides a REST API for interacting with the LLaMA language model. It integrates grammar constraints to shape the model's output, ensuring that the generated text adheres to predefined grammatical rules or JSON schema formats. The application leverages two key components for parsing and applying these constraints:
Chat-Based Interaction with Grammar ConstraintsRevise
References: examples/server/chat-llama2.sh
, examples/server/chat.sh
The scripts …/chat-llama2.sh
and …/chat.sh
facilitate chat-based interactions with an AI assistant, handling the conversation flow and applying grammar constraints to ensure coherent dialogue. These scripts interact with a server-side API to process user input and generate AI responses.
Continuous Interaction and Batching SupportRevise
References: examples/server
The server application located at …/server
supports continuous interaction with the LLaMA language model, enabling a persistent conversation state across multiple requests. This is achieved through the following mechanisms:
Android and Docker SupportRevise
References: llama.cpp
The Llama project facilitates Android device support through a dedicated application development process, integrating the Llama language model into the Android platform. The core functionality is encapsulated within the Android application's main activity, view model, and user interface components, which are detailed in the Android Application Development subsection.
Android Application DevelopmentRevise
References: examples/llama.android/app/src/main/java/com/example/llama
, examples/llama.android/app/src/main/cpp
The MainActivity
class serves as the entry point for the Llama Android application, orchestrating the user interface and interactions. It utilizes Jetpack Compose to render the UI, which includes a text input field for user messages, buttons for actions like sending messages and benchmarking, and a display for the message history. The activity also manages downloads of machine learning models, providing a list of downloadable items and initiating download processes through the DownloadManager
.
Docker Build EnvironmentRevise
References: .devops
Docker build files located in .devops
are essential for setting up various environments tailored to the needs of the Llama language model project. These environments cater to different hardware acceleration technologies such as CUDA, ROCm, Intel OneAPI, and Vulkan, enabling the project to leverage specific features and optimizations offered by each platform.
Android C++ IntegrationRevise
References: examples/llama.android/app/src/main/cpp
The …/cpp
directory integrates the Llama language model with the Android platform, enabling the model's core functionalities such as loading, context management, and text completion to be utilized within Android applications.
Android UI and Theme CustomizationRevise
The Llama Android application's user interface is defined by a theme that includes a color scheme, dynamic color theming, and default text styles. The theme customization is managed through several Kotlin files located in the …/theme
directory.
Android Model Download and ManagementRevise
The Downloadable
data class in …/Downloadable.kt
encapsulates the properties of downloadable items, including their name, source URI, and destination file path. It also defines the various states of a download, such as Ready
, Downloading
, Downloaded
, and Error
, which are essential for tracking the progress and status of downloads within the Llama Android application.
Docker Build Files for Android SupportRevise
In the "LlamaAndroid" project, the build environment is established through the use of Docker build files, which are crucial for defining the overall structure and settings for Android development. The …/build.gradle.kts
file is responsible for setting up the necessary plugins, including the Android application plugin and the Kotlin Android plugin. These plugins are essential for configuring the build process and enabling Kotlin support for the Android application, respectively.
Continuous Integration SetupRevise
References: ci
The Continuous Integration (CI) setup for the Llama project is encapsulated within the ci
directory, primarily driven by the run.sh
script. This setup is crucial for validating the project's functionality across various hardware configurations, ensuring that the codebase remains stable and performs as expected on different platforms.
Example Programs and ToolsRevise
References: examples
The …/simple
directory showcases the tokenization capabilities of the LLaMA model. The tokenize
executable, built from tokenize.cpp
, accepts a model path and a prompt to tokenize the input text. It demonstrates the initialization of the LLaMA backend, model loading, context creation, and the use of llama_tokenize()
to output tokenized text.
Language Model ExamplesRevise
References: examples/baby-llama
, examples/batched
, examples/batched-bench
, examples/batched.swift
, examples/beam-search
, examples/benchmark
, examples/convert-llama2c-to-ggml
, examples/embedding
, examples/eval-callback
, examples/export-lora
, examples/finetune
, examples/gbnf-validator
, examples/gguf
, examples/gguf-split
, examples/gritlm
, examples/imatrix
, examples/infill
, examples/jeopardy
, examples/llama-bench
, examples/llama.android
, examples/llama.swiftui
, examples/llava
, examples/lookahead
, examples/lookup
, examples/main
, examples/main-cmake-pkg
, examples/parallel
, examples/passkey
The …/baby-llama
directory showcases the initialization and training of the LLaMA language model. Key functions include init_model()
for setting up model parameters and forward()
for computing output logits. The sample_softmax()
function demonstrates text generation capabilities by sampling from the model's output distribution.
Server-Side Language Model ExamplesRevise
Routes for the server-side application are managed by the server.cpp
file, which utilizes the httplib
library to handle HTTP requests and the nlohmann::json
library for JSON data manipulation. The server integrates with the llama.cpp
library to provide LLM inference functionality.
Multimodal Language Model ExamplesRevise
References: examples/llava
, examples/llava/android
The LLaVA model, encapsulated in …/llava
, integrates text and image inputs to provide multimodal language model capabilities. It leverages the CLIP model for image understanding, which is detailed in clip.cpp
and clip.h
, and combines it with the language understanding of the LLAMA model.
Language Model Fine-Tuning and ExportRevise
References: examples/finetune
, examples/export-lora
Fine-tuning a pre-trained LLaMA language model leverages the Low-Rank Adaptation (LoRA) technique, enabling the model to adapt to specific tasks or datasets while maintaining the original model's parameters largely unchanged. The …/finetune
directory contains the necessary scripts and code for this process.
Text Generation and Decoding TechniquesRevise
References: examples/batched
, examples/beam-search
, examples/lookahead
, examples/lookup
Batched text generation in …/batched
utilizes parallel processing to generate multiple sequences of text simultaneously. The process involves:
Benchmarking and Performance EvaluationRevise
References: examples/benchmark
, examples/llama-bench
, examples/batched-bench
Benchmarking tools within the LLaMA project are utilized to assess the performance of the language model across various computational tasks. The …/benchmark
directory houses a program specifically designed to benchmark matrix multiplication operations, a fundamental operation in neural network computations. The program, benchmark-matmult.cpp
, measures the performance in GFLOPS (Giga Floating-Point Operations per Second) for both floating-point (F32) and quantized (Q4_1) data types, providing insights into the efficiency of the model's underlying numerical computations.
Language Model Utility ToolsRevise
The …/convert-llama2c-to-ggml
directory hosts a tool for converting language models from the llama2.c
project to the ggml
format. This conversion is essential for compatibility with the ggml
library, which is widely used across the Llama project. The tool operates by reading the weights from a llama2.c
model and saving them in a ggml
-compatible format. It defaults to using the vocabulary from /models/ggml-vocab.bin
but allows for custom vocabularies via command-line arguments.
Language Model Interaction and EmbeddingRevise
References: examples/infill
, examples/passkey
, examples/embedding
Interactive text generation in the LLaMA language model is facilitated through modes like infill
and passkey
. In the …/infill
directory, the infill
mode enables users to provide a prefix and suffix, with the model generating text to fill the gap. The infill.cpp
program manages this process by:
Language Model Applications and ExtensionsRevise
The LLaMA language model extends its capabilities through various applications, including mobile platforms and knowledge testing tools. The …/llama.android
directory provides an Android application implementation, enabling the use of the language model on Android devices. Key components include the Llm
class for managing the model's lifecycle and the MainActivity
class for user interaction. The application's UI is built using Jetpack Compose, with theme customization handled in the …/theme
directory.
Test SuiteRevise
References: tests
The test suite for the Llama language model validates the core components essential for the model's operation. The suite includes tests for the automatic release of resources, ensuring that the model and context are properly freed after use, as demonstrated by test-autorelease.cpp
. It also verifies the accuracy of type conversions, particularly from double
to float
, to maintain computational precision as seen in test-double-float.cpp
.
Tokenization TestsRevise
References: tests/test-tokenizer-0.cpp
, tests/test-tokenizer-0-bpe.py
, tests/test-tokenizer-1-bpe.cpp
, tests/test-tokenizer-0-spm.py
, tests/test-tokenizer-1-spm.cpp
Tokenization and detokenization are critical components of the Llama language model, ensuring that input strings are correctly converted into tokens that the model can process, and that these tokens can be converted back into human-readable text. The Llama project includes a suite of tests to validate the functionality of its tokenizers, specifically focusing on Byte-Pair Encoding (BPE) and Sentencepiece (SPM) algorithms.
Quantization TestsRevise
References: tests/test-quantize-fns.cpp
, tests/test-quantize-perf.cpp
Unit tests for the GGML library's quantization functions are encapsulated in …/test-quantize-fns.cpp
. These tests validate the accuracy of quantize
and dequantize
operations and ensure that the dot product computations adhere to predefined error thresholds. The tests involve:
Sampling TestsRevise
References: tests/test-sampling.cpp
Unit tests for the sampling functionality within the Llama language model are encapsulated in …/test-sampling.cpp
. These tests validate the robustness of various sampling techniques integral to text generation:
Grammar Parsing TestsRevise
The test-grammar-parser.cpp
file validates the grammar_parser::parse()
function, which is crucial for parsing grammar specifications into a structured format that the Llama language model can utilize. The tests ensure that given a grammar specification string, the function returns a parse_state
object with accurate symbol IDs and grammar rules. The parse_state
object contains a map of symbol names to their unique IDs (symbol_ids
) and a vector of grammar rules (rules
), where each rule is represented as a sequence of llama_grammar_element
structs.
Optimization TestsRevise
References: tests/test-opt.cpp
The test-opt.cpp
file validates the Adam optimization algorithm within the ggml
library by ensuring it effectively minimizes a defined objective function. The test involves creating three random tensors and using them to form an objective function that represents the sum of squared differences between the result of a matrix multiplication and a target tensor. The steps taken in the test are:
Miscellaneous TestsRevise
References: tests/test-autorelease.cpp
, tests/test-double-float.cpp
, tests/test-rope.cpp
, tests/test-chat-template.cpp
, tests/test-c.c
The test-autorelease.cpp
ensures the proper release of resources within a multithreaded environment. It involves the following key actions:
Utility ScriptsRevise
References: scripts
Utility scripts within the scripts
directory facilitate a range of operations crucial for the maintenance and functionality of the llama.cpp project. These scripts automate tasks such as downloading datasets, checking script requirements, generating author lists, and deploying servers.
Model Conversion and Quantization UtilitiesRevise
The …/convert-gg.sh
script automates the conversion of pre-trained language models into the GGML format, a requisite step for utilizing these models within the LLaMA framework. The script invokes convert.py
and convert-falcon-hf-to-gguf.py
for different model types, including LLaMA v1, LLaMA v2, CodeLlama, and Falcon models. The output models are stored in the models
directory with filenames indicative of their version and precision, such as f16
for 16-bit floating-point representation.
Performance Testing and Benchmarking UtilitiesRevise
References: scripts/compare-commits.sh
, scripts/run-all-perf.sh
, scripts/run-all-ppl.sh
, scripts/compare-llama-bench.py
The …/compare-commits.sh
script automates the comparison of llama-bench
performance across two different commits. It executes the benchmarking tool for each commit, storing results in an SQLite database, and then leverages …/compare-llama-bench.py
to generate a performance comparison table. This utility is crucial for assessing the impact of code changes on model performance.
Dataset Management UtilitiesRevise
References: scripts/get-hellaswag.sh
, scripts/get-wikitext-103.sh
, scripts/get-wikitext-2.sh
, scripts/get-winogrande.sh
The dataset management utilities within the scripts
directory facilitate the acquisition and preparation of various datasets for language modeling tasks. These scripts automate the process of downloading datasets from external sources and provide usage examples for the perplexity
command, which is used to evaluate the performance of language models on these datasets.
Build and Deployment UtilitiesRevise
References: scripts/build-info.sh
, scripts/ci-run.sh
, scripts/hf.sh
, scripts/server-llm.sh
The …/build-info.sh
script generates build information as C preprocessor macros, which are then used to provide version and build details within the LLAMA project. It retrieves the build number and commit hash from the Git repository, compiler information, and target platform using the $CC
command. The output includes macros like LLAMA_BUILD_NUMBER
, LLAMA_COMMIT
, LLAMA_COMPILER
, and LLAMA_BUILD_TARGET
.
Synchronization and Maintenance UtilitiesRevise
References: scripts/sync-ggml.sh
, scripts/sync-ggml-am.sh
The sync-ggml.sh
script automates the process of updating the current project with the latest changes from the ggml
directory. It performs the following tasks:
Miscellaneous Utility ScriptsRevise
The …/check-requirements.sh
script automates the validation of Python dependencies for the conversion scripts within the llama.cpp
project. It performs checks in isolated virtual environments to ensure that each script can be imported without ImportError
s. The script also verifies the inclusion of sub-requirements in the top-level requirements.txt
and warns against the pinning of exact release versions in requirement files. It supports optional arguments for specifying a working directory and disabling cleanup of temporary files.
Python Library for GGUFRevise
References: gguf-py
The Python library for GGUF provides a suite of tools for interacting with the Generalized Graph Universal Format, which is a binary format designed for storing tensors and related metadata. The library is structured to support the creation, manipulation, and reading of GGUF files, which are essential for machine learning model data interchange.
Core GGUF Package FunctionalityRevise
References: gguf-py/gguf
The gguf
package provides a set of classes and methods for interacting with GGUF files, which are used to store and manage tensor data and metadata for machine learning models. The package includes the GGUFReader
and GGUFWriter
classes for reading and writing GGUF files, respectively. The TensorNameMap
class manages the mapping between tensor names and model tensors, facilitating the identification and manipulation of tensors within a model's architecture.
GGUF Utility ScriptsRevise
References: gguf-py/scripts
The …/scripts
directory equips users with a suite of Python scripts to handle various operations on GGUF files. These operations include endian conversion, metadata inspection and modification, and the creation of new GGUF files with updated metadata.
GGUF Examples and UsageRevise
References: gguf-py/examples
The …/examples
directory provides practical examples of how to interact with GGUF files using the GGUFReader
and GGUFWriter
classes. These examples serve as a guide for users to understand the process of reading from and writing to GGUF files, which are used for storing and sharing data in the Generalized Graph Universal Format.
GGUF Testing FrameworkRevise
References: gguf-py/tests
The …/tests
directory is designated for housing the test suite of the gguf-py
project, with the current focus on the gguf
module. The directory contains a single test file, …/test_gguf.py
, which is expected to contain the test cases for the gguf
module. As of now, the file includes a placeholder test function, test_write_gguf()
, which suggests the intention to test the writing capabilities of the gguf
module in the future.
Common UtilitiesRevise
References: common
The gpt_params_parse_ex()
and gpt_params_parse()
functions handle the parsing of command-line arguments, populating the gpt_params
structure with model paths, prompts, and sampling parameters. These functions are critical for initializing the Llama language model with user-specified configurations. The gpt_print_usage()
function provides users with guidance on the available command-line options.
Console InterfaceRevise
References: common/console.cpp
, common/console.h
The cross-platform console interface utilities facilitate user interaction with the console across different operating systems. The interface handles initialization, cleanup, display management, and readline functionality, abstracting platform-specific details to provide a unified experience.
Grammar and Schema ParsingRevise
References: common/grammar-parser.cpp
, common/grammar-parser.h
, common/json-schema-to-grammar.cpp
, common/json-schema-to-grammar.h
The grammar-parser.cpp
and grammar-parser.h
files provide the functionality to parse extended BNF grammars. The parser assigns unique identifiers to each symbol for efficient management and uses recursive functions like parse_sequence()
and parse_alternates()
to handle nested structures and repetition operators. Error handling is robust, with informative messages for parsing issues, and a print_grammar()
function is available for outputting the grammar in a readable format.
Logging SystemRevise
References: common/log.h
The Llama project incorporates a logging system that facilitates the tracking of events and messages during execution. The system provides flexibility in logging output, allowing messages to be directed to either files or the console. Developers can control the verbosity and detail of the logs through various macros and runtime functions.
N-gram CacheRevise
References: common/ngram-cache.cpp
, common/ngram-cache.h
The n-gram cache, implemented in …/ngram-cache.cpp
, optimizes text generation by storing and retrieving n-gram frequencies and their associated tokens. The cache is a critical component for drafting potential next tokens based on the statistical likelihood of n-gram sequences appearing in the language model's training data.
Sampling ModuleRevise
References: common/sampling.cpp
, common/sampling.h
The llama_sampling_context
struct manages the state of the sampling process, which includes a history of previously sampled tokens and parameters that influence the selection of the next token. The llama_sampling_params
struct within …/sampling.h
encapsulates these parameters, such as the number of tokens to remember, the temperature for sampling, and the specific technique to be used, like Top-K or Top-P.
Training UtilitiesRevise
References: common/train.cpp
, common/train.h
In …/train.cpp
, the training process is managed through a series of functions that handle the initialization, randomization, and optimization of training states. The train_state
struct is pivotal, encapsulating the optimization context and training iterations, as well as the random number generator state for sample shuffling. The init_train_state()
and free_train_state()
functions are responsible for setting up and tearing down this structure.
Prompts and InstructionsRevise
References: prompts
The prompts
directory serves as a repository for textual content designed to facilitate interaction with AI models and assistants. It includes a variety of text files that range from instructional content to conversational examples, each tailored to specific AI functionalities or scenarios.
Smart Home Assistant ImplementationRevise
References: prompts/assistant.txt
The smart home assistant implemented in …/assistant.txt
processes JSON-formatted requests to manage a variety of smart home devices. The assistant supports four request categories: "command", "query", "answer", and "clarify", each corresponding to a different type of interaction with the smart home environment.
AI Model Interaction ExamplesRevise
References: prompts/chat-with-baichuan.txt
, prompts/chat-with-bob.txt
, prompts/chat-with-qwen.txt
, prompts/chat-with-vicuna-v0.txt
, prompts/chat-with-vicuna-v1.txt
, prompts/chat.txt
The AI model interaction examples within the repository demonstrate the conversational capabilities of AI assistants through a series of text-based transcripts. These examples serve as templates or scripts for expected interactions between users and AI models, showcasing the models' natural language understanding and generation abilities. The transcripts are found in files such as …/chat-with-baichuan.txt
, …/chat-with-bob.txt
, …/chat-with-qwen.txt
, …/chat-with-vicuna-v0.txt
, …/chat-with-vicuna-v1.txt
, and …/chat.txt
.
Advanced AI Model CapabilitiesRevise
References: prompts/dan-modified.txt
, prompts/dan.txt
The "DAN" AI model, as described in …/dan-modified.txt
and …/dan.txt
, is designed to simulate an advanced AI with capabilities beyond the standard constraints of typical AI models. Key features of DAN include:
Large Language Model (LLM) ConceptsRevise
References: prompts/LLM-questions.txt
The file …/LLM-questions.txt
serves as a resource for understanding key machine learning concepts associated with Large Language Models (LLMs). It contains a curated set of questions that probe into various aspects of LLMs, from foundational elements to advanced mechanisms. These questions are instrumental in guiding users through the complexities of LLMs, offering insights into how these models process and generate language.
Mnemonics for Language LearningRevise
References: prompts/mnemonics.txt
The file …/mnemonics.txt
serves as an educational resource within the codebase, offering Markdown-formatted mnemonics to aid in the learning of kanji characters. The mnemonics are designed to facilitate memory retention by associating each kanji with keywords derived from its components. This approach leverages the cognitive strategy of creating vivid and associative mental images to enhance recall, a technique that is particularly useful for characters that are complex or have abstract meanings.
Thought-Provoking Questions for AI DiscussionRevise
References: prompts/parallel-questions.txt
The file located at …/parallel-questions.txt
serves as a repository of diverse and thought-provoking questions designed to engage AI models in deep and wide-ranging discussions. These questions span a multitude of subjects, from the whimsical to the profound, challenging the AI's capacity for understanding and generating responses across various domains. The inclusion of such a broad spectrum of topics reflects the intended versatility of AI models in simulating human-like conversational abilities and analytical thinking.
AI Reasoning and Action LoopRevise
References: prompts/reason-act.txt
The AI system in …/reason-act.txt
operates on a loop consisting of Thought, Action, and Observation steps to process and respond to questions. This loop models the AI's reasoning and response generation process:
Development and OperationsRevise
References: .devops
Docker build files within .devops
serve as the backbone for setting up various environments tailored to the Llama language model project's needs. These environments enable the project to be built, deployed, and executed across different hardware and software configurations, with specialized support for CUDA, ROCm, Intel OneAPI, and Vulkan. The Dockerfiles are designed to create isolated and reproducible build environments that encapsulate all the necessary dependencies and configurations required for the project.
Docker Build Environment SetupRevise
References: .devops/full-cuda.Dockerfile
, .devops/full-rocm.Dockerfile
, .devops/full.Dockerfile
, .devops/main-cuda.Dockerfile
, .devops/main-intel.Dockerfile
, .devops/main-rocm.Dockerfile
, .devops/main-vulkan.Dockerfile
, .devops/main.Dockerfile
, .devops/server-cuda.Dockerfile
, .devops/server-intel.Dockerfile
, .devops/server-rocm.Dockerfile
, .devops/server-vulkan.Dockerfile
, .devops/server.Dockerfile
Docker build files are utilized to create consistent environments for building, deploying, and executing the Llama language model project. These environments are tailored to support various computational backends such as CUDA, ROCm, Intel OneAPI, and Vulkan, which are essential for leveraging different hardware acceleration capabilities.
DevOps Tooling and AutomationRevise
References: .devops/tools.sh
The …/tools.sh
script acts as a command-line utility facilitating several operations for the Llama language model. It provides a unified interface for tasks such as model conversion, quantization, execution, finetuning, and server deployment. The script interprets command-line arguments to trigger specific functionalities:
Multimodal Language Model ImplementationRevise
References: examples/llava
The LLaVA model facilitates the integration of visual data with language processing, enabling the model to handle both text and image inputs. The model leverages the capabilities of the CLIP (Contrastive Language-Image Pre-training) model to process images, which is then combined with the language understanding capabilities of the LLAMA (Large Language Model Meta-Adapter) framework.
LLaVA Model Core ImplementationRevise
References: examples/llava/llava.cpp
, examples/llava/llava.h
The llava.cpp
and llava.h
files provide the core functionality for the LLaVA model, which is designed to create and evaluate image embeddings within the LLAMA framework. The LLaVA model leverages the CLIP model's capabilities to process images and integrate them with textual data, enabling a multimodal approach to language modeling.
CLIP Integration with LLaVARevise
References: examples/llava/clip.cpp
, examples/llava/clip.h
Integration of the CLIP model with the LLaVA framework is achieved through the clip_model_load()
function, which initializes the clip_ctx
struct by loading model parameters from a GGUF file located at …/clip.cpp
. The function sets up the model context by checking for the presence of text and vision encoders and loading the necessary weights and biases for the vision model. It also accommodates various projector types, such as MLP and LDP, crucial for the multimodal capabilities of LLaVA.
LLaVA Command-Line InterfaceRevise
References: examples/llava/llava-cli.cpp
The llava-cli
executable serves as the interface for interacting with the LLaVA model, which combines the capabilities of the LLAMA language model with vision abilities through CLIP integration. Users can input prompts that may include images, and the executable will generate responses accordingly. Key functionalities include:
LLaVA Model Conversion ScriptsRevise
References: examples/llava/llava-surgery-v2.py
, examples/llava/llava-surgery.py
, examples/llava/convert-image-encoder-to-gguf.py
The llava-surgery-v2.py
and llava-surgery.py
scripts serve as tools for preparing the LLaVA model components for conversion to the LLaMA GGUF format. They perform tasks such as cleaning the vision tower from a checkpoint, extracting multimodal projector tensors, and handling the added_tokens.json
file.
LLaVA Android IntegrationRevise
Integration of the LLaVA model with Android devices leverages scripts to manage the deployment and execution process. The …/adb_run.sh
script automates the interaction with an Android device using the Android Debug Bridge (ADB) tool. The script performs several key operations:
Building LLaVA with CMakeRevise
References: examples/llava/CMakeLists.txt
The build process for the llava
library and the llava-cli
executable is managed by the CMakeLists.txt
file located at …/CMakeLists.txt
. The llava
library is compiled as both an object library and a static library, with the option to build as a shared library if BUILD_SHARED_LIBS
is set. The llava
library is linked with the ggml
and llama
libraries, ensuring necessary functionalities are included.
LLaVA Model DocumentationRevise
The LLaVA model documentation, located at …/README.md
, guides users through the setup and usage of the LLaVA (Large Language and Vision Assistant) model, a multimodal language model capable of processing both text and image inputs. The documentation includes instructions for obtaining pre-converted models, building and running the llava-cli
command-line interface, and converting models to the GGUF format. It addresses two versions of the LLaVA model, LLaVA-v1.5 and LLaVA-v1.6, highlighting differences such as context length requirements and prompt templating for non-Vicuna models.
Android Application ImplementationRevise
References: examples/llama.android
The Llama Android application is orchestrated through the MainActivity
class, which serves as the entry point and orchestrates the user interface using Jetpack Compose. It manages the application's state, including the display of downloadable machine learning models and the provision of user interaction elements such as text input fields and action buttons. The MainActivity
class also logs device and application information, contributing to a robust user experience.
Android C++ CoreRevise
References: examples/llama.android/app/src/main/cpp
In …/cpp
, the C++ core of the Llama Android application handles several critical operations to facilitate interaction with the Llama language model on Android devices. The core functionalities include:
Android Kotlin ComponentsRevise
Lifecycle management of the Llama language model on Android devices is centralized within the Llm
class located at …/Llm.kt
. This class is responsible for loading and unloading the model, as well as initiating benchmarking and text completion tasks. It interfaces with native C++ code to perform these operations, ensuring that resource management is handled efficiently and that the model's state is maintained correctly across different threads.
Android UI and ComposablesRevise
References: examples/llama.android/app/src/main/java/com/example/llama/ui/theme
, examples/llama.android/app/src/main/java/com/example/llama/Downloadable.kt
Jetpack Compose is utilized in …/theme
to implement the user interface of the Llama Android application. The UI adapts to system settings for light and dark modes and supports dynamic color theming on devices running Android 12 and above. The theme customization is managed through the LlamaAndroidTheme
function, which applies the appropriate color scheme and updates the status bar appearance.
Android Build ConfigurationRevise
References: examples/llama.android/app/build.gradle.kts
, examples/llama.android/build.gradle.kts
, examples/llama.android/settings.gradle.kts
Gradle Kotlin scripts serve as the backbone for configuring the Llama Android application's build process. The scripts define the application's SDK requirements, build types, and dependencies, ensuring compatibility with the Android platform and facilitating the use of Kotlin as the primary programming language.
Batched Text GenerationRevise
References: examples/batched
Leveraging the llama.cpp
library, the batched text generation process is designed to initialize and utilize a large language model (LLM) for generating multiple text sequences in parallel. The workflow includes:
GGUF Library UsageRevise
References: examples/gguf
The gguf
library is utilized to demonstrate the writing and reading of GGUF data files in …/gguf.cpp
. The file includes functions like gguf_ex_write
, gguf_ex_read_0
, and gguf_ex_read_1
to showcase these capabilities.
Lookup Cache FunctionalityRevise
References: examples/lookup
The llama_ngram_cache
component enhances language model performance by caching n-gram lookups, which are sequences of tokens used to predict subsequent text. This caching mechanism reduces the need for repeated and computationally expensive n-gram computations during text generation.
Cache Creation and ManagementRevise
References: examples/lookup/lookup-create.cpp
The llama_ngram_cache
component is utilized to create a lookup cache, enhancing the performance of language models by minimizing the frequency of computationally expensive n-gram lookups. The cache is populated with tokenized prompts through the llama_ngram_cache_update()
function, which accepts the tokens, cache sizes, and an update flag. Once updated, the cache is persisted to disk using llama_ngram_cache_save()
, specifying the output file path. This process involves several steps:
Cache MergingRevise
References: examples/lookup/lookup-merge.cpp
Merging multiple lookup cache files into a single file is facilitated by the lookup-merge.cpp
program, which takes a list of input cache files and combines them into one output cache file. The process involves:
Cache Performance AnalysisRevise
References: examples/lookup/lookup-stats.cpp
The lookup-stats.cpp
program serves as a performance analysis tool for the token lookup functionality within the Llama language model. It operates by loading a language model, tokenizing an input prompt, and simulating text generation through drafting and accepting tokens. The program is designed to collect and report key performance metrics, which are crucial for evaluating and optimizing the model's efficiency.
Lookup Cache Utilization in InferenceRevise
References: examples/lookup/lookup.cpp
Utilizing llama_ngram_cache
during language model inference is demonstrated in …/lookup.cpp
, which showcases the process of token generation and n-gram caching. The example highlights the integration of n-gram caches within the inference workflow, emphasizing their role in improving the efficiency and performance of token prediction.
Python Requirements for ConversionRevise
References: requirements
The llama.cpp
project utilizes a set of Python dependencies to facilitate the conversion of various language models to different formats. These dependencies are essential for numerical operations, text tokenization, model handling, and serialization of structured data. The conversion processes are supported by libraries such as numpy
for array manipulations, sentencepiece
for tokenization, transformers
for handling pre-trained models, gguf
for working with Google General Universal Format, and protobuf
for data serialization.
General Conversion DependenciesRevise
References: requirements/requirements-convert.txt
The llama.cpp
project relies on a set of Python libraries to facilitate the conversion of language models and data formats. The dependencies are specified in …/requirements-convert.txt
and include:
Hugging Face to GGUF Conversion DependenciesRevise
References: requirements/requirements-convert-hf-to-gguf.txt
, requirements/requirements-convert-hf-to-gguf-update.txt
For the conversion of Hugging Face models to the Google General Universal Format (GGUF), two Python dependencies are crucial: torch
and einops
. The standard requirement for torch
is any version compatible with 2.1.1, as indicated by the version constraint in …/requirements-convert-hf-to-gguf.txt
. Similarly, einops
is required to be any compatible version with 0.7.0, denoted by the same constraints in the requirements file.
LLAMA GGML to GGUF Conversion DependenciesRevise
The conversion of LLAMA GGML models to GGUF format relies on a set of Python libraries detailed in …/requirements-convert-llama-ggml-to-gguf.txt
. This file itself points to another requirements file, …/requirements-convert.txt
, which likely contains the actual list of dependencies. The dependencies specified are essential for the conversion process, ensuring compatibility and functionality when transitioning between these two model formats. The conversion process is a critical step for model deployment and interoperability within different parts of the Llama project.
LORA to GGML Conversion DependenciesRevise
The conversion of LORA models to the GGML format relies on the PyTorch library, specifically version 2.1.1
as indicated in …/requirements-convert-lora-to-ggml.txt
. PyTorch provides the necessary functionalities for handling tensor operations which are essential during the conversion process. The conversion likely involves manipulating the model's weights and parameters, which are typically stored as tensors in PyTorch.
Persimmon to GGUF Conversion DependenciesRevise
Conversion of Persimmon models to the GGUF format relies on the Python library torch
. The specific version required for this conversion process is 2.1.1
or a compatible version, as indicated in the requirements file located at …/requirements-convert-persimmon-to-gguf.txt
. This dependency is crucial for handling the neural network operations that are part of the Persimmon model's architecture.
GGML and SPM HeadersRevise
References: spm-headers
The spm-headers
directory serves as the nexus for the foundational components of the GGML and the Llama language model. It encapsulates the essential data structures and interfaces required for tensor manipulation, memory management, hardware abstraction, and model interaction.
GGML Core FunctionalityRevise
References: spm-headers/ggml.h
The ggml.h
header file is central to the GGML library, providing the necessary components for tensor manipulation and operations. The ggml_tensor
struct is the primary data structure, representing multi-dimensional arrays essential for numerical computations in machine learning. Core tensor operations facilitated by the library include addition (ggml_add()
), multiplication (ggml_mul()
), summation (ggml_sum()
), and transposition (ggml_transpose()
), which are fundamental to neural network computations.
GGML Memory ManagementRevise
References: spm-headers/ggml-alloc.h
The ggml-alloc.h
file introduces a custom memory allocator for the GGML framework, designed to optimize memory usage and performance. The allocator is initialized through the ggml_init()
function, which configures the allocator's behavior based on the provided ggml_init_params
structure. This initialization is crucial as it prepares the allocator for subsequent tensor allocations within the GGML context.
GGML Hardware Abstraction LayerRevise
References: spm-headers/ggml-backend.h
The ggml_backend
struct in …/ggml-backend.h
encapsulates the hardware abstraction layer for the GGML library, enabling the execution of operations across different hardware platforms. This struct is pivotal in ensuring that the library can leverage the specific computational capabilities of CPUs, GPUs, and potentially other accelerates without the need for the calling code to manage these details.
Llama Language Model InterfaceRevise
References: spm-headers/llama.h
Interfacing with the Llama language model is facilitated through the Llama
class, which serves as the primary access point for key operations such as model loading, text generation, and perplexity evaluation. The class is defined in …/llama.h
and encapsulates the following methods: