Mutable.ai logoAuto Wiki by Mutable.ai

xgboost

Auto-generated from dmlc/xgboost by Mutable.ai Auto Wiki

xgboost
GitHub Repository
Developerdmlc
Written inC++
Stars25k
Watchers 911
Created2014-02-06
Last updated2024-01-05
LicenseApache License 2.0
Homepagexgboost.readthedocs.io/en/stable
Repositorydmlc/xgboost
Auto Wiki
Generated at2024-01-06
Generated fromCommit 38dd91
Version0.0.4

XGBoost is an optimized, distributed gradient boosting library designed for speed and performance. It implements machine learning algorithms under the gradient boosting framework and provides a scalable, portable solution for large-scale tree boosting.

The core functionality for training and evaluating gradient boosted tree models is implemented in C++ code under …/gbm. This handles important tasks like calculating gradients, training new trees, and adding them to the model. The trees themselves are defined in …/tree_model.h.

XGBoost supports distributed training leveraging multiple machines with its Rabit framework implemented in rabit. This provides collective communication operations and synchronization to coordinate gradient boosting across nodes.

The Python package in python-package wraps the core C++ functionality and provides user-friendly APIs for training, evaluation, prediction, and analysis. It includes integrations with NumPy, Pandas, Scikit-Learn, Dask, and Spark. Comprehensive testing validates the functionality.

XGBoost implements several machine learning tasks via objective functions like classification, regression, and ranking. New objectives can be added through the modular interface in …/objective.

It provides GPU support to accelerate both training and inference by leveraging CUDA libraries and GPU-enabled algorithms. This includes a SYCL plugin in …/sycl for heterogeneous computing on CPUs and GPUs.

Key design choices include its high performance C++ core that leverages system optimization libraries, use of efficient in-memory sparse matrix representations, and innovations like approximate algorithms and block structure for parallel and distributed computation.

Together, these components provide an optimized, versatile gradient boosting library that scales effectively to large datasets across a cluster of machines.

Distributed Training

References: rabit, src/collective, tests/test_distributed

XGBoost's distributed training functionality leverages multiple machines to train models using large datasets in parallel. The core abstraction that enables this is the interface defined in the …/communicator.h header. Classes like defined in the …/comm.cc file implement this interface.

The …/collective directory contains other important components. The class defined in the …/coll.h header provides a common interface for collective operations and dispatches them to different backend implementations. The …/allgather.h header contains implementations of collectives like that are optimized for efficient communication patterns like the ring algorithm.

The classes defined in the …/tracker.h header coordinate the overall distributed training job.

Distributed Engine and Coordination

References: rabit, rabit/include, rabit/include/rabit, rabit/include/rabit/internal, src/collective, tests/test_distributed

The core distributed engine functionality in XGBoost is handled by the Rabit distributed training framework. Rabit provides collective communication operations and synchronization primitives that enable distributed training.

The rabit directory contains the implementation of Rabit. The …/include directory defines the core interfaces and functionality. This includes interfaces for the distributed training engine in …/engine.h, serialization utilities in …/io.h, common operations in …/rabit-inl.h, and utilities in …/utils.h.

The main interfaces are defined in …/engine.h.

The Rabit distributed engine functionality is implemented in …/engine.cc.

The …/allreduce_base.cc file implements collective communication operations using asynchronous non-blocking sockets.

The …/engine_mock.cc file implements a mock synchronization engine that inserts failures for testing the robustness of the engine to failures.

The …/rabit_c_api.cc file exposes the Rabit functionality through C bindings.

In summary, Rabit provides the core distributed engine and coordination primitives for XGBoost through its collective communication abstractions and implementations of common patterns like allreduce. The interfaces it defines allow different backends to be supported while the implementations optimize performance.

Integration with Frameworks

References: tests/test_distributed/test_gpu_with_dask, tests/test_distributed/test_gpu_with_spark, tests/test_distributed/test_with_dask, tests/test_distributed/test_with_spark

The code in …/test_distributed provides integration between XGBoost and distributed frameworks like Spark and Dask to enable distributed training and prediction.

When using Dask, the tests in …/test_with_dask simulate a distributed environment to validate behavior without a real cluster.

For Spark integration, the classes are tested in …/test_spark_local_cluster.py. This file validates parameters, predictions, and other outputs match the core XGBoost API. The …/test_data.py file contains tests for loading data from Spark, including stacking Pandas data and creating matrices from partitioned DataFrames.

Federated Learning

References: tests/test_distributed/test_federated

The XGBoost tests in …/test_federated provide an example implementation of federated learning using XGBoost's distributed training functionality. Federated learning allows a central process to coordinate distributed training across many clients, each with their own local dataset, without requiring the clients to share their private data.

The …/runtests-federated.sh script sets up a simulated federated environment by splitting the datasets in …/data into multiple smaller files, with one partition per available GPU. This simulates a real-world scenario where each client (GPU) holds its own local dataset.

The main test logic is contained in …/test_federated.py. This file implements functions for:

  • Each client process to load the local data partition, initialize the distributed training context, train an XGBoost model on the local data, evaluate it on local validation data, and save results.

  • Orchestrating a full federated training job by starting processes and waiting for completion.

It also contains tests of different configurations by calling the orchestration function with options like SSL encryption enabled.

This provides a simple demonstration of implementing federated learning using XGBoost's distributed functionality. The key aspects are the distributed training context initialized in each client, and the central coordination of the distributed training job.

Gradient Boosted Trees

References: src/gbm, src/tree, include/xgboost

The core gradient boosted trees functionality in XGBoost is implemented in the …/gbm and …/tree directories. These directories contain the key algorithms and data structures for gradient boosted trees (GBTs).

The …/gbm directory contains implementations of standard GBTs and linear GBMs algorithms. Classes related to GBTs include classes defined in files like …/gbtree.cc.

Functions in files like …/gbtree.cc handle configuring models, training trees via calls to functions in …/tree, making predictions, and model slicing. Core training logic is implemented in methods that calculate gradients, train trees, and commit results.

Classes defined in files like …/gbtree_model.h represent trained GBT models and store trees using vectors. Methods allow loading/saving models and adding trees.

The …/tree directory contains implementations of data structures like classes defined in files like …/tree_model.cc representing individual trees. Algorithms are implemented in subdirectories like …/hist.

Core Gradient Boosting Algorithm

References: src/gbm, src/tree

The core gradient boosting algorithm and tree construction logic is implemented in the …/gbm and …/tree directories. The main training logic occurs in the () method in the …/gbtree.cc file. It first calculates gradients on the training data. It then trains new trees to fit the gradients using tree updaters like …/updater_colmaker.cc and …/updater_approx.cc. The new trees are constructed in parallel. Next, the () method commits the new trees to the model.

Trees are represented using classes defined in …/tree_model.cc. This file stores the tree structure in vectors and provides methods for expanding nodes during training. The …/gbtree_model.h file represents the full GBT model, storing hyperparameters and a group of trees.

The …/hist directory contains implementations of histogram-based algorithms for tree construction. Files like …/evaluate_splits.h handle accumulating statistics for nodes, enumerating and evaluating splits, and applying the best split. The …/histogram.cc file contains parallel algorithms to efficiently build histograms during each training iteration.

Objective Functions

References: src/objective

Objective functions in XGBoost are implemented to support a variety of machine learning tasks like regression, classification, ranking and custom tasks. Objective functions calculate the loss function and its gradients during model training. This allows XGBoost to optimize different loss functions for the given learning problem.

Objective functions are registered with XGBoost using DMLC's registry mechanism. This allows them to be loaded by name. The base class defines the common interface for objectives.

Regression objectives calculate loss functions for regression tasks. The …/regression_obj.cc file registers regression objectives. It includes CPU or CUDA implementations conditionally.

Classification objectives handle multi-class and binary classification. The …/multiclass_obj.cc file registers the multiclass objective. It includes the CPU or CUDA implementation which defines a class containing methods.

Ranking objectives optimize ranking metrics. The …/lambdarank_obj.cc file implements LambdaRank, an algorithm. It defines the base class and subclasses that implement specific ranking objectives by overriding.

Custom objectives can be implemented by inheriting from the base class and its methods. The …/objective.cc file contains a registry and factory for loading objectives by name.

Histogram Algorithms

References: src/tree/hist

The histogram algorithms in XGBoost provide an efficient way to find the best splits during tree construction. Classes handle building histograms for single target problems, enumerating and evaluating all possible split points for each feature.

Histograms are built in parallel during each boosting iteration using classes defined in …/histogram.h. The class for single target problems caches histograms to improve performance.

Sampling of gradients is done using functions in …/sampler.h to reduce data and improve training speed.

Node assignment is performed by functions in …/histogram.cc, which calculate the hessian weight on each side of a split and decide which side will be the build node and which is subtract based on minimizing the objective.

Distributed Training

References: src/collective

Distributed training in XGBoost is handled through the collective communication functionality defined in …/comm.h. The core components provide abstractions and implementations for common collective patterns needed in distributed gradient boosting algorithms.

XGBoost uses the interface defined in …/comm.h to represent communicators between processes.

Distributed algorithms are implemented using collective primitives like the function defined in …/allreduce.h. This performs an efficient ring all-reduce operation on data across workers.

Files under …/tree contain the tree learning algorithms. Distributed tree learning leverages collective primitives to parallelize operations across workers.

Model Formats

References: include/xgboost

Serializing and deserializing GBT models involves saving trained gradient boosted tree (GBT) models to file and loading them back later. This allows entire trained models to be persisted to disk. The …/c_api.h file defines the C API for XGBoost, which includes important functions for model I/O.

The core tree data structure is represented using objects defined in …/tree_model.h. It represents a single regression or classification tree as a vector of objects. Each object stores child pointers, split criteria, and leaf values. Serializing a model involves traversing the tree structures and writing out this node information in a binary format.

When loading a model, the node information is read from the file and used to reconstruct the tree structures that represent the trained boosted trees. This fully recreates the trained model.

Python Integration

References: python-package/xgboost

The XGBoost Python package provides a seamless interface for training, predicting, and analyzing gradient boosted trees in Python. Key classes represent the data. The package implements estimators for classification and regression tasks.

Distributed training is enabled via classes, which distributes data and computations across a Dask cluster. Asynchronous training is implemented using context managers.

Core functionality is contained in …/__init__.py, which defines the main data, and training functions. Callbacks are implemented in …/callback.py. Data preprocessing occurs in …/data.py through classes that represent the data.

Visualization is provided in …/plotting.py. Configuration is managed via …/config.py.

Testing covers functionality like data handling, metrics, and distributed computation.

Objective Functions

References: src/objective, src/metric

Objectives in XGBoost allow specifying the loss function used during model training. This guides the learning process towards optimizing for specific machine learning tasks like regression, classification, ranking, and more.

The main implementation of objectives in XGBoost is contained in the …/objective directory. Here, objectives are registered with XGBoost's objective registry using DMLC. The registry provides a way to lookup objectives by name. This allows objectives to be loaded during training when specified.

The base class for objectives is defined in …/objective.cc. This class declares the interface that all objective implementations must follow. A key method computes the initial base score tensor used for predictions.

Individual objectives inherit from the base class and implement the gradient and hessian computation methods. For example, the regression objectives are implemented in …/regression_obj.cc. This file conditions inclusion of the CPU or GPU versions depending on build configuration.

The quantile regression objective is handled in …/quantile_obj.cc. This file registers the objective and includes the implementation conditionally. The implementation likely defines a class encapsulating the quantile loss computation. Methods within approximate the true quantile loss during training.

Parameter classes allow tuning objectives. For example, …/regression_param.h defines parameters for regression losses. Structs store parameter values specified during training.

Initialization of base scores is implemented in …/init_estimation.cc. This file fits a simple model to generate initial predictions before the boosting process begins.

Regression Objectives

References: src/objective/regression_obj.cc

The …/regression_obj.cc file handles regression objectives for tasks like regression analysis. It conditionally includes CUDA code when possible to allow for efficient parallel gradient calculation on GPUs.

The gradient calculation is performed by evaluating the gradient, which takes the prediction and label as input and returns the gradient. When CUDA is enabled, it delegates the calculation to device functions for efficient parallelism.

Otherwise, the CPU fallback implementations directly calculate the gradient CPU side. Once calculated, the gradient is returned to XGBoost core for model updating.

This allows XGBoost to support common regression loss functions both on CPU and GPU hardware, enabling flexible and efficient regression analysis. Programmers can select the loss function and hardware accordingly for their task.

Classification Objectives

References: xgboost

Objectives for binary classification are contained in the …/regression_obj.cc file. This file defines classes for logistic regression problems.

Multiclass classification objectives are in …/multiclass_obj.cc. This file contains classes that support loss functions for multiclass problems.

Both files register the objective classes. This allows loading objectives by name.

Ranking Objectives

References: src/metric/rank_metric.cc

The …/rank_metric.cc file implements objectives and metrics for ranking tasks. It contains classes that calculate various ranking metrics.

These classes override a method to calculate the metric given predictions and labels. Caching is supported using the base class. The file also registers these metrics with XGBoost so they can be used during training and evaluation.

A class implements discounted cumulative gain by first sorting predictions and labels, then calculating the result. Caching of results is supported to speed up calculations for subsequent iterations. Another class similarly implements its metric by sorting and caching intermediate results. A third class calculates precision at standard cutoffs.

Parallelization is used to distribute the metric calculation work across multiple threads for improved performance on large datasets. Metric configuration is also supported via JSON.

Custom Objectives

References: src/objective/objective.cc

To implement a custom objective, a developer would create a new class inheriting from the base class defined in the file. They would override the method to initialize any state needed for the objective. The class must register itself with the registry using the macro defined in the file, which associates the class with a unique name. Objectives can then be loaded by name at runtime.

Some key aspects of implementing a custom objective include:

  • Inheriting from the base class and implementing the required interface
  • Initializing any state needed for the objective in the method
  • Registering the class with the registry using the macro
  • Implementing gradient and hessian computation methods
  • Returning the predicted score from the method

The registry and factory pattern defined in the file make objectives pluggable and loadable by name, abstracting away the implementation details. This provides a clean, extensible way for developers to implement new objectives tailored to their machine learning tasks.

Data Loading and Feature Engineering

References: src/data, demo/data

XGBoost provides functionality for efficiently loading datasets into memory from various sources such as files, arrays, or databases and representing the data in sparse matrix formats suitable for gradient boosted tree training. This functionality is implemented primarily in the …/data directory.

The …/adapter.h and …/adapter.cc files define adapter classes that provide a common interface for loading data from different sources. These adapter classes handle sparsity, metadata extraction, and providing efficient access patterns. They load data incrementally in batches through an iterator interface, avoiding unnecessary data copying and enabling out-of-core loading.

File iterators defined in …/file_iterator.h and implemented in …/file_iterator.cc handle incrementally loading data from text files.

Validation logic in …/validation.h checks for invalid data values.

Model Evaluation

References: src/metric, python-package/xgboost/callback.py

This section covers model evaluation, overfitting detection, and early stopping functionality in XGBoost. Key aspects include metrics for evaluating model performance, using validation data to detect overfitting, and stopping training early based on validation results.

The core functionality is implemented through metric objects defined in …/metric and callback classes in …/callback.py. Metrics calculate error, accuracy, and other scores on training and validation data to quantify model performance.

Several important metric classes are defined for specific tasks. For survival analysis, the base class defines the core API for survival metrics. A class that inherits from the base class and implements the Cox proportional hazards metric, which is commonly used for survival analysis.

Metrics are registered with XGBoost via the DMLC registry to make them available for use. They are instantiated by name and implement a common interface to calculate scores on batches of data. Some optimizations include caching intermediate results for efficiency.

Evaluation Metrics

References: src/metric

The XGBoost library provides several metrics for evaluating model performance that are implemented in the …/metric directory. Key metrics include accuracy and log loss for classification as well as regression metrics. These metrics are used during training to measure how well a model fits the data and helps guide the boosting process, and are also used for final evaluation on test data.

The main classes that implement evaluation metrics include classes in …/multiclass_metric.cc for multiclass classification. Caches defined in …/rank_metric.h are used to store intermediate results for ranking metrics to improve efficiency.

Validation Datasets

References: python-package/xgboost/callback.py

XGBoost provides functionality for monitoring metrics during training, saving checkpoints, and performing early stopping based on validation results via callback functions defined in the …/callback.py file. The key callback computes metrics on validation data during training. It aggregates results across workers and handles distributing specifics. The callback gets the model and evaluation logs passed to its methods, allowing it to monitor progress. Metrics are all-reduced so early stopping criteria is the same everywhere. This allows easy overfitting detection and early stopping when validation performance stops improving.

Model Analysis

References: demo/CLI

The …/binary_classification subdirectory contains examples that demonstrate analyzing model predictions to diagnose problems. It shows inspecting a trained XGBoost model's predictions.

The …/README.md file documents the process of model inspection.

The …/runexp.sh script runs an end-to-end experiment and allows inspecting a trained model.

By analyzing model outputs, users can evaluate performance, detect problems, and refine training. The …/CLI directory likely contains other examples demonstrating model inspection for regression and distributed tasks.

GPU Support

References: tests/python-gpu, plugin/sycl

XGBoost's GPU support allows leveraging GPU acceleration for training models and making predictions more efficiently. The extensive tests in the …/python-gpu directory validate aspects of XGBoost's GPU functionality by training models with different objectives, datasets, and frameworks like CuPy and CuDF.

The class in …/data.h represents input data for kernels. It constructs buffers for row pointers and feature values from batches by storing offsets into the single contiguous feature values buffer, allowing efficient row-wise iteration.

Objectives implemented in …/objective compute gradients in parallel, such as the multi-class classification objective which calculates probabilities, loss, and gradients concurrently.

The class in …/predictor runs predictions efficiently on devices by initializing buffers from the CPU model and traversing trees in parallel across rows, summing results on the device.

The plugin in …/sycl provides an interface to select resources.

GPU Training

References: tests/python-gpu

XGBoost supports training models directly on GPUs using CUDA for significantly faster training times on large datasets. The …/python-gpu directory contains an extensive test suite validating the GPU training functionality.

Key aspects of GPU training include representing input data in a compressed sparse format suited for the GPU, and ensuring results match CPU training. The tests generate different data types, train models with various parameters, and carefully validate predictions are equivalent between CPU and GPU.

Some important classes tested include classes inheriting from test logic in other files to reuse tests between CPU and GPU.

The tests in …/test_gpu_basic_models.py train models with different configurations to test robustness. …/test_gpu_data_iterator.py ensures the iterator can handle batches efficiently on the GPU. …/test_gpu_prediction.py trains models and validates predictions match between devices. …/test_device_quantile_dmatrix.py focuses on representing input data and determinism of models trained on CPU and GPU.

…/test_gpu_updaters.py has detailed tests for updating different data types on the GPU, including sparse, categorical, and external memory data. It leverages property testing with Hypothesis and various test datasets to validate different scenarios. Overall, the GPU tests aim to be comprehensive in validating all aspects of the implementation are functioning as expected for reliable and performant GPU training.

GPU Inference

References: tests/python-gpu, plugin/sycl/predictor

The main focus of GPU Inference is running predictions and scoring models on GPUs for faster inference. This is implemented through several key components:

The …/test_gpu_prediction.py file contains tests for prediction functionality on the GPU. It trains models on different datasets and devices, then makes predictions and verifies the results match between CPU and GPU implementations.

The …/predictor subdirectory implements efficient parallel tree traversal for predictions on SYCL/OpenCL devices. The …/predictor.cc file contains functionality for running predictions efficiently on SYCL/OpenCL devices.

Some key aspects covered are parallelizing tree traversal, efficient device-side computation, and minimizing data movement between host and device. This provides fast, scalable inference by leveraging modern GPUs.

GPU Data Representations

References: tests/python-gpu/test_device_quantile_dmatrix.py

The …/test_device_quantile_dmatrix.py file contains tests for representing input data for efficient GPU processing and training.

It focuses on testing models trained on DMatrices initialized from different data sources stored in NumPy arrays, and checking the predictions are equal after training on CPU and GPU.

The file contains tests for training models using different algorithms and data formats on CPU and GPU devices. This helps ensure functionality for GPU training and representations works as expected.

The tests initialize DMatrix instances from CPU and GPU data sources and configurations. They train models, and check the predictions are equivalent. Tests use various algorithms, objectives, and regularization to support diverse training scenarios.

SYCL Plugin

References: plugin/sycl, plugin/sycl/objective, plugin/sycl/predictor

The …/sycl directory implements an SYCL plugin for XGBoost that adds support for the SYCL programming model. This allows parts of the XGBoost algorithm like model training and inference to be offloaded to SYCL devices like GPUs and CPUs, enabling parallelism on these devices.

The plugin can be built from the XGBoost directory by running cmake and make. It supports tree construction for both training and inference using SYCL. Key dependencies include the Intel oneAPI DPC++/C++ Compiler.

The …/objective subdirectory contains objectives that can be used with XGBoost when training models using SYCL. This includes the …/multiclass_obj.cc file, which defines objectives for multi-class classification problems. The …/regression_obj.cc file defines regression objectives using a base class.

The …/predictor subdirectory contains functionality for running predictions efficiently on SYCL devices. The …/predictor.cc file defines a class to represent the tree structure, and contains SYCL buffers for the tree metadata. It implements a core prediction function that traverses trees in parallel across rows using these classes.

The …/device_manager.cc file manages SYCL devices and queues.

Objective Functions

References: plugin/sycl/objective

The …/objective directory contains objectives that can be used for GPU training in XGBoost. It implements objectives using the SYCL parallel programming model, which allows taking advantage of GPUs and other accelerators.

The …/multiclass_obj.cc file contains functionality for multi-class classification objectives.

The …/regression_obj.cc file contains a base regression objective class that takes a template parameter for the loss function. The base class handles common tasks for any regression objective. A key method computes the gradient via a SYCL kernel, taking the predictions, labels, and weights as buffers to run in parallel on GPUs.

Python Package Integration

References: tests/python-gpu

The Python package integration focuses on enabling GPU usage through extensive tests. Files like …/test_device_quantile_dmatrix.py contain tests for representing input data on the GPU.

…/test_gpu_basic_models.py includes tests for basic models using different objectives and configurations. Files like …/test_gpu_updaters.py validate histogram and approximation algorithms.

…/test_gpu_data_iterator.py contains tests for the data iterator, and …/test_gpu_demos.py ensures demo scripts run correctly. Metric evaluations are compared in …/test_gpu_eval_metrics.py.

…/test_from_cudf.py and …/test_from_cupy.py focus on initializing data and checking results. …/test_gpu_updaters.py rigorously tests updaters.

Classes in files like …/test_gpu_prediction.py validate results match between devices. The tests seamlessly integrate GPU usage through extensive unit tests.

Python Package

References: python-package/xgboost, tests/python

The core functionality of the XGBoost Python package is contained within the …/xgboost directory. This directory provides Python classes, functions, and modules that serve as wrappers around the underlying C++ XGBoost library.

The main data structure handles loading training/test data from NumPy arrays, Pandas DataFrames, and other sources into an internal matrix format suitable for XGBoost training and inference.

Callbacks defined in …/callback.py allow monitoring metrics during training, early stopping, and checkpointing models. Modify the learning rate over rounds.

Additional functionality includes distributed training wrappers in …/dask that allow using Dask for large-scale training. The …/spark directory provides a Spark MLlib interface.

The core training logic is contained in …/training.py. Handles initializing models, running callbacks, and updating models over rounds. Cross-validation functionality splits data into folds and trains on each fold.

Comprehensive tests for the Python package are in …/testing. Tests cover functionality like models, data representations, preprocessing, plotting, and distributed training.

Distributed Training

References: python-package/xgboost/dask, python-package/xgboost/spark

The XGBoost Python package provides integrations for distributed training using Dask and Spark. For Dask, the …/dask directory contains utilities that allow training models on large datasets across a Dask cluster. The …/__init__.py file defines functionality.

For Spark, the …/spark directory provides functionality using PySpark. The …/estimator.py file defines functionality. Utilities for processing Spark partitions are provided in …/data.py, and base functionality is implemented in …/core.py.

Testing

References: python-package/xgboost/testing

The Python package provides a comprehensive test suite covering the core functionality of XGBoost. The test suite, located in …/testing, rigorously validates all aspects of model training, prediction, evaluation and I/O.

The suite contains tests for updaters, data loading, preprocessing, model evaluation, learning-to-rank tasks, and distributed training using Dask. Tests are organized into modules based on the functionality or component they validate.

The …/updater.py module lies at the core of testing. It provides thorough validation of updaters under different configurations by setting up data and parameters, running training, and making assertions on the results. Tests cover a wide range of data types, storage methods and parameters to exercise diverse code paths.

Other important modules include:

  • …/dask.py tests distributed training functionality by initializing models from sample data on Dask and validating scores.

  • …/metrics.py validates evaluation metrics.

  • …/ranking.py performs end-to-end tests of learning-to-rank with different data formats and feature types.

Sample data generation and utilities central to many tests are located in …/data.py and …/shared.py. Parameter strategies in …/params.py help test a wide range of configurations.

Utilities

References: python-package/xgboost/_typing.py, python-package/xgboost/sklearn.py

The utilities provided in the XGBoost Python package allow for flexible model training, evaluation, and analysis. Type annotations in the …/_typing.py module define common data types like features, parameters, and models to standardize inputs and outputs. This includes defining types for callbacks and preprocessing functions to integrate user-defined extensions.

Callbacks provide a way to execute custom code at different stages of training. This allows monitoring training metrics, early stopping, and parameter optimization.

The …/sklearn.py module builds on these utilities to provide a consistent Scikit-Learn interface. It handles storing the underlying booster object and exposing methods like model fitting, prediction, and persistence.

Command Line Interface

References: demo/CLI

The …/CLI directory contains demonstrations of using XGBoost from the command line interface (CLI) for common machine learning tasks. The CLI provides a simple and flexible way to train, predict, and analyze models without writing any custom code.

Key functionality includes:

  • …/binary_classification and …/regression demonstrate end-to-end workflows for binary classification and regression tasks using datasets like mushrooms and computer hardware specifications. They showcase data preprocessing through Python scripts like …/mapfeat.py, splitting data into folds with …/mknfold.py, and running experiments through shell scripts like …/runexp.sh.

  • Shell scripts like …/runexp.sh execute tasks by calling the xgboost executable.

  • Configuration files specify hyperparameters, loss functions, and other settings.

  • Monitoring training progress is supported by writing evaluation metrics to standard error which can be logged.

  • …/distributed-training demonstrates distributed XGBoost training on AWS using the YARN resource manager through the …/run_aws.sh shell script.

Binary Classification

References: demo/CLI/binary_classification

The …/binary_classification directory contains code demonstrations for performing binary classification tasks from the XGBoost command line interface (CLI). It focuses on a binary classification problem using the mushroom dataset to classify mushrooms as poisonous or edible.

The data is preprocessed using the …/mapfeat.py script to encode categorical features as indicators and produce a feature map file. The …/mknfold.py script is used to split the data into training and test folds for cross validation.

Models are trained using the xgboost CLI tool, specifying parameters like the algorithm and loss function in a configuration file. Training progress can be monitored by writing evaluation metrics to standard error. Models can be periodically saved and training resumed from existing models. The trained model's boosters can be dumped to a file.

The …/runexp.sh shell script runs end-to-end experiments by preprocessing data with the Python scripts, training a model, making predictions on test data, and dumping models. It demonstrates the standard workflow for binary classification using the XGBoost CLI.

Regression

References: demo/CLI/regression

The XGBoost demo code provides examples of using XGBoost for regression tasks on a machine hardware dataset. The code preprocesses the raw CSV data into a format suitable for XGBoost, then performs common regression modeling steps like training a model on a training set and making predictions on a held-out test set.

The …/mapfeat.py file takes the raw machine CSV and creates preprocessed features. It maps each unique vendor value to an integer ID.

The …/mknfold.py script performs a k-fold cross validation split of the data. It writes each example line to either a training or test file based on a random integer assignment, randomly allocating examples to create the train/test folds.

The …/runexp.sh script runs the full workflow: preprocessing, data splitting, training an XGBoost model on the training data, making predictions on the held-out test set, and dumping/inspecting the trained model. This provides an end-to-end demonstration of a regression predictive modeling workflow with XGBoost.

The preprocessing file maps vendor values by iterating through the CSV lines and using a dictionary to store the mappings from vendor values to integer IDs. It writes out the preprocessed data line-by-line.

Distributed Training

References: demo/CLI/distributed-training

Distributed XGBoost training using YARN and S3 allows running distributed training jobs from the command line without needing to write custom code. The …/distributed-training directory contains examples for running distributed XGBoost from the CLI on AWS.

XGBoost is built with support for distributed filesystems like HDFS and S3. A shell script in this directory launches distributed training jobs on an AWS YARN cluster using these filesystems. It exports an environment variable to specify the S3 bucket for data and model storage. A command is used to submit the job to YARN, configuring properties like workers and cores. The configuration file likely specifies hyperparameters, data format, and is passed to the training job. This allows the training to run in a distributed fashion across the YARN cluster. Once complete, the trained model is stored back in the S3 bucket.

The README provides documentation on building XGBoost with distributed filesystem support and running distributed jobs on AWS YARN. It discusses how the trained model can be analyzed across platforms using bindings like Python and R after downloading from S3.