Mutable.ai logoAuto Wiki by Mutable.ai

scikit-learn

Auto-generated from scikit-learn/scikit-learn by Mutable.ai Auto WikiRevise

scikit-learn
GitHub Repository
Developerscikit-learn
Written inPython
Stars58k
Watchers2.1k
Created08/17/2010
Last updated04/03/2024
LicenseBSD 3-Clause "New" or "Revised"
Homepagescikit-learn.org
Repositoryscikit-learn/scikit-learn
Auto Wiki
Revision
Software Version0.0.8Basic
Generated fromCommit 30f4d9
Generated at04/04/2024

The scikit-learn library is a powerful and flexible machine learning toolkit written in Python. It provides a wide range of tools and algorithms for data preprocessing, model training, and performance evaluation, making it a valuable resource for data scientists and machine learning practitioners.

The core functionality of the library is implemented across several key directories, each focusing on a specific aspect of the machine learning workflow:

  • The …/preprocessing directory contains a comprehensive set of tools for data preprocessing, including scaling, normalization, encoding, and feature engineering. This includes classes like StandardScaler, OneHotEncoder, and PolynomialFeatures, which allow users to transform their data into a format suitable for machine learning models.

  • The …/linear_model directory provides a variety of linear models, such as Generalized Linear Models, Bayesian Linear Models, Logistic Regression, and Stochastic Gradient Descent-based models. These models are widely used for classification, regression, and other tasks, and the directory includes over 20 different classes and functions implementing these algorithms.

  • The …/model_selection directory is a crucial component of the scikit-learn library, offering tools for model selection, hyperparameter tuning, and performance evaluation. This includes cross-validation techniques, grid search, randomized search, and learning curve visualization. The GridSearchCV and RandomizedSearchCV classes, for example, allow users to efficiently tune the hyperparameters of their models.

  • The …/manifold directory contains implementations of dimensionality reduction and data embedding techniques, such as Isomap, Locally Linear Embedding, Multidimensional Scaling, Spectral Embedding, and t-SNE. These algorithms can be used to visualize and analyze high-dimensional data by projecting it into a lower-dimensional space.

  • The …/neighbors directory provides functionality for nearest neighbors-related algorithms, including k-Nearest Neighbors classification and regression, radius-based nearest neighbors, kernel density estimation, and outlier detection using the Local Outlier Factor (LOF) algorithm.

  • The …/feature_selection directory offers a range of feature selection techniques, such as univariate feature selection, recursive feature elimination, and feature selection based on model importance. These tools can be used to identify the most relevant features in a dataset, which is crucial for improving model performance and interpretability.

  • The …/inspection directory contains functionality for inspecting and understanding machine learning models, including partial dependence plots, permutation importance, and decision boundary visualization. These tools can help users gain insights into the behavior and performance of their models.

  • The …/impute directory provides various imputation methods for handling missing values in datasets, including simple imputation, iterative imputation, and k-Nearest Neighbors-based imputation.

The scikit-learn library is designed with a focus on flexibility, efficiency, and ease of use. The modular structure of the codebase, with specialized submodules for different machine learning tasks, allows users to easily access the functionality they need for their specific use cases. Additionally, the library's comprehensive test suite and well-documented code ensure the reliability and robustness of the implemented algorithms.

Data Preprocessing
Revise

The scikit-learn library provides a comprehensive set of tools for data preprocessing, including scaling, normalization, encoding, and feature engineering. These tools are implemented across several directories and modules, allowing users to easily prepare their data for machine learning models.

Read more

Scaling and Normalization
Revise

The scikit-learn library provides several classes and functions for scaling and normalizing data, including StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, and Normalizer.

Read more

Binarization and Discretization
Revise

The Binarizer class in the scikit-learn/sklearn/preprocessing module is used to binarize data by applying a threshold to the input values. It can be useful for converting continuous features into binary (0/1) features.

Read more

Encoding
Revise

The scikit-learn library provides two main classes for encoding categorical features: OneHotEncoder and OrdinalEncoder. These classes are implemented in the …/_encoders.py file.

Read more

Transformations
Revise

The PowerTransformer and QuantileTransformer classes in the scikit-learn/sklearn/preprocessing module provide functionality for applying power and quantile transformations to the input data, respectively. Additionally, the FunctionTransformer class allows users to apply arbitrary functions to the input data as part of a preprocessing pipeline.

Read more

Polynomial and Spline Features
Revise

The PolynomialFeatures class in the scikit-learn/sklearn/preprocessing directory is used to generate polynomial and interaction features from input data. It can create features up to a specified degree, including interaction terms and an optional bias term. The transform() method of this class uses an efficient implementation for sparse input data, leveraging the _csr_polynomial_expansion() function.

Read more

Target Encoding
Revise

The TargetEncoder class in the …/_target_encoder.py file provides a way to encode categorical features based on the target variable. This can be useful for improving the performance of machine learning models, especially when dealing with high-cardinality categorical features.

Read more

Model Training
Revise

The scikit-learn library offers a wide range of machine learning models for classification, regression, and clustering tasks, implemented across several directories. The core functionality of these models is as follows:

Read more

Generalized Linear Models (GLMs)
Revise

The …/_glm directory contains the implementation of Generalized Linear Models (GLMs) in the scikit-learn library. It provides several classes that allow for fitting and predicting using GLMs with different underlying distributions, such as Poisson, Gamma, and Tweedie distributions.

Read more

Bayesian Linear Models
Revise

The _bayes.py file in the scikit-learn library's linear_model module contains two main classes: BayesianRidge and ARDRegression. These classes implement Bayesian regression techniques, specifically Bayesian Ridge Regression and Automatic Relevance Determination (ARD) Regression.

Read more

Robust Linear Models
Revise

The HuberRegressor class in …/_huber.py implements a robust linear regression model that is less sensitive to outliers in the data. The Huber Regressor optimizes a loss function that is quadratic for small residuals (the difference between the predicted and actual values) and linear for large residuals, allowing the model to be more robust to outliers.

Read more

Least Angle Regression
Revise

The scikit-learn/sklearn/linear_model/_least_angle.py file contains the implementation of the Least Angle Regression (LARS) algorithm and its variants, including Lasso and Cross-Validated LARS and Lasso models.

Read more

Logistic Regression
Revise

The LogisticRegression class in the …/_logistic.py file implements the Logistic Regression algorithm for binary and multiclass classification problems. It supports various regularization penalties ('l1', 'l2', 'elasticnet') and optimization solvers ('liblinear', 'lbfgs', 'newton-cg', 'newton-cholesky', 'sag', 'saga').

Read more

Orthogonal Matching Pursuit
Revise

The scikit-learn library provides an implementation of the Orthogonal Matching Pursuit (OMP) algorithm, which is a greedy algorithm for solving sparse linear regression problems. The core functionality of the OMP algorithm is implemented in the …/_omp.py file.

Read more

Passive Aggressive Algorithms
Revise

The PassiveAggressiveClassifier and PassiveAggressiveRegressor classes in the …/_passive_aggressive.py file implement the Passive Aggressive algorithm for classification and regression tasks, respectively.

Read more

Perceptron
Revise

The Perceptron class in …/_perceptron.py provides a simple and efficient implementation of the perceptron algorithm, a type of linear classifier. The Perceptron class is a wrapper around the BaseSGDClassifier class, with the loss parameter set to "perceptron" and the learning_rate parameter set to "constant".

Read more

Quantile Regression
Revise

The QuantileRegressor class in the …/_quantile.py file implements a linear regression model that predicts conditional quantiles, rather than the mean, which is the typical target of linear regression. This can be useful in applications where the mean may not be the most informative statistic.

Read more

RANSAC Regression
Revise

The RANSACRegressor class in the …/_ransac.py file implements the RANSAC (Random Sample Consensus) algorithm for robust regression. RANSAC is an iterative method for estimating parameters from a data set containing outliers.

Read more

Ridge Regression
Revise

The Ridge and RidgeClassifier classes in the …/_ridge.py file provide an implementation of Ridge regression and Ridge classification, respectively.

Read more

Stochastic Gradient Descent
Revise

The scikit-learn library provides an efficient and scalable implementation of Stochastic Gradient Descent (SGD) based models, including Ridge Regression, Logistic Regression, and One-Class SVM. These models are implemented in the _sag.py and _stochastic_gradient.py files within the linear_model module.

Read more

Theil-Sen Regression
Revise

The TheilSenRegressor class in the …/_theil_sen.py file implements the Theil-Sen Estimator, a robust multivariate regression algorithm. The Theil-Sen Estimator is known for its high breakdown point, making it resistant to outliers in the data.

Read more

Model Selection and Evaluation
Revise

The scikit-learn library provides a comprehensive set of tools for model selection, hyperparameter tuning, and performance evaluation, implemented in the …/model_selection directory.

Read more

Cross-Validation
Revise

The scikit-learn library provides a comprehensive set of tools for performing cross-validation, which is a crucial technique for evaluating the performance of machine learning models. The core functionality is implemented in the …/_validation.py module.

Read more

Hyperparameter Optimization
Revise

The scikit-learn library provides two main classes for performing hyperparameter optimization: GridSearchCV and RandomizedSearchCV. These classes implement grid search and randomized search, respectively, which are two common techniques for tuning the hyperparameters of machine learning models.

Read more

Model Evaluation
Revise

The scikit-learn library provides a comprehensive set of tools for evaluating the performance of machine learning models, implemented in the sklearn.model_selection._validation module.

Read more

Visualization
Revise

The scikit-learn library provides tools for visualizing model selection and evaluation, including the LearningCurveDisplay and ValidationCurveDisplay classes. These classes offer a convenient way to generate and customize visualizations of the learning curve and validation curve for machine learning models.

Read more

Dimensionality Reduction and Manifold Learning
Revise

References: sklearn/manifold

The scikit-learn library provides algorithms for dimensionality reduction and manifold learning, implemented in the …/manifold directory. This directory contains the implementation of various techniques, including:

Read more

Isomap
Revise

The Isomap class in the …/_isomap.py file provides the core functionality for the Isomap algorithm, a non-linear dimensionality reduction technique. Isomap is a manifold learning algorithm that preserves the geodesic distances between data points, allowing it to effectively capture the underlying non-linear structure of high-dimensional data.

Read more

Locally Linear Embedding (LLE)
Revise

Locally Linear Embedding (LLE)

Read more

Multidimensional Scaling (MDS)
Revise

The MDS class in the …/_mds.py file implements Multidimensional Scaling (MDS), a technique for embedding high-dimensional data into a lower-dimensional space. The core functionality is provided by the smacof() function, which computes the MDS solution using the SMACOF (Scaling by MAjorizing a COmplicated Function) algorithm.

Read more

Spectral Embedding
Revise

Spectral Embedding is a spectral clustering-based dimensionality reduction algorithm implemented in the _spectral_embedding.py file of the scikit-learn library. The main entry point is the spectral_embedding() function, which takes an adjacency matrix as input and projects the samples onto the first eigenvectors of the graph Laplacian.

Read more

t-SNE (t-Distributed Stochastic Neighbor Embedding)
Revise

The TSNE class in the scikit-learn/sklearn/manifold/_t_sne.py file provides an implementation of the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm, a popular nonlinear dimensionality reduction technique for embedding high-dimensional data into a low-dimensional space.

Read more

Nearest Neighbors and Density Estimation
Revise

References: sklearn/neighbors

The scikit-learn library includes functionality for nearest neighbors search, kernel density estimation, and outlier detection, implemented in the …/neighbors directory.

Read more

k-Nearest Neighbors (k-NN)
Revise

The k-Nearest Neighbors (k-NN) algorithm is a non-parametric method used for classification and regression. It provides functionality for k-nearest neighbors search, including k-nearest neighbors classification and regression.

Read more

Radius-Based Nearest Neighbors
Revise

The RadiusNeighborsRegressor and RadiusNeighborsClassifier classes in the sklearn.neighbors module of the scikit-learn library implement the radius-based nearest neighbors algorithm. This is a variant of the k-nearest neighbors algorithm that uses a fixed radius to determine the neighbors, instead of a fixed number of neighbors.

Read more

Nearest Neighbor Graph
Revise

The nearest neighbor graph functionality in the scikit-learn library provides tools for computing the weighted graph of k-nearest neighbors and neighbors within a given radius for a set of data points. This functionality is implemented in the …/_graph.py file.

Read more

Kernel Density Estimation
Revise

The sklearn/neighbors/_kde.py file provides the implementation of Kernel Density Estimation (KDE) in the scikit-learn library. The main component is the KernelDensity class, which allows users to fit a KDE model on a dataset and perform various operations such as scoring samples, computing the total log-likelihood, and generating random samples from the model.

Read more

Local Outlier Factor (LOF)
Revise

The Local Outlier Factor (LOF) algorithm is an unsupervised outlier detection method that identifies outliers based on the local density of the data points. The core functionality of the LOF algorithm is implemented in the LocalOutlierFactor class, located in the …/_lof.py file.

Read more

Neighborhood Components Analysis (NCA)
Revise

Neighborhood Components Analysis (NCA)

Read more

Nearest Centroid Classifier
Revise

The Nearest Centroid Classifier is a simple and efficient classification algorithm that assigns a label to a new input based on the nearest centroid of the training data. The implementation of this algorithm is provided in the NearestCentroid class in the …/_nearest_centroid.py file.

Read more

Unsupervised Nearest Neighbors
Revise

The NearestNeighbors class provides an unsupervised learner for implementing nearest neighbors searches, supporting various algorithms and distance metrics. It is the primary component in the …/_unsupervised.py file.

Read more

Feature Selection
Revise

The scikit-learn library offers a range of feature selection techniques, implemented in the …/feature_selection directory. This directory includes various tools and algorithms for feature selection, which is the process of identifying and selecting the most relevant features from a dataset for use in model training.

Read more

Univariate Feature Selection
Revise

The _univariate_selection.py file in the scikit-learn library provides a comprehensive set of tools for performing univariate feature selection. The main classes in this file are:

Read more

Recursive Feature Elimination (RFE)
Revise

The Recursive Feature Elimination (RFE) algorithm is a feature selection technique implemented in the scikit-learn library. It recursively removes features and builds a model on the remaining features, allowing you to select the most important features for your machine learning task.

Read more

Sequential Feature Selection
Revise

The SequentialFeatureSelector class in the …/_sequential.py file is responsible for implementing a sequential feature selection algorithm. This algorithm iteratively adds or removes features from the input data based on a specified scoring function and stopping criterion.

Read more

Feature Selection Based on Model Importance
Revise

The SelectFromModel class in the …/_from_model.py file is a meta-transformer that allows for feature selection based on the importance weights of an underlying estimator. This class can be used with any estimator that has a feature_importances_ or coef_ attribute after fitting, or with a custom importance getter function.

Read more

Mutual Information
Revise

The scikit-learn library provides functionality for estimating the mutual information between input features and the target variable, for both classification and regression problems. This is implemented in the …/_mutual_info.py file.

Read more

Variance Threshold
Revise

The VarianceThreshold class, which is part of the scikit-learn/sklearn/feature_selection/ directory, is a feature selection algorithm that removes low-variance features from the input data. This algorithm is useful for unsupervised learning tasks, as it only considers the features (X) and not the desired outputs (y).

Read more

Model Inspection and Interpretation
Revise

References: sklearn/inspection

The scikit-learn library provides tools for inspecting and interpreting machine learning models, implemented in the …/inspection directory. This directory includes functionality for computing and visualizing partial dependence plots (PDPs), calculating the permutation importance of features, and visualizing the decision boundaries of machine learning models.

Read more

Partial Dependence Plots
Revise

The scikit-learn library provides functionality for computing and visualizing partial dependence plots (PDPs) for regression and classification models. This functionality is implemented in the …/_partial_dependence.py and …/partial_dependence.py files.

Read more

Permutation Importance
Revise

The _permutation_importance.py file in the scikit-learn library provides functionality for computing the permutation importance of features in a trained estimator. Permutation importance is a technique for evaluating the importance of individual features by measuring the decrease in a model's performance when a feature is randomly shuffled.

Read more

Decision Boundary Visualization
Revise

The scikit-learn/sklearn/inspection/_plot/decision_boundary.py file provides functionality for visualizing the decision boundaries of machine learning models. The main entry point is the DecisionBoundaryDisplay.from_estimator() class method, which allows creating a DecisionBoundaryDisplay object directly from a fitted estimator.

Read more

Utility Functions
Revise

The important functionality in the file …/_pd_utils.py is as follows:

Read more

Handling Missing Data
Revise

References: sklearn/impute

The scikit-learn library includes various imputation methods for handling missing values in datasets, implemented in the …/impute directory. The main components in this directory are:

Read more

Simple Imputation
Revise

The SimpleImputer class in the scikit-learn/sklearn/impute/_base.py file provides a simple and efficient way to handle missing values in data using common imputation strategies. The SimpleImputer class is a concrete implementation of a univariate imputer that replaces missing values using strategies like mean, median, or most frequent value.

Read more

Iterative Imputation
Revise

The IterativeImputer class in the …/_iterative.py file provides a multivariate imputation approach for handling missing values in a dataset. The key functionality of this class is to iteratively impute the missing values by estimating each feature from all the others in a round-robin fashion.

Read more

k-Nearest Neighbors Imputation
Revise

The KNNImputer class in the scikit-learn/sklearn/impute/_knn.py file provides a way to impute missing values in a dataset using a k-Nearest Neighbors (kNN) based approach. The KNNImputer class inherits from the _BaseImputer class and offers several parameters to control the behavior of the imputation process.

Read more

Utility Functions
Revise

References: sklearn/utils

The scikit-learn library includes a variety of utility functions and classes that are used throughout the library, implemented in the …/utils directory.

Read more

Utility Functions
Revise

References: sklearn/utils

The …/utils directory contains a wide range of utility functions and classes that are used throughout the scikit-learn library. This directory provides functionality for data manipulation and validation, parallel processing, hashing, and various other tasks that are common in machine learning applications.

Read more

Estimator Utilities
Revise

The _pprint.py file in the scikit-learn utils module contains the _EstimatorPrettyPrinter class, which is used to provide custom printing functionality for estimator objects in the BaseEstimator.__repr__ method. This class extends the built-in pprint.PrettyPrinter class and overrides several methods to handle the printing of estimators, their parameters, and related data structures.

Read more

Testing Utilities
Revise

The assert_allclose() and assert_allclose_dense_sparse() functions in the …/_testing.py file are utility functions used for testing in the scikit-learn library.

Read more

Class and Sample Weighting
Revise

The sklearn.utils.class_weight module in scikit-learn provides utility functions for handling class weights and sample weights for unbalanced datasets. The two main functions in this module are compute_class_weight() and compute_sample_weight().

Read more

Deprecation Handling
Revise

The deprecated decorator class, located in the …/deprecation.py file, is a utility provided by the scikit-learn library to handle deprecation of functions and classes. This decorator serves two main purposes:

Read more

Object Discovery
Revise

The sklearn.utils.discovery module provides utility functions for discovering various types of objects within the scikit-learn package, including estimators, displays, and functions.

Read more

Mathematical and Data Manipulation Utilities
Revise

The …/extmath.py file provides a variety of mathematical and data manipulation utility functions that are used throughout the scikit-learn library.

Read more

Example Scripts and Notebooks
Revise

References: examples

The scikit-learn library includes a directory with example scripts and Jupyter notebooks that demonstrate the usage of various features and functionalities. These examples cover a wide range of topics, including:

Read more

Linear Models
Revise

The …/linear_model directory contains a collection of example scripts that demonstrate the usage of various linear models and regression techniques from the scikit-learn library. The examples cover a wide range of functionality, including:

Read more

Clustering
Revise

References: examples/cluster

The …/ directory contains a collection of Python scripts that demonstrate the usage of various clustering algorithms from the scikit-learn library. The examples cover a wide range of clustering techniques, including:

Read more

Ensemble Methods
Revise

References: examples/ensemble

The …/ensemble directory contains a collection of example scripts that demonstrate the usage and functionality of various ensemble methods in the scikit-learn library. The examples cover a wide range of ensemble techniques, including:

Read more

Model Selection and Evaluation
Revise

The …/model_selection directory contains a collection of Python scripts that demonstrate various aspects of model selection and evaluation in the scikit-learn library.

Read more

Support Vector Machines
Revise

References: examples/svm

The file …/plot_custom_kernel.py demonstrates how to use a custom kernel with a Support Vector Machine (SVM) classifier to perform a 3-class classification task on the Iris dataset. The key aspects of the implementation are:

Read more

Nearest Neighbors
Revise

References: examples/neighbors

The …/neighbors directory contains a collection of example files that demonstrate the usage of the sklearn.neighbors module in the scikit-learn library. This module provides various nearest neighbors-based methods, such as k-Nearest Neighbors (kNN) classification and regression, Kernel Density Estimation (KDE), and Neighborhood Components Analysis (NCA).

Read more

Applications
Revise

The …/applications directory contains a collection of example scripts that demonstrate the application of various machine learning techniques to real-world problems and datasets. These examples cover a wide range of topics, including:

Read more

Dimensionality Reduction
Revise

The …/decomposition directory contains a collection of example scripts that demonstrate the usage of various dimensionality reduction and matrix decomposition techniques from the sklearn.decomposition module in the scikit-learn library.

Read more

Gaussian Processes
Revise

The …/ directory contains a set of example files that demonstrate the usage of the sklearn.gaussian_process module in the scikit-learn library. This module provides functionality for Gaussian Process Regression (GPR) and Gaussian Process Classification (GPC), which are powerful tools for regression and classification tasks.

Read more

Data Preprocessing
Revise

The …/preprocessing directory contains a set of example scripts that demonstrate various data preprocessing techniques from the sklearn.preprocessing module in the scikit-learn library. These examples cover a wide range of preprocessing tasks, including feature scaling, discretization, target encoding, and mapping data to a normal distribution.

Read more

Feature Selection
Revise

The …/feature_selection directory contains a set of example files that demonstrate various feature selection techniques available in the scikit-learn library. The examples cover topics such as:

Read more

Model Inspection and Interpretation
Revise

The …/inspection directory contains a set of example files that demonstrate the usage of the sklearn.inspection module in the scikit-learn library. This module provides tools for model inspection and interpretation, which are crucial for understanding the behavior and performance of machine learning models.

Read more

Neural Networks
Revise

The …/neural_networks directory contains several example scripts that demonstrate the usage of the sklearn.neural_network module in the scikit-learn library. These examples cover various aspects of neural network models, including:

Read more

Datasets
Revise

References: examples/datasets

The …/datasets directory contains several example scripts that demonstrate the usage of the sklearn.datasets module in the scikit-learn library. This module provides access to various datasets that can be used for machine learning tasks.

Read more

Text Processing
Revise

References: examples/text

The …/text directory contains several example scripts that demonstrate the usage of text processing techniques in the scikit-learn library.

Read more

Handling Missing Data
Revise

References: examples/impute

The scikit-learn library includes various imputation methods for handling missing values in datasets, implemented in the …/impute directory. The examples in the …/impute directory demonstrate the usage of these imputation techniques.

Read more

Multi-Output Problems
Revise

The …/multioutput directory contains an example demonstrating the usage of the sklearn.multioutput module in the scikit-learn library. The sklearn.multioutput module is used for handling multiple output problems, where a single model is trained to predict multiple target variables simultaneously.

Read more

Kernel Approximation
Revise

The scikit-learn/examples/kernel_approximation directory contains an example that demonstrates the use of the PolynomialCountSketch class from the sklearn.kernel_approximation module. This class is used to efficiently generate an approximation of the polynomial kernel feature space, which can then be used to train a linear classifier that approximates the accuracy of a kernelized classifier.

Read more

Developing Custom Estimators
Revise

The scikit-learn/examples/developing_estimators directory contains examples and guidance on developing custom estimators for the scikit-learn library. Estimators are the core components of machine learning models in scikit-learn, and the ability to create custom estimators is an important feature of the library.

Read more

Miscellaneous Examples
Revise

The …/miscellaneous directory contains a collection of example scripts and notebooks that demonstrate various features and functionalities of the scikit-learn library. The examples cover a wide range of topics, including anomaly detection, kernel approximation, multi-label classification, outlier detection, and more.

Read more