pandas[Edit section][Copy link]
The pandas library is a powerful and flexible data analysis and manipulation tool for Python, providing labeled data structures similar to R's data.frame objects, as well as a wide range of statistical functions and utilities. The library is primarily focused on handling tabular and time-series data, with a strong emphasis on efficient and intuitive data processing.
At the core of the pandas library are the Series
and DataFrame
classes, which represent one-dimensional and two-dimensional labeled data structures, respectively. These data structures are designed to work seamlessly with a variety of data types, including numeric, string, datetime, and custom data types. The …/core
directory contains the implementation of these core data structures, as well as a rich set of utility functions and algorithms for working with the data.
The …/dtypes
directory is a crucial component of the library, as it provides functionality for managing data types in Pandas. This includes type checking, type conversion, missing value handling, and type promotion. The ExtensionDtype
class and related utilities in this directory allow users to define and work with custom data types, extending the capabilities of the library.
Another important aspect of the pandas library is its support for datetime, timedelta, and period data. The …/arrays
directory contains the implementation of specialized array types for handling these data types, such as DatetimeArray
, TimedeltaArray
, and PeriodArray
. These array types provide efficient storage and operations for time-series data, including support for time zone handling and various time-based operations.
The …/groupby
directory is responsible for the core functionality of performing groupby operations on Pandas data structures. This includes the GroupBy
class, which provides methods for applying aggregation, transformation, and filtering operations on grouped data. The directory also includes utilities for handling grouping criteria, such as the Grouper
class, as well as Numba-based optimizations for improved performance.
The …/io
directory is dedicated to the input/output functionality of the pandas library, providing tools for reading and writing data in various formats, including Excel, HTML, JSON, and SQL databases. This directory includes specialized classes and functions for handling the parsing, formatting, and conversion of data between Pandas data structures and external data sources.
Overall, the pandas library is designed to provide a comprehensive and efficient set of tools for working with tabular and time-series data in Python. The codebase is well-structured, with a clear separation of concerns between the various components, and a focus on performance, flexibility, and ease of use.
Data Structures[Edit section][Copy link]
References: pandas/core
The Pandas library provides two core data structures: Series
and DataFrame
.
Series[Edit section][Copy link]
References: pandas/core
The Series
class is the main one-dimensional data structure in the Pandas library, representing a labeled array of data. The core functionality of the Series
class is implemented across several files and directories in the Pandas codebase, including:
DataFrame[Edit section][Copy link]
References: pandas/core
The DataFrame
class is the main two-dimensional data structure in the Pandas library, representing a labeled table of data. The core functionality of the DataFrame
class is implemented across several key components:
Indexes[Edit section][Copy link]
References: pandas/core/indexes
The Index
and MultiIndex
classes are used to represent the row and column labels in Series
and DataFrame
objects, providing a unified interface for working with labeled data in Pandas.
Array Types[Edit section][Copy link]
References: pandas/core/arrays
, pandas/core/arrays/string_.py
Pandas utilizes a variety of array classes to manage and store data efficiently, each tailored to specific data types and use cases. One such class is StringArray
, which is specifically designed for handling string data. This class is capable of managing missing values, a common occurrence in real-world datasets, and allows for type conversions, which are essential for data cleaning and preparation processes.
Extension Arrays[Edit section][Copy link]
References: pandas/core/arrays/base.py
, pandas/core/dtypes/common.py
The ExtensionArray
abstract base class, defined in …/base.py
, serves as the interface for custom 1-D array types in pandas. It provides a set of required methods and attributes that subclasses must implement, including:
Block Manager[Edit section][Copy link]
References: pandas/core/internals/managers.py
, pandas/core/internals/blocks.py
The BlockManager
class orchestrates the internal storage of pandas data structures, particularly DataFrame
and Series
. It manages a collection of blocks, which are 2D arrays that store data in a contiguous manner according to their respective dtypes. The class is defined in …/managers.py
and is a concrete implementation of the BaseBlockManager
abstract base class.
Data Types Casting and Promotion[Edit section][Copy link]
References: pandas/core/dtypes/cast.py
In …/cast.py
, a collection of functions is dedicated to the casting and promotion of data types within Pandas data structures. These functions play a critical role in ensuring that data types are compatible and can handle the introduction of missing values.
Data Structure Construction[Edit section][Copy link]
References: pandas/core/internals/construction.py
In the construction of pandas data structures, such as DataFrame
, the …/construction.py
plays a pivotal role by providing functions that transform various input data into the internal storage format. One key function, arrays_to_mgr()
, accepts arrays, columns, and an optional index to create a BlockManager
, which is the backbone of pandas data structures, handling the actual data storage.
Concatenation and Merging[Edit section][Copy link]
References: pandas/core/reshape/concat.py
The concat()
function in …/concat.py
is the primary mechanism for combining pandas objects along a particular axis. It is equipped to handle a variety of scenarios, including different object types and dimensions, with a focus on maintaining the integrity of the data structures involved. The function's flexibility is evident in its ability to manage both Series and DataFrame objects, applying logic to sort and align the data as needed.
Data Types[Edit section][Copy link]
References: pandas/core/dtypes
The pandas library provides a set of tools and utilities for working with data types, including extension data types and custom data types. This functionality is implemented across several modules in the …/dtypes
directory.
Extension Data Types[Edit section][Copy link]
References: pandas/core/dtypes/base.py
, pandas/core/dtypes/dtypes.py
The ExtensionDtype
class provides the base implementation for defining custom data types in Pandas. This class defines the required methods and properties that must be implemented by subclasses, such as type
, name
, and construct_array_type()
. These methods allow the custom data type to be properly integrated with the Pandas ecosystem.
Data Type Conversion and Casting[Edit section][Copy link]
References: pandas/core/dtypes/astype.py
, pandas/core/dtypes/cast.py
The astype.py
file in the Pandas library provides functions for implementing the astype
method according to Pandas conventions, particularly when the behavior differs from NumPy. The main functionality includes:
Type Checking and Inference[Edit section][Copy link]
References: pandas/core/dtypes/api.py
, pandas/core/dtypes/common.py
, pandas/core/dtypes/inference.py
, pandas/core/dtypes/missing.py
The pandas.core.dtypes.api
module provides a collection of utility functions and type checks related to data types in the Pandas library. These functions are used throughout the Pandas codebase to ensure data integrity and perform type-specific operations.
Error Handling in Data Type Operations[Edit section][Copy link]
References: pandas/core/dtypes/common.py
In the context of data type operations within the pandas library, the pandas_dtype
function from …/common.py
plays a key role in error handling by raising more informative error messages. This function is designed to convert an input into a Pandas-specific data type object or a NumPy data type object. When the input provided to pandas_dtype
is not recognizable as a valid data type, the function raises a TypeError
with a message that clarifies the nature of the issue.
Datetime and Timedelta Arrays[Edit section][Copy link]
References: pandas/core/arrays/datetimes.py
, pandas/core/arrays/timedeltas.py
The DatetimeArray
and TimedeltaArray
classes in the …/datetimes.py
and …/timedeltas.py
files provide functionality for working with datetime and timedelta data in the pandas library.
Period Arrays[Edit section][Copy link]
References: pandas/core/arrays/period.py
The PeriodArray
class provides the core functionality for working with period data in the pandas library. This class is responsible for constructing, converting, and performing operations on period-based data structures.
Arrow-backed Arrays[Edit section][Copy link]
References: pandas/core/arrays/arrow
The ArrowExtensionArray
class is the core component for working with Apache Arrow data types within the Pandas library. It is a subclass of ExtensionArray
and implements the required methods and properties to integrate with Pandas.
Sparse Arrays[Edit section][Copy link]
References: pandas/core/arrays/sparse
The SparseArray
class is a core component for working with sparse data within the Pandas library. It is a subclass of the Pandas ExtensionArray
and is designed for efficient storage and manipulation of sparse data.
Grouping and Aggregation[Edit section][Copy link]
References: pandas/core/groupby
The Grouping and Aggregation section of the pandas wiki explains the functionality for performing groupby operations in the pandas library. The key components and functionality are:
Read moreGrouping Objects[Edit section][Copy link]
References: pandas/core/groupby
The GroupBy
class is the main entry point for creating grouping objects in the Pandas library. It provides a flexible and powerful interface for grouping data based on various criteria, such as keys, levels, and time-based grouping.
Grouping Functionality[Edit section][Copy link]
References: pandas/core/groupby/groupby.py
The DataFrameGroupBy
and SeriesGroupBy
classes provide the core functionality for performing groupby aggregation, transformation, and filtering operations on Pandas DataFrame
and Series
objects, respectively.
Grouper Handling[Edit section][Copy link]
References: pandas/core/groupby/grouper.py
, pandas/core/groupby/indexing.py
The Grouper
class is the main entry point for specifying a grouping instruction for a Pandas object (Series or DataFrame). It supports various parameters to control the grouping behavior, such as key
, level
, freq
, sort
, closed
, label
, convention
, origin
, offset
, and dropna
.
GroupBy Plotting[Edit section][Copy link]
References: pandas/core/groupby/groupby.py
The GroupByPlot
class in …/groupby.py
equips GroupBy
objects with the capability to visualize grouped data through plotting. This class is a part of the larger GroupBy
functionality that enables users to perform various operations on data that has been grouped according to certain criteria.
Numba-based Optimizations[Edit section][Copy link]
References: pandas/core/groupby/groupby.py
The generate_numba_agg_func()
function in …/numba_.py
is responsible for generating Numba-jitted functions that can be used to optimize groupby aggregation operations in pandas.
Groupby Operations[Edit section][Copy link]
References: pandas/core/groupby/ops.py
The BinGrouper
and BaseGrouper
classes in …/ops.py
are responsible for performing the actual groupby operations in the Pandas library. These classes handle the data splitting and Cython-based implementations that power the groupby functionality.
Algorithms and Utility Functions[Edit section][Copy link]
References: pandas/core
The Pandas library provides a wide range of algorithms and utility functions for data manipulation and analysis. These functions enable efficient operations on arrays, handling of missing data, and common data transformations.
Read moreAlgorithms and Utility Functions[Edit section][Copy link]
References: pandas/core/algorithms.py
, pandas/core/apply.py
, pandas/core/ops
, pandas/core/reshape
The Algorithms and Utility Functions section covers the various algorithms and utility functions provided by the pandas library for data manipulation and analysis. This includes functions for applying operations to arrays, handling missing data, and performing common data transformations.
Read moreArray Operations[Edit section][Copy link]
References: pandas/core/array_algos/datetimelike_accumulations.py
, pandas/core/array_algos/masked_accumulations.py
, pandas/core/array_algos/masked_reductions.py
The core functionality for performing efficient array operations, such as cumulative sums, minimums, maximums, and variances, is provided in the following files:
Read moreData Transformation[Edit section][Copy link]
References: pandas/core/reshape/api.py
, pandas/core/reshape/concat.py
, pandas/core/reshape/encoding.py
, pandas/core/reshape/melt.py
, pandas/core/reshape/pivot.py
, pandas/core/reshape/reshape.py
, pandas/core/reshape/tile.py
The pandas.core.reshape
module provides a collection of functions and utilities for performing common data transformations, such as pivoting, melting, and binning, as well as utilities for concatenating and reshaping data.
Missing Data Handling[Edit section][Copy link]
References: pandas/core/dtypes/missing.py
The …/missing.py
module provides functionality for detecting and handling missing values in Pandas data structures. This includes several key functions:
Computation and Expression Evaluation[Edit section][Copy link]
References: pandas/core/computation
The core functionality for evaluating expressions on Pandas objects is provided in the …/computation
directory. This includes parsing and evaluating expressions, aligning the operands, and providing support for different computation engines.
Rolling Window Calculations[Edit section][Copy link]
References: pandas/core/indexers/objects.py
Rolling window calculations in Pandas are facilitated by a suite of indexers located in …/objects.py
. These indexers determine the boundaries for various rolling operations, which are essential for time series analysis and include computations like rolling means and sums. The indexers cater to different scenarios:
Resampling[Edit section][Copy link]
References: pandas/core/resample.py
The Resampler
class in …/resample.py
serves as the primary interface for resampling time series data. It provides methods like aggregate()
, transform()
, and apply()
for aggregating resampled data.
Sparse Data Handling[Edit section][Copy link]
References: pandas/core/sparse
The …/sparse
directory in the Pandas library provides the core functionality for working with sparse data. It defines two main components:
String Manipulation[Edit section][Copy link]
References: pandas/core/strings
The …/strings
directory contains the core functionality for working with string data in the Pandas library. The main components in this directory are:
String Methods[Edit section][Copy link]
References: pandas/core/strings/accessor.py
The StringMethods
class in …/accessor.py
provides vectorized string operations for Series and Index objects. Key features include:
Regular Expression Operations[Edit section][Copy link]
References: pandas/core/strings/accessor.py
The StringMethods
class in …/accessor.py
provides methods for performing regular expression operations on string data in Pandas Series and Index objects. These methods utilize Python's re
module to offer vectorized string manipulation capabilities.
Input/Output[Edit section][Copy link]
References: pandas/io
The …/io
directory in the Pandas library provides functionality for reading and writing data in various formats, including Excel, HTML, JSON, and SQL databases. The directory contains several sub-directories and files that handle the specific implementation details for each data format.
Clipboard Functionality[Edit section][Copy link]
References: pandas/io/clipboard
The pandas.io.clipboard
module provides functionality for interacting with the system clipboard, allowing users to read from and write to the clipboard. The core functionality is provided by the copy()
and paste()
functions, which are responsible for copying data to and retrieving data from the clipboard, respectively.
Excel File Handling[Edit section][Copy link]
References: pandas/io/formats/excel.py
, pandas/io/formats/style.py
The …/excel
directory in the Pandas library provides functionality for reading and writing Excel files using various backend libraries. The main components in this directory are:
JSON Data Handling[Edit section][Copy link]
References: pandas/io/json
, pandas/io/json/_json.py
The …/json
directory in the Pandas library provides functionality for reading and writing JSON data to and from Pandas data structures, such as DataFrame
and Series
. The main components in this directory include:
CSV and Tabular Data Parsing[Edit section][Copy link]
References: pandas/io/parsers/readers.py
, pandas/io/parsers/base_parser.py
The core functionality for parsing CSV, fixed-width, and other tabular data formats into Pandas DataFrames is implemented in the …/parsers
directory. This directory contains several key components that work together to provide a robust and flexible data parsing pipeline.
SAS File Handling[Edit section][Copy link]
References: pandas/io/sas
The pandas library provides functionality for reading SAS data files in both the SAS7BDAT and XPORT formats into Pandas DataFrames. This functionality is primarily implemented in the …/sas
directory.
HTML Data Handling[Edit section][Copy link]
References: pandas/io/html.py
Pandas provides the capability to parse HTML tables into DataFrame
objects through the …/html.py
file. The read_html()
function serves as the primary interface for this task, accommodating various types of HTML sources, including non-seekable io
objects. It offers flexibility in handling different HTML structures by returning an empty list when no tables are found, ensuring robustness in diverse scenarios.
Stata File Handling[Edit section][Copy link]
References: pandas/io/stata.py
Handling Stata files within the pandas library is facilitated by the …/stata.py
file, which includes classes designed for both reading and writing Stata data files. The StataReader
class provides the capability to read Stata files into pandas DataFrames, accommodating various Stata file versions. It also offers methods to handle metadata such as variable labels and value labels, which are essential for understanding the data's structure and meaning.
SPSS File Handling[Edit section][Copy link]
References: pandas/io/spss.py
The read_spss()
function in …/spss.py
enables the conversion of SPSS file data into a pandas DataFrame
. It leverages the pyreadstat
external library to facilitate the reading process. The function is designed to handle additional keyword arguments which are passed directly to pyreadstat.read_sav
, the underlying function responsible for parsing SPSS files.
LaTeX Output[Edit section][Copy link]
References: pandas/io/formats/style.py
The Styler
class in …/style.py
serves as the interface for styling DataFrames and Series, offering a method to_latex()
for generating LaTeX representations. This method is instrumental for users who need to integrate DataFrame styling within LaTeX documents, a typesetting system commonly used for scientific publications.
HDF5 File Handling[Edit section][Copy link]
References: pandas/io/pytables.py
Interfacing with HDF5 file formats in the pandas library is facilitated by …/pytables.py
, which leverages the PyTables library to provide a high-level API for data storage and retrieval. The HDFStore
class serves as a central component, offering a dictionary-like interface for these operations. It supports various file opening modes and compression options, enhancing flexibility in data handling.