pandas¶
pandas is a Python library for data analysis that has become very popular in recent years. On the website, pandas is described thus:
„pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.“
More specifically, pandas is an in-memory analysis tool that offers SQL-like constructs, as well as statistical and analytical tools. In doing so, pandas builds on Cython and NumPy, making it less memory intensive and faster than pure Python code. Mostly pandas is used to
implement an ETL process
prepare machine learning
See also
pandas vs. Polars vs. Dask and DuckDB¶
The choice between pandas, Polars, Dask, and DuckDB depends on the type of workload:
- pandas
is the canonical Python DataFrame library for analysis on a single machine.
- Polars
is written in Rust and allows for powerful analysis on a single node or when lazy evaluation and expressions API are important.
- Dask
is a Python library for parallel computing that scales familiar APIs, including pandas and Scikit-Learn, to clusters.
- DuckDB
is an in-process OLAP database for analysis and SQL over local files, which often complements pandas DataFrames as it is excellent for in-process analysis and SQL tasks.