pandas¶

pandas is a Python library for data analysis that has become very popular in recent years. On the website, pandas is described thus:

„pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.“

More specifically, pandas is an in-memory analysis tool that offers SQL-like constructs, as well as statistical and analytical tools. In doing so, pandas builds on Cython and NumPy, making it less memory intensive and faster than pure Python code. Mostly pandas is used to

replace Excel and Power BI
implement an ETL process
process CSV or JSON data
prepare machine learning

Tip

Analysing data with pandas

pandas vs. Polars vs. Dask and DuckDB¶

The choice between pandas, Polars, Dask, and DuckDB depends on the type of workload:

pandas: is the canonical Python DataFrame library for analysis on a single machine.
Polars: is written in Rust and allows for powerful analysis on a single node or when lazy evaluation and expressions API are important.
Dask: is a Python library for parallel computing that scales familiar APIs, including pandas and Scikit-Learn, to clusters.
DuckDB: is an in-process OLAP database for analysis and SQL over local files, which often complements pandas DataFrames as it is excellent for in-process analysis and SQL tasks.