{ "cells": [ { "cell_type": "markdown", "id": "2065a177", "metadata": {}, "source": [ "# Gruppenoperationen\n", "\n", "Mit `groupby` ist ein Prozess gemeint, der einen oder mehrere der folgenden Schritte umfasst:\n", "\n", "* **Split** teilt die Daten in Gruppen nach bestimmten Kriterien auf\n", "* **Apply** wendet eine Funktion unabhängig auf jede Gruppe an\n", "* **Combine** kombiniert die Ergebnisse in einer Datenstruktur\n", "\n", "In der ersten Phase des Prozesses werden die in einem pandas-Objekt enthaltenen Daten, sei es eine Series, ein DataFrame oder etwas anderes, in Gruppen aufgeteilt, die auf einem oder mehreren Schlüsseln basieren. Die Aufteilung wird auf einer bestimmten Achse eines Objekts durchgeführt. Ein DataFrame kann zum Beispiel nach seinen Zeilen oder seinen Spalten gruppiert werden. Danach wird auf jede Gruppe eine Funktion angewendet, die einen neuen Wert erzeugt. Schließlich werden die Ergebnisse all dieser Funktionsanwendungen in einem Ergebnisobjekt kombiniert. Die Form des Ergebnisobjekts hängt normalerweise davon ab, was mit den Daten gemacht wird.\n", "\n", "Jeder Gruppierungsschlüssel kann viele Formen annehmen, und die Schlüssel müssen nicht alle vom gleichen Typ sein:\n", "* Eine Liste oder ein Array von Werten, die die gleiche Länge wie die zu gruppierende Achse haben\n", "* Ein Wert, der einen Spaltennamen in einem DataFrame angibt\n", "* Ein Dict oder eine Series, die eine Entsprechung zwischen den Werten auf der Achse, die gruppiert wird, und den Gruppennamen darstellt\n", "* Eine Funktion, die auf dem Achsenindex oder den einzelnen Beschriftungen im Index aufgerufen wird\n", "\n", "
\n", "\n", "**Hinweis:**\n", "\n", "Die drei letztgenannten Methoden sind Abkürzungen, um ein Array von Werten zu erzeugen, die für die Aufteilung des Objekts verwendet werden.\n", "
\n", "\n", "Macht euch keine Sorgen, wenn dies alles abstrakt erscheint. Im Laufe dieses Kapitels werde ich viele Beispiele für all diese Methoden geben. Für den Anfang hier ein kleiner Tabellendatensatz als DataFrame:" ] }, { "cell_type": "code", "execution_count": 1, "id": "26872ed8", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "id": "920e4738", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleLanguage2021-122022-012022-02
0Jupyter Tutorialde19651.030134.033295.0
1Jupyter Tutorialen4722.03497.04009.0
2PyViz Tutorialde2573.04873.03930.0
3NoneNoneNaNNaNNaN
4Python Basicsde525.0427.0276.0
5Python Basicsen157.085.0226.0
\n", "
" ], "text/plain": [ " Title Language 2021-12 2022-01 2022-02\n", "0 Jupyter Tutorial de 19651.0 30134.0 33295.0\n", "1 Jupyter Tutorial en 4722.0 3497.0 4009.0\n", "2 PyViz Tutorial de 2573.0 4873.0 3930.0\n", "3 None None NaN NaN NaN\n", "4 Python Basics de 525.0 427.0 276.0\n", "5 Python Basics en 157.0 85.0 226.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"Title\": [\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " None,\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " \"Language\": [\"de\", \"en\", \"de\", None, \"de\", \"en\"],\n", " \"2021-12\": [19651, 4722, 2573, None, 525, 157],\n", " \"2022-01\": [30134, 3497, 4873, None, 427, 85],\n", " \"2022-02\": [33295, 4009, 3930, None, 276, 226],\n", " }\n", ")\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "647e625c", "metadata": {}, "source": [ "Angenommen, ihr möchtet den Summe der Spalte `02/2022` anhand der Beschriftungen von `Title` berechnen. Es gibt mehrere Möglichkeiten, dies zu tun. Eine davon ist der Zugriff auf `02/2022` und der Aufruf von `groupby` mit der Spalte (einer Series) in `Title`:" ] }, { "cell_type": "code", "execution_count": 3, "id": "d2f62153", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped = df[\"2022-02\"].groupby(df[\"Title\"])\n", "\n", "grouped" ] }, { "cell_type": "markdown", "id": "e0f2e825", "metadata": {}, "source": [ "Diese `grouped`-Variable ist nun ein spezielles `SeriesGroupBy`-Objekt. Es hat noch nichts berechnet, außer einigen Zwischendaten über den Gruppenschlüssel `df['Title']`. Die Idee ist, dass dieses Objekt über alle Informationen verfügt, die benötigt werden, um eine Operation auf jede der Gruppen anzuwenden. Zur Berechnung der Gruppenmittelwerte können wir beispielsweise die Methode `sum` des `GroupBy`-Objekts aufrufen:" ] }, { "cell_type": "code", "execution_count": 4, "id": "2939a291", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Title\n", "Jupyter Tutorial 37304.0\n", "PyViz Tutorial 3930.0\n", "Python Basics 502.0\n", "Name: 2022-02, dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.sum()" ] }, { "cell_type": "markdown", "id": "19bcb625", "metadata": {}, "source": [ "Später werde ich mehr darüber erklären, was passiert, wenn ihr `.sum()` aufruft. Wichtig ist hier, dass die Daten (eine Reihe) durch Aufteilung der Daten auf den Gruppenschlüssel aggregiert wurden, wodurch eine neue Reihe entsteht, die nun durch die eindeutigen Werte in der Spalte `Title` indiziert ist. Der resultierende Index ist `Title`, weil `groupby(df['Title']` dies tat." ] }, { "cell_type": "markdown", "id": "424fdc8a", "metadata": {}, "source": [ "Hätten wir stattdessen mehrere Arrays als Liste übergeben, würden wir etwas anderes erhalten:" ] }, { "cell_type": "code", "execution_count": 5, "id": "4ceeb0cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Language Title \n", "de Jupyter Tutorial 19651.0\n", " PyViz Tutorial 2573.0\n", " Python Basics 525.0\n", "en Jupyter Tutorial 4722.0\n", " Python Basics 157.0\n", "Name: 2021-12, dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sums = df[\"2021-12\"].groupby([df[\"Language\"], df[\"Title\"]]).sum()\n", "\n", "sums" ] }, { "cell_type": "markdown", "id": "969abdd0", "metadata": {}, "source": [ "Hier haben wir die Daten anhand von zwei Schlüsseln gruppiert, und die resultierende Reihe hat nun einen hierarchischen Index, der aus den beobachteten eindeutigen Schlüsselpaaren besteht:" ] }, { "cell_type": "code", "execution_count": 6, "id": "c72eb2c6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleJupyter TutorialPyViz TutorialPython Basics
Language
de19651.02573.0525.0
en4722.0NaN157.0
\n", "
" ], "text/plain": [ "Title Jupyter Tutorial PyViz Tutorial Python Basics\n", "Language \n", "de 19651.0 2573.0 525.0\n", "en 4722.0 NaN 157.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sums.unstack()" ] }, { "cell_type": "markdown", "id": "59025fe7", "metadata": {}, "source": [ "Häufig befinden sich die Gruppierungsinformationen in demselben DataFrame wie die Daten, die ihr bearbeiten möchtet. In diesem Fall könnt ihr Spaltennamen (egal ob es sich um Zeichenketten, Zahlen oder andere Python-Objekte handelt) als Gruppenschlüssel übergeben:" ] }, { "cell_type": "code", "execution_count": 7, "id": "4d7fab97", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Language2021-122022-012022-02
Title
Jupyter Tutorialdeen24373.033631.037304.0
PyViz Tutorialde2573.04873.03930.0
Python Basicsdeen682.0512.0502.0
\n", "
" ], "text/plain": [ " Language 2021-12 2022-01 2022-02\n", "Title \n", "Jupyter Tutorial deen 24373.0 33631.0 37304.0\n", "PyViz Tutorial de 2573.0 4873.0 3930.0\n", "Python Basics deen 682.0 512.0 502.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"Title\").sum()" ] }, { "cell_type": "markdown", "id": "ae165388", "metadata": {}, "source": [ "Hierbei fällt auf, dass das Ergebnis keine Spalte `Language` enthält. Da es sich bei `df['Language']` nicht um numerische Daten handelt, stört sie im Tabellenlayout und wird daher automatisch aus dem Ergebnis ausgeschlossen. Standardmäßig werden alle numerischen Spalten aggregiert." ] }, { "cell_type": "code", "execution_count": 8, "id": "1a4f155e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde19651.030134.033295.0
en4722.03497.04009.0
PyViz Tutorialde2573.04873.03930.0
Python Basicsde525.0427.0276.0
en157.085.0226.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 19651.0 30134.0 33295.0\n", " en 4722.0 3497.0 4009.0\n", "PyViz Tutorial de 2573.0 4873.0 3930.0\n", "Python Basics de 525.0 427.0 276.0\n", " en 157.0 85.0 226.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"Title\", \"Language\"]).sum()" ] }, { "cell_type": "markdown", "id": "dd25c6d2", "metadata": {}, "source": [ "Unabhängig vom Ziel der Verwendung von `groupby` ist eine allgemein nützliche `groupby`-Methode `size`, die eine Serie mit den Gruppengrößen zurückgibt:" ] }, { "cell_type": "code", "execution_count": 9, "id": "8ec18d81", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Language\n", "de 3\n", "en 2\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"Language\"]).size()" ] }, { "cell_type": "markdown", "id": "2b2b057e", "metadata": {}, "source": [ "
\n", "\n", "**Hinweis:**\n", "\n", "Alle fehlenden Werte in einem Gruppenschlüssel werden standardmäßig aus dem Ergebnis ausgeschlossen. Dieses Verhalten kann deaktiviert werden, indem `dropna=False` an `groupby` übergeben wird.\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "id": "fcbd7160", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Language\n", "de 3\n", "en 2\n", "NaN 1\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"Language\", dropna=False).size()" ] }, { "cell_type": "code", "execution_count": 11, "id": "06598c88", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Title Language\n", "Jupyter Tutorial de 1\n", " en 1\n", "PyViz Tutorial de 1\n", "Python Basics de 1\n", " en 1\n", "NaN NaN 1\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"Title\", \"Language\"], dropna=False).size()" ] }, { "cell_type": "markdown", "id": "7e3d1d0b", "metadata": {}, "source": [ "## Iteration über Gruppen\n", "\n", "Das von `groupby` zurückgegebene Objekt unterstützt Iteration und erzeugt eine Folge von 2-Tupeln, die den Gruppennamen zusammen mit dem Datenpaket enthalten. Betrachten wir das Folgende:" ] }, { "cell_type": "code", "execution_count": 12, "id": "3d297d82", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Jupyter Tutorial\n", " Title Language 2021-12 2022-01 2022-02\n", "0 Jupyter Tutorial de 19651.0 30134.0 33295.0\n", "1 Jupyter Tutorial en 4722.0 3497.0 4009.0\n", "PyViz Tutorial\n", " Title Language 2021-12 2022-01 2022-02\n", "2 PyViz Tutorial de 2573.0 4873.0 3930.0\n", "Python Basics\n", " Title Language 2021-12 2022-01 2022-02\n", "4 Python Basics de 525.0 427.0 276.0\n", "5 Python Basics en 157.0 85.0 226.0\n" ] } ], "source": [ "for name, group in df.groupby(\"Title\"):\n", " print(name)\n", " print(group)" ] }, { "cell_type": "markdown", "id": "aaa800a6", "metadata": {}, "source": [ "Bei mehreren Schlüsseln ist das erste Element des Tupels ein Tupel von Schlüsselwerten:" ] }, { "cell_type": "code", "execution_count": 13, "id": "33bb6b4d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Jupyter Tutorial', 'de')\n", " Title Language 2021-12 2022-01 2022-02\n", "0 Jupyter Tutorial de 19651.0 30134.0 33295.0\n", "('Jupyter Tutorial', 'en')\n", " Title Language 2021-12 2022-01 2022-02\n", "1 Jupyter Tutorial en 4722.0 3497.0 4009.0\n", "('PyViz Tutorial', 'de')\n", " Title Language 2021-12 2022-01 2022-02\n", "2 PyViz Tutorial de 2573.0 4873.0 3930.0\n", "('Python Basics', 'de')\n", " Title Language 2021-12 2022-01 2022-02\n", "4 Python Basics de 525.0 427.0 276.0\n", "('Python Basics', 'en')\n", " Title Language 2021-12 2022-01 2022-02\n", "5 Python Basics en 157.0 85.0 226.0\n" ] } ], "source": [ "for (i1, i2), group in df.groupby([\"Title\", \"Language\"]):\n", " print((i1, i2))\n", " print(group)" ] }, { "cell_type": "markdown", "id": "aa7ac14c", "metadata": {}, "source": [ "Als nächstes wollen wir ein `dict` der Daten als Einzeiler ausgeben:" ] }, { "cell_type": "code", "execution_count": 14, "id": "7080879c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Jupyter Tutorial': Title Language 2021-12 2022-01 2022-02\n", " 0 Jupyter Tutorial de 19651.0 30134.0 33295.0\n", " 1 Jupyter Tutorial en 4722.0 3497.0 4009.0,\n", " 'PyViz Tutorial': Title Language 2021-12 2022-01 2022-02\n", " 2 PyViz Tutorial de 2573.0 4873.0 3930.0,\n", " 'Python Basics': Title Language 2021-12 2022-01 2022-02\n", " 4 Python Basics de 525.0 427.0 276.0\n", " 5 Python Basics en 157.0 85.0 226.0}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "books = dict(list(df.groupby(\"Title\")))\n", "\n", "books" ] }, { "cell_type": "markdown", "id": "b1614d9d", "metadata": {}, "source": [ "Standardmäßig gruppiert `groupby` auf `axis=0`, aber ihr könnt auch auf jeder der anderen Achsen gruppieren. Zum Beispiel könnten wir die Spalten unseres Beispiels `df` hier nach `dtype` gruppieren wie folgt:" ] }, { "cell_type": "code", "execution_count": 15, "id": "d5974c77", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Title object\n", "Language object\n", "2021-12 float64\n", "2022-01 float64\n", "2022-02 float64\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 16, "id": "01dc9b39", "metadata": {}, "outputs": [], "source": [ "grouped = df.T.groupby(df.dtypes)" ] }, { "cell_type": "code", "execution_count": 17, "id": "bd81bddb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "float64\n", " 0 1 2 3 4 5\n", "2021-12 19651.0 4722.0 2573.0 NaN 525.0 157.0\n", "2022-01 30134.0 3497.0 4873.0 NaN 427.0 85.0\n", "2022-02 33295.0 4009.0 3930.0 NaN 276.0 226.0\n", "object\n", " 0 1 2 3 \\\n", "Title Jupyter Tutorial Jupyter Tutorial PyViz Tutorial None \n", "Language de en de None \n", "\n", " 4 5 \n", "Title Python Basics Python Basics \n", "Language de en \n" ] } ], "source": [ "for dtype, group in grouped:\n", " print(dtype)\n", " print(group)" ] }, { "cell_type": "markdown", "id": "2baec4a6", "metadata": {}, "source": [ "## Auswählen einer Spalte oder Untergruppe von Spalten\n", "\n", "Die Indizierung eines `GroupBy`-Objekts, das aus einem DataFrame mit einem Spaltennamen oder einem Array von Spaltennamen erstellt wurde, hat den Effekt einer Spaltenunterteilung für die Aggregation. Dies bedeutet, dass:" ] }, { "cell_type": "code", "execution_count": 18, "id": "e5e8fd94", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"Title\")[\"2021-12\"]\n", "df.groupby(\"Title\")[[\"2022-01\"]]" ] }, { "cell_type": "markdown", "id": "ce99850b", "metadata": {}, "source": [ "sind vereinfachte Schreibweisen für:" ] }, { "cell_type": "code", "execution_count": 19, "id": "96205ee8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"2021-12\"].groupby(df[\"Title\"])\n", "df[[\"2022-01\"]].groupby(df[\"Title\"])" ] }, { "cell_type": "markdown", "id": "6865f7b5", "metadata": {}, "source": [ "Insbesondere bei großen Datensätzen kann es wünschenswert sein, nur einige Spalten zu aggregieren. Um zum Beispiel im vorhergehenden Datensatz die Summe nur für die Spalte `01/2022` zu berechnen und das Ergebnis als DataFrame zu erhalten, könnten wir schreiben:" ] }, { "cell_type": "code", "execution_count": 20, "id": "46d4b885", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2022-01
TitleLanguage
Jupyter Tutorialde30134.0
en3497.0
PyViz Tutorialde4873.0
Python Basicsde427.0
en85.0
\n", "
" ], "text/plain": [ " 2022-01\n", "Title Language \n", "Jupyter Tutorial de 30134.0\n", " en 3497.0\n", "PyViz Tutorial de 4873.0\n", "Python Basics de 427.0\n", " en 85.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"Title\", \"Language\"])[[\"2022-01\"]].sum()" ] }, { "cell_type": "markdown", "id": "52345b61", "metadata": {}, "source": [ "Das von dieser Indizierungsoperation zurückgegebene Objekt ist ein gruppierter DataFrame, wenn eine Liste oder ein Array übergeben wird, oder eine gruppierte Serie, wenn nur ein einzelner Spaltenname als Skalar übergeben wird:" ] }, { "cell_type": "code", "execution_count": 21, "id": "ca221529", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series_grouped = df.groupby([\"Title\", \"Language\"])[\"2022-01\"]\n", "\n", "series_grouped" ] }, { "cell_type": "code", "execution_count": 22, "id": "9167f78a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Title Language\n", "Jupyter Tutorial de 30134.0\n", " en 3497.0\n", "PyViz Tutorial de 4873.0\n", "Python Basics de 427.0\n", " en 85.0\n", "Name: 2022-01, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series_grouped.sum()" ] }, { "cell_type": "markdown", "id": "d3d46fdc", "metadata": {}, "source": [ "## Gruppierung mit `dicts` und `Series`\n", "\n", "Gruppierungsinformationen können auch in einer anderen Form als einem Array vorliegen:" ] }, { "cell_type": "code", "execution_count": 23, "id": "bc2df07e", "metadata": {}, "outputs": [], "source": [ "df.iloc[2:3, [2, 3]] = np.nan" ] }, { "cell_type": "markdown", "id": "ec42307f", "metadata": {}, "source": [ "Angenommen, ich habe eine Gruppenkorrespondenz für die Spalten und möchte die Spalten nach Gruppen zusammenfassen:" ] }, { "cell_type": "code", "execution_count": 24, "id": "fff0698e", "metadata": {}, "outputs": [], "source": [ "mapping = {\"2021-12\": \"Dec 2021\", \"2022-01\": \"Jan 2022\", \"2022-02\": \"Feb 2022\"}" ] }, { "cell_type": "markdown", "id": "29f128ac", "metadata": {}, "source": [ "Nun könnte aus diesem `dict` ein Array konstruiert werden, um es an `groupby` zu übergeben, aber stattdessen können wir auch einfach das `dict` übergeben:" ] }, { "cell_type": "code", "execution_count": 25, "id": "a35f838f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345
Dec 202119651.04722.000525.0157.0
Feb 202233295.04009.03930.00276.0226.0
Jan 202230134.03497.000427.085.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5\n", "Dec 2021 19651.0 4722.0 0 0 525.0 157.0\n", "Feb 2022 33295.0 4009.0 3930.0 0 276.0 226.0\n", "Jan 2022 30134.0 3497.0 0 0 427.0 85.0" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "by_column = df.T.groupby(mapping)\n", "\n", "by_column.sum()" ] }, { "cell_type": "markdown", "id": "acc8d67c", "metadata": {}, "source": [ "Die gleiche Funktionalität gilt für `Series`, die als eine Abbildung mit fester Größe betrachtet werden können:" ] }, { "cell_type": "code", "execution_count": 26, "id": "adad743a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2021-12 Dec 2021\n", "2022-01 Jan 2022\n", "2022-02 Feb 2022\n", "dtype: object" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map_series = pd.Series(mapping)\n", "\n", "map_series" ] }, { "cell_type": "code", "execution_count": 27, "id": "5be77438", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345
Dec 2021110011
Feb 2022111011
Jan 2022110011
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5\n", "Dec 2021 1 1 0 0 1 1\n", "Feb 2022 1 1 1 0 1 1\n", "Jan 2022 1 1 0 0 1 1" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.T.groupby(map_series).count()" ] }, { "cell_type": "markdown", "id": "ee329ff2", "metadata": {}, "source": [ "## Gruppieren mit Funktionen\n", "\n", "Die Verwendung von Python-Funktionen ist im Vergleich zu einem `Dict` oder einer `Series` eine allgemeinere Methode zur Definition einer Gruppenzuordnung. Jede Funktion, die als Gruppenschlüssel übergeben wird, wird einmal pro Indexwert aufgerufen, wobei die Rückgabewerte als Gruppennamen verwendet werden. Betrachtet konkret den Beispiel-DataFrame aus dem vorherigen Abschnitt, das die Titel als Indexwerte enthält. Angenommen, Wenn ihr nach der Länge der Namen gruppieren wollt, könnt ihr zwar ein Array mit den Längen der Strings berechnen, aber es ist einfacher, die Funktion `len` zu übergeben:" ] }, { "cell_type": "code", "execution_count": 28, "id": "48553e76", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(\n", " [\n", " [19651, 30134, 33295],\n", " [4722, 3497, 4009],\n", " [2573, 4873, 3930],\n", " [525, 427, 276],\n", " [157, 85, 226],\n", " ],\n", " index=[\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " columns=[\"2021-12\", \"2022-01\", \"2022-02\"],\n", ")" ] }, { "cell_type": "code", "execution_count": 29, "id": "737dfa93", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
13222
14111
16222
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "13 2 2 2\n", "14 1 1 1\n", "16 2 2 2" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(len).count()" ] }, { "cell_type": "markdown", "id": "fc755a53", "metadata": {}, "source": [ "Das Mischen von Funktionen mit Arrays, Dicts oder Series ist kein Problem, da alles intern in Arrays umgewandelt wird:" ] }, { "cell_type": "code", "execution_count": 30, "id": "5976d024", "metadata": {}, "outputs": [], "source": [ "languages = [\"de\", \"en\", \"de\", \"de\", \"en\"]" ] }, { "cell_type": "code", "execution_count": 31, "id": "8a49da12", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
13de111
en111
14de111
16de111
en111
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "13 de 1 1 1\n", " en 1 1 1\n", "14 de 1 1 1\n", "16 de 1 1 1\n", " en 1 1 1" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([len, languages]).count()" ] }, { "cell_type": "markdown", "id": "9e2622a9", "metadata": {}, "source": [ "## Gruppierung nach Indexebenen\n", "\n", "Eine letzte praktische Funktion für hierarchisch indizierte Datensätze ist die Möglichkeit der Aggregation anhand einer der Indexebenen einer Achse. Schauen wir uns ein Beispiel an:" ] }, { "cell_type": "code", "execution_count": 32, "id": "ce3a50f6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Month2021-122022-012022-02
Versionlateststablelateststablelateststable
Jupyter Tutorialde19651.00.030134.00.033295.00.0
en4722.01825.03497.02576.04009.03707.0
PyViz Tutorialde2573.00.04873.00.03930.00.0
NaNNaNNaNNaNNaNNaNNaNNaN
Python Basicsde525.00.0427.00.0276.00.0
en157.00.085.00.0226.00.0
\n", "
" ], "text/plain": [ "Month 2021-12 2022-01 2022-02 \n", "Version latest stable latest stable latest stable\n", "Jupyter Tutorial de 19651.0 0.0 30134.0 0.0 33295.0 0.0\n", " en 4722.0 1825.0 3497.0 2576.0 4009.0 3707.0\n", "PyViz Tutorial de 2573.0 0.0 4873.0 0.0 3930.0 0.0\n", "NaN NaN NaN NaN NaN NaN NaN NaN\n", "Python Basics de 525.0 0.0 427.0 0.0 276.0 0.0\n", " en 157.0 0.0 85.0 0.0 226.0 0.0" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "version_hits = [\n", " [19651, 0, 30134, 0, 33295, 0],\n", " [4722, 1825, 3497, 2576, 4009, 3707],\n", " [2573, 0, 4873, 0, 3930, 0],\n", " [None, None, None, None, None, None],\n", " [525, 0, 427, 0, 276, 0],\n", " [157, 0, 85, 0, 226, 0],\n", "]\n", "df = pd.DataFrame(\n", " version_hits,\n", " index=[\n", " [\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " None,\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " [\"de\", \"en\", \"de\", None, \"de\", \"en\"],\n", " ],\n", " columns=[\n", " [\"2021-12\", \"2021-12\", \"2022-01\", \"2022-01\", \"2022-02\", \"2022-02\"],\n", " [\"latest\", \"stable\", \"latest\", \"stable\", \"latest\", \"stable\"],\n", " ],\n", ")\n", "df.columns.names = [\"Month\", \"Version\"]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 33, "id": "1ab217b9", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Jupyter TutorialPyViz TutorialNaNPython Basics
deendeNaNdeen
Month
2021-1219651.06547.02573.00.0525.0157.0
2022-0130134.06073.04873.00.0427.085.0
2022-0233295.07716.03930.00.0276.0226.0
\n", "
" ], "text/plain": [ " Jupyter Tutorial PyViz Tutorial NaN Python Basics \n", " de en de NaN de en\n", "Month \n", "2021-12 19651.0 6547.0 2573.0 0.0 525.0 157.0\n", "2022-01 30134.0 6073.0 4873.0 0.0 427.0 85.0\n", "2022-02 33295.0 7716.0 3930.0 0.0 276.0 226.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.T.groupby(level=\"Month\").sum()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }