{ "cells": [ { "cell_type": "markdown", "id": "28f76326", "metadata": {}, "source": [ "# Apply\n", "\n", "Die am allgemeinsten einsetzbare `GroupBy`-Methode ist `apply`. Sie teilt das zu bearbeitende Objekt auf, ruft die übergebene Funktion auf jedem Teil auf und versucht dann, die Teile miteinander zu verketten." ] }, { "cell_type": "markdown", "id": "bd02d5f5", "metadata": {}, "source": [ "Nehmen wir an, wir wollen die fünf größten `hit`-Werte nach Gruppen auswählen. Hierzu schreiben wir zunächst eine Funktion, die die Zeilen mit den größten Werten in einer bestimmten Spalte auswählt:" ] }, { "cell_type": "code", "execution_count": 1, "id": "a5bbe4fa", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.103478Z", "iopub.status.busy": "2026-05-21T16:34:54.103283Z", "iopub.status.idle": "2026-05-21T16:34:54.315510Z", "shell.execute_reply": "2026-05-21T16:34:54.315072Z", "shell.execute_reply.started": "2026-05-21T16:34:54.103460Z" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "1b99d975", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.316044Z", "iopub.status.busy": "2026-05-21T16:34:54.315919Z", "iopub.status.idle": "2026-05-21T16:34:54.323503Z", "shell.execute_reply": "2026-05-21T16:34:54.323199Z", "shell.execute_reply.started": "2026-05-21T16:34:54.316034Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde30134.033295.019651.0
en6073.07716.06547.0
PyViz Tutorialde4873.03930.02573.0
enNaNNaNNaN
Python Basicsde427.0276.0525.0
en95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " en 6073.0 7716.0 6547.0\n", "PyViz Tutorial de 4873.0 3930.0 2573.0\n", " en NaN NaN NaN\n", "Python Basics de 427.0 276.0 525.0\n", " en 95.0 226.0 157.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"2021-12\": [30134, 6073, 4873, None, 427, 95],\n", " \"2022-01\": [33295, 7716, 3930, None, 276, 226],\n", " \"2022-02\": [19651, 6547, 2573, None, 525, 157],\n", " },\n", " index=[\n", " [\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " \"PyViz Tutorial\",\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " [\"de\", \"en\", \"de\", \"en\", \"de\", \"en\"],\n", " ],\n", ")\n", "df.index.names = [\"Title\", \"Language\"]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 3, "id": "da3c09a8", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.323850Z", "iopub.status.busy": "2026-05-21T16:34:54.323774Z", "iopub.status.idle": "2026-05-21T16:34:54.327983Z", "shell.execute_reply": "2026-05-21T16:34:54.327660Z", "shell.execute_reply.started": "2026-05-21T16:34:54.323844Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde30134.033295.019651.0
en6073.07716.06547.0
PyViz Tutorialde4873.03930.02573.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " en 6073.0 7716.0 6547.0\n", "PyViz Tutorial de 4873.0 3930.0 2573.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def top(df, n=5, column=\"2021-12\"):\n", " return df.sort_values(by=column, ascending=False)[:n]\n", "\n", "\n", "top(df, n=3)" ] }, { "cell_type": "markdown", "id": "45248446", "metadata": {}, "source": [ "Wenn wir nun z.B. nach Titeln gruppieren und `apply` mit dieser Funktion aufrufen, erhalten wir Folgendes:" ] }, { "cell_type": "code", "execution_count": 4, "id": "deaf5428", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.328473Z", "iopub.status.busy": "2026-05-21T16:34:54.328382Z", "iopub.status.idle": "2026-05-21T16:34:54.332851Z", "shell.execute_reply": "2026-05-21T16:34:54.332651Z", "shell.execute_reply.started": "2026-05-21T16:34:54.328465Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleTitleLanguage
Jupyter TutorialJupyter Tutorialde30134.033295.019651.0
en6073.07716.06547.0
PyViz TutorialPyViz Tutorialde4873.03930.02573.0
enNaNNaNNaN
Python BasicsPython Basicsde427.0276.0525.0
en95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Title Language \n", "Jupyter Tutorial Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " en 6073.0 7716.0 6547.0\n", "PyViz Tutorial PyViz Tutorial de 4873.0 3930.0 2573.0\n", " en NaN NaN NaN\n", "Python Basics Python Basics de 427.0 276.0 525.0\n", " en 95.0 226.0 157.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_titles = df.groupby(\"Title\")\n", "\n", "grouped_titles.apply(top)" ] }, { "cell_type": "markdown", "id": "08227af2", "metadata": {}, "source": [ "Was ist hier passiert? Die obere Funktion wird für jede Zeilengruppe des DataFrame aufgerufen, und dann werden die Ergebnisse mit [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) zusammengefügt, wobei die Teile mit den Gruppennamen gekennzeichnet werden. Das Ergebnis hat daher einen hierarchischen Index, dessen innere Ebene Indexwerte aus dem ursprünglichen DataFrame enthält." ] }, { "cell_type": "markdown", "id": "b37bee0a", "metadata": {}, "source": [ "Wenn ihr eine Funktion an `apply` übergebt, die andere Argumente oder Schlüsselwörter benötigt, könnt ihr diese nach der Funktion übergeben:" ] }, { "cell_type": "code", "execution_count": 5, "id": "0c89d8e6", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.333233Z", "iopub.status.busy": "2026-05-21T16:34:54.333176Z", "iopub.status.idle": "2026-05-21T16:34:54.337450Z", "shell.execute_reply": "2026-05-21T16:34:54.337154Z", "shell.execute_reply.started": "2026-05-21T16:34:54.333226Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleTitleLanguage
Jupyter TutorialJupyter Tutorialde30134.033295.019651.0
PyViz TutorialPyViz Tutorialde4873.03930.02573.0
Python BasicsPython Basicsde427.0276.0525.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Title Language \n", "Jupyter Tutorial Jupyter Tutorial de 30134.0 33295.0 19651.0\n", "PyViz Tutorial PyViz Tutorial de 4873.0 3930.0 2573.0\n", "Python Basics Python Basics de 427.0 276.0 525.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_titles.apply(top, n=1)" ] }, { "cell_type": "markdown", "id": "96cc6370", "metadata": {}, "source": [ "Wir haben nun die grundlegende Verwendungsweise von `apply` gesehen. Was innerhalb der übergebenen Funktion geschieht, ist sehr vielseitig und bleibt euch überlassen; sie muss nur ein pandas-Objekt oder einen Einzelwert zurückgeben. Im Folgend werden wir daher hauptsächlich Beispielen zeigen, die euch Anregungen geben können, wie ihr verschiedene Probleme mit `groupby` lösen könnt." ] }, { "cell_type": "markdown", "id": "8db08c56", "metadata": {}, "source": [ "Zunächst vergegenwärtigen wir uns nochmal an `describe`, aufgerufen über dem `GroupBy`-Objekt:" ] }, { "cell_type": "code", "execution_count": 6, "id": "a0ebb337", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.339233Z", "iopub.status.busy": "2026-05-21T16:34:54.339102Z", "iopub.status.idle": "2026-05-21T16:34:54.352732Z", "shell.execute_reply": "2026-05-21T16:34:54.352489Z", "shell.execute_reply.started": "2026-05-21T16:34:54.339222Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
countmeanstdmin25%50%75%maxcountmean...75%maxcountmeanstdmin25%50%75%max
Title
Jupyter Tutorial2.018103.517013.6962626073.012088.2518103.524118.7530134.02.020505.5...26900.2533295.02.013099.09265.9272616547.09823.013099.016375.019651.0
PyViz Tutorial1.04873.0NaN4873.04873.004873.04873.004873.01.03930.0...3930.003930.01.02573.0NaN2573.02573.02573.02573.02573.0
Python Basics2.0261.0234.75945195.0178.00261.0344.00427.02.0251.0...263.50276.02.0341.0260.215295157.0249.0341.0433.0525.0
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " 2021-12 \\\n", " count mean std min 25% 50% \n", "Title \n", "Jupyter Tutorial 2.0 18103.5 17013.696262 6073.0 12088.25 18103.5 \n", "PyViz Tutorial 1.0 4873.0 NaN 4873.0 4873.00 4873.0 \n", "Python Basics 2.0 261.0 234.759451 95.0 178.00 261.0 \n", "\n", " 2022-01 ... \\\n", " 75% max count mean ... 75% max \n", "Title ... \n", "Jupyter Tutorial 24118.75 30134.0 2.0 20505.5 ... 26900.25 33295.0 \n", "PyViz Tutorial 4873.00 4873.0 1.0 3930.0 ... 3930.00 3930.0 \n", "Python Basics 344.00 427.0 2.0 251.0 ... 263.50 276.0 \n", "\n", " 2022-02 \\\n", " count mean std min 25% 50% \n", "Title \n", "Jupyter Tutorial 2.0 13099.0 9265.927261 6547.0 9823.0 13099.0 \n", "PyViz Tutorial 1.0 2573.0 NaN 2573.0 2573.0 2573.0 \n", "Python Basics 2.0 341.0 260.215295 157.0 249.0 341.0 \n", "\n", " \n", " 75% max \n", "Title \n", "Jupyter Tutorial 16375.0 19651.0 \n", "PyViz Tutorial 2573.0 2573.0 \n", "Python Basics 433.0 525.0 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = grouped_titles.describe()\n", "\n", "result" ] }, { "cell_type": "markdown", "id": "6ccdb4d5", "metadata": {}, "source": [ "Wenn ihr innerhalb von `GroupBy` eine Methode wie `describe` aufruft, ist dies eigentlich nur eine Abkürzung für:" ] }, { "cell_type": "code", "execution_count": 7, "id": "967709f7", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.353165Z", "iopub.status.busy": "2026-05-21T16:34:54.353077Z", "iopub.status.idle": "2026-05-21T16:34:54.364029Z", "shell.execute_reply": "2026-05-21T16:34:54.363422Z", "shell.execute_reply.started": "2026-05-21T16:34:54.353158Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
Title
Jupyter Tutorialcount2.0000002.0000002.000000
mean18103.50000020505.50000013099.000000
std17013.69626218087.0843569265.927261
min6073.0000007716.0000006547.000000
25%12088.25000014110.7500009823.000000
50%18103.50000020505.50000013099.000000
75%24118.75000026900.25000016375.000000
max30134.00000033295.00000019651.000000
PyViz Tutorialcount1.0000001.0000001.000000
mean4873.0000003930.0000002573.000000
stdNaNNaNNaN
min4873.0000003930.0000002573.000000
25%4873.0000003930.0000002573.000000
50%4873.0000003930.0000002573.000000
75%4873.0000003930.0000002573.000000
max4873.0000003930.0000002573.000000
Python Basicscount2.0000002.0000002.000000
mean261.000000251.000000341.000000
std234.75945135.355339260.215295
min95.000000226.000000157.000000
25%178.000000238.500000249.000000
50%261.000000251.000000341.000000
75%344.000000263.500000433.000000
max427.000000276.000000525.000000
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title \n", "Jupyter Tutorial count 2.000000 2.000000 2.000000\n", " mean 18103.500000 20505.500000 13099.000000\n", " std 17013.696262 18087.084356 9265.927261\n", " min 6073.000000 7716.000000 6547.000000\n", " 25% 12088.250000 14110.750000 9823.000000\n", " 50% 18103.500000 20505.500000 13099.000000\n", " 75% 24118.750000 26900.250000 16375.000000\n", " max 30134.000000 33295.000000 19651.000000\n", "PyViz Tutorial count 1.000000 1.000000 1.000000\n", " mean 4873.000000 3930.000000 2573.000000\n", " std NaN NaN NaN\n", " min 4873.000000 3930.000000 2573.000000\n", " 25% 4873.000000 3930.000000 2573.000000\n", " 50% 4873.000000 3930.000000 2573.000000\n", " 75% 4873.000000 3930.000000 2573.000000\n", " max 4873.000000 3930.000000 2573.000000\n", "Python Basics count 2.000000 2.000000 2.000000\n", " mean 261.000000 251.000000 341.000000\n", " std 234.759451 35.355339 260.215295\n", " min 95.000000 226.000000 157.000000\n", " 25% 178.000000 238.500000 249.000000\n", " 50% 261.000000 251.000000 341.000000\n", " 75% 344.000000 263.500000 433.000000\n", " max 427.000000 276.000000 525.000000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def desc(x):\n", " return x.describe()\n", "\n", "\n", "grouped_titles.apply(desc)" ] }, { "cell_type": "markdown", "id": "fd2395bc", "metadata": {}, "source": [ "## Unterdrückung der Gruppenschlüssel\n", "\n", "In den vorangegangenen Beispielen habr ihr gesehen, dass das resultierende Objekt einen hierarchischen Index hat, der aus den Gruppenschlüsseln zusammen mit den Indizes der einzelnen Teile des ursprünglichen Objekts gebildet wird. Ihr können dies deaktivieren, indem ihr `group_keys=False` an `groupby` übergebt:" ] }, { "cell_type": "code", "execution_count": 8, "id": "4f3cbc07", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.364522Z", "iopub.status.busy": "2026-05-21T16:34:54.364423Z", "iopub.status.idle": "2026-05-21T16:34:54.368679Z", "shell.execute_reply": "2026-05-21T16:34:54.368408Z", "shell.execute_reply.started": "2026-05-21T16:34:54.364514Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde30134.033295.019651.0
PyViz Tutorialde4873.03930.02573.0
Python Basicsde427.0276.0525.0
Jupyter Tutorialen6073.07716.06547.0
Python Basicsen95.0226.0157.0
PyViz TutorialenNaNNaNNaN
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 30134.0 33295.0 19651.0\n", "PyViz Tutorial de 4873.0 3930.0 2573.0\n", "Python Basics de 427.0 276.0 525.0\n", "Jupyter Tutorial en 6073.0 7716.0 6547.0\n", "Python Basics en 95.0 226.0 157.0\n", "PyViz Tutorial en NaN NaN NaN" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_lang = df.groupby(\"Language\", group_keys=False)\n", "\n", "grouped_lang.apply(top)" ] }, { "cell_type": "markdown", "id": "92c83367", "metadata": {}, "source": [ "## Quantil- und Bucket-Analyse\n", "\n", "Wie bereits in [Diskretisierung und Gruppierung](discretisation.ipynb) beschrieben, verfügt pandas über einige Werkzeuge, insbesondere `cut` und `qcut`, um Daten in Buckets mit Bins eurer Wahl oder nach Stichprobenquantilen aufzuteilen. Kombiniert man diese Funktionen mit `groupby`, kann man bequem eine Bucket- oder Quantilanalyse für einen Datensatz durchführen. Betrachtet einen einfachen Zufallsdatensatz und eine gleich lange Bucket-Kategorisierung mit `cut`:" ] }, { "cell_type": "code", "execution_count": 9, "id": "796c9936", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.369049Z", "iopub.status.busy": "2026-05-21T16:34:54.368987Z", "iopub.status.idle": "2026-05-21T16:34:54.372834Z", "shell.execute_reply": "2026-05-21T16:34:54.372464Z", "shell.execute_reply.started": "2026-05-21T16:34:54.369043Z" } }, "outputs": [ { "data": { "text/plain": [ "0 (-0.462, 1.407]\n", "1 (-0.462, 1.407]\n", "2 (-0.462, 1.407]\n", "3 (-0.462, 1.407]\n", "4 (-0.462, 1.407]\n", "5 (-2.331, -0.462]\n", "6 (-2.331, -0.462]\n", "7 (-0.462, 1.407]\n", "8 (-2.331, -0.462]\n", "9 (1.407, 3.275]\n", "Name: data1, dtype: category\n", "Categories (4, interval[float64, right]): [(-4.208, -2.331] < (-2.331, -0.462] < (-0.462, 1.407] < (1.407, 3.275]]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.default_rng()\n", "df2 = pd.DataFrame(\n", " {\n", " \"data1\": rng.normal(size=1000),\n", " \"data2\": rng.normal(size=1000),\n", " },\n", ")\n", "\n", "quartiles = pd.cut(df2.data1, 4)\n", "\n", "quartiles[:10]" ] }, { "cell_type": "markdown", "id": "c28aa4f8", "metadata": {}, "source": [ "Das von `cut` zurückgegebene `Categorical`-Objekt kann direkt an `groupby` übergeben werden. Wir könnten also eine Reihe von Gruppenstatistiken für die Quartile wie folgt berechnen:" ] }, { "cell_type": "code", "execution_count": 10, "id": "f1517742", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.373288Z", "iopub.status.busy": "2026-05-21T16:34:54.373215Z", "iopub.status.idle": "2026-05-21T16:34:54.380545Z", "shell.execute_reply": "2026-05-21T16:34:54.380143Z", "shell.execute_reply.started": "2026-05-21T16:34:54.373282Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/hk/s8m0bblj0g10hw885gld52mc0000gn/T/ipykernel_40670/157931318.py:12: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n", " grouped_quart = df2.groupby(quartiles)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minmaxcountmean
data1
(-4.208, -2.331]data1-4.200342-2.36617222-2.697025
data2-1.5056251.04916422-0.019913
(-2.331, -0.462]data1-2.329520-0.466649296-1.064579
data2-2.8865523.427402296-0.029285
(-0.462, 1.407]data1-0.4594331.4062235890.405802
data2-2.8404333.120917589-0.041455
(1.407, 3.275]data11.4113153.275479931.938677
data2-2.1421002.717809930.140287
\n", "
" ], "text/plain": [ " min max count mean\n", "data1 \n", "(-4.208, -2.331] data1 -4.200342 -2.366172 22 -2.697025\n", " data2 -1.505625 1.049164 22 -0.019913\n", "(-2.331, -0.462] data1 -2.329520 -0.466649 296 -1.064579\n", " data2 -2.886552 3.427402 296 -0.029285\n", "(-0.462, 1.407] data1 -0.459433 1.406223 589 0.405802\n", " data2 -2.840433 3.120917 589 -0.041455\n", "(1.407, 3.275] data1 1.411315 3.275479 93 1.938677\n", " data2 -2.142100 2.717809 93 0.140287" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def stats(group):\n", " return pd.DataFrame(\n", " {\n", " \"min\": group.min(),\n", " \"max\": group.max(),\n", " \"count\": group.count(),\n", " \"mean\": group.mean(),\n", " }\n", " )\n", "\n", "\n", "grouped_quart = df2.groupby(quartiles)\n", "\n", "grouped_quart.apply(stats)" ] }, { "cell_type": "markdown", "id": "b006fddf", "metadata": {}, "source": [ "Dies waren Buckets gleicher Länge; um Buckets gleicher Größe auf der Grundlage von Stichprobenquantilen zu berechnen, können wir `qcut` verwenden. Ich übergebe `labels=False`, um nur Quantilzahlen zu erhalten:" ] }, { "cell_type": "code", "execution_count": 11, "id": "70589dd6", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.381068Z", "iopub.status.busy": "2026-05-21T16:34:54.380990Z", "iopub.status.idle": "2026-05-21T16:34:54.388435Z", "shell.execute_reply": "2026-05-21T16:34:54.388115Z", "shell.execute_reply.started": "2026-05-21T16:34:54.381061Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minmaxcountmean
data1
0data1-4.200342-0.681646250-1.342533
data2-2.8865523.427402250-0.010124
1data1-0.6800270.076445250-0.273673
data2-2.5522693.120917250-0.030725
2data10.0836430.7614842500.403456
data2-2.8404332.677849250-0.029197
3data10.7621383.2754792501.392207
data2-2.8157442.725009250-0.011862
\n", "
" ], "text/plain": [ " min max count mean\n", "data1 \n", "0 data1 -4.200342 -0.681646 250 -1.342533\n", " data2 -2.886552 3.427402 250 -0.010124\n", "1 data1 -0.680027 0.076445 250 -0.273673\n", " data2 -2.552269 3.120917 250 -0.030725\n", "2 data1 0.083643 0.761484 250 0.403456\n", " data2 -2.840433 2.677849 250 -0.029197\n", "3 data1 0.762138 3.275479 250 1.392207\n", " data2 -2.815744 2.725009 250 -0.011862" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quartiles_samp = pd.qcut(df2.data1, 4, labels=False)\n", "grouped_quart_samp = df2.groupby(quartiles_samp)\n", "\n", "grouped_quart_samp.apply(stats)" ] }, { "cell_type": "markdown", "id": "b8043d46", "metadata": {}, "source": [ "## Daten mit gruppenspezifischen Werten auffüllen\n", "\n", "Wenn ihr fehlende Daten bereinigt, werdet ihr in einigen Fällen Datenbeobachtungen mit `dropna` ersetzen, aber in anderen Fällen möchtet ihr vielleicht die Nullwerte (`NA`) mit einem festen Wert oder einem aus den Daten abgeleiteten Wert auffüllen. `fillna` ist das richtige Werkzeug dafür; hier fülle ich zum Beispiel die Nullwerte mit dem Mittelwert auf:" ] }, { "cell_type": "code", "execution_count": 12, "id": "417345b1", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.389493Z", "iopub.status.busy": "2026-05-21T16:34:54.389114Z", "iopub.status.idle": "2026-05-21T16:34:54.392105Z", "shell.execute_reply": "2026-05-21T16:34:54.391805Z", "shell.execute_reply.started": "2026-05-21T16:34:54.389478Z" } }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 0.411457\n", "2 0.122992\n", "3 NaN\n", "4 -0.110075\n", "5 -0.494890\n", "6 NaN\n", "7 0.124568\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series(rng.normal(size=8))\n", "s[::3] = np.nan\n", "\n", "s" ] }, { "cell_type": "code", "execution_count": 13, "id": "f2e0d0eb", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.392615Z", "iopub.status.busy": "2026-05-21T16:34:54.392546Z", "iopub.status.idle": "2026-05-21T16:34:54.394891Z", "shell.execute_reply": "2026-05-21T16:34:54.394565Z", "shell.execute_reply.started": "2026-05-21T16:34:54.392609Z" } }, "outputs": [ { "data": { "text/plain": [ "0 0.010811\n", "1 0.411457\n", "2 0.122992\n", "3 0.010811\n", "4 -0.110075\n", "5 -0.494890\n", "6 0.010811\n", "7 0.124568\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.fillna(s.mean())" ] }, { "cell_type": "markdown", "id": "ac4979f5", "metadata": {}, "source": [ "Hier sind einige Beispieldaten zu meinen Tutorials, die in deutsch- und englischsprachige Ausgaben unterteilt sind:" ] }, { "cell_type": "markdown", "id": "07edf8ca", "metadata": {}, "source": [ "Angenommen, ihr möchtet, dass der Füllwert je nach Gruppe variiert. Diese Werte können vordefiniert werden, und da die Gruppen ein internes Namensattribut `name` haben, könnt ihr dieses mit `apply` verwenden:" ] }, { "cell_type": "code", "execution_count": 14, "id": "eb7970fb", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.395437Z", "iopub.status.busy": "2026-05-21T16:34:54.395373Z", "iopub.status.idle": "2026-05-21T16:34:54.399900Z", "shell.execute_reply": "2026-05-21T16:34:54.399698Z", "shell.execute_reply.started": "2026-05-21T16:34:54.395431Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
LanguageTitleLanguage
deJupyter Tutorialde30134.033295.019651.0
PyViz Tutorialde4873.03930.02573.0
Python Basicsde427.0276.0525.0
enJupyter Tutorialen6073.07716.06547.0
PyViz Tutorialen3469.03469.03469.0
Python Basicsen95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Language Title Language \n", "de Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " PyViz Tutorial de 4873.0 3930.0 2573.0\n", " Python Basics de 427.0 276.0 525.0\n", "en Jupyter Tutorial en 6073.0 7716.0 6547.0\n", " PyViz Tutorial en 3469.0 3469.0 3469.0\n", " Python Basics en 95.0 226.0 157.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_values = {\"de\": 10632, \"en\": 3469}\n", "\n", "\n", "def fill(g):\n", " return g.fillna(fill_values[g.name])\n", "\n", "\n", "df.groupby(\"Language\").apply(fill)" ] }, { "cell_type": "markdown", "id": "0514aa28", "metadata": {}, "source": [ "Ihr könnt auch die Daten gruppieren und `apply` mit einer Funktion zu verwenden, die `fillna` für jedes Datenpaket aufruft:" ] }, { "cell_type": "code", "execution_count": 15, "id": "b32b4aea", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.400213Z", "iopub.status.busy": "2026-05-21T16:34:54.400143Z", "iopub.status.idle": "2026-05-21T16:34:54.405198Z", "shell.execute_reply": "2026-05-21T16:34:54.404905Z", "shell.execute_reply.started": "2026-05-21T16:34:54.400207Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
LanguageTitleLanguage
deJupyter Tutorialde30134.033295.019651.0
PyViz Tutorialde4873.03930.02573.0
Python Basicsde427.0276.0525.0
enJupyter Tutorialen6073.07716.06547.0
PyViz Tutorialen3084.03971.03352.0
Python Basicsen95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Language Title Language \n", "de Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " PyViz Tutorial de 4873.0 3930.0 2573.0\n", " Python Basics de 427.0 276.0 525.0\n", "en Jupyter Tutorial en 6073.0 7716.0 6547.0\n", " PyViz Tutorial en 3084.0 3971.0 3352.0\n", " Python Basics en 95.0 226.0 157.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def fill_mean(g):\n", " return g.fillna(g.mean())\n", "\n", "\n", "df.groupby(\"Language\").apply(fill_mean)" ] }, { "cell_type": "markdown", "id": "857cdfe4", "metadata": {}, "source": [ "## Gruppierter gewichteter Durchschnitt\n", "\n", "Da Operationen zwischen Spalten in einem `DataFrame` oder zwei `Series` möglich sind, können wir z.B. den gruppengewichteten Durchschnitt berechnen:" ] }, { "cell_type": "code", "execution_count": 16, "id": "b925bb61", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.405681Z", "iopub.status.busy": "2026-05-21T16:34:54.405612Z", "iopub.status.idle": "2026-05-21T16:34:54.409227Z", "shell.execute_reply": "2026-05-21T16:34:54.408936Z", "shell.execute_reply.started": "2026-05-21T16:34:54.405675Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categorydataweights
0de817410.105997
1de256690.509308
2de134880.283457
3de321260.587351
4en416780.284316
5en920220.661866
6en742780.869102
7en437580.871160
\n", "
" ], "text/plain": [ " category data weights\n", "0 de 81741 0.105997\n", "1 de 25669 0.509308\n", "2 de 13488 0.283457\n", "3 de 32126 0.587351\n", "4 en 41678 0.284316\n", "5 en 92022 0.661866\n", "6 en 74278 0.869102\n", "7 en 43758 0.871160" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.default_rng()\n", "df3 = pd.DataFrame(\n", " {\n", " \"category\": [\"de\", \"de\", \"de\", \"de\", \"en\", \"en\", \"en\", \"en\"],\n", " \"data\": rng.integers(100000, size=8),\n", " \"weights\": rng.random(8),\n", " },\n", ")\n", "\n", "df3" ] }, { "cell_type": "markdown", "id": "bfd945a2", "metadata": {}, "source": [ "Der nach Kategorien gewichtete Gruppendurchschnitt würde dann lauten:" ] }, { "cell_type": "code", "execution_count": 17, "id": "de364549", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.409930Z", "iopub.status.busy": "2026-05-21T16:34:54.409675Z", "iopub.status.idle": "2026-05-21T16:34:54.413010Z", "shell.execute_reply": "2026-05-21T16:34:54.412791Z", "shell.execute_reply.started": "2026-05-21T16:34:54.409918Z" } }, "outputs": [ { "data": { "text/plain": [ "category\n", "de 29896.945738\n", "en 65302.434726\n", "dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_cat = df3.groupby(\"category\")\n", "\n", "\n", "def get_wavg(g):\n", " return np.average(g[\"data\"], weights=g[\"weights\"])\n", "\n", "\n", "grouped_cat.apply(get_wavg, include_groups=False)" ] }, { "cell_type": "markdown", "id": "485bde5b", "metadata": {}, "source": [ "## Korrelation" ] }, { "cell_type": "markdown", "id": "ffaf9697", "metadata": {}, "source": [ "Eine interessante Aufgabe könnte darin bestehen, einen `DataFrame` zu berechnen, der aus den prozentualen Veränderungen besteht." ] }, { "cell_type": "markdown", "id": "b2018813", "metadata": {}, "source": [ "Zu diesem Zweck erstellen wir zunächst eine Funktion, die die paarweise Korrelation der Spalte `2021-12` mit den nachfolgenden Spalten berechnet:" ] }, { "cell_type": "code", "execution_count": 18, "id": "2731ed0c", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.413372Z", "iopub.status.busy": "2026-05-21T16:34:54.413308Z", "iopub.status.idle": "2026-05-21T16:34:54.415225Z", "shell.execute_reply": "2026-05-21T16:34:54.414913Z", "shell.execute_reply.started": "2026-05-21T16:34:54.413366Z" } }, "outputs": [], "source": [ "def corr(x):\n", " return x.corrwith(x[\"2021-12\"])" ] }, { "cell_type": "markdown", "id": "c9d312fa", "metadata": {}, "source": [ "Als nächstes berechnen wir die prozentuale Veränderung:" ] }, { "cell_type": "code", "execution_count": 19, "id": "33a3d392", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.415691Z", "iopub.status.busy": "2026-05-21T16:34:54.415569Z", "iopub.status.idle": "2026-05-21T16:34:54.419892Z", "shell.execute_reply": "2026-05-21T16:34:54.419571Z", "shell.execute_reply.started": "2026-05-21T16:34:54.415681Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/hk/s8m0bblj0g10hw885gld52mc0000gn/T/ipykernel_40670/3358811060.py:1: FutureWarning: The default fill_method='pad' in DataFrame.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.\n", " pcts = df.pct_change().dropna()\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialen-0.798467-0.768253-0.666836
PyViz Tutorialde-0.197596-0.490669-0.606996
en0.0000000.0000000.000000
Python Basicsde-0.912374-0.929771-0.795958
en-0.777518-0.181159-0.700952
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial en -0.798467 -0.768253 -0.666836\n", "PyViz Tutorial de -0.197596 -0.490669 -0.606996\n", " en 0.000000 0.000000 0.000000\n", "Python Basics de -0.912374 -0.929771 -0.795958\n", " en -0.777518 -0.181159 -0.700952" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pcts = df.pct_change().dropna()\n", "\n", "pcts" ] }, { "cell_type": "markdown", "id": "7ff5fdec", "metadata": {}, "source": [ "Schließlich gruppieren wir diese prozentualen Änderungen nach Jahr, das aus jeder Zeilenbeschriftung mit einer einzeiligen Funktion extrahiert werden kann, die das Attribut Jahr jeder Datumsbeschriftung zurückgibt:" ] }, { "cell_type": "code", "execution_count": 20, "id": "8a566a2a", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.420311Z", "iopub.status.busy": "2026-05-21T16:34:54.420232Z", "iopub.status.idle": "2026-05-21T16:34:54.424279Z", "shell.execute_reply": "2026-05-21T16:34:54.423989Z", "shell.execute_reply.started": "2026-05-21T16:34:54.420304Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
Language
de1.01.0000001.00000
en1.00.6990880.99781
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Language \n", "de 1.0 1.000000 1.00000\n", "en 1.0 0.699088 0.99781" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "by_language = pcts.groupby(\"Language\")\n", "\n", "by_language.apply(corr)" ] }, { "cell_type": "code", "execution_count": 21, "id": "96f1a586", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.427395Z", "iopub.status.busy": "2026-05-21T16:34:54.427298Z", "iopub.status.idle": "2026-05-21T16:34:54.430130Z", "shell.execute_reply": "2026-05-21T16:34:54.429773Z", "shell.execute_reply.started": "2026-05-21T16:34:54.427389Z" } }, "outputs": [ { "data": { "text/plain": [ "Language\n", "de 1.000000\n", "en 0.699088\n", "dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "by_language.apply(lambda g: g[\"2021-12\"].corr(g[\"2022-01\"]))" ] }, { "cell_type": "markdown", "id": "9800b02b", "metadata": {}, "source": [ "## Performance-Probleme mit `apply`\n", "\n", "Da die `apply`-Methode typischerweise auf jeden einzelnen Wert in einer `Series` wirkt, wird die Funktion für jeden Wert einmal aufgerufen. Wenn ihr tausende Werte habt, wird die Funktion auch tausende Male aufgerufen. Dadurch werden die schnellen Vektorisierungen von pandas ignoriert sofern ihr keine NumPy-Funktionen verwendet, und langsames Python verwendet. Zum Beispiel haben wir zuvor die Daten nach Titel gruppiert und dann unsere `top`-Methode mit `apply` aufgerufen. Messen wir hierfür die Zeit:" ] }, { "cell_type": "code", "execution_count": 22, "id": "b6815e84", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:54.431088Z", "iopub.status.busy": "2026-05-21T16:34:54.430983Z", "iopub.status.idle": "2026-05-21T16:34:57.721745Z", "shell.execute_reply": "2026-05-21T16:34:57.721445Z", "shell.execute_reply.started": "2026-05-21T16:34:54.431081Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "399 μs ± 23.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "source": [ "%%timeit\n", "grouped_titles.apply(top)" ] }, { "cell_type": "markdown", "id": "1035c064", "metadata": {}, "source": [ "Wir können dasselbe Ergebnis auch ohne `apply` erhalten indem wir unserer Methode `top` den DataFrame übergeben:" ] }, { "cell_type": "code", "execution_count": 23, "id": "111d1777", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:34:57.722186Z", "iopub.status.busy": "2026-05-21T16:34:57.722110Z", "iopub.status.idle": "2026-05-21T16:35:00.975265Z", "shell.execute_reply": "2026-05-21T16:35:00.974949Z", "shell.execute_reply.started": "2026-05-21T16:34:57.722178Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "39.9 μs ± 785 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" ] } ], "source": [ "%%timeit\n", "top(df)" ] }, { "cell_type": "markdown", "id": "67aeb04c", "metadata": {}, "source": [ "Diese Berechnung ist 18 mal schneller." ] }, { "cell_type": "markdown", "id": "81fe0de3", "metadata": {}, "source": [ "## Optimieren von `apply` mit Cython\n", "\n", "Nicht immer lässt sich jedoch für `apply`so einfach eine Alternative finden. Numerische Operationen wie unsere `top`-Methode lässt sich jedoch mit [Cython](https://cython.org/) schneller machen. Um Cython in Jupyyter zu nutzen, verwenden wir die folgende [IPython-Magie](../ipython/magics.ipynb):" ] }, { "cell_type": "code", "execution_count": 24, "id": "c8b32bd2", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:35:00.975684Z", "iopub.status.busy": "2026-05-21T16:35:00.975605Z", "iopub.status.idle": "2026-05-21T16:35:01.219304Z", "shell.execute_reply": "2026-05-21T16:35:01.218908Z", "shell.execute_reply.started": "2026-05-21T16:35:00.975676Z" } }, "outputs": [], "source": [ "%load_ext Cython" ] }, { "cell_type": "markdown", "id": "662eb122", "metadata": {}, "source": [ "Dann können wir unsere `top`-Funktion mit Cython definieren:" ] }, { "cell_type": "code", "execution_count": 25, "id": "4c016e1a", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:35:01.219944Z", "iopub.status.busy": "2026-05-21T16:35:01.219874Z", "iopub.status.idle": "2026-05-21T16:35:01.277139Z", "shell.execute_reply": "2026-05-21T16:35:01.276878Z", "shell.execute_reply.started": "2026-05-21T16:35:01.219938Z" } }, "outputs": [], "source": [ "%%cython\n", "def top_cy(df, n=5, column=\"2021-12\"):\n", " return df.sort_values(by=column, ascending=False)[:n]" ] }, { "cell_type": "code", "execution_count": 26, "id": "ea728ba7", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T16:35:01.277568Z", "iopub.status.busy": "2026-05-21T16:35:01.277490Z", "iopub.status.idle": "2026-05-21T16:35:04.554226Z", "shell.execute_reply": "2026-05-21T16:35:04.553950Z", "shell.execute_reply.started": "2026-05-21T16:35:01.277561Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "399 μs ± 2.95 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "source": [ "%%timeit\n", "grouped_titles.apply(top_cy)" ] }, { "cell_type": "markdown", "id": "57a39279", "metadata": {}, "source": [ "Damit haben wir noch nicht wirklich viel gewonnen. Weitere Optimierungsmöglichkeiten wären nun, dass wir mit `cpdef` den Typ im Cython-Code definieren. Dafür müssten wir jedoch unsere Methode umbauen, da dann kein `DataFrame` mehr übergeben werden kann." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }