{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Verwalten fehlender Daten mit pandas\n", "\n", "Fehlende Daten treten häufig bei Datenanalysen auf. pandas vereinfacht die Arbeit mit fehlenden Daten so weit wie möglich. Zum Beispiel schließen alle [deskriptiven Statistiken](../workspace/pandas/descriptive-statistics.ipynb) von pandas-Objekten standardmäßig fehlende Daten aus. pandas verwendet für numerische Daten den Fließkommawert `NaN` (*Not a Number*), um fehlende Daten darzustellen.\n", "\n", "In pandas wurde eine der Programmiersprache R entlehnte Konvention übernommen und fehlende Daten als `NA` bezeichnet, was für *not available* (engl.: nicht verfügbar) steht. In statistischen Anwendungen können `NA`-Daten entweder Daten sein, die nicht existieren oder die zwar existieren, aber nicht beobachtet wurden (z.B. durch Probleme bei der Datenerfassung). Auch das [None](https://docs.python.org/3/c-api/none.html)-Objekt von Python wird in nicht-numerischen Arrays als `NA` behandelt.\n", "\n", "Methoden zum Handling von `NA`-Objekten:\n", "\n", "Argument | Beschreibung\n", ":------- | :-----------\n", "`dropna` | filtert Achsenbeschriftungen auf der Grundlage, ob Werte für die einzelnen Label fehlende Daten aufweisen, wobei unterschiedliche Schwellenwerte für die zu tolerierende Menge fehlender Daten gelten.\n", "`fillna` | füllt fehlende Daten mit einem Wert oder mit einer Interpolationsmethode wie `ffill'` oder `bfill` auf.\n", "`isna` | gibt boolesche Werte zurück, die angeben, welche Werte fehlen/`NA` sind.\n", "`notna` | negiert `isna` und gibt `True` für nicht-`NA`-Werte und `False` für `NA`-Werte zurück." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In diesem Notebook werden einige Möglichkeiten vorgestellt, wie fehlende Daten mit pandas DataFrames verwaltet werden können. Weitere Informationen findet ihr in der Pandas-Dokumentation: [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) und [Missing data cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-missing-data).\n", "\n", "> **Siehe auch:**\n", "> \n", "> * [Dora](https://github.com/NathanEpstein/Dora)\n", "> * [Badfish](https://github.com/harshnisar/badfish)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:08.245740Z", "iopub.status.busy": "2026-05-22T13:42:08.245593Z", "iopub.status.idle": "2026-05-22T13:42:08.534472Z", "shell.execute_reply": "2026-05-22T13:42:08.534149Z", "shell.execute_reply.started": "2026-05-22T13:42:08.245723Z" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:08.534910Z", "iopub.status.busy": "2026-05-22T13:42:08.534788Z", "iopub.status.idle": "2026-05-22T13:42:10.539711Z", "shell.execute_reply": "2026-05-22T13:42:10.539394Z", "shell.execute_reply.started": "2026-05-22T13:42:08.534901Z" } }, "outputs": [], "source": [ "df = pd.read_csv(\n", " \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Überprüfen der Daten\n", "\n", "Bei der Bereinigung von Daten für die Analyse ist es oft wichtig, die fehlenden Daten selbst zu analysieren, um Probleme bei der Datenerfassung oder potenzielle Verzerrungen in den Daten aufgrund der fehlenden Daten zu ermitteln. Zunächst lassen wir uns die ersten 20 Datensätze anzeigen:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:10.540438Z", "iopub.status.busy": "2026-05-22T13:42:10.540337Z", "iopub.status.idle": "2026-05-22T13:42:10.549434Z", "shell.execute_reply": "2026-05-22T13:42:10.549111Z", "shell.execute_reply.started": "2026-05-22T13:42:10.540430Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampusernametemperatureheartratebuildlatestnote
02017-01-01T12:00:23michaelsmith12.0674e6a7805-8faa-2768-6ef6-eb3198b483ac0.0interval
12017-01-01T12:01:09kharrison6.0787256b7b0-e502-f576-62ec-ed73533c9c840.0wake
22017-01-01T12:01:34smithadam5.0899226c94b-bb4b-a6c8-8e02-cb42b53e9c900.0NaN
32017-01-01T12:02:09eddierodriguez28.076NaN0.0update
42017-01-01T12:02:36kenneth9429.062122f1c6a-403c-2221-6ed1-b5caa08f11e0NaNNaN
52017-01-01T12:03:04bryanttodd13.0860897dbe5-9c5b-71ca-73a1-7586959ca1980.0interval
62017-01-01T12:03:51andrea9817.0811c07ab9b-5f66-137d-a74f-921a41001f4e1.0NaN
72017-01-01T12:04:35scott2816.0767a60219f-6621-e548-180e-ca69624f9824NaNinterval
82017-01-01T12:05:05hillpamela5.082a8b87754-a162-da28-2527-4bce4b3d41911.0NaN
92017-01-01T12:05:41moorejeffrey25.063585f1a3c-0679-0ffe-9132-508933c703430.0wake
102017-01-01T12:06:21njohnsonNaN63e09b6001-125d-51cf-9c3f-9cb686c19d02NaNNaN
112017-01-01T12:06:53gsutton29.080607c9f6e-2bdf-a606-6d16-3004c69584361.0update
122017-01-01T12:07:41jessica4822.08303e1a07b-3e14-412c-3a69-6b45bc79f81cNaNupdate
132017-01-01T12:08:08hornjohn16.073NaN0.0interval
142017-01-01T12:08:35gramirez24.073NaN0.0wake
152017-01-01T12:09:05schmidtsamuelNaN78b9890c1e-79d5-8979-63ae-6c08a4cd476a0.0NaN
162017-01-01T12:09:48derrick47NaN63b60bd7de-4057-8a85-f806-e6eec1350338NaNinterval
172017-01-01T12:10:23beckercharles12.061b1dacc73-c8b7-1d7d-ee02-578da781a71e0.0test
182017-01-01T12:10:57ipittman11.0691aef7db8-9a3e-7dc9-d7a5-781ec0efd200NaNuser
192017-01-01T12:11:34sabrina6522.0828075d058-7dae-e2ec-d47e-58ec6d26899b1.0NaN
\n", "
" ], "text/plain": [ " timestamp username temperature heartrate \\\n", "0 2017-01-01T12:00:23 michaelsmith 12.0 67 \n", "1 2017-01-01T12:01:09 kharrison 6.0 78 \n", "2 2017-01-01T12:01:34 smithadam 5.0 89 \n", "3 2017-01-01T12:02:09 eddierodriguez 28.0 76 \n", "4 2017-01-01T12:02:36 kenneth94 29.0 62 \n", "5 2017-01-01T12:03:04 bryanttodd 13.0 86 \n", "6 2017-01-01T12:03:51 andrea98 17.0 81 \n", "7 2017-01-01T12:04:35 scott28 16.0 76 \n", "8 2017-01-01T12:05:05 hillpamela 5.0 82 \n", "9 2017-01-01T12:05:41 moorejeffrey 25.0 63 \n", "10 2017-01-01T12:06:21 njohnson NaN 63 \n", "11 2017-01-01T12:06:53 gsutton 29.0 80 \n", "12 2017-01-01T12:07:41 jessica48 22.0 83 \n", "13 2017-01-01T12:08:08 hornjohn 16.0 73 \n", "14 2017-01-01T12:08:35 gramirez 24.0 73 \n", "15 2017-01-01T12:09:05 schmidtsamuel NaN 78 \n", "16 2017-01-01T12:09:48 derrick47 NaN 63 \n", "17 2017-01-01T12:10:23 beckercharles 12.0 61 \n", "18 2017-01-01T12:10:57 ipittman 11.0 69 \n", "19 2017-01-01T12:11:34 sabrina65 22.0 82 \n", "\n", " build latest note \n", "0 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval \n", "1 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake \n", "2 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN \n", "3 NaN 0.0 update \n", "4 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN \n", "5 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval \n", "6 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN \n", "7 7a60219f-6621-e548-180e-ca69624f9824 NaN interval \n", "8 a8b87754-a162-da28-2527-4bce4b3d4191 1.0 NaN \n", "9 585f1a3c-0679-0ffe-9132-508933c70343 0.0 wake \n", "10 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN \n", "11 607c9f6e-2bdf-a606-6d16-3004c6958436 1.0 update \n", "12 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update \n", "13 NaN 0.0 interval \n", "14 NaN 0.0 wake \n", "15 b9890c1e-79d5-8979-63ae-6c08a4cd476a 0.0 NaN \n", "16 b60bd7de-4057-8a85-f806-e6eec1350338 NaN interval \n", "17 b1dacc73-c8b7-1d7d-ee02-578da781a71e 0.0 test \n", "18 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 NaN user \n", "19 8075d058-7dae-e2ec-d47e-58ec6d26899b 1.0 NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dann schauen wir uns an, von welchem Datentyp die Spalten sind:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:10.549968Z", "iopub.status.busy": "2026-05-22T13:42:10.549883Z", "iopub.status.idle": "2026-05-22T13:42:10.552686Z", "shell.execute_reply": "2026-05-22T13:42:10.552385Z", "shell.execute_reply.started": "2026-05-22T13:42:10.549960Z" } }, "outputs": [ { "data": { "text/plain": [ "timestamp object\n", "username object\n", "temperature float64\n", "heartrate int64\n", "build object\n", "latest float64\n", "note object\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mit `pandas.api.types as ptypes` können wir auch automatisiert überprüfen, ob die Datentypen euren Erwartungen entsprechen:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:10.553783Z", "iopub.status.busy": "2026-05-22T13:42:10.553688Z", "iopub.status.idle": "2026-05-22T13:42:10.651289Z", "shell.execute_reply": "2026-05-22T13:42:10.649555Z", "shell.execute_reply.started": "2026-05-22T13:42:10.553775Z" } }, "outputs": [ { "ename": "ValueError", "evalue": "Expected datetimes in ['timestamp'].", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[5], line 13\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mall\u001b[39m(\n\u001b[1;32m 10\u001b[0m ptypes\u001b[38;5;241m.\u001b[39mis_datetime64_any_dtype(df[col]) \u001b[38;5;28;01mfor\u001b[39;00m col \u001b[38;5;129;01min\u001b[39;00m expected_datetimes\n\u001b[1;32m 11\u001b[0m ):\n\u001b[1;32m 12\u001b[0m msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected datetimes in \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mexpected_datetimes\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m---> 13\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(msg)\n\u001b[1;32m 14\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mall\u001b[39m(ptypes\u001b[38;5;241m.\u001b[39mis_object_dtype(df[col]) \u001b[38;5;28;01mfor\u001b[39;00m col \u001b[38;5;129;01min\u001b[39;00m expected_objects):\n\u001b[1;32m 15\u001b[0m msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected objects in \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mexpected_objects\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n", "\u001b[0;31mValueError\u001b[0m: Expected datetimes in ['timestamp']." ] } ], "source": [ "import pandas.api.types as ptypes\n", "\n", "\n", "expected_datetimes = [\"timestamp\"]\n", "expected_objects = [\"username\", \"build\", \"note\"]\n", "expected_floats = [\"temperature\", \"latest\"]\n", "expected_ints = [\"heartrate\"]\n", "\n", "if not all(\n", " ptypes.is_datetime64_any_dtype(df[col]) for col in expected_datetimes\n", "):\n", " msg = f\"Expected datetimes in {expected_datetimes}.\"\n", " raise ValueError(msg)\n", "if not all(ptypes.is_object_dtype(df[col]) for col in expected_objects):\n", " msg = f\"Expected objects in {expected_objects}.\"\n", " raise ValueError(msg)\n", "if not all(ptypes.is_float_dtype(df[col]) for col in expected_floats):\n", " msg = f\"Expected floats in {expected_floats}\"\n", " raise ValueError(msg)\n", "if not all(ptypes.is_int64_dtype(df[col]) for col in expected_ints):\n", " msg = f\"Expected ints in {expected_ints}\"\n", " raise ValueError(msg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nun können wir die `timestamp`-Spalte in den passenden Typ umwandeln, siehe auch [dtype konvertieren](convert-dtypes.ipynb):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:17.977797Z", "iopub.status.busy": "2026-05-22T13:42:17.977502Z", "iopub.status.idle": "2026-05-22T13:42:18.022332Z", "shell.execute_reply": "2026-05-22T13:42:18.022037Z", "shell.execute_reply.started": "2026-05-22T13:42:17.977777Z" } }, "outputs": [], "source": [ "df[\"timestamp\"] = df[\"timestamp\"].astype(\"datetime64[ns]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir können uns auch die Werte und deren Häufigkeit anzeigen lassen, z.B. für die Spalte `note`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:18.023157Z", "iopub.status.busy": "2026-05-22T13:42:18.023052Z", "iopub.status.idle": "2026-05-22T13:42:18.029936Z", "shell.execute_reply": "2026-05-22T13:42:18.029562Z", "shell.execute_reply.started": "2026-05-22T13:42:18.023147Z" } }, "outputs": [ { "data": { "text/plain": [ "note\n", "wake 16496\n", "user 16416\n", "interval 16274\n", "sleep 16226\n", "update 16213\n", "test 16068\n", "Name: count, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.note.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Entfernen aller Nullwerte (einschließlich der Angabe `n/a`)\n", "\n", "### 2.1 … mit `pandas.read_csv`\n", "\n", "[pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) filtert normalerweise bereits viele Werte heraus, die es als `NA` oder `NaN` erkennt. Weitere Werte können mit `na_values` angegeben werden." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:18.030588Z", "iopub.status.busy": "2026-05-22T13:42:18.030484Z", "iopub.status.idle": "2026-05-22T13:42:20.066338Z", "shell.execute_reply": "2026-05-22T13:42:20.066040Z", "shell.execute_reply.started": "2026-05-22T13:42:18.030581Z" } }, "outputs": [], "source": [ "df = pd.read_csv(\n", " \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv\",\n", " na_values=[\"n/a\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 … mit `pandas.DataFrame.dropna`\n", "\n", "Mit [pandas.DataFrame.dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) lassen sich fehlende Werte löschen." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Um den Umfang der Löschungen zu analysieren lassen wir uns den Umfang des DataFrame vor und nach dem Läschen mit [pandas.DataFrame.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) anzeigen:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.067453Z", "iopub.status.busy": "2026-05-22T13:42:20.067353Z", "iopub.status.idle": "2026-05-22T13:42:20.070197Z", "shell.execute_reply": "2026-05-22T13:42:20.069917Z", "shell.execute_reply.started": "2026-05-22T13:42:20.067445Z" } }, "outputs": [ { "data": { "text/plain": [ "(146397, 7)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.070564Z", "iopub.status.busy": "2026-05-22T13:42:20.070497Z", "iopub.status.idle": "2026-05-22T13:42:20.088946Z", "shell.execute_reply": "2026-05-22T13:42:20.088144Z", "shell.execute_reply.started": "2026-05-22T13:42:20.070557Z" } }, "outputs": [ { "data": { "text/plain": [ "(46116, 7)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna().shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir würden mit `pandas.DataFrame.dropna` also mehr als 2/3 der Datensätze verlieren." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Im nächsten Versuch wollen wir analysieren, ob ganze Zeilen oder Spalten keine Daten enthalten. Dabei werden mit `how='all'` Zeilen oder Spalten entfernt, die keine Werte enthalten; `axis=1` besagt, dass leere Zeilen entfernt werden sollen." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.089531Z", "iopub.status.busy": "2026-05-22T13:42:20.089414Z", "iopub.status.idle": "2026-05-22T13:42:20.106947Z", "shell.execute_reply": "2026-05-22T13:42:20.106693Z", "shell.execute_reply.started": "2026-05-22T13:42:20.089523Z" } }, "outputs": [ { "data": { "text/plain": [ "(146397, 7)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(how=\"all\", axis=1).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Auch dies bringt uns noch nicht weiter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Alle Spalten finden, in denen die meisten Daten vorhanden sind" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.107505Z", "iopub.status.busy": "2026-05-22T13:42:20.107361Z", "iopub.status.idle": "2026-05-22T13:42:20.124713Z", "shell.execute_reply": "2026-05-22T13:42:20.124462Z", "shell.execute_reply.started": "2026-05-22T13:42:20.107496Z" } }, "outputs": [ { "data": { "text/plain": [ "['timestamp', 'username', 'heartrate']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(df.dropna(thresh=int(df.shape[0] * 0.9), axis=1).columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`thresh` erfordert eine bestimmte Anzahl von NA-Werten, in unserem Fall 90 %, bevor `axis=1` eine Spalte ausblendet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Alle Spalten mit fehlenden Daten finden\n", "\n", "Mit [pandas.DataFrame.isnull](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html) können wir fehlende Werte finden und mit [pandas.DataFrame.any](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html) erfahren wir, ob ein Element gültig ist, normalerweise über einer Spalte." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.125147Z", "iopub.status.busy": "2026-05-22T13:42:20.125073Z", "iopub.status.idle": "2026-05-22T13:42:20.139856Z", "shell.execute_reply": "2026-05-22T13:42:20.139543Z", "shell.execute_reply.started": "2026-05-22T13:42:20.125139Z" } }, "outputs": [], "source": [ "incomplete_columns = list(df.columns[df.isna().any()])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.140371Z", "iopub.status.busy": "2026-05-22T13:42:20.140273Z", "iopub.status.idle": "2026-05-22T13:42:20.142750Z", "shell.execute_reply": "2026-05-22T13:42:20.142437Z", "shell.execute_reply.started": "2026-05-22T13:42:20.140363Z" } }, "outputs": [ { "data": { "text/plain": [ "['temperature', 'build', 'latest', 'note']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "incomplete_columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mit `num_missing` können wir uns nun die Anzahl der fehlenden Werte pro Spalte ausgeben lassen:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.144338Z", "iopub.status.busy": "2026-05-22T13:42:20.144194Z", "iopub.status.idle": "2026-05-22T13:42:20.162366Z", "shell.execute_reply": "2026-05-22T13:42:20.161696Z", "shell.execute_reply.started": "2026-05-22T13:42:20.144329Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number missing for column temperature: 32357\n", "number missing for column build: 32350\n", "number missing for column latest: 32298\n", "number missing for column note: 48704\n" ] } ], "source": [ "for col in incomplete_columns:\n", " num_missing = df[df[col].isna()].shape[0]\n", " print(f\"number missing for column {col}: {num_missing}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Diese Werte können wir uns auch prozentual ausgeben lassen:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.163080Z", "iopub.status.busy": "2026-05-22T13:42:20.162925Z", "iopub.status.idle": "2026-05-22T13:42:20.179218Z", "shell.execute_reply": "2026-05-22T13:42:20.178838Z", "shell.execute_reply.started": "2026-05-22T13:42:20.163072Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "percent missing for column temperature: 0.22102228870810195\n", "percent missing for column build: 0.22097447352063226\n", "percent missing for column latest: 0.22061927498514314\n", "percent missing for column note: 0.332684412931959\n" ] } ], "source": [ "for col in incomplete_columns:\n", " percent_missing = df[df[col].isna()].shape[0] / df.shape[0]\n", " print(f\"percent missing for column {col}: {percent_missing}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5 Ersetzen fehlender Daten\n", "\n", "Um unsere Änderungen in der Spalte `latest` überprüfen zu können, verwenden wir [pandas.Series.value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html). Die Methode gibt eine Serie zurück, die die Anzahl der eindeutigen Werte enthält:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.180045Z", "iopub.status.busy": "2026-05-22T13:42:20.179886Z", "iopub.status.idle": "2026-05-22T13:42:20.184512Z", "shell.execute_reply": "2026-05-22T13:42:20.184220Z", "shell.execute_reply.started": "2026-05-22T13:42:20.180037Z" } }, "outputs": [ { "data": { "text/plain": [ "temperature\n", "29.0 4688\n", "26.0 4674\n", "16.0 4656\n", "28.0 4648\n", "10.0 4632\n", "13.0 4629\n", "7.0 4624\n", "27.0 4621\n", "21.0 4585\n", "9.0 4576\n", "23.0 4571\n", "5.0 4568\n", "6.0 4563\n", "19.0 4561\n", "18.0 4557\n", "17.0 4556\n", "11.0 4529\n", "15.0 4525\n", "8.0 4486\n", "12.0 4484\n", "20.0 4473\n", "25.0 4469\n", "14.0 4464\n", "22.0 4455\n", "24.0 4446\n", "Name: count, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.temperature.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jetzt ersetzen wir die fehlenden Werte in der Spalte `temperature` durch den auf eine Nachkommastelle gerundeten Mittelwert mit [DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html):" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.184935Z", "iopub.status.busy": "2026-05-22T13:42:20.184862Z", "iopub.status.idle": "2026-05-22T13:42:20.189732Z", "shell.execute_reply": "2026-05-22T13:42:20.189484Z", "shell.execute_reply.started": "2026-05-22T13:42:20.184928Z" } }, "outputs": [ { "data": { "text/plain": [ "temperature\n", "17.0 36913\n", "29.0 4688\n", "26.0 4674\n", "16.0 4656\n", "28.0 4648\n", "10.0 4632\n", "13.0 4629\n", "7.0 4624\n", "27.0 4621\n", "21.0 4585\n", "9.0 4576\n", "23.0 4571\n", "5.0 4568\n", "6.0 4563\n", "19.0 4561\n", "18.0 4557\n", "11.0 4529\n", "15.0 4525\n", "8.0 4486\n", "12.0 4484\n", "20.0 4473\n", "25.0 4469\n", "14.0 4464\n", "22.0 4455\n", "24.0 4446\n", "Name: count, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp_mean = round(df.temperature.mean(), 1)\n", "fill_mean = df.temperature.fillna(temp_mean)\n", "fill_mean.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.6 Ersetzen fehlender Daten durch `backfill`\n", "\n", "Damit die Datensätze in ihrer zeitlichen Reihenfolge aufeinanderfolgen, setzen wir zunächst den Index für `timestamp` mit [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html):" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.190080Z", "iopub.status.busy": "2026-05-22T13:42:20.190022Z", "iopub.status.idle": "2026-05-22T13:42:20.194998Z", "shell.execute_reply": "2026-05-22T13:42:20.194654Z", "shell.execute_reply.started": "2026-05-22T13:42:20.190074Z" } }, "outputs": [], "source": [ "df = df.set_index(\"timestamp\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.195518Z", "iopub.status.busy": "2026-05-22T13:42:20.195440Z", "iopub.status.idle": "2026-05-22T13:42:20.200375Z", "shell.execute_reply": "2026-05-22T13:42:20.200152Z", "shell.execute_reply.started": "2026-05-22T13:42:20.195510Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
usernametemperatureheartratebuildlatestnote
timestamp
2017-01-01T12:00:23michaelsmith12.0674e6a7805-8faa-2768-6ef6-eb3198b483ac0.0interval
2017-01-01T12:01:09kharrison6.0787256b7b0-e502-f576-62ec-ed73533c9c840.0wake
2017-01-01T12:01:34smithadam5.0899226c94b-bb4b-a6c8-8e02-cb42b53e9c900.0NaN
2017-01-01T12:02:09eddierodriguez28.076NaN0.0update
2017-01-01T12:02:36kenneth9429.062122f1c6a-403c-2221-6ed1-b5caa08f11e0NaNNaN
2017-01-01T12:03:04bryanttodd13.0860897dbe5-9c5b-71ca-73a1-7586959ca1980.0interval
2017-01-01T12:03:51andrea9817.0811c07ab9b-5f66-137d-a74f-921a41001f4e1.0NaN
2017-01-01T12:04:35scott2816.0767a60219f-6621-e548-180e-ca69624f9824NaNinterval
2017-01-01T12:05:05hillpamela5.082a8b87754-a162-da28-2527-4bce4b3d41911.0NaN
2017-01-01T12:05:41moorejeffrey25.063585f1a3c-0679-0ffe-9132-508933c703430.0wake
2017-01-01T12:06:21njohnsonNaN63e09b6001-125d-51cf-9c3f-9cb686c19d02NaNNaN
2017-01-01T12:06:53gsutton29.080607c9f6e-2bdf-a606-6d16-3004c69584361.0update
2017-01-01T12:07:41jessica4822.08303e1a07b-3e14-412c-3a69-6b45bc79f81cNaNupdate
2017-01-01T12:08:08hornjohn16.073NaN0.0interval
2017-01-01T12:08:35gramirez24.073NaN0.0wake
2017-01-01T12:09:05schmidtsamuelNaN78b9890c1e-79d5-8979-63ae-6c08a4cd476a0.0NaN
2017-01-01T12:09:48derrick47NaN63b60bd7de-4057-8a85-f806-e6eec1350338NaNinterval
2017-01-01T12:10:23beckercharles12.061b1dacc73-c8b7-1d7d-ee02-578da781a71e0.0test
2017-01-01T12:10:57ipittman11.0691aef7db8-9a3e-7dc9-d7a5-781ec0efd200NaNuser
2017-01-01T12:11:34sabrina6522.0828075d058-7dae-e2ec-d47e-58ec6d26899b1.0NaN
\n", "
" ], "text/plain": [ " username temperature heartrate \\\n", "timestamp \n", "2017-01-01T12:00:23 michaelsmith 12.0 67 \n", "2017-01-01T12:01:09 kharrison 6.0 78 \n", "2017-01-01T12:01:34 smithadam 5.0 89 \n", "2017-01-01T12:02:09 eddierodriguez 28.0 76 \n", "2017-01-01T12:02:36 kenneth94 29.0 62 \n", "2017-01-01T12:03:04 bryanttodd 13.0 86 \n", "2017-01-01T12:03:51 andrea98 17.0 81 \n", "2017-01-01T12:04:35 scott28 16.0 76 \n", "2017-01-01T12:05:05 hillpamela 5.0 82 \n", "2017-01-01T12:05:41 moorejeffrey 25.0 63 \n", "2017-01-01T12:06:21 njohnson NaN 63 \n", "2017-01-01T12:06:53 gsutton 29.0 80 \n", "2017-01-01T12:07:41 jessica48 22.0 83 \n", "2017-01-01T12:08:08 hornjohn 16.0 73 \n", "2017-01-01T12:08:35 gramirez 24.0 73 \n", "2017-01-01T12:09:05 schmidtsamuel NaN 78 \n", "2017-01-01T12:09:48 derrick47 NaN 63 \n", "2017-01-01T12:10:23 beckercharles 12.0 61 \n", "2017-01-01T12:10:57 ipittman 11.0 69 \n", "2017-01-01T12:11:34 sabrina65 22.0 82 \n", "\n", " build latest note \n", "timestamp \n", "2017-01-01T12:00:23 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval \n", "2017-01-01T12:01:09 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake \n", "2017-01-01T12:01:34 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN \n", "2017-01-01T12:02:09 NaN 0.0 update \n", "2017-01-01T12:02:36 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN \n", "2017-01-01T12:03:04 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval \n", "2017-01-01T12:03:51 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN \n", "2017-01-01T12:04:35 7a60219f-6621-e548-180e-ca69624f9824 NaN interval \n", "2017-01-01T12:05:05 a8b87754-a162-da28-2527-4bce4b3d4191 1.0 NaN \n", "2017-01-01T12:05:41 585f1a3c-0679-0ffe-9132-508933c70343 0.0 wake \n", "2017-01-01T12:06:21 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN \n", "2017-01-01T12:06:53 607c9f6e-2bdf-a606-6d16-3004c6958436 1.0 update \n", "2017-01-01T12:07:41 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update \n", "2017-01-01T12:08:08 NaN 0.0 interval \n", "2017-01-01T12:08:35 NaN 0.0 wake \n", "2017-01-01T12:09:05 b9890c1e-79d5-8979-63ae-6c08a4cd476a 0.0 NaN \n", "2017-01-01T12:09:48 b60bd7de-4057-8a85-f806-e6eec1350338 NaN interval \n", "2017-01-01T12:10:23 b1dacc73-c8b7-1d7d-ee02-578da781a71e 0.0 test \n", "2017-01-01T12:10:57 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 NaN user \n", "2017-01-01T12:11:34 8075d058-7dae-e2ec-d47e-58ec6d26899b 1.0 NaN " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Anschließend verwenden wir [pandas.DataFrame.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), um die Datensätze nach `username` zu gruppieren und dann die fehlenden Daten mit der `backfill`-Methode von [pandas.core.groupby.DataFrameGroupBy.fillna](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.fillna.html) zu füllen. `limit` definiert die maximale Anzahl aufeinanderfolgender `NaN`-Werte:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.200718Z", "iopub.status.busy": "2026-05-22T13:42:20.200662Z", "iopub.status.idle": "2026-05-22T13:42:20.272918Z", "shell.execute_reply": "2026-05-22T13:42:20.272616Z", "shell.execute_reply.started": "2026-05-22T13:42:20.200712Z" } }, "outputs": [], "source": [ "df.temperature = df.groupby(\"username\").temperature.bfill(limit=3)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:42:20.273408Z", "iopub.status.busy": "2026-05-22T13:42:20.273342Z", "iopub.status.idle": "2026-05-22T13:42:20.293479Z", "shell.execute_reply": "2026-05-22T13:42:20.293195Z", "shell.execute_reply.started": "2026-05-22T13:42:20.273401Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number missing for column temperature: 22633\n", "number missing for column build: 32350\n", "number missing for column latest: 32298\n", "number missing for column note: 48704\n" ] } ], "source": [ "for col in incomplete_columns:\n", " num_missing = df[df[col].isna()].shape[0]\n", " print(f\"number missing for column {col}: {num_missing}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Argumente der Funktion `fillna`:\n", "\n", "Argument | Beschreibung\n", ":------- | :-----------\n", "`value` | Skalarwert oder dict-atähnliches Objekt, das zum Auffüllen fehlender Werte verwendet wird\n", "Methode | Interpolation; standardmäßig `ffill`, wenn die Funktion ohne weitere Argumente aufgerufen wird\n", "`axis` | Aufzufüllende Achse; Voreinstellung `axis=0`\n", "`inplace` | ändert das aufrufende Objekt, ohne eine Kopie zu erzeugen\n", "`limit` | Für das Auffüllen in Vorwärts- und Rückwärtsrichtung, maximale Anzahl von aufeinanderfolgenden Perioden zum Auffüllen" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }