{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Verwalten fehlender Daten mit pandas\n",
    "\n",
    "Fehlende Daten treten häufig bei Datenanalysen auf. pandas vereinfacht die Arbeit mit fehlenden Daten so weit wie möglich. Zum Beispiel schließen alle [deskriptiven Statistiken](../workspace/pandas/descriptive-statistics.ipynb) von pandas-Objekten standardmäßig fehlende Daten aus. pandas verwendet für numerische Daten den Fließkommawert `NaN` (*Not a Number*), um fehlende Daten darzustellen.\n",
    "\n",
    "In pandas wurde eine der Programmiersprache R entlehnte Konvention übernommen und fehlende Daten als `NA` bezeichnet, was für *not available* (engl.: nicht verfügbar) steht. In statistischen Anwendungen können `NA`-Daten entweder Daten sein, die nicht existieren oder die zwar existieren, aber nicht beobachtet wurden (z.B. durch Probleme bei der Datenerfassung). Auch das [None](https://docs.python.org/3/c-api/none.html)-Objekt von Python wird in nicht-numerischen Arrays als `NA` behandelt.\n",
    "\n",
    "Methoden zum Handling von `NA`-Objekten:\n",
    "\n",
    "Argument | Beschreibung\n",
    ":------- | :-----------\n",
    "`dropna` | filtert Achsenbeschriftungen auf der Grundlage, ob Werte für die einzelnen Label fehlende Daten aufweisen, wobei unterschiedliche Schwellenwerte für die zu tolerierende Menge fehlender Daten gelten.\n",
    "`fillna` | füllt fehlende Daten mit einem Wert oder mit einer Interpolationsmethode wie `ffill'` oder `bfill` auf.\n",
    "`isna` | gibt boolesche Werte zurück, die angeben, welche Werte fehlen/`NA` sind.\n",
    "`notna` | negiert `isna` und gibt `True` für nicht-`NA`-Werte und `False` für `NA`-Werte zurück."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In diesem Notebook werden einige Möglichkeiten vorgestellt, wie fehlende Daten mit pandas DataFrames verwaltet werden können. Weitere Informationen findet ihr in der Pandas-Dokumentation: [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) und [Missing data cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-missing-data).\n",
    "\n",
    "> **Siehe auch:**\n",
    "> \n",
    "> * [Dora](https://github.com/NathanEpstein/Dora)\n",
    "> * [Badfish](https://github.com/harshnisar/badfish)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:08.245740Z",
     "iopub.status.busy": "2026-05-22T13:42:08.245593Z",
     "iopub.status.idle": "2026-05-22T13:42:08.534472Z",
     "shell.execute_reply": "2026-05-22T13:42:08.534149Z",
     "shell.execute_reply.started": "2026-05-22T13:42:08.245723Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:08.534910Z",
     "iopub.status.busy": "2026-05-22T13:42:08.534788Z",
     "iopub.status.idle": "2026-05-22T13:42:10.539711Z",
     "shell.execute_reply": "2026-05-22T13:42:10.539394Z",
     "shell.execute_reply.started": "2026-05-22T13:42:08.534901Z"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Überprüfen der Daten\n",
    "\n",
    "Bei der Bereinigung von Daten für die Analyse ist es oft wichtig, die fehlenden Daten selbst zu analysieren, um Probleme bei der Datenerfassung oder potenzielle Verzerrungen in den Daten aufgrund der fehlenden Daten zu ermitteln. Zunächst lassen wir uns die ersten 20 Datensätze anzeigen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:10.540438Z",
     "iopub.status.busy": "2026-05-22T13:42:10.540337Z",
     "iopub.status.idle": "2026-05-22T13:42:10.549434Z",
     "shell.execute_reply": "2026-05-22T13:42:10.549111Z",
     "shell.execute_reply.started": "2026-05-22T13:42:10.540430Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>timestamp</th>\n",
       "      <th>username</th>\n",
       "      <th>temperature</th>\n",
       "      <th>heartrate</th>\n",
       "      <th>build</th>\n",
       "      <th>latest</th>\n",
       "      <th>note</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2017-01-01T12:00:23</td>\n",
       "      <td>michaelsmith</td>\n",
       "      <td>12.0</td>\n",
       "      <td>67</td>\n",
       "      <td>4e6a7805-8faa-2768-6ef6-eb3198b483ac</td>\n",
       "      <td>0.0</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2017-01-01T12:01:09</td>\n",
       "      <td>kharrison</td>\n",
       "      <td>6.0</td>\n",
       "      <td>78</td>\n",
       "      <td>7256b7b0-e502-f576-62ec-ed73533c9c84</td>\n",
       "      <td>0.0</td>\n",
       "      <td>wake</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2017-01-01T12:01:34</td>\n",
       "      <td>smithadam</td>\n",
       "      <td>5.0</td>\n",
       "      <td>89</td>\n",
       "      <td>9226c94b-bb4b-a6c8-8e02-cb42b53e9c90</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2017-01-01T12:02:09</td>\n",
       "      <td>eddierodriguez</td>\n",
       "      <td>28.0</td>\n",
       "      <td>76</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>update</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2017-01-01T12:02:36</td>\n",
       "      <td>kenneth94</td>\n",
       "      <td>29.0</td>\n",
       "      <td>62</td>\n",
       "      <td>122f1c6a-403c-2221-6ed1-b5caa08f11e0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2017-01-01T12:03:04</td>\n",
       "      <td>bryanttodd</td>\n",
       "      <td>13.0</td>\n",
       "      <td>86</td>\n",
       "      <td>0897dbe5-9c5b-71ca-73a1-7586959ca198</td>\n",
       "      <td>0.0</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2017-01-01T12:03:51</td>\n",
       "      <td>andrea98</td>\n",
       "      <td>17.0</td>\n",
       "      <td>81</td>\n",
       "      <td>1c07ab9b-5f66-137d-a74f-921a41001f4e</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2017-01-01T12:04:35</td>\n",
       "      <td>scott28</td>\n",
       "      <td>16.0</td>\n",
       "      <td>76</td>\n",
       "      <td>7a60219f-6621-e548-180e-ca69624f9824</td>\n",
       "      <td>NaN</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2017-01-01T12:05:05</td>\n",
       "      <td>hillpamela</td>\n",
       "      <td>5.0</td>\n",
       "      <td>82</td>\n",
       "      <td>a8b87754-a162-da28-2527-4bce4b3d4191</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2017-01-01T12:05:41</td>\n",
       "      <td>moorejeffrey</td>\n",
       "      <td>25.0</td>\n",
       "      <td>63</td>\n",
       "      <td>585f1a3c-0679-0ffe-9132-508933c70343</td>\n",
       "      <td>0.0</td>\n",
       "      <td>wake</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2017-01-01T12:06:21</td>\n",
       "      <td>njohnson</td>\n",
       "      <td>NaN</td>\n",
       "      <td>63</td>\n",
       "      <td>e09b6001-125d-51cf-9c3f-9cb686c19d02</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2017-01-01T12:06:53</td>\n",
       "      <td>gsutton</td>\n",
       "      <td>29.0</td>\n",
       "      <td>80</td>\n",
       "      <td>607c9f6e-2bdf-a606-6d16-3004c6958436</td>\n",
       "      <td>1.0</td>\n",
       "      <td>update</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2017-01-01T12:07:41</td>\n",
       "      <td>jessica48</td>\n",
       "      <td>22.0</td>\n",
       "      <td>83</td>\n",
       "      <td>03e1a07b-3e14-412c-3a69-6b45bc79f81c</td>\n",
       "      <td>NaN</td>\n",
       "      <td>update</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2017-01-01T12:08:08</td>\n",
       "      <td>hornjohn</td>\n",
       "      <td>16.0</td>\n",
       "      <td>73</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2017-01-01T12:08:35</td>\n",
       "      <td>gramirez</td>\n",
       "      <td>24.0</td>\n",
       "      <td>73</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>wake</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>2017-01-01T12:09:05</td>\n",
       "      <td>schmidtsamuel</td>\n",
       "      <td>NaN</td>\n",
       "      <td>78</td>\n",
       "      <td>b9890c1e-79d5-8979-63ae-6c08a4cd476a</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2017-01-01T12:09:48</td>\n",
       "      <td>derrick47</td>\n",
       "      <td>NaN</td>\n",
       "      <td>63</td>\n",
       "      <td>b60bd7de-4057-8a85-f806-e6eec1350338</td>\n",
       "      <td>NaN</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>2017-01-01T12:10:23</td>\n",
       "      <td>beckercharles</td>\n",
       "      <td>12.0</td>\n",
       "      <td>61</td>\n",
       "      <td>b1dacc73-c8b7-1d7d-ee02-578da781a71e</td>\n",
       "      <td>0.0</td>\n",
       "      <td>test</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>2017-01-01T12:10:57</td>\n",
       "      <td>ipittman</td>\n",
       "      <td>11.0</td>\n",
       "      <td>69</td>\n",
       "      <td>1aef7db8-9a3e-7dc9-d7a5-781ec0efd200</td>\n",
       "      <td>NaN</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>2017-01-01T12:11:34</td>\n",
       "      <td>sabrina65</td>\n",
       "      <td>22.0</td>\n",
       "      <td>82</td>\n",
       "      <td>8075d058-7dae-e2ec-d47e-58ec6d26899b</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              timestamp        username  temperature  heartrate  \\\n",
       "0   2017-01-01T12:00:23    michaelsmith         12.0         67   \n",
       "1   2017-01-01T12:01:09       kharrison          6.0         78   \n",
       "2   2017-01-01T12:01:34       smithadam          5.0         89   \n",
       "3   2017-01-01T12:02:09  eddierodriguez         28.0         76   \n",
       "4   2017-01-01T12:02:36       kenneth94         29.0         62   \n",
       "5   2017-01-01T12:03:04      bryanttodd         13.0         86   \n",
       "6   2017-01-01T12:03:51        andrea98         17.0         81   \n",
       "7   2017-01-01T12:04:35         scott28         16.0         76   \n",
       "8   2017-01-01T12:05:05      hillpamela          5.0         82   \n",
       "9   2017-01-01T12:05:41    moorejeffrey         25.0         63   \n",
       "10  2017-01-01T12:06:21        njohnson          NaN         63   \n",
       "11  2017-01-01T12:06:53         gsutton         29.0         80   \n",
       "12  2017-01-01T12:07:41       jessica48         22.0         83   \n",
       "13  2017-01-01T12:08:08        hornjohn         16.0         73   \n",
       "14  2017-01-01T12:08:35        gramirez         24.0         73   \n",
       "15  2017-01-01T12:09:05   schmidtsamuel          NaN         78   \n",
       "16  2017-01-01T12:09:48       derrick47          NaN         63   \n",
       "17  2017-01-01T12:10:23   beckercharles         12.0         61   \n",
       "18  2017-01-01T12:10:57        ipittman         11.0         69   \n",
       "19  2017-01-01T12:11:34       sabrina65         22.0         82   \n",
       "\n",
       "                                   build  latest      note  \n",
       "0   4e6a7805-8faa-2768-6ef6-eb3198b483ac     0.0  interval  \n",
       "1   7256b7b0-e502-f576-62ec-ed73533c9c84     0.0      wake  \n",
       "2   9226c94b-bb4b-a6c8-8e02-cb42b53e9c90     0.0       NaN  \n",
       "3                                    NaN     0.0    update  \n",
       "4   122f1c6a-403c-2221-6ed1-b5caa08f11e0     NaN       NaN  \n",
       "5   0897dbe5-9c5b-71ca-73a1-7586959ca198     0.0  interval  \n",
       "6   1c07ab9b-5f66-137d-a74f-921a41001f4e     1.0       NaN  \n",
       "7   7a60219f-6621-e548-180e-ca69624f9824     NaN  interval  \n",
       "8   a8b87754-a162-da28-2527-4bce4b3d4191     1.0       NaN  \n",
       "9   585f1a3c-0679-0ffe-9132-508933c70343     0.0      wake  \n",
       "10  e09b6001-125d-51cf-9c3f-9cb686c19d02     NaN       NaN  \n",
       "11  607c9f6e-2bdf-a606-6d16-3004c6958436     1.0    update  \n",
       "12  03e1a07b-3e14-412c-3a69-6b45bc79f81c     NaN    update  \n",
       "13                                   NaN     0.0  interval  \n",
       "14                                   NaN     0.0      wake  \n",
       "15  b9890c1e-79d5-8979-63ae-6c08a4cd476a     0.0       NaN  \n",
       "16  b60bd7de-4057-8a85-f806-e6eec1350338     NaN  interval  \n",
       "17  b1dacc73-c8b7-1d7d-ee02-578da781a71e     0.0      test  \n",
       "18  1aef7db8-9a3e-7dc9-d7a5-781ec0efd200     NaN      user  \n",
       "19  8075d058-7dae-e2ec-d47e-58ec6d26899b     1.0       NaN  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dann schauen wir uns an, von welchem Datentyp die Spalten sind:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:10.549968Z",
     "iopub.status.busy": "2026-05-22T13:42:10.549883Z",
     "iopub.status.idle": "2026-05-22T13:42:10.552686Z",
     "shell.execute_reply": "2026-05-22T13:42:10.552385Z",
     "shell.execute_reply.started": "2026-05-22T13:42:10.549960Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "timestamp       object\n",
       "username        object\n",
       "temperature    float64\n",
       "heartrate        int64\n",
       "build           object\n",
       "latest         float64\n",
       "note            object\n",
       "dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mit `pandas.api.types as ptypes` können wir auch automatisiert überprüfen, ob die Datentypen euren Erwartungen entsprechen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:10.553783Z",
     "iopub.status.busy": "2026-05-22T13:42:10.553688Z",
     "iopub.status.idle": "2026-05-22T13:42:10.651289Z",
     "shell.execute_reply": "2026-05-22T13:42:10.649555Z",
     "shell.execute_reply.started": "2026-05-22T13:42:10.553775Z"
    }
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Expected datetimes in ['timestamp'].",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[5], line 13\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mall\u001b[39m(\n\u001b[1;32m     10\u001b[0m     ptypes\u001b[38;5;241m.\u001b[39mis_datetime64_any_dtype(df[col]) \u001b[38;5;28;01mfor\u001b[39;00m col \u001b[38;5;129;01min\u001b[39;00m expected_datetimes\n\u001b[1;32m     11\u001b[0m ):\n\u001b[1;32m     12\u001b[0m     msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected datetimes in \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mexpected_datetimes\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m---> 13\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(msg)\n\u001b[1;32m     14\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mall\u001b[39m(ptypes\u001b[38;5;241m.\u001b[39mis_object_dtype(df[col]) \u001b[38;5;28;01mfor\u001b[39;00m col \u001b[38;5;129;01min\u001b[39;00m expected_objects):\n\u001b[1;32m     15\u001b[0m     msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected objects in \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mexpected_objects\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n",
      "\u001b[0;31mValueError\u001b[0m: Expected datetimes in ['timestamp']."
     ]
    }
   ],
   "source": [
    "import pandas.api.types as ptypes\n",
    "\n",
    "\n",
    "expected_datetimes = [\"timestamp\"]\n",
    "expected_objects = [\"username\", \"build\", \"note\"]\n",
    "expected_floats = [\"temperature\", \"latest\"]\n",
    "expected_ints = [\"heartrate\"]\n",
    "\n",
    "if not all(\n",
    "    ptypes.is_datetime64_any_dtype(df[col]) for col in expected_datetimes\n",
    "):\n",
    "    msg = f\"Expected datetimes in {expected_datetimes}.\"\n",
    "    raise ValueError(msg)\n",
    "if not all(ptypes.is_object_dtype(df[col]) for col in expected_objects):\n",
    "    msg = f\"Expected objects in {expected_objects}.\"\n",
    "    raise ValueError(msg)\n",
    "if not all(ptypes.is_float_dtype(df[col]) for col in expected_floats):\n",
    "    msg = f\"Expected floats in {expected_floats}\"\n",
    "    raise ValueError(msg)\n",
    "if not all(ptypes.is_int64_dtype(df[col]) for col in expected_ints):\n",
    "    msg = f\"Expected ints in {expected_ints}\"\n",
    "    raise ValueError(msg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nun können wir die `timestamp`-Spalte in den passenden Typ umwandeln, siehe auch [dtype konvertieren](convert-dtypes.ipynb):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:17.977797Z",
     "iopub.status.busy": "2026-05-22T13:42:17.977502Z",
     "iopub.status.idle": "2026-05-22T13:42:18.022332Z",
     "shell.execute_reply": "2026-05-22T13:42:18.022037Z",
     "shell.execute_reply.started": "2026-05-22T13:42:17.977777Z"
    }
   },
   "outputs": [],
   "source": [
    "df[\"timestamp\"] = df[\"timestamp\"].astype(\"datetime64[ns]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wir können uns auch die Werte und deren Häufigkeit anzeigen lassen, z.B. für die Spalte `note`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:18.023157Z",
     "iopub.status.busy": "2026-05-22T13:42:18.023052Z",
     "iopub.status.idle": "2026-05-22T13:42:18.029936Z",
     "shell.execute_reply": "2026-05-22T13:42:18.029562Z",
     "shell.execute_reply.started": "2026-05-22T13:42:18.023147Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "note\n",
       "wake        16496\n",
       "user        16416\n",
       "interval    16274\n",
       "sleep       16226\n",
       "update      16213\n",
       "test        16068\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.note.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Entfernen aller Nullwerte (einschließlich der Angabe `n/a`)\n",
    "\n",
    "### 2.1  … mit `pandas.read_csv`\n",
    "\n",
    "[pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) filtert normalerweise bereits viele Werte heraus, die es als `NA` oder `NaN` erkennt. Weitere Werte können mit `na_values` angegeben werden."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:18.030588Z",
     "iopub.status.busy": "2026-05-22T13:42:18.030484Z",
     "iopub.status.idle": "2026-05-22T13:42:20.066338Z",
     "shell.execute_reply": "2026-05-22T13:42:20.066040Z",
     "shell.execute_reply.started": "2026-05-22T13:42:18.030581Z"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv\",\n",
    "    na_values=[\"n/a\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 … mit `pandas.DataFrame.dropna`\n",
    "\n",
    "Mit [pandas.DataFrame.dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) lassen sich fehlende Werte löschen."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Um den Umfang der Löschungen zu analysieren lassen wir uns den Umfang des DataFrame vor und nach dem Läschen mit [pandas.DataFrame.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) anzeigen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.067453Z",
     "iopub.status.busy": "2026-05-22T13:42:20.067353Z",
     "iopub.status.idle": "2026-05-22T13:42:20.070197Z",
     "shell.execute_reply": "2026-05-22T13:42:20.069917Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.067445Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(146397, 7)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.070564Z",
     "iopub.status.busy": "2026-05-22T13:42:20.070497Z",
     "iopub.status.idle": "2026-05-22T13:42:20.088946Z",
     "shell.execute_reply": "2026-05-22T13:42:20.088144Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.070557Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(46116, 7)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna().shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wir würden mit `pandas.DataFrame.dropna` also mehr als 2/3 der Datensätze verlieren."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Im nächsten Versuch wollen wir analysieren, ob ganze Zeilen oder Spalten keine Daten enthalten. Dabei werden mit `how='all'` Zeilen oder Spalten entfernt, die keine Werte enthalten; `axis=1` besagt, dass leere Zeilen entfernt werden sollen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.089531Z",
     "iopub.status.busy": "2026-05-22T13:42:20.089414Z",
     "iopub.status.idle": "2026-05-22T13:42:20.106947Z",
     "shell.execute_reply": "2026-05-22T13:42:20.106693Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.089523Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(146397, 7)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(how=\"all\", axis=1).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Auch dies bringt uns noch nicht weiter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Alle Spalten finden, in denen die meisten Daten vorhanden sind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.107505Z",
     "iopub.status.busy": "2026-05-22T13:42:20.107361Z",
     "iopub.status.idle": "2026-05-22T13:42:20.124713Z",
     "shell.execute_reply": "2026-05-22T13:42:20.124462Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.107496Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['timestamp', 'username', 'heartrate']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(df.dropna(thresh=int(df.shape[0] * 0.9), axis=1).columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`thresh` erfordert eine bestimmte Anzahl von NA-Werten, in unserem Fall 90 %, bevor `axis=1` eine Spalte ausblendet."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Alle Spalten mit fehlenden Daten finden\n",
    "\n",
    "Mit [pandas.DataFrame.isnull](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html) können wir fehlende Werte finden und mit [pandas.DataFrame.any](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html) erfahren wir, ob ein Element gültig ist, normalerweise über einer Spalte."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.125147Z",
     "iopub.status.busy": "2026-05-22T13:42:20.125073Z",
     "iopub.status.idle": "2026-05-22T13:42:20.139856Z",
     "shell.execute_reply": "2026-05-22T13:42:20.139543Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.125139Z"
    }
   },
   "outputs": [],
   "source": [
    "incomplete_columns = list(df.columns[df.isna().any()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.140371Z",
     "iopub.status.busy": "2026-05-22T13:42:20.140273Z",
     "iopub.status.idle": "2026-05-22T13:42:20.142750Z",
     "shell.execute_reply": "2026-05-22T13:42:20.142437Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.140363Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['temperature', 'build', 'latest', 'note']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "incomplete_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mit `num_missing` können wir uns nun die Anzahl der fehlenden Werte pro Spalte ausgeben lassen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.144338Z",
     "iopub.status.busy": "2026-05-22T13:42:20.144194Z",
     "iopub.status.idle": "2026-05-22T13:42:20.162366Z",
     "shell.execute_reply": "2026-05-22T13:42:20.161696Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.144329Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number missing for column temperature: 32357\n",
      "number missing for column build: 32350\n",
      "number missing for column latest: 32298\n",
      "number missing for column note: 48704\n"
     ]
    }
   ],
   "source": [
    "for col in incomplete_columns:\n",
    "    num_missing = df[df[col].isna()].shape[0]\n",
    "    print(f\"number missing for column {col}: {num_missing}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Diese Werte können wir uns auch prozentual ausgeben lassen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.163080Z",
     "iopub.status.busy": "2026-05-22T13:42:20.162925Z",
     "iopub.status.idle": "2026-05-22T13:42:20.179218Z",
     "shell.execute_reply": "2026-05-22T13:42:20.178838Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.163072Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "percent missing for column temperature: 0.22102228870810195\n",
      "percent missing for column build: 0.22097447352063226\n",
      "percent missing for column latest: 0.22061927498514314\n",
      "percent missing for column note: 0.332684412931959\n"
     ]
    }
   ],
   "source": [
    "for col in incomplete_columns:\n",
    "    percent_missing = df[df[col].isna()].shape[0] / df.shape[0]\n",
    "    print(f\"percent missing for column {col}: {percent_missing}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 Ersetzen fehlender Daten\n",
    "\n",
    "Um unsere Änderungen in der Spalte `latest` überprüfen zu können, verwenden wir [pandas.Series.value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html). Die Methode gibt eine Serie zurück, die die Anzahl der eindeutigen Werte enthält:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.180045Z",
     "iopub.status.busy": "2026-05-22T13:42:20.179886Z",
     "iopub.status.idle": "2026-05-22T13:42:20.184512Z",
     "shell.execute_reply": "2026-05-22T13:42:20.184220Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.180037Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "temperature\n",
       "29.0    4688\n",
       "26.0    4674\n",
       "16.0    4656\n",
       "28.0    4648\n",
       "10.0    4632\n",
       "13.0    4629\n",
       "7.0     4624\n",
       "27.0    4621\n",
       "21.0    4585\n",
       "9.0     4576\n",
       "23.0    4571\n",
       "5.0     4568\n",
       "6.0     4563\n",
       "19.0    4561\n",
       "18.0    4557\n",
       "17.0    4556\n",
       "11.0    4529\n",
       "15.0    4525\n",
       "8.0     4486\n",
       "12.0    4484\n",
       "20.0    4473\n",
       "25.0    4469\n",
       "14.0    4464\n",
       "22.0    4455\n",
       "24.0    4446\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.temperature.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Jetzt ersetzen wir die fehlenden Werte in der Spalte `temperature` durch den auf eine Nachkommastelle gerundeten Mittelwert mit [DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.184935Z",
     "iopub.status.busy": "2026-05-22T13:42:20.184862Z",
     "iopub.status.idle": "2026-05-22T13:42:20.189732Z",
     "shell.execute_reply": "2026-05-22T13:42:20.189484Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.184928Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "temperature\n",
       "17.0    36913\n",
       "29.0     4688\n",
       "26.0     4674\n",
       "16.0     4656\n",
       "28.0     4648\n",
       "10.0     4632\n",
       "13.0     4629\n",
       "7.0      4624\n",
       "27.0     4621\n",
       "21.0     4585\n",
       "9.0      4576\n",
       "23.0     4571\n",
       "5.0      4568\n",
       "6.0      4563\n",
       "19.0     4561\n",
       "18.0     4557\n",
       "11.0     4529\n",
       "15.0     4525\n",
       "8.0      4486\n",
       "12.0     4484\n",
       "20.0     4473\n",
       "25.0     4469\n",
       "14.0     4464\n",
       "22.0     4455\n",
       "24.0     4446\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp_mean = round(df.temperature.mean(), 1)\n",
    "fill_mean = df.temperature.fillna(temp_mean)\n",
    "fill_mean.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6 Ersetzen fehlender Daten durch `backfill`\n",
    "\n",
    "Damit die Datensätze in ihrer zeitlichen Reihenfolge aufeinanderfolgen, setzen wir zunächst den Index für `timestamp` mit [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.190080Z",
     "iopub.status.busy": "2026-05-22T13:42:20.190022Z",
     "iopub.status.idle": "2026-05-22T13:42:20.194998Z",
     "shell.execute_reply": "2026-05-22T13:42:20.194654Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.190074Z"
    }
   },
   "outputs": [],
   "source": [
    "df = df.set_index(\"timestamp\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.195518Z",
     "iopub.status.busy": "2026-05-22T13:42:20.195440Z",
     "iopub.status.idle": "2026-05-22T13:42:20.200375Z",
     "shell.execute_reply": "2026-05-22T13:42:20.200152Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.195510Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>username</th>\n",
       "      <th>temperature</th>\n",
       "      <th>heartrate</th>\n",
       "      <th>build</th>\n",
       "      <th>latest</th>\n",
       "      <th>note</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:00:23</th>\n",
       "      <td>michaelsmith</td>\n",
       "      <td>12.0</td>\n",
       "      <td>67</td>\n",
       "      <td>4e6a7805-8faa-2768-6ef6-eb3198b483ac</td>\n",
       "      <td>0.0</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:01:09</th>\n",
       "      <td>kharrison</td>\n",
       "      <td>6.0</td>\n",
       "      <td>78</td>\n",
       "      <td>7256b7b0-e502-f576-62ec-ed73533c9c84</td>\n",
       "      <td>0.0</td>\n",
       "      <td>wake</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:01:34</th>\n",
       "      <td>smithadam</td>\n",
       "      <td>5.0</td>\n",
       "      <td>89</td>\n",
       "      <td>9226c94b-bb4b-a6c8-8e02-cb42b53e9c90</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:02:09</th>\n",
       "      <td>eddierodriguez</td>\n",
       "      <td>28.0</td>\n",
       "      <td>76</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>update</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:02:36</th>\n",
       "      <td>kenneth94</td>\n",
       "      <td>29.0</td>\n",
       "      <td>62</td>\n",
       "      <td>122f1c6a-403c-2221-6ed1-b5caa08f11e0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:03:04</th>\n",
       "      <td>bryanttodd</td>\n",
       "      <td>13.0</td>\n",
       "      <td>86</td>\n",
       "      <td>0897dbe5-9c5b-71ca-73a1-7586959ca198</td>\n",
       "      <td>0.0</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:03:51</th>\n",
       "      <td>andrea98</td>\n",
       "      <td>17.0</td>\n",
       "      <td>81</td>\n",
       "      <td>1c07ab9b-5f66-137d-a74f-921a41001f4e</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:04:35</th>\n",
       "      <td>scott28</td>\n",
       "      <td>16.0</td>\n",
       "      <td>76</td>\n",
       "      <td>7a60219f-6621-e548-180e-ca69624f9824</td>\n",
       "      <td>NaN</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:05:05</th>\n",
       "      <td>hillpamela</td>\n",
       "      <td>5.0</td>\n",
       "      <td>82</td>\n",
       "      <td>a8b87754-a162-da28-2527-4bce4b3d4191</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:05:41</th>\n",
       "      <td>moorejeffrey</td>\n",
       "      <td>25.0</td>\n",
       "      <td>63</td>\n",
       "      <td>585f1a3c-0679-0ffe-9132-508933c70343</td>\n",
       "      <td>0.0</td>\n",
       "      <td>wake</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:06:21</th>\n",
       "      <td>njohnson</td>\n",
       "      <td>NaN</td>\n",
       "      <td>63</td>\n",
       "      <td>e09b6001-125d-51cf-9c3f-9cb686c19d02</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:06:53</th>\n",
       "      <td>gsutton</td>\n",
       "      <td>29.0</td>\n",
       "      <td>80</td>\n",
       "      <td>607c9f6e-2bdf-a606-6d16-3004c6958436</td>\n",
       "      <td>1.0</td>\n",
       "      <td>update</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:07:41</th>\n",
       "      <td>jessica48</td>\n",
       "      <td>22.0</td>\n",
       "      <td>83</td>\n",
       "      <td>03e1a07b-3e14-412c-3a69-6b45bc79f81c</td>\n",
       "      <td>NaN</td>\n",
       "      <td>update</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:08:08</th>\n",
       "      <td>hornjohn</td>\n",
       "      <td>16.0</td>\n",
       "      <td>73</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:08:35</th>\n",
       "      <td>gramirez</td>\n",
       "      <td>24.0</td>\n",
       "      <td>73</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>wake</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:09:05</th>\n",
       "      <td>schmidtsamuel</td>\n",
       "      <td>NaN</td>\n",
       "      <td>78</td>\n",
       "      <td>b9890c1e-79d5-8979-63ae-6c08a4cd476a</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:09:48</th>\n",
       "      <td>derrick47</td>\n",
       "      <td>NaN</td>\n",
       "      <td>63</td>\n",
       "      <td>b60bd7de-4057-8a85-f806-e6eec1350338</td>\n",
       "      <td>NaN</td>\n",
       "      <td>interval</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:10:23</th>\n",
       "      <td>beckercharles</td>\n",
       "      <td>12.0</td>\n",
       "      <td>61</td>\n",
       "      <td>b1dacc73-c8b7-1d7d-ee02-578da781a71e</td>\n",
       "      <td>0.0</td>\n",
       "      <td>test</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:10:57</th>\n",
       "      <td>ipittman</td>\n",
       "      <td>11.0</td>\n",
       "      <td>69</td>\n",
       "      <td>1aef7db8-9a3e-7dc9-d7a5-781ec0efd200</td>\n",
       "      <td>NaN</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017-01-01T12:11:34</th>\n",
       "      <td>sabrina65</td>\n",
       "      <td>22.0</td>\n",
       "      <td>82</td>\n",
       "      <td>8075d058-7dae-e2ec-d47e-58ec6d26899b</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           username  temperature  heartrate  \\\n",
       "timestamp                                                     \n",
       "2017-01-01T12:00:23    michaelsmith         12.0         67   \n",
       "2017-01-01T12:01:09       kharrison          6.0         78   \n",
       "2017-01-01T12:01:34       smithadam          5.0         89   \n",
       "2017-01-01T12:02:09  eddierodriguez         28.0         76   \n",
       "2017-01-01T12:02:36       kenneth94         29.0         62   \n",
       "2017-01-01T12:03:04      bryanttodd         13.0         86   \n",
       "2017-01-01T12:03:51        andrea98         17.0         81   \n",
       "2017-01-01T12:04:35         scott28         16.0         76   \n",
       "2017-01-01T12:05:05      hillpamela          5.0         82   \n",
       "2017-01-01T12:05:41    moorejeffrey         25.0         63   \n",
       "2017-01-01T12:06:21        njohnson          NaN         63   \n",
       "2017-01-01T12:06:53         gsutton         29.0         80   \n",
       "2017-01-01T12:07:41       jessica48         22.0         83   \n",
       "2017-01-01T12:08:08        hornjohn         16.0         73   \n",
       "2017-01-01T12:08:35        gramirez         24.0         73   \n",
       "2017-01-01T12:09:05   schmidtsamuel          NaN         78   \n",
       "2017-01-01T12:09:48       derrick47          NaN         63   \n",
       "2017-01-01T12:10:23   beckercharles         12.0         61   \n",
       "2017-01-01T12:10:57        ipittman         11.0         69   \n",
       "2017-01-01T12:11:34       sabrina65         22.0         82   \n",
       "\n",
       "                                                    build  latest      note  \n",
       "timestamp                                                                    \n",
       "2017-01-01T12:00:23  4e6a7805-8faa-2768-6ef6-eb3198b483ac     0.0  interval  \n",
       "2017-01-01T12:01:09  7256b7b0-e502-f576-62ec-ed73533c9c84     0.0      wake  \n",
       "2017-01-01T12:01:34  9226c94b-bb4b-a6c8-8e02-cb42b53e9c90     0.0       NaN  \n",
       "2017-01-01T12:02:09                                   NaN     0.0    update  \n",
       "2017-01-01T12:02:36  122f1c6a-403c-2221-6ed1-b5caa08f11e0     NaN       NaN  \n",
       "2017-01-01T12:03:04  0897dbe5-9c5b-71ca-73a1-7586959ca198     0.0  interval  \n",
       "2017-01-01T12:03:51  1c07ab9b-5f66-137d-a74f-921a41001f4e     1.0       NaN  \n",
       "2017-01-01T12:04:35  7a60219f-6621-e548-180e-ca69624f9824     NaN  interval  \n",
       "2017-01-01T12:05:05  a8b87754-a162-da28-2527-4bce4b3d4191     1.0       NaN  \n",
       "2017-01-01T12:05:41  585f1a3c-0679-0ffe-9132-508933c70343     0.0      wake  \n",
       "2017-01-01T12:06:21  e09b6001-125d-51cf-9c3f-9cb686c19d02     NaN       NaN  \n",
       "2017-01-01T12:06:53  607c9f6e-2bdf-a606-6d16-3004c6958436     1.0    update  \n",
       "2017-01-01T12:07:41  03e1a07b-3e14-412c-3a69-6b45bc79f81c     NaN    update  \n",
       "2017-01-01T12:08:08                                   NaN     0.0  interval  \n",
       "2017-01-01T12:08:35                                   NaN     0.0      wake  \n",
       "2017-01-01T12:09:05  b9890c1e-79d5-8979-63ae-6c08a4cd476a     0.0       NaN  \n",
       "2017-01-01T12:09:48  b60bd7de-4057-8a85-f806-e6eec1350338     NaN  interval  \n",
       "2017-01-01T12:10:23  b1dacc73-c8b7-1d7d-ee02-578da781a71e     0.0      test  \n",
       "2017-01-01T12:10:57  1aef7db8-9a3e-7dc9-d7a5-781ec0efd200     NaN      user  \n",
       "2017-01-01T12:11:34  8075d058-7dae-e2ec-d47e-58ec6d26899b     1.0       NaN  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Anschließend verwenden wir [pandas.DataFrame.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), um die Datensätze nach `username` zu gruppieren und dann die fehlenden Daten mit der `backfill`-Methode von [pandas.core.groupby.DataFrameGroupBy.fillna](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.fillna.html) zu füllen. `limit` definiert die maximale Anzahl aufeinanderfolgender `NaN`-Werte:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.200718Z",
     "iopub.status.busy": "2026-05-22T13:42:20.200662Z",
     "iopub.status.idle": "2026-05-22T13:42:20.272918Z",
     "shell.execute_reply": "2026-05-22T13:42:20.272616Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.200712Z"
    }
   },
   "outputs": [],
   "source": [
    "df.temperature = df.groupby(\"username\").temperature.bfill(limit=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T13:42:20.273408Z",
     "iopub.status.busy": "2026-05-22T13:42:20.273342Z",
     "iopub.status.idle": "2026-05-22T13:42:20.293479Z",
     "shell.execute_reply": "2026-05-22T13:42:20.293195Z",
     "shell.execute_reply.started": "2026-05-22T13:42:20.273401Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number missing for column temperature: 22633\n",
      "number missing for column build: 32350\n",
      "number missing for column latest: 32298\n",
      "number missing for column note: 48704\n"
     ]
    }
   ],
   "source": [
    "for col in incomplete_columns:\n",
    "    num_missing = df[df[col].isna()].shape[0]\n",
    "    print(f\"number missing for column {col}: {num_missing}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Argumente der Funktion `fillna`:\n",
    "\n",
    "Argument | Beschreibung\n",
    ":------- | :-----------\n",
    "`value` | Skalarwert oder dict-atähnliches Objekt, das zum Auffüllen fehlender Werte verwendet wird\n",
    "Methode | Interpolation; standardmäßig `ffill`, wenn die Funktion ohne weitere Argumente aufgerufen wird\n",
    "`axis` | Aufzufüllende Achse; Voreinstellung `axis=0`\n",
    "`inplace` | ändert das aufrufende Objekt, ohne eine Kopie zu erzeugen\n",
    "`limit` | Für das Auffüllen in Vorwärts- und Rückwärtsrichtung, maximale Anzahl von aufeinanderfolgenden Perioden zum Auffüllen"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.13 Kernel",
   "language": "python",
   "name": "python313"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}