{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dabf3b3d",
   "metadata": {},
   "source": [
    "# XML/HTML-Beispiele"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6368f510",
   "metadata": {},
   "source": [
    "## HTML\n",
    "\n",
    "Python verfügt über zahlreiche Bibliotheken zum Lesen und Schreiben von Daten in den allgegenwärtigen HTML- und XML-Formaten. Beispiele sind [lxml](#lxml), [Beautiful Soup](beautifulsoup.ipynb) und html5lib. Während lxml im Allgemeinen vergleichsweise viel schneller ist, können die anderen Bibliotheken besser mit fehlerhaften HTML- oder XML-Dateien umgehen.\n",
    "\n",
    "pandas hat eine eingebaute Funktion, `read_html`, die Bibliotheken wie lxml, html5lib und Beautiful Soup verwendet, um automatisch Tabellen aus HTML-Dateien als DataFrame-Objekte zu parsen. Diese müssen zusätzlich installiert werden. Mit [Spack](../../../productive/envs/spack/index.rst) könnt ihr lxml, BeautifulSoup und html5lib in eurem Kernel bereitstellen:\n",
    "\n",
    "``` bash\n",
    "$ spack env activate python-311\n",
    "$ spack install py-lxml py-beautifulsoup4~html5lib~lxml py-html5lib\n",
    "```\n",
    "\n",
    "Alternativ könnt ihr BeautifulSoup auch mit anderen Paketmanagern installieren, z.B.\n",
    "\n",
    "``` bash\n",
    "$ uv add lxml beautifulsoup4 html5lib\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b69188b1",
   "metadata": {},
   "source": [
    "Um zu zeigen, wie das funktioniert, verwende ich eine HTML-Datei der Wikipedia, die einen Überblick über verschiedene Serialisierungsformate gibt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "44a6ae85",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T08:18:58.738053Z",
     "iopub.status.busy": "2026-05-22T08:18:58.737857Z",
     "iopub.status.idle": "2026-05-22T08:18:59.107269Z",
     "shell.execute_reply": "2026-05-22T08:18:59.106968Z",
     "shell.execute_reply.started": "2026-05-22T08:18:58.738032Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "\n",
    "tables = pd.read_html(\n",
    "    \"https://docs.python.org/3/library/xml.dom.html\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "342f5ba2",
   "metadata": {},
   "source": [
    "Die Funktion `pandas.read_html` hat eine Reihe von Optionen, aber standardmäßig sucht sie nach allen Tabellendaten, die in `<table>`-Tags enthalten sind, und versucht, diese zu analysieren. Das Ergebnis ist eine Liste von DataFrame-Objekten:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3cdaf9e9",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T08:18:59.107844Z",
     "iopub.status.busy": "2026-05-22T08:18:59.107723Z",
     "iopub.status.idle": "2026-05-22T08:18:59.113425Z",
     "shell.execute_reply": "2026-05-22T08:18:59.113094Z",
     "shell.execute_reply.started": "2026-05-22T08:18:59.107835Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(tables)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1e0b348e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T08:18:59.114227Z",
     "iopub.status.busy": "2026-05-22T08:18:59.114138Z",
     "iopub.status.idle": "2026-05-22T08:18:59.119303Z",
     "shell.execute_reply": "2026-05-22T08:18:59.119028Z",
     "shell.execute_reply.started": "2026-05-22T08:18:59.114220Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>IDL Type</th>\n",
       "      <th>Python Type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>boolean</td>\n",
       "      <td>bool or int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>int</td>\n",
       "      <td>int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>long int</td>\n",
       "      <td>int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>unsigned int</td>\n",
       "      <td>int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>DOMString</td>\n",
       "      <td>str or bytes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       IDL Type   Python Type\n",
       "0       boolean   bool or int\n",
       "1           int           int\n",
       "2      long int           int\n",
       "3  unsigned int           int\n",
       "4     DOMString  str or bytes"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xml_idl = tables[2]\n",
    "\n",
    "xml_idl.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fca0f763",
   "metadata": {},
   "source": [
    "Von hier aus können wir einige [Datenbereinigungen und -analysen](../../../clean-prep/index.rst) vornehmen, wie z.B. die Anzahl der verschiedenen Schema-IDL:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e6672fe3",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T08:18:59.119815Z",
     "iopub.status.busy": "2026-05-22T08:18:59.119710Z",
     "iopub.status.idle": "2026-05-22T08:18:59.123680Z",
     "shell.execute_reply": "2026-05-22T08:18:59.123260Z",
     "shell.execute_reply.started": "2026-05-22T08:18:59.119807Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Python Type\n",
       "int             3\n",
       "bool or int     1\n",
       "str or bytes    1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xml_idl[\"Python Type\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52bec83f",
   "metadata": {},
   "source": [
    "## XML\n",
    "\n",
    "pandas hat eine Funktion `read_xml`, wodurch das Lesen von XML-Dateien sehr einfach wird:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "923f990d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T08:18:59.124212Z",
     "iopub.status.busy": "2026-05-22T08:18:59.124134Z",
     "iopub.status.idle": "2026-05-22T08:18:59.129104Z",
     "shell.execute_reply": "2026-05-22T08:18:59.128808Z",
     "shell.execute_reply.started": "2026-05-22T08:18:59.124204Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>title</th>\n",
       "      <th>language</th>\n",
       "      <th>author</th>\n",
       "      <th>license</th>\n",
       "      <th>date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id             title language        author       license        date\n",
       "0   1     Python basics       en  Veit Schiele  BSD-3-Clause  2021-10-28\n",
       "1   2  Jupyter Tutorial       en  Veit Schiele  BSD-3-Clause  2019-06-27\n",
       "2   3  Jupyter Tutorial       de  Veit Schiele  BSD-3-Clause  2020-10-26\n",
       "3   4    PyViz Tutorial       en  Veit Schiele  BSD-3-Clause  2020-04-13"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_xml(\"books.xml\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "745031e6",
   "metadata": {},
   "source": [
    "### `lxml`\n",
    "\n",
    "Alternativ kann auch zunächst `lxml.objectify` zum Parsen von XML-Dateien verwendet werden. Dabei erhalten wir mit `getroot` einen Verweis auf den Wurzelknoten der XML-Datei:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3849303b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T08:18:59.130467Z",
     "iopub.status.busy": "2026-05-22T08:18:59.130367Z",
     "iopub.status.idle": "2026-05-22T08:18:59.211738Z",
     "shell.execute_reply": "2026-05-22T08:18:59.211424Z",
     "shell.execute_reply.started": "2026-05-22T08:18:59.130459Z"
    }
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "from lxml import objectify\n",
    "\n",
    "\n",
    "parsed = objectify.parse(Path.open(\"books.xml\"))\n",
    "root = parsed.getroot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d45c8806",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T08:18:59.212139Z",
     "iopub.status.busy": "2026-05-22T08:18:59.212055Z",
     "iopub.status.idle": "2026-05-22T08:18:59.214234Z",
     "shell.execute_reply": "2026-05-22T08:18:59.213958Z",
     "shell.execute_reply.started": "2026-05-22T08:18:59.212131Z"
    }
   },
   "outputs": [],
   "source": [
    "books = []\n",
    "\n",
    "for element in root.book:\n",
    "    data = {}\n",
    "    for child in element.getchildren():\n",
    "        data[child.tag] = child.pyval\n",
    "    books.append(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f44e22f2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-22T08:18:59.214552Z",
     "iopub.status.busy": "2026-05-22T08:18:59.214487Z",
     "iopub.status.idle": "2026-05-22T08:18:59.219192Z",
     "shell.execute_reply": "2026-05-22T08:18:59.218979Z",
     "shell.execute_reply.started": "2026-05-22T08:18:59.214545Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>language</th>\n",
       "      <th>author</th>\n",
       "      <th>license</th>\n",
       "      <th>date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              title language        author       license        date\n",
       "0     Python basics       en  Veit Schiele  BSD-3-Clause  2021-10-28\n",
       "1  Jupyter Tutorial       en  Veit Schiele  BSD-3-Clause  2019-06-27\n",
       "2  Jupyter Tutorial       de  Veit Schiele  BSD-3-Clause  2020-10-26\n",
       "3    PyViz Tutorial       en  Veit Schiele  BSD-3-Clause  2020-04-13"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(books)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.13 Kernel",
   "language": "python",
   "name": "python313"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}