{ "cells": [ { "cell_type": "markdown", "id": "dabf3b3d", "metadata": {}, "source": [ "# XML/HTML-Beispiele" ] }, { "cell_type": "markdown", "id": "6368f510", "metadata": {}, "source": [ "## HTML\n", "\n", "Python verfügt über zahlreiche Bibliotheken zum Lesen und Schreiben von Daten in den allgegenwärtigen HTML- und XML-Formaten. Beispiele sind [lxml](#lxml), [Beautiful Soup](beautifulsoup.ipynb) und html5lib. Während lxml im Allgemeinen vergleichsweise viel schneller ist, können die anderen Bibliotheken besser mit fehlerhaften HTML- oder XML-Dateien umgehen.\n", "\n", "pandas hat eine eingebaute Funktion, `read_html`, die Bibliotheken wie lxml, html5lib und Beautiful Soup verwendet, um automatisch Tabellen aus HTML-Dateien als DataFrame-Objekte zu parsen. Diese müssen zusätzlich installiert werden. Mit [Spack](../../../productive/envs/spack/index.rst) könnt ihr lxml, BeautifulSoup und html5lib in eurem Kernel bereitstellen:\n", "\n", "``` bash\n", "$ spack env activate python-311\n", "$ spack install py-lxml py-beautifulsoup4~html5lib~lxml py-html5lib\n", "```\n", "\n", "Alternativ könnt ihr BeautifulSoup auch mit anderen Paketmanagern installieren, z.B.\n", "\n", "``` bash\n", "$ uv add lxml beautifulsoup4 html5lib\n", "```" ] }, { "cell_type": "markdown", "id": "b69188b1", "metadata": {}, "source": [ "Um zu zeigen, wie das funktioniert, verwende ich eine HTML-Datei der Wikipedia, die einen Überblick über verschiedene Serialisierungsformate gibt." ] }, { "cell_type": "code", "execution_count": 1, "id": "44a6ae85", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:18:58.738053Z", "iopub.status.busy": "2026-05-22T08:18:58.737857Z", "iopub.status.idle": "2026-05-22T08:18:59.107269Z", "shell.execute_reply": "2026-05-22T08:18:59.106968Z", "shell.execute_reply.started": "2026-05-22T08:18:58.738032Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "\n", "tables = pd.read_html(\n", " \"https://docs.python.org/3/library/xml.dom.html\",\n", ")" ] }, { "cell_type": "markdown", "id": "342f5ba2", "metadata": {}, "source": [ "Die Funktion `pandas.read_html` hat eine Reihe von Optionen, aber standardmäßig sucht sie nach allen Tabellendaten, die in ``-Tags enthalten sind, und versucht, diese zu analysieren. Das Ergebnis ist eine Liste von DataFrame-Objekten:" ] }, { "cell_type": "code", "execution_count": 2, "id": "3cdaf9e9", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:18:59.107844Z", "iopub.status.busy": "2026-05-22T08:18:59.107723Z", "iopub.status.idle": "2026-05-22T08:18:59.113425Z", "shell.execute_reply": "2026-05-22T08:18:59.113094Z", "shell.execute_reply.started": "2026-05-22T08:18:59.107835Z" } }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(tables)" ] }, { "cell_type": "code", "execution_count": 3, "id": "1e0b348e", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:18:59.114227Z", "iopub.status.busy": "2026-05-22T08:18:59.114138Z", "iopub.status.idle": "2026-05-22T08:18:59.119303Z", "shell.execute_reply": "2026-05-22T08:18:59.119028Z", "shell.execute_reply.started": "2026-05-22T08:18:59.114220Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDL TypePython Type
0booleanbool or int
1intint
2long intint
3unsigned intint
4DOMStringstr or bytes
\n", "" ], "text/plain": [ " IDL Type Python Type\n", "0 boolean bool or int\n", "1 int int\n", "2 long int int\n", "3 unsigned int int\n", "4 DOMString str or bytes" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xml_idl = tables[2]\n", "\n", "xml_idl.head()" ] }, { "cell_type": "markdown", "id": "fca0f763", "metadata": {}, "source": [ "Von hier aus können wir einige [Datenbereinigungen und -analysen](../../../clean-prep/index.rst) vornehmen, wie z.B. die Anzahl der verschiedenen Schema-IDL:" ] }, { "cell_type": "code", "execution_count": 4, "id": "e6672fe3", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:18:59.119815Z", "iopub.status.busy": "2026-05-22T08:18:59.119710Z", "iopub.status.idle": "2026-05-22T08:18:59.123680Z", "shell.execute_reply": "2026-05-22T08:18:59.123260Z", "shell.execute_reply.started": "2026-05-22T08:18:59.119807Z" } }, "outputs": [ { "data": { "text/plain": [ "Python Type\n", "int 3\n", "bool or int 1\n", "str or bytes 1\n", "Name: count, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xml_idl[\"Python Type\"].value_counts()" ] }, { "cell_type": "markdown", "id": "52bec83f", "metadata": {}, "source": [ "## XML\n", "\n", "pandas hat eine Funktion `read_xml`, wodurch das Lesen von XML-Dateien sehr einfach wird:" ] }, { "cell_type": "code", "execution_count": 5, "id": "923f990d", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:18:59.124212Z", "iopub.status.busy": "2026-05-22T08:18:59.124134Z", "iopub.status.idle": "2026-05-22T08:18:59.129104Z", "shell.execute_reply": "2026-05-22T08:18:59.128808Z", "shell.execute_reply.started": "2026-05-22T08:18:59.124204Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtitlelanguageauthorlicensedate
01Python basicsenVeit SchieleBSD-3-Clause2021-10-28
12Jupyter TutorialenVeit SchieleBSD-3-Clause2019-06-27
23Jupyter TutorialdeVeit SchieleBSD-3-Clause2020-10-26
34PyViz TutorialenVeit SchieleBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " id title language author license date\n", "0 1 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n", "1 2 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n", "2 3 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n", "3 4 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_xml(\"books.xml\")" ] }, { "cell_type": "markdown", "id": "745031e6", "metadata": {}, "source": [ "### `lxml`\n", "\n", "Alternativ kann auch zunächst `lxml.objectify` zum Parsen von XML-Dateien verwendet werden. Dabei erhalten wir mit `getroot` einen Verweis auf den Wurzelknoten der XML-Datei:" ] }, { "cell_type": "code", "execution_count": 6, "id": "3849303b", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:18:59.130467Z", "iopub.status.busy": "2026-05-22T08:18:59.130367Z", "iopub.status.idle": "2026-05-22T08:18:59.211738Z", "shell.execute_reply": "2026-05-22T08:18:59.211424Z", "shell.execute_reply.started": "2026-05-22T08:18:59.130459Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "from lxml import objectify\n", "\n", "\n", "parsed = objectify.parse(Path.open(\"books.xml\"))\n", "root = parsed.getroot()" ] }, { "cell_type": "code", "execution_count": 7, "id": "d45c8806", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:18:59.212139Z", "iopub.status.busy": "2026-05-22T08:18:59.212055Z", "iopub.status.idle": "2026-05-22T08:18:59.214234Z", "shell.execute_reply": "2026-05-22T08:18:59.213958Z", "shell.execute_reply.started": "2026-05-22T08:18:59.212131Z" } }, "outputs": [], "source": [ "books = []\n", "\n", "for element in root.book:\n", " data = {}\n", " for child in element.getchildren():\n", " data[child.tag] = child.pyval\n", " books.append(data)" ] }, { "cell_type": "code", "execution_count": 8, "id": "f44e22f2", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:18:59.214552Z", "iopub.status.busy": "2026-05-22T08:18:59.214487Z", "iopub.status.idle": "2026-05-22T08:18:59.219192Z", "shell.execute_reply": "2026-05-22T08:18:59.218979Z", "shell.execute_reply.started": "2026-05-22T08:18:59.214545Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlelanguageauthorlicensedate
0Python basicsenVeit SchieleBSD-3-Clause2021-10-28
1Jupyter TutorialenVeit SchieleBSD-3-Clause2019-06-27
2Jupyter TutorialdeVeit SchieleBSD-3-Clause2020-10-26
3PyViz TutorialenVeit SchieleBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " title language author license date\n", "0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n", "1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n", "2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n", "3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(books)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }