{
"cells": [
{
"cell_type": "markdown",
"id": "dabf3b3d",
"metadata": {},
"source": [
"# XML/HTML-Beispiele"
]
},
{
"cell_type": "markdown",
"id": "6368f510",
"metadata": {},
"source": [
"## HTML\n",
"\n",
"Python verfügt über zahlreiche Bibliotheken zum Lesen und Schreiben von Daten in den allgegenwärtigen HTML- und XML-Formaten. Beispiele sind [lxml](#lxml), [Beautiful Soup](beautifulsoup.ipynb) und html5lib. Während lxml im Allgemeinen vergleichsweise viel schneller ist, können die anderen Bibliotheken besser mit fehlerhaften HTML- oder XML-Dateien umgehen.\n",
"\n",
"pandas hat eine eingebaute Funktion, `read_html`, die Bibliotheken wie lxml, html5lib und Beautiful Soup verwendet, um automatisch Tabellen aus HTML-Dateien als DataFrame-Objekte zu parsen. Diese müssen zusätzlich installiert werden. Mit [Spack](../../../productive/envs/spack/index.rst) könnt ihr lxml, BeautifulSoup und html5lib in eurem Kernel bereitstellen:\n",
"\n",
"``` bash\n",
"$ spack env activate python-311\n",
"$ spack install py-lxml py-beautifulsoup4~html5lib~lxml py-html5lib\n",
"```\n",
"\n",
"Alternativ könnt ihr BeautifulSoup auch mit anderen Paketmanagern installieren, z.B.\n",
"\n",
"``` bash\n",
"$ uv add lxml beautifulsoup4 html5lib\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "b69188b1",
"metadata": {},
"source": [
"Um zu zeigen, wie das funktioniert, verwende ich eine HTML-Datei der Wikipedia, die einen Überblick über verschiedene Serialisierungsformate gibt."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "44a6ae85",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T08:18:58.738053Z",
"iopub.status.busy": "2026-05-22T08:18:58.737857Z",
"iopub.status.idle": "2026-05-22T08:18:59.107269Z",
"shell.execute_reply": "2026-05-22T08:18:59.106968Z",
"shell.execute_reply.started": "2026-05-22T08:18:58.738032Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"\n",
"tables = pd.read_html(\n",
" \"https://docs.python.org/3/library/xml.dom.html\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "342f5ba2",
"metadata": {},
"source": [
"Die Funktion `pandas.read_html` hat eine Reihe von Optionen, aber standardmäßig sucht sie nach allen Tabellendaten, die in `
`-Tags enthalten sind, und versucht, diese zu analysieren. Das Ergebnis ist eine Liste von DataFrame-Objekten:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3cdaf9e9",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T08:18:59.107844Z",
"iopub.status.busy": "2026-05-22T08:18:59.107723Z",
"iopub.status.idle": "2026-05-22T08:18:59.113425Z",
"shell.execute_reply": "2026-05-22T08:18:59.113094Z",
"shell.execute_reply.started": "2026-05-22T08:18:59.107835Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(tables)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1e0b348e",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T08:18:59.114227Z",
"iopub.status.busy": "2026-05-22T08:18:59.114138Z",
"iopub.status.idle": "2026-05-22T08:18:59.119303Z",
"shell.execute_reply": "2026-05-22T08:18:59.119028Z",
"shell.execute_reply.started": "2026-05-22T08:18:59.114220Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" IDL Type | \n",
" Python Type | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" boolean | \n",
" bool or int | \n",
"
\n",
" \n",
" | 1 | \n",
" int | \n",
" int | \n",
"
\n",
" \n",
" | 2 | \n",
" long int | \n",
" int | \n",
"
\n",
" \n",
" | 3 | \n",
" unsigned int | \n",
" int | \n",
"
\n",
" \n",
" | 4 | \n",
" DOMString | \n",
" str or bytes | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" IDL Type Python Type\n",
"0 boolean bool or int\n",
"1 int int\n",
"2 long int int\n",
"3 unsigned int int\n",
"4 DOMString str or bytes"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xml_idl = tables[2]\n",
"\n",
"xml_idl.head()"
]
},
{
"cell_type": "markdown",
"id": "fca0f763",
"metadata": {},
"source": [
"Von hier aus können wir einige [Datenbereinigungen und -analysen](../../../clean-prep/index.rst) vornehmen, wie z.B. die Anzahl der verschiedenen Schema-IDL:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e6672fe3",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T08:18:59.119815Z",
"iopub.status.busy": "2026-05-22T08:18:59.119710Z",
"iopub.status.idle": "2026-05-22T08:18:59.123680Z",
"shell.execute_reply": "2026-05-22T08:18:59.123260Z",
"shell.execute_reply.started": "2026-05-22T08:18:59.119807Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Python Type\n",
"int 3\n",
"bool or int 1\n",
"str or bytes 1\n",
"Name: count, dtype: int64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xml_idl[\"Python Type\"].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "52bec83f",
"metadata": {},
"source": [
"## XML\n",
"\n",
"pandas hat eine Funktion `read_xml`, wodurch das Lesen von XML-Dateien sehr einfach wird:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "923f990d",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T08:18:59.124212Z",
"iopub.status.busy": "2026-05-22T08:18:59.124134Z",
"iopub.status.idle": "2026-05-22T08:18:59.129104Z",
"shell.execute_reply": "2026-05-22T08:18:59.128808Z",
"shell.execute_reply.started": "2026-05-22T08:18:59.124204Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" title | \n",
" language | \n",
" author | \n",
" license | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" Python basics | \n",
" en | \n",
" Veit Schiele | \n",
" BSD-3-Clause | \n",
" 2021-10-28 | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" Jupyter Tutorial | \n",
" en | \n",
" Veit Schiele | \n",
" BSD-3-Clause | \n",
" 2019-06-27 | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" Jupyter Tutorial | \n",
" de | \n",
" Veit Schiele | \n",
" BSD-3-Clause | \n",
" 2020-10-26 | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" PyViz Tutorial | \n",
" en | \n",
" Veit Schiele | \n",
" BSD-3-Clause | \n",
" 2020-04-13 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id title language author license date\n",
"0 1 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n",
"1 2 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n",
"2 3 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n",
"3 4 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_xml(\"books.xml\")"
]
},
{
"cell_type": "markdown",
"id": "745031e6",
"metadata": {},
"source": [
"### `lxml`\n",
"\n",
"Alternativ kann auch zunächst `lxml.objectify` zum Parsen von XML-Dateien verwendet werden. Dabei erhalten wir mit `getroot` einen Verweis auf den Wurzelknoten der XML-Datei:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3849303b",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T08:18:59.130467Z",
"iopub.status.busy": "2026-05-22T08:18:59.130367Z",
"iopub.status.idle": "2026-05-22T08:18:59.211738Z",
"shell.execute_reply": "2026-05-22T08:18:59.211424Z",
"shell.execute_reply.started": "2026-05-22T08:18:59.130459Z"
}
},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"from lxml import objectify\n",
"\n",
"\n",
"parsed = objectify.parse(Path.open(\"books.xml\"))\n",
"root = parsed.getroot()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d45c8806",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T08:18:59.212139Z",
"iopub.status.busy": "2026-05-22T08:18:59.212055Z",
"iopub.status.idle": "2026-05-22T08:18:59.214234Z",
"shell.execute_reply": "2026-05-22T08:18:59.213958Z",
"shell.execute_reply.started": "2026-05-22T08:18:59.212131Z"
}
},
"outputs": [],
"source": [
"books = []\n",
"\n",
"for element in root.book:\n",
" data = {}\n",
" for child in element.getchildren():\n",
" data[child.tag] = child.pyval\n",
" books.append(data)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f44e22f2",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T08:18:59.214552Z",
"iopub.status.busy": "2026-05-22T08:18:59.214487Z",
"iopub.status.idle": "2026-05-22T08:18:59.219192Z",
"shell.execute_reply": "2026-05-22T08:18:59.218979Z",
"shell.execute_reply.started": "2026-05-22T08:18:59.214545Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" language | \n",
" author | \n",
" license | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Python basics | \n",
" en | \n",
" Veit Schiele | \n",
" BSD-3-Clause | \n",
" 2021-10-28 | \n",
"
\n",
" \n",
" | 1 | \n",
" Jupyter Tutorial | \n",
" en | \n",
" Veit Schiele | \n",
" BSD-3-Clause | \n",
" 2019-06-27 | \n",
"
\n",
" \n",
" | 2 | \n",
" Jupyter Tutorial | \n",
" de | \n",
" Veit Schiele | \n",
" BSD-3-Clause | \n",
" 2020-10-26 | \n",
"
\n",
" \n",
" | 3 | \n",
" PyViz Tutorial | \n",
" en | \n",
" Veit Schiele | \n",
" BSD-3-Clause | \n",
" 2020-04-13 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title language author license date\n",
"0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n",
"1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n",
"2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n",
"3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(books)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.13 Kernel",
"language": "python",
"name": "python313"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}