{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BeautifulSoup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:21:53.511105Z", "iopub.status.busy": "2026-05-22T08:21:53.510952Z", "iopub.status.idle": "2026-05-22T08:21:53.785603Z", "shell.execute_reply": "2026-05-22T08:21:53.785067Z", "shell.execute_reply.started": "2026-05-22T08:21:53.511088Z" } }, "outputs": [], "source": [ "import httpx\n", "\n", "\n", "url = \"https://docs.python.org/3/library/xml.dom.html\"\n", "r = httpx.get(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Installieren:\n", "\n", " Mit [Spack](../../../productive/envs/spack/index.rst) könnt ihr BeautifulSoup in eurem Kernel bereitstellen:\n", "\n", " ``` bash\n", " $ spack env activate python-38\n", " $ spack install py-beautifulsoup4@4.10.0%gcc@11.2.0~html5lib~lxml\n", " ```\n", "\n", " Alternativ könnt ihr BeautifulSoup auch mit anderen Paketmanagern installieren, z.B.\n", "\n", " ``` bash\n", " $ uv add beautifulsoup4\n", " ```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Mit `r.content` können wir uns das HTML der Seite ausgeben lassen." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Als nächstes müssen wir diesen String mit BeautifulSoup in eine Python-Darstellung der Seite zerlegen:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:21:53.786290Z", "iopub.status.busy": "2026-05-22T08:21:53.786197Z", "iopub.status.idle": "2026-05-22T08:21:53.863567Z", "shell.execute_reply": "2026-05-22T08:21:53.863166Z", "shell.execute_reply.started": "2026-05-22T08:21:53.786282Z" } }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "\n", "soup = BeautifulSoup(r.content, \"html.parser\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Um den Code zu strukturieren, erstellen wir eine neue Funktion `get_dom` (Document Object Model), die den gesamten vorhergehenden Code einschließt:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:21:53.864063Z", "iopub.status.busy": "2026-05-22T08:21:53.863954Z", "iopub.status.idle": "2026-05-22T08:21:53.866091Z", "shell.execute_reply": "2026-05-22T08:21:53.865822Z", "shell.execute_reply.started": "2026-05-22T08:21:53.864054Z" } }, "outputs": [], "source": [ "def get_dom(url):\n", " r = httpx.get(url)\n", " r.raise_for_status()\n", " return BeautifulSoup(r.content, \"html.parser\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Das Herausfiltern einzelner Elemente kann z.B. über CSS-Selektoren erfolgen. Diese können in einer Website ermittelt werden, indem ihr z.B. in Firefox mit der rechten Maustaste auf eine der Tabellenzellen in der ersten Spalte der Tabelle klickt. Im sich nun öffnenden *Inspector* könnt ihr das Element erneut mit der rechten Maustaste anklicken und dann *Copy → CSS Selector* auswählen. In der Zwischenablage befindet sich dann z.B.\n", "`table.wikitable:nth-child(13) > tbody:nth-child(2) > tr:nth-child(1)`. Diesen *CSS-Selector* bereinigen wir nun, da wir weder nach dem 13. Kindelement der Tabelle `table.wikitable` noch dem 2. Kindelement in `tbody` filtern wollen sondern nur nach der 1. Spalte innerhalb von `tbody`.\n", "\n", "Schließlich lassen wir uns mit `limit=3` in diesem Notebook exemplarisch nur die ersten drei Ergebnisse anzeigen:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:21:53.866456Z", "iopub.status.busy": "2026-05-22T08:21:53.866366Z", "iopub.status.idle": "2026-05-22T08:21:53.870939Z", "shell.execute_reply": "2026-05-22T08:21:53.870591Z", "shell.execute_reply.started": "2026-05-22T08:21:53.866447Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[DOMImplementation, Node, NodeList, DocumentType]\n" ] } ], "source": [ "links = soup.select(\n", " \"#objects-in-the-dom > table > tbody > tr > td > p > code .pre\",\n", " limit=4,\n", ")\n", "\n", "print(links)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir wollen jedoch nicht den gesamten HTML-Link, sondern nur dessen Textinnhalt:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:21:53.871393Z", "iopub.status.busy": "2026-05-22T08:21:53.871321Z", "iopub.status.idle": "2026-05-22T08:21:53.873608Z", "shell.execute_reply": "2026-05-22T08:21:53.873364Z", "shell.execute_reply.started": "2026-05-22T08:21:53.871385Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DOMImplementation\n", "Node\n", "NodeList\n", "DocumentType\n" ] } ], "source": [ "for content in links:\n", " print(content.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Siehe auch**\n", "\n", "* [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }