XML/HTML examples

HTML

Python has numerous libraries for reading and writing data in the ubiquitous HTML and XML formats. Examples are lxml, Beautiful Soup and html5lib. While lxml is generally comparatively much faster, the other libraries are better at handling corrupted HTML or XML files.

pandas has a built-in function, read_html, which uses libraries like lxml, html5lib and Beautiful Soup to automatically parse tables from HTML files as DataFrame objects. These have to be installed additionally. With Spack you can provide lxml, BeautifulSoup and html5lib in your kernel:

$ spack env activate python-311
$ spack install py-lxml py-beautifulsoup4~html5lib~lxml py-html5lib

Alternatively, you can install BeautifulSoup with other package managers, for example

$ uv add lxml beautifulsoup4 html5lib

To show how this works, I use an HTML file from Wikipedia that gives an overview of different serialisation formats.

[1]:
import pandas as pd


tables = pd.read_html(
    "https://docs.python.org/3/library/xml.dom.html",
)

The pandas.read_html function has a number of options, but by default it looks for and tries to parse all table data contained in <table> tags. The result is a list of DataFrame objects:

[2]:
len(tables)
[2]:
3
[3]:
xml_idl = tables[2]

xml_idl.head()
[3]:
IDL Type Python Type
0 boolean bool or int
1 int int
2 long int int
3 unsigned int int
4 DOMString str or bytes

From here we can do some data cleansing and analysis, such as the number of different Python types:

[4]:
xml_idl["Python Type"].value_counts()
[4]:
Python Type
int             3
bool or int     1
str or bytes    1
Name: count, dtype: int64

XML

pandas has a function read_xml, which makes reading XML files very easy:

[5]:
pd.read_xml("books.xml")
[5]:
id title language author license date
0 1 Python basics en Veit Schiele BSD-3-Clause 2021-10-28
1 2 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
2 3 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26
3 4 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13

lxml

Alternatively, lxml.objectify can be used first to parse XML files. In doing so, we get a reference to the root node of the XML file with getroot:

[6]:
from pathlib import Path

from lxml import objectify


parsed = objectify.parse(Path.open("books.xml"))
root = parsed.getroot()
[7]:
books = []

for element in root.book:
    data = {}
    for child in element.getchildren():
        data[child.tag] = child.pyval
    books.append(data)
[8]:
pd.DataFrame(books)
[8]:
title language author license date
0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28
1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26
3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13