Git for binary files

git diff can be configured so that it can also display meaningful diffs for binary files.

… for Excel files

For this we need openpyxl and pandas:

$ uv add openpyxl pandas

Then we can use pandas.DataFrame.to_csv in exceltocsv.py to convert the Excel files:

exceltocsv.py
# SPDX-FileCopyrightText: 2023 cusy GmbH
#
# SPDX-License-Identifier: BSD-3-Clause

import sys
from io import StringIO

import pandas as pd

for sheet_name in pd.ExcelFile(sys.argv[1]).sheet_names:
    output = StringIO()
    print("Sheet: %s" % sheet_name)
    pd.read_excel(sys.argv[1], sheet_name=sheet_name).to_csv(
        output, header=True, index=False
    )
    print(output.getvalue())

Now add the following section to your global Git configuration ~/.config/git/config:

[diff "excel"]
    textconv=python3 /PATH/TO/exceltocsv.py
    binary=true

Finally, in the global ~/.config/git/attributes file, our excel converter is linked to *.xlsx files:

*.xlsx diff=excel

… for PDF files

For this, pdftohtml is additionally required. It can be installed with

$ sudo apt install poppler-utils
$ brew install pdftohtml

Add the following section to the global Git configuration ~/.config/git/config:

[diff "pdf"]
    textconv=pdftohtml -stdout

Finally, in the global ~/.config/git/attributes file, our pdf converter is linked to *.pdf files:

*.pdf diff=pdf

Now, when git diff is called, the PDF files are first converted and then a diff is performed over the outputs of the converter.

… for documents

Differences in documents can also be displayed. For this purpose Pandoc can be used, which can be easily installed with

$ sudo apt install pandoc
$ brew install pandoc

Download and install the *.msi. file from GitHub.

Then add the following section to your global Git configuration ~/.config/git/attributes:

[diff "pandoc-to-markdown"]
    textconv = pandoc --to markdown
    cachetextconv = true

Finally, in the global ~/.config/git/attributes file, our pandoc-to-markdown converter is linked to *.docx, *.odt and *.rtf files:

*.docx diff=pandoc-to-markdown
*.odt diff=pandoc-to-markdown
*.rtf diff=pandoc-to-markdown

Tip

Jupyter Notebooks write to a JSON file *.ipynb, which is quite dense and difficult to read, especially with diffs. The Markdown representation of Pandoc simplifies this:

*.ipynb diff=pandoc-to-markdown

The same procedure can be used to obtain useful diffs from other binaries, for example *.zip, *.jar and other archives with unzip or for changes in the meta information of images with exiv2. There are also conversion tools for converting *.odt, *.doc and other document formats into plain text. For binary files for which there is no converter, strings are often sufficient.

… for media files

ExifTool can be used to convert the metadata of media files to text.

$ sudo apt install libimage-exiftool-perl
$ brew install exiftool
> choco install exiftool

You can then add the following section to the global Git configuration file ~/.config/git/config:

[diff "exiftool"]
textconv = exiftool --composite -x 'Exiftool:*'
cachetextconv = true
xfuncname = "^-.*$"

Finally, in ~/.config/git/attributes the exiftool converter is linked to file endings of media files:

*.avif diff=exiftool
*.bmp diff=exiftool
*.gif diff=exiftool
*.jpeg diff=exiftool
*.jpg diff=exiftool
*.png diff=exiftool
*.webp diff=exiftool

See also

exiftool can process many more media files. You can find a complete list in Supported File Types.