String comparisons¶
In this notebook we use the popular library for string comparisons fuzzywuzzy. It is based on the built-in Python library difflib. For more information on the various methods available and their differences, see the blog post FuzzyWuzzy: Fuzzy String Matching in Python.
See also:
1. Installation¶
With Spack you can provide fuzzywuzzy and the optional python-levenshtein library in your kernel:
$ spack env activate python-311
$ spack install py-fuzzywuzzy+speedup
Alternatively, you can install the two libraries with other package managers, for example
$ uv add "fuzzywuzzy[speedup]"
2. Import¶
[1]:
from fuzzywuzzy import fuzz, process
3. Example¶
[2]:
berlin = ["Berlin, Germany", "Berlin, Deutschland", "Berlin", "Berlin, DE"]
String similarity¶
[3]:
fuzz.ratio(berlin[0], berlin[1])
[3]:
65
[4]:
fuzz.ratio(berlin[0], berlin[2])
[4]:
57
[5]:
fuzz.ratio(berlin[0], berlin[3])
[5]:
64
Partial string similarity¶
Inconsistent partial strings are a common problem. To get around this, fuzzywuzzy uses a heuristic called best partial.
[6]:
fuzz.partial_ratio(berlin[0], berlin[1])
[6]:
60
[7]:
fuzz.partial_ratio(berlin[0], berlin[2])
[7]:
100
Token sorting¶
In token sorting, the string in question is given a token, the tokens are sorted alphabetically and then reassembled into a string, for example:
[8]:
fuzz.token_set_ratio(berlin[0], berlin[1])
[8]:
62
[9]:
fuzz.token_set_ratio(berlin[0], berlin[2])
[9]:
100
Further information¶
[10]:
fuzz.ratio?
Signature: fuzz.ratio(s1, s2)
Docstring: <no docstring>
File: ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/fuzzywuzzy/fuzz.py
Type: function
Extract from a list¶
[11]:
choices = [
"Germany",
"Deutschland",
"France",
"United Kingdom",
"Great Britain",
"United States",
]
[12]:
process.extract("DE", choices, limit=2)
[12]:
[('Deutschland', 90), ('Germany', 45)]
[13]:
process.extract("Vereinigtes Königreich", choices)
[13]:
[('United Kingdom', 51),
('United States', 41),
('Germany', 39),
('Great Britain', 35),
('Deutschland', 31)]
[14]:
process.extractOne("frankreich", choices)
[14]:
('France', 62)
[15]:
process.extractOne("U.S.", choices)
[15]:
('United States', 86)
Known ports¶
FuzzyWuzzy is also ported to other languages! Here are some known ports:
Java: xpresso
Java: xdrop fuzzywuzzy
Rust: fuzzyrusty
JavaScript: fuzzball.js
C++: tmplt fuzzywuzzy
C#: FuzzySharp
Go: go-fuzzywuzzy
Pascal: FuzzyWuzzy.pas
Kotlin: FuzzyWuzzy-Kotlin
R: fuzzywuzzyR