Deduplicate data¶
1. Load sample data¶
[1]:
import pandas as pd
[2]:
customers = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv",
encoding="utf-8",
)
2. Deduplication with pandas¶
2.1 Overview¶
[3]:
customers
[3]:
| name | job | company | street_address | city | state | user_name | ||
|---|---|---|---|---|---|---|---|---|
| 0 | Patricia Schaefer | Programmer, systems | Estrada-Best | 398 Paul Drive | Christianview | Delaware | lambdavid@gmail.com | ndavidson |
| 1 | Olivie Dubois | Ingénieur recherche et développement en agroal... | Moreno | rue Lucas Benard | Saint Anastasie-les-Bains | AR | berthelotjacqueline@mahe.fr | manonallain |
| 2 | Mary Davies-Kirk | Public affairs consultant | Baker Ltd | Flat 3\nPugh mews | Stanleyfurt | ZA | middletonconor@hotmail.com | colemanmichael |
| 3 | Miroslawa Eckbauer | Dispensing optician | Ladeck GmbH | Mijo-Lübs-Straße 12 | Neubrandenburg | Berlin | sophia01@yahoo.de | romanjunitz |
| 4 | Richard Bauer | Accountant, chartered certified | Hoffman-Rocha | 6541 Rodriguez Wall | Carlosmouth | Texas | tross@jensen-ware.org | adam78 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2075 | Maurice Stey | Systems developer | Linke Margraf GmbH & Co. OHG | Laila-Scheibe-Allee 2/0 | Luckenwalde | Hamburg | gutknechtevelyn@niemeier.com | dkreusel |
| 2076 | Linda Alexander | Commrcil horiculuri | Webb, Ballald and Vasquel | 5594 Persn Ciff | Mooneybury | Maryland | ahleythoa@ail.co | kennethrchn |
| 2077 | Diane Bailly | Pharmacien | Voisin | 527, rue Dijoux | Duval-les-Bains | CH | aruiz@reynaud.fr | dorothee41 |
| 2078 | Jorge Riba Cerdán | Hotel manager | Amador-Diego | Rambla de Adriana Barceló 854 Puerta 3 | Huesca | Asturias | manuelamosquera@yahoo.com | eugenia17 |
| 2079 | Ryan Thompson | Brewing technologist | Smith-Sullivan | 136 Rodriguez Point | Bradfordborough | North Dakota | lcruz@gmail.com | cnewton |
2080 rows × 8 columns
2.2 Display data types¶
For this we use pandas.DataFrame.dtypes:
[4]:
customers.dtypes
[4]:
name object
job object
company object
street_address object
city object
state object
email object
user_name object
dtype: object
2.3 Determining missing values¶
pandas.isnull shows whether values are missing for an array-like object:
NaNin numeric arraysNoneorNaNin object arraysNaTin datetimelike
See also:
notna for the Boolean inverse of pandas.isna
Series.isna for the missing values in a series
DataFrame.isna for the missing values in a DataFrame
Index.isna for the missing values in an index
[5]:
for col in customers.columns:
print(col, customers[col].isna().sum())
name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0
2.4 Determine duplicated data records¶
[6]:
customers.duplicated()
[6]:
0 False
1 False
2 False
3 False
4 False
...
2075 False
2076 False
2077 False
2078 False
2079 False
Length: 2080, dtype: bool
customers.duplicated() does not yet give us the desired indication of whether there are duplicate data records. In the following, we display all data records for which True is returned:
[7]:
customers[customers.duplicated()]
[7]:
| name | job | company | street_address | city | state | user_name |
|---|
Obviously there are no identical data records.
2.5 Deleting duplicated data¶
Deleting duplicate data records with drop_duplicates should therefore not change anything and leave the number of data records at 2080:
[8]:
customers.drop_duplicates()
[8]:
| name | job | company | street_address | city | state | user_name | ||
|---|---|---|---|---|---|---|---|---|
| 0 | Patricia Schaefer | Programmer, systems | Estrada-Best | 398 Paul Drive | Christianview | Delaware | lambdavid@gmail.com | ndavidson |
| 1 | Olivie Dubois | Ingénieur recherche et développement en agroal... | Moreno | rue Lucas Benard | Saint Anastasie-les-Bains | AR | berthelotjacqueline@mahe.fr | manonallain |
| 2 | Mary Davies-Kirk | Public affairs consultant | Baker Ltd | Flat 3\nPugh mews | Stanleyfurt | ZA | middletonconor@hotmail.com | colemanmichael |
| 3 | Miroslawa Eckbauer | Dispensing optician | Ladeck GmbH | Mijo-Lübs-Straße 12 | Neubrandenburg | Berlin | sophia01@yahoo.de | romanjunitz |
| 4 | Richard Bauer | Accountant, chartered certified | Hoffman-Rocha | 6541 Rodriguez Wall | Carlosmouth | Texas | tross@jensen-ware.org | adam78 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2075 | Maurice Stey | Systems developer | Linke Margraf GmbH & Co. OHG | Laila-Scheibe-Allee 2/0 | Luckenwalde | Hamburg | gutknechtevelyn@niemeier.com | dkreusel |
| 2076 | Linda Alexander | Commrcil horiculuri | Webb, Ballald and Vasquel | 5594 Persn Ciff | Mooneybury | Maryland | ahleythoa@ail.co | kennethrchn |
| 2077 | Diane Bailly | Pharmacien | Voisin | 527, rue Dijoux | Duval-les-Bains | CH | aruiz@reynaud.fr | dorothee41 |
| 2078 | Jorge Riba Cerdán | Hotel manager | Amador-Diego | Rambla de Adriana Barceló 854 Puerta 3 | Huesca | Asturias | manuelamosquera@yahoo.com | eugenia17 |
| 2079 | Ryan Thompson | Brewing technologist | Smith-Sullivan | 136 Rodriguez Point | Bradfordborough | North Dakota | lcruz@gmail.com | cnewton |
2080 rows × 8 columns
Now we want to display those data records for which user_name is identical:
[9]:
customers[customers.duplicated(["user_name"])]
[9]:
| name | job | company | street_address | city | state | user_name | ||
|---|---|---|---|---|---|---|---|---|
| 337 | Aysel Binner | Reccig officer | Kuhl Kalleww Swifwunw & Co. KGaA | Batix-Kanz-Staß 5/4 | Fulda | Berli | frncoise@wgnerco | christinefinke |
| 377 | Jolanta Rogge | Accommodation managr | Scholl e.V. | Lrchplz 4/6 | Mettmnn | Thüringen | inrharff@yah.d | walentinabeier |
| 506 | Mrs. Frances Peters | Fuiue desie | Rsgers, Lawrence and Richards | Studio \nCarpntr kys | Wes Simn | BO | halenewilliams@wilson-sandes.og | amy17 |
| 545 | Gerhart Krebs MBA. | Surgeon | Roskoth | Kühnertweg 863 | Stade | Bayern | olav44@bolander.de | bettyhahn |
| 592 | Folkert Gnatz | Meteorologist | Bolnbach | Heinfried-Austermühle-Ring 05 | Eilenburg | Thüringen | jaentschbirgitt@boerner.org | francesco44 |
| 633 | Manon Jacquot | Ingénieur en aéronautique | Jacob | 8, chemin Éléonore Evrard | Marechal-les-Bains | AR | ilemaitre@voila.fr | astrid58 |
| 658 | Austin Waller | Insurance risk surveyor | Sexton Group | 11097 Hansen Field | Davidmouth | Texas | christina74@doyle-baker.biz | olynn |
| 723 | Wanda Moran | Solicitor, Scotland | Estes PLC | 08011 Hernandez Streets Apt. 149 | Natalieshire | Oregon | howardreginald@gmail.com | dana91 |
| 762 | Charles Russell | Scientist, research (physical sciences) | Preston-Wilson | 6709 Ashley Circle Apt. 309 | Danielberg | South Dakota | nancyescobar@brown.net | ruben71 |
| 772 | Waltrud Wohlgemut | Designer, fashion/clothing | Nerger AG | Elmar-Ullmann-Allee 6 | Schlüchtern | Rheinland-Pfalz | auch-schlauchindietlind@gmx.de | zitakuhl |
| 783 | Caroline Mata | Engineer, elecrical | Grimes Grrur | 80157 Whte Alley Sute 79 | Soh Mark | Iw | jared52@aoo.com | thomasthompson |
| 889 | Ricardo Ripoll Lucena | Teevisi camera peratr | Luzq Estraqa anq Galinqq | Caejón Rosario Viapana 16 | Palencia | Lgo | ev0@oo.com | colomerenrique |
| 928 | Sophie Letellier du Carpentier | Cnucteu e ét | Valle7 SARL | 3, boulvard Jan Augr | Saint Daviddan | BS | rdorm@dbmi.com | anne28 |
| 979 | Irene Roda Dávila | Eitor, maazine featres | Daza Inc | Roda Carla Miró 5 | Viy | La Rioa | sldrpére@ps.cm | ipeñalver |
| 995 | Abigail Hernandez | Mechanical engineer | Smith Ltd | 766 Adrian Ranch | Ellismouth | Colorado | jordan60@gmail.com | mendozajody |
| 1015 | Mr. Paul Newton | Government soa researh offer | LemnardmWatsmn | Studi 86\nKaty ill | West Jue | VE | em@mil.cm | bbennett |
| 1043 | Anna Adams | Programmer alcatons | Jones Gjoup | 22 Kateen ova | Noth Joa | KZ | asleig65@aisay.co | lloydann |
| 1052 | Aurélie Vidal | Magistrat | Martins | 88, rue Stéphanie Letellier | Rouxnec | SE | boutineric@blin.fr | iwagner |
| 1062 | Regina Schacht-Kusch | Herbalist | Hartung GmbH & Co. KGaA | Wenke-Hörle-Ring 36 | Eggenfelden | Sachsen-Anhalt | oluebs@troest.de | xklotz |
| 1120 | Jeffrey Benjamin | Publ house manager | Chcn Inc | 27 Rodgrs Rdgs Apt. 269 | Suth Jeffererg | Iinois | stepanie90@rogers.co | lori67 |
| 1170 | Julio Agustín Amaya | Tax aviser | Piñolk Belmonke and Codina | Calleón de Gregorio Bustamante 28 Piso 7 | La Pala | Salamanca | usolana@jáuregui-pedraza.om | gloriaolmo |
| 1339 | Ing. Andrew Schleich B.A. | Ln | Holt Putz GnR | Hugasse 8/8 | Hainichn | Neersachsen | jun@putz.com | jesselmaja |
| 1360 | Frédérique Lejeune-Daniel | Tecce cse | Sctmitt | chemin Denise Ferrand | Saint ChalotteVille | IE | jchretien@costacom | joseph60 |
| 1384 | Kenneth Moore | Magazine journalist | Cross, Bfll anf Diaz | 753 Lindsey Pine | Thompsonshe | Colorao | ashey28@rice.co | todd72 |
| 1423 | Thomas Coulon | Collecteur de fonds | Levy | 91, rue Laetitia Collet | Dias-sur-Normand | SC | deschampsgabriel@guyot.fr | michelepetit |
| 1433 | Jerry Barnes | Tour mner | Col-Wllllams | 30 Mpy Ovepass | Jeiferview | Utah | insnashl@gas-hais.cm | christopher62 |
| 1452 | Karen Weeks | Psychotherapist, child | Rodriguez, Brady and Jackson | 233 Kevin Street | Larryside | Indiana | gregg39@hernandez-gomez.com | knapprobert |
| 1489 | Herr Johann Eigenwillig | Immigration officer | Süßebier Hänel GmbH | Langernplatz 0 | Stadtsteinach | Thüringen | haasemarieluise@noack.com | istoll |
| 1544 | Pasquale Schwital | Trade mark attorney | Finke | Detlef-Binner-Platz 0/1 | Burg | Niedersachsen | hanne-lore98@gmx.de | thomas14 |
| 1557 | Stephanie Young | Herpetologist | Bryant and Sons | 5163 Rebecca Creek Suite 421 | North Theresaberg | Alaska | stephenwilliams@summers.com | ahawkins |
| 1567 | Carolina Reguera Sanz | Fam manae | Cami77, C7aparr7 a7d N7gu7ra | Vil e Imel Oorio 25 | Madd | Vicaya | mordóñ@cámara.info | eva16 |
| 1616 | Sonia Amores | Senir tax prfessina/tax inspectr | J5an-Núñez | Avnida d Grgorio Manón 344 Prta 8 | Ponevedr | Lugo | icent4@montenero-brroso.info | sanmartínguillermo |
| 1647 | Juan Carlos Iker Boix Ros | Pre phtgrapher | Pont, P44om4r4s 4nd Arjon4 | Pasadzo de Josep Bentez Pso | Las Palmas | Mia | srgio24@gail.co | luis-miguel23 |
| 1652 | Jörg Henschel | Chaity office | Schicke AG | HennyLorchRng 484 | Hohensein-Ensh | BadenWürtteberg | huerhes@hmal.de | anne-katrin51 |
| 1703 | Marc Tate | Ship broker | Wagner, Mitchell and Grimes | 721 Christopher View Suite 840 | Watsonmouth | Connecticut | chenjessica@hotmail.com | patricia34 |
| 1707 | Joseph Hines | Pyhiatri nre | Cr4ig, G4rci4 4nd Rich4rds | 85663 Savage Gles | Mcgeeon | Als | bcaldern@htmail.cm | emilytorres |
| 1722 | Julie Baldwin | Set deigner | W5ll55mson-G5rz5 | 58513 Paricia Res Suie 45 | So Me | Alaska | diuez@uess. | cmoss |
| 1759 | Sarah Hoffman | Exhibitin designe | Hensont Wiley and Ryan | 9490 Curts Spur Sute 82 | Jseptwn | Arizona | ncole@yahoo.com | csmith |
| 1796 | Valentine Devaux-Roger | Direceur d'ôial | Leiris | 57, enue de Gros | BenadBou | AL | rogrlro@munoz.om | xherve |
| 1809 | Slavica Seidel | Psychotherapist, child | Wulff Hande KG | Preißgasse 0/4 | Soest | Rheinland-Pfalz | tloos@krause.net | abien |
| 1820 | Wenke Schweitzer | Enginr, automoti | Wesa4k KG | Eies. 7 | Ba Lnwra | Thürige | rsthveriue@mies.rg | kwernecke |
| 1829 | Dr. Thomas Hein | Copy | Geisel | Ladeckgasse 11 | Rockenhausen | Nordrhein-Westfalen | grein-grotharnim@kallert.de | siegmar08 |
| 1837 | Andrew Hart | Engineer, civil (contracting) | Barnett LLC | 258 Day Hollow Suite 410 | Kimberlyhaven | Colorado | brandy00@yahoo.com | amy30 |
| 1914 | Shelby Fowler | Air traffic controller | Fields-Sanchez | 533 Fitzpatrick Bypass | Francesberg | Michigan | terrystephen@anderson.org | gcain |
| 1938 | Susan Aubry | Directeur d'agence bancaire | Payet Georges S.A.S. | 67, rue Inès Valentin | Nicolas | FI | milletedith@sfr.fr | tthierry |
| 1948 | Richard Karge-Kobelt | Junalist maaine | Abberb Keubeb AG | Mitschkeee 8 | Mß | SachsnAnhalt | nrejwgner@gmx.e | muehlehenni |
| 1960 | Anna de Lobato | Medcl techcl ocer | Maciag PLC | Calleón de Dolore Parea 21 At 7 | Palncia | Cantaria | vázqzlornzo@al.om | daniel70 |
| 1968 | Zoltan Wähner B.A. | Professor Emerits | Th8e8 | Stotr. 1 | Saulgau | Shlsg-Holst | arlenpruschke@salz.or | kklemm |
| 1995 | Kenneth Dunn | Programmer, systems | Leonard Inc | 5361 Patterson Mission Suite 504 | Villaburgh | Rhode Island | kristen54@gmail.com | jkent |
| 2010 | Gertraude Schomber | Insurance risk surveyor | Bruder | Christa-Ullrich-Allee 0/1 | Schwäbisch Hall | Hessen | gumprichalice@schmidt.de | fruppert |
| 2075 | Maurice Stey | Systems developer | Linke Margraf GmbH & Co. OHG | Laila-Scheibe-Allee 2/0 | Luckenwalde | Hamburg | gutknechtevelyn@niemeier.com | dkreusel |
Now we can display the associated data records, for example with:
[10]:
customers[customers["user_name"] == "christinefinke"]
[10]:
| name | job | company | street_address | city | state | user_name | ||
|---|---|---|---|---|---|---|---|---|
| 236 | Aysel Binner | Recycling officer | Kuhl Kallert Stiftung & Co. KGaA | Beatrix-Kranz-Straße 5/4 | Fulda | Berlin | francoise22@wagner.com | christinefinke |
| 337 | Aysel Binner | Reccig officer | Kuhl Kalleww Swifwunw & Co. KGaA | Batix-Kanz-Staß 5/4 | Fulda | Berli | frncoise@wgnerco | christinefinke |
Finally, we can delete those data records whose user_name is identical:
[11]:
customers.drop_duplicates(["user_name"])
[11]:
| name | job | company | street_address | city | state | user_name | ||
|---|---|---|---|---|---|---|---|---|
| 0 | Patricia Schaefer | Programmer, systems | Estrada-Best | 398 Paul Drive | Christianview | Delaware | lambdavid@gmail.com | ndavidson |
| 1 | Olivie Dubois | Ingénieur recherche et développement en agroal... | Moreno | rue Lucas Benard | Saint Anastasie-les-Bains | AR | berthelotjacqueline@mahe.fr | manonallain |
| 2 | Mary Davies-Kirk | Public affairs consultant | Baker Ltd | Flat 3\nPugh mews | Stanleyfurt | ZA | middletonconor@hotmail.com | colemanmichael |
| 3 | Miroslawa Eckbauer | Dispensing optician | Ladeck GmbH | Mijo-Lübs-Straße 12 | Neubrandenburg | Berlin | sophia01@yahoo.de | romanjunitz |
| 4 | Richard Bauer | Accountant, chartered certified | Hoffman-Rocha | 6541 Rodriguez Wall | Carlosmouth | Texas | tross@jensen-ware.org | adam78 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2074 | Rhonda James | Recruitment consultant | Turner, Bradley and Scott | 28382 Stokes Expressway | Port Gabrielaport | New Hampshire | zroberts@hotmail.com | heathscott |
| 2076 | Linda Alexander | Commrcil horiculuri | Webb, Ballald and Vasquel | 5594 Persn Ciff | Mooneybury | Maryland | ahleythoa@ail.co | kennethrchn |
| 2077 | Diane Bailly | Pharmacien | Voisin | 527, rue Dijoux | Duval-les-Bains | CH | aruiz@reynaud.fr | dorothee41 |
| 2078 | Jorge Riba Cerdán | Hotel manager | Amador-Diego | Rambla de Adriana Barceló 854 Puerta 3 | Huesca | Asturias | manuelamosquera@yahoo.com | eugenia17 |
| 2079 | Ryan Thompson | Brewing technologist | Smith-Sullivan | 136 Rodriguez Point | Bradfordborough | North Dakota | lcruz@gmail.com | cnewton |
2029 rows × 8 columns
This deleted 51 data records.
3. Dedupe¶
Alternatively, we can recognise the duplicated data with the Dedupe library, which uses a shallow neural network to learn from a small training.
See also:
csvdedupe offers a command line tool for dedupe.
In addition, the same developers have created parserator, which you can use to extract text functions and train your own text extraction.
3.1 Configuring Dedupe¶
Now we define the fields to be taken into account during deduplication and create a new deduper object:
[12]:
from pathlib import Path
import dedupe
customers = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv",
encoding="utf-8",
)
[13]:
variables = [
dedupe.variables.String("name"),
dedupe.variables.String("job"),
dedupe.variables.String("company"),
dedupe.variables.String("street_address"),
dedupe.variables.String("city"),
dedupe.variables.String("state"),
dedupe.variables.String("email"),
dedupe.variables.String("user_name"),
]
deduper = dedupe.Dedupe(variables)
If the value of a field is missing, this missing value should be displayed as a None object. However, 'has_missing': True creates a new, additional field that indicates whether the data was present or not, and the missing data is assigned zero.
See also:
[14]:
deduper
[14]:
<dedupe.api.Dedupe at 0x14a970c20>
[15]:
customers.shape
[15]:
(2080, 8)
4. Create training data¶
[16]:
deduper.prepare_training(customers.T.to_dict())
prepare_training initialises active learning with our data and, optionally, with existing training data.
T mirrors the DataFrame via its diagonal by writing rows as columns and vice versa. For this, pandas.DataFrame.transpose is used.
5. Active learning¶
You can train your dedupe instance with dedupe.console_label. If Dedupe finds a pair of data sets, you will be asked to label it as a duplicate. You can use the y, n and u keys to label duplicates. Press f when you are finished.
[17]:
dedupe.console_label(deduper)
name : Lauren Green
job : Market researcher
company : Chen-Kelly
street_address : 75836 Lopez Plain Suite 513
city : South Matthew
state : Indiana
email : simskevin@gmail.com
user_name : briggsjamie
name : Lauren Green
job : Maet eeache
company : Chen-Kelly
street_address : 75836 Lopez Plai Suite 53
city : Soh Mahew
state : Indiana
email : smskevn@gmalcom
user_name : risjamie
0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
name : Anthony Walker
job : Solicitor, Scotland
company : Barrera-Wilcox
street_address : 649 Jacob Harbors
city : Drewton
state : Virginia
email : fjackson@gmail.com
user_name : bethburch
name : Anthony Walker
job : Solicior Scoland
company : Barrera-Wildox
street_address : 649 Jacb Harbrs
city : Drewton
state : Virginia
email : fjaso@gmail.om
user_name : betbr
1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : Michael Miller
job : Algcal scns
company : Landry Grmmp
street_address : 895 Randy Plains
city : Braorogh
state : Nebaska
email : jul@hotmlcom
user_name : tlas
name : Michael Miller
job : Audiological scientist
company : Landry Group
street_address : 895 Randy Plains
city : Brayborough
state : Nebraska
email : juliabaird@hotmail.com
user_name : tlucas
2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : Sarah Hoffman
job : Exhibition designer
company : Henson, Wiley and Ryan
street_address : 97490 Curtis Spur Suite 825
city : Josephtown
state : Arizona
email : ncole@yahoo.com
user_name : csmith
name : Sarah Hoffman
job : Exhibitin designe
company : Hensont Wiley and Ryan
street_address : 9490 Curts Spur Sute 82
city : Jseptwn
state : Arizona
email : ncole@yahoo.com
user_name : csmith
3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : Jonathan Campos
job : Edior, ommissioning
company : Azvazez Inc
street_address : 740 Willia Sals
city : Lake Case
state : Wahington
email : james4@taorcom
user_name : yduy
name : Jonathan Campos
job : Editor, commissioning
company : Alvarez Inc
street_address : 78840 William Shoals
city : Lake Chase
state : Washington
email : james04@taylor.com
user_name : yduffy
4/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : Joerg Hornich
job : Rtal managr
company : H\.\.n
street_address : Astrd-Spß-All 09
city : Shmön
state : Meclebr-Vorpommer
email : christine@gx.de
user_name : rsbtte
name : Joerg Hornich
job : Retail manager
company : Hein
street_address : Astrid-Spieß-Allee 09
city : Schmölln
state : Mecklenburg-Vorpommern
email : christine74@gmx.de
user_name : ursbutte
5/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : Martin Butte
job : Sale executve
company : Junhk Stiftung & Chh KG
street_address : Tadus-Tröst-Rin 1/
city : Gen
state : Niedersacsen
email : beckercarlo@googlemail.com
user_name : cmaeler
name : Hans Eberhardt
job : Research scientist (life sciences)
company : Pruschke Stiftung & Co. KG
street_address : Notburga-Reising-Weg 452
city : Griesbach Rottal
state : Niedersachsen
email : twohlgemut@hotmail.de
user_name : qmuehle
6/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
n
name : Andrea Jurado Bas
job : Prnmakr
company : Znrita, Madrid and Cnlladn
street_address : Pasaje Lís Blazqez 29
city : Huesc
state : Ciu
email : foso@rocmor-dgdocom
user_name : jan-franiso
name : Andrea Jurado Bas
job : Printmaker
company : Zurita, Madrid and Collado
street_address : Pasaje Luís Blazquez 29
city : Huesca
state : Ciudad
email : alfonso33@rocamora-delgado.com
user_name : juan-francisco06
6/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
f
Finished labeling
The last two training datasets compared make it clear that we did not delete this duplicate with our drop_duplicates example above – clittle and little were recognised as different.
With Dedupe.train, the data record pairs you have marked are added to the training data and the matching model is updated.
With index_predicates=True, deduplication also takes into account predicates based on the indexing of the data.
When you are finished, save your training data with Dedupe.write_settings.
[18]:
settings_file = "csv_example_learned_settings"
if Path(settings_file).exists():
print("reading from", settings_file)
with Path.open(settings_file, "rb") as f:
deduper = dedupe.StaticDedupe(f)
else:
deduper.train(index_predicates=True)
with Path.open(settings_file, "wb") as sf:
deduper.write_settings(sf)
reading from csv_example_learned_settings
With dedupe.Dedupe.partition, data sets that all refer to the same entity are identified and returned as tuples that are a sequence of data set IDs and confidence values. Further details on the confidence value can be found at dedupe.Dedupe.cluster.
[19]:
dupes = deduper.partition(customers.T.to_dict())
We can also display only individual entries:
[20]:
dupes[0]
[20]:
((np.int64(0), np.int64(963)),
(np.float32(0.95884323), np.float32(0.95884323)))
We can then display these with pandas.DataFrame.iloc:
[21]:
customers.iloc[[0, 963]]
[21]:
| name | job | company | street_address | city | state | user_name | ||
|---|---|---|---|---|---|---|---|---|
| 0 | Patricia Schaefer | Programmer, systems | Estrada-Best | 398 Paul Drive | Christianview | Delaware | lambdavid@gmail.com | ndavidson |
| 963 | Patricia Schaefer | Prorammer, ytem | Es:rada-Bes: | 39 Pul Drve | Chistianview | Delwre | mbdvid@gmim | ndvdson |