TDDA: Test-Driven Data Analysis

TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.

  • Reference Tests supports the creation of reference tests based on either unittest or pytest.

  • Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.

  • Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.

1. Imports

[1]:
import numpy as np
import pandas as pd

from tdda.constraints import discover_df, verify_df, detect_df
[2]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv"
)

2. Check data

With pandas.DataFrame.sample we display ten random data sets:

[3]:
df.sample(10)
[3]:
timestamp username temperature heartrate build latest note
83649 2017-02-03T22:57:50 richardsjoshua 7 71 e78590a9-7ac2-d06c-4874-c4a8aad68d97 0 user
97132 2017-02-09T08:10:00 galvanlisa 26 60 73713d18-2348-2af0-df22-b8353291a2b0 0 user
15024 2017-01-07T12:10:23 martinvelez 14 85 fa2234c1-6b94-e42d-7c1e-04f0d93d511b 0 update
10587 2017-01-05T17:32:16 johndiaz 18 81 e0479902-1510-df3f-5b11-4ec804c29c0a 0 update
10895 2017-01-05T20:30:45 kerrycarter 28 76 5284ac27-131e-5ec6-a7ca-23897e877d8d 0 sleep
128611 2017-02-21T21:14:03 moorechristopher 10 62 65852418-68ce-cdc9-e2b5-68ecb095c398 0 test
9842 2017-01-05T10:23:58 randy54 24 73 a715f30d-448c-58a8-b7d2-1dc996f0dc26 1 wake
34461 2017-01-15T06:25:25 shirley88 24 69 87950e6c-27f7-97d6-fcff-6785d80eb6df 0 wake
93567 2017-02-07T21:59:16 bullockjames 15 84 a78c79af-34f5-3a96-375c-534cab19fe13 1 user
37045 2017-01-16T07:02:26 joshua04 12 87 1fded4f3-9fdf-6a69-3921-de7ed939e03b 1 update

And with pandas.DataFrame.dtypes we display the data types for the individual columns:

[4]:
df.dtypes
[4]:
timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

3. Creating a constraints object

With discover_constraints a constraints object can be created.

[5]:
constraints = discover_df(df)
[6]:
constraints
[6]:
<tdda.constraints.base.DatasetConstraints at 0x114e52270>
[7]:
constraints.fields
[7]:
Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x114e51a90>),
        ('username', <tdda.constraints.base.FieldConstraints at 0x114d8a350>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x114d8ac10>),
        ('heartrate', <tdda.constraints.base.FieldConstraints at 0x1269a8510>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x1269a88a0>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x1269b0a70>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x114e198c0>)])

4. Writing the constraints into a file

[8]:
with open("../../data/iot_example.json", "w") as f:
    f.write(constraints.to_json())

If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp column and temperature expects integers with values from 5-29.

[9]:
!cat ../../data/iot_example.json
{
    "creation_metadata": {
        "local_time": "2025-12-15T19:43:37",
        "utc_time": "2025-12-15T18:43:37",
        "creator": "TDDA 2.2.05",
        "host": "fay.local",
        "user": "veit",
        "n_records": 146397,
        "n_selected": 146397
    },
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "build": {
            "type": "string",
            "min_length": 36,
            "max_length": 36,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "latest": {
            "type": "int",
            "min": 0,
            "max": 1,
            "sign": "non-negative",
            "max_nulls": 0
        },
        "note": {
            "type": "string",
            "min_length": 4,
            "max_length": 8,
            "allowed_values": [
                "interval",
                "sleep",
                "test",
                "update",
                "user",
                "wake"
            ]
        }
    }
}

5. Checking data frames

To do this, we first read in a new csv file with pandas and then have ten data records output as examples:

[10]:
new_df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv"
)
new_df.sample(10)
[10]:
timestamp username temperature heartrate build latest note
30764 2017-01-13T18:54:44 mcdanielthomas NaN 87 NaN NaN NaN
104026 2017-02-12T01:58:52 eking 23.0 61 NaN 1.0 interval
143254 2017-02-27T17:54:45 kross 10.0 71 NaN NaN user
82725 2017-02-03T14:03:12 kterry 8.0 80 NaN 0.0 NaN
49289 2017-01-21T04:45:24 zstewart 29.0 70 1b67ef10-6d22-0c00-32bd-2db70c7a5557 NaN interval
51277 2017-01-21T23:48:25 adamsmith 10.0 72 d50ff57d-0980-ce81-d347-06bacd75f381 NaN update
104873 2017-02-12T10:03:57 jasminemcguire 25.0 73 9b8740d9-f7fd-aa8d-f61b-934e34c05beb 1.0 NaN
122166 2017-02-19T07:27:49 vgolden 5.0 82 NaN 1.0 wake
73503 2017-01-30T21:24:05 hsanders NaN 65 ab685dd0-64e6-00b4-fd87-46dc496d7b86 NaN interval
85301 2017-02-04T14:50:19 leroychapman NaN 78 7e579149-f44e-9e74-e097-1a00ba317fcb 1.0 sleep

We see several fields that are output as NaN. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes returns the number of passed constraints, and failures returns the number of failed constraints.

[11]:
v = verify_df(new_df, '../../data/iot_example.json')
[12]:
v
[12]:
<tdda.constraints.pd.constraints.PandasVerification at 0x114e53230>
[13]:
v.passes
[13]:
30
[14]:
v.failures
[14]:
3

We can also display which constraints passed and failed in which columns:

[15]:
print(str(v))
FIELDS:

timestamp: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓

username: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

temperature: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

heartrate: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

build: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  no_duplicates ✓

latest: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

note: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

SUMMARY:

Constraints passing: 30
Constraints failing: 3

Alternatively, we can also display these results in tabular form:

[16]:
v.to_frame()
[16]:
field failures passes type min min_length max max_length sign max_nulls no_duplicates allowed_values
0 timestamp 0 5 True NaN True NaN True NaN True True NaN
1 username 0 4 True NaN True NaN True NaN True NaN NaN
2 temperature 1 4 True True NaN True NaN True False NaN NaN
3 heartrate 0 5 True True NaN True NaN True True NaN NaN
4 build 1 4 True NaN True NaN True NaN False True NaN
5 latest 1 4 True True NaN True NaN True False NaN NaN
6 note 0 4 True NaN True NaN True NaN NaN NaN True

6. Finding the faulty rows

tdda.constraints.pd.constraints.detect_df() detects records in the pandas DataFrame that violate one of the constraints in the provided JSON file. We can then call the detected() function on the created PandasDetection object to output the rows that are faulty:

[17]:
d = detect_df(new_df, "iot_example.json")

d.detected()
[17]:
n_failures
Index
3 1
4 1
7 1
10 2
12 1
... ...
146385 1
146387 2
146391 2
146393 2
146394 1

77260 rows × 1 columns

We can display all incorrect data records by using only the part of the index from new_df that also appears in d.detected():

[18]:
d_index = d.detected().index
[19]:
new_df[new_df.index.isin(d_index)]
[19]:
timestamp username temperature heartrate build latest note
3 2017-01-01T12:02:09 eddierodriguez 28.0 76 NaN 0.0 update
4 2017-01-01T12:02:36 kenneth94 29.0 62 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN
7 2017-01-01T12:04:35 scott28 16.0 76 7a60219f-6621-e548-180e-ca69624f9824 NaN interval
10 2017-01-01T12:06:21 njohnson NaN 63 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN
12 2017-01-01T12:07:41 jessica48 22.0 83 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update
... ... ... ... ... ... ... ...
146385 2017-02-28T23:53:59 powelleric 20.0 86 152eda10-676a-069c-b664-19443f2c8081 NaN test
146387 2017-02-28T23:54:50 jthompson NaN 66 8da10303-fe49-e313-8fda-0d5e79ded054 NaN update
146391 2017-02-28T23:57:21 aaronbecker NaN 87 7e52f4a8-345c-5ee0-e515-b8c392213062 NaN sleep
146393 2017-02-28T23:58:43 joelrusso NaN 89 NaN 0.0 NaN
146394 2017-02-28T23:59:23 lellis NaN 84 dac87426-e147-9c39-6e4c-790bb11f8fc9 0.0 update

77260 rows × 7 columns