Managing missing data with pandas

Missing data often occurs in data analysis. pandas simplifies working with missing data as much as possible. For example, all descriptive statistics of pandas objects exclude missing data by default. pandas uses the floating point value NaN (Not a Number) to represent missing data for numerical data.

pandas has adopted a convention borrowed from the R programming language and refers to missing data as NA, which stands for not available. In statistical applications, NA data can be either data that does not exist or data that exists but has not been observed (for example, due to problems with data collection). Python’s None object is also treated as NA in non-numeric arrays.

Methods for handling NA objects:

Argument

Description

dropna

filters axis labels based on whether values for individual labels have missing data, applying different thresholds for the amount of missing data to be tolerated.

fillna

fills missing data with a value or with an interpolation method such as ffill or bfill.

isna

returns Boolean values indicating which values are missing/NA.

notna

negates isna and returns True for non-NA values and False for NA values.

This notebook presents some ways to manage missing data with pandas DataFrames. For more information, see the Pandas documentation: Working with missing data and Missing data cookbook.

See also:

[1]:
import pandas as pd
[2]:
df = pd.read_csv("https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv")

1. Checking the data

When cleaning data for analysis, it is often important to analyse the missing data itself in order to identify problems with data collection or potential biases in the data due to the missing data. First, let’s display the first 20 data records:

[3]:
df.head(20)
[3]:
timestamp username temperature heartrate build latest note
0 2017-01-01T12:00:23 michaelsmith 12.0 67 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval
1 2017-01-01T12:01:09 kharrison 6.0 78 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake
2 2017-01-01T12:01:34 smithadam 5.0 89 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN
3 2017-01-01T12:02:09 eddierodriguez 28.0 76 NaN 0.0 update
4 2017-01-01T12:02:36 kenneth94 29.0 62 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN
5 2017-01-01T12:03:04 bryanttodd 13.0 86 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval
6 2017-01-01T12:03:51 andrea98 17.0 81 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN
7 2017-01-01T12:04:35 scott28 16.0 76 7a60219f-6621-e548-180e-ca69624f9824 NaN interval
8 2017-01-01T12:05:05 hillpamela 5.0 82 a8b87754-a162-da28-2527-4bce4b3d4191 1.0 NaN
9 2017-01-01T12:05:41 moorejeffrey 25.0 63 585f1a3c-0679-0ffe-9132-508933c70343 0.0 wake
10 2017-01-01T12:06:21 njohnson NaN 63 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN
11 2017-01-01T12:06:53 gsutton 29.0 80 607c9f6e-2bdf-a606-6d16-3004c6958436 1.0 update
12 2017-01-01T12:07:41 jessica48 22.0 83 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update
13 2017-01-01T12:08:08 hornjohn 16.0 73 NaN 0.0 interval
14 2017-01-01T12:08:35 gramirez 24.0 73 NaN 0.0 wake
15 2017-01-01T12:09:05 schmidtsamuel NaN 78 b9890c1e-79d5-8979-63ae-6c08a4cd476a 0.0 NaN
16 2017-01-01T12:09:48 derrick47 NaN 63 b60bd7de-4057-8a85-f806-e6eec1350338 NaN interval
17 2017-01-01T12:10:23 beckercharles 12.0 61 b1dacc73-c8b7-1d7d-ee02-578da781a71e 0.0 test
18 2017-01-01T12:10:57 ipittman 11.0 69 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 NaN user
19 2017-01-01T12:11:34 sabrina65 22.0 82 8075d058-7dae-e2ec-d47e-58ec6d26899b 1.0 NaN

Then let’s take a look at what data type the columns are:

[4]:
df.dtypes
[4]:
timestamp       object
username        object
temperature    float64
heartrate        int64
build           object
latest         float64
note            object
dtype: object

With pandas.api.types, we can also automatically check whether the data types meet our expectations:

[5]:
import pandas.api.types as ptypes


expected_datetimes = ["timestamp"]
expected_objects = ["username", "build", "note"]
expected_floats = ["temperature", "latest"]
expected_ints = ["heartrate"]

assert all(ptypes.is_datetime64_any_dtype(df[col]) for col in expected_datetimes)
assert all(ptypes.is_object_dtype(df[col]) for col in expected_objects)
assert all(ptypes.is_float_dtype(df[col]) for col in expected_floats)
assert all(ptypes.is_int64_dtype(df[col]) for col in expected_ints)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[5], line 9
      6 expected_floats = ["temperature", "latest"]
      7 expected_ints = ["heartrate"]
----> 9 assert all(ptypes.is_datetime64_any_dtype(df[col]) for col in expected_datetimes)
     10 assert all(ptypes.is_object_dtype(df[col]) for col in expected_objects)
     11 assert all(ptypes.is_float_dtype(df[col]) for col in expected_floats)

AssertionError:

Now we can convert the timestamp column to the appropriate type, see also Converting dtype:

[6]:
df["timestamp"] = df["timestamp"].astype("datetime64[ns]")

We can also display the values and their frequency, for example for the column note:

[7]:
df.note.value_counts()
[7]:
note
wake        16496
user        16416
interval    16274
sleep       16226
update      16213
test        16068
Name: count, dtype: int64

2. Removing all zero values (including n/a)

2.1 … with pandas.read_csv

pandas.read_csv normally filters out many values that it recognises as NA or NaN. Additional values can be specified with na_values.

[8]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv",
    na_values=["n/a"],
)

2.2 … with pandas.DataFrame.dropna

Missing values can be deleted using pandas.DataFrame.dropna.

To analyse the scope of the deletions, we display the scope of the DataFrame before and after deletion using pandas.DataFrame.shape:

[9]:
df.shape
[9]:
(146397, 7)
[10]:
df.dropna().shape
[10]:
(46116, 7)

Using pandas.DataFrame.dropna would therefore result in the loss of more than two-thirds of the data records.

In the next attempt, we want to analyse whether entire rows or columns contain no data. Using how="all", rows or columns that contain no values are removed; axis=1 specifies that empty rows should be removed.

[11]:
df.dropna(how="all", axis=1).shape
[11]:
(146397, 7)

This does not get us any further either.

2.3 Find all columns containing the most data

[12]:
list(df.dropna(thresh=int(df.shape[0] * 0.9), axis=1).columns)
[12]:
['timestamp', 'username', 'heartrate']

thresh requires a certain number of NA values, in our case 90%, before axis=1 hides a column.

2.4 Find all columns with missing data

With pandas.DataFrame.isnull, we can find missing values, and with pandas.DataFrame.any, we can find out whether an element is valid, usually across a column.

[13]:
incomplete_columns = list(df.columns[df.isnull().any()])
[14]:
incomplete_columns
[14]:
['temperature', 'build', 'latest', 'note']

With num_missing, we can now display the number of missing values per column:

[15]:
for col in incomplete_columns:
    num_missing = df[df[col].isnull() == True].shape[0]
    print("number missing for column {}: {}".format(col, num_missing))
number missing for column temperature: 32357
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704

We can also display these values as percentages:

[16]:
for col in incomplete_columns:
    percent_missing = df[df[col].isnull() == True].shape[0] / df.shape[0]
    print("percent missing for column {}: {}".format(col, percent_missing))
percent missing for column temperature: 0.22102228870810195
percent missing for column build: 0.22097447352063226
percent missing for column latest: 0.22061927498514314
percent missing for column note: 0.332684412931959

2.5 Replacing missing data

To verify our changes in the latest column, we use pandas.Series.value_counts. The method returns a series containing the number of unique values:

[17]:
df.temperature.value_counts()
[17]:
temperature
29.0    4688
26.0    4674
16.0    4656
28.0    4648
10.0    4632
13.0    4629
7.0     4624
27.0    4621
21.0    4585
9.0     4576
23.0    4571
5.0     4568
6.0     4563
19.0    4561
18.0    4557
17.0    4556
11.0    4529
15.0    4525
8.0     4486
12.0    4484
20.0    4473
25.0    4469
14.0    4464
22.0    4455
24.0    4446
Name: count, dtype: int64

Now we replace the missing values in the temperature column with the mean value rounded to one decimal place using DataFrame.fillna:

[18]:
temp_mean = round(df.temperature.mean(), 1)
fill_mean = df.temperature.fillna(temp_mean)
fill_mean.value_counts()
[18]:
temperature
17.0    36913
29.0     4688
26.0     4674
16.0     4656
28.0     4648
10.0     4632
13.0     4629
7.0      4624
27.0     4621
21.0     4585
9.0      4576
23.0     4571
5.0      4568
6.0      4563
19.0     4561
18.0     4557
11.0     4529
15.0     4525
8.0      4486
12.0     4484
20.0     4473
25.0     4469
14.0     4464
22.0     4455
24.0     4446
Name: count, dtype: int64

2.6 Replacing missing data with backfill

To ensure that the data records follow each other in chronological order, we first set the index for timestamp with set_index:

[19]:
df = df.set_index("timestamp")
[20]:
df.head(20)
[20]:
username temperature heartrate build latest note
timestamp
2017-01-01T12:00:23 michaelsmith 12.0 67 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval
2017-01-01T12:01:09 kharrison 6.0 78 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake
2017-01-01T12:01:34 smithadam 5.0 89 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN
2017-01-01T12:02:09 eddierodriguez 28.0 76 NaN 0.0 update
2017-01-01T12:02:36 kenneth94 29.0 62 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN
2017-01-01T12:03:04 bryanttodd 13.0 86 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval
2017-01-01T12:03:51 andrea98 17.0 81 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN
2017-01-01T12:04:35 scott28 16.0 76 7a60219f-6621-e548-180e-ca69624f9824 NaN interval
2017-01-01T12:05:05 hillpamela 5.0 82 a8b87754-a162-da28-2527-4bce4b3d4191 1.0 NaN
2017-01-01T12:05:41 moorejeffrey 25.0 63 585f1a3c-0679-0ffe-9132-508933c70343 0.0 wake
2017-01-01T12:06:21 njohnson NaN 63 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN
2017-01-01T12:06:53 gsutton 29.0 80 607c9f6e-2bdf-a606-6d16-3004c6958436 1.0 update
2017-01-01T12:07:41 jessica48 22.0 83 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update
2017-01-01T12:08:08 hornjohn 16.0 73 NaN 0.0 interval
2017-01-01T12:08:35 gramirez 24.0 73 NaN 0.0 wake
2017-01-01T12:09:05 schmidtsamuel NaN 78 b9890c1e-79d5-8979-63ae-6c08a4cd476a 0.0 NaN
2017-01-01T12:09:48 derrick47 NaN 63 b60bd7de-4057-8a85-f806-e6eec1350338 NaN interval
2017-01-01T12:10:23 beckercharles 12.0 61 b1dacc73-c8b7-1d7d-ee02-578da781a71e 0.0 test
2017-01-01T12:10:57 ipittman 11.0 69 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 NaN user
2017-01-01T12:11:34 sabrina65 22.0 82 8075d058-7dae-e2ec-d47e-58ec6d26899b 1.0 NaN

We then use pandas.DataFrame.groupby to group the records by username and then fill in the missing data using the backfill method from pandas.core.groupby.DataFrameGroupBy.fillna. limit defines the maximum number of consecutive NaN values:

[21]:
df.temperature = df.groupby("username").temperature.fillna(
    method="backfill", limit=3
)
[22]:
for col in incomplete_columns:
    num_missing = df[df[col].isnull() == True].shape[0]
    print(f"number missing for column {col}: {num_missing}")
number missing for column temperature: 22633
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704

Arguments of the fillna function:

Argument

Description

value

Scalar value or dict-like object used to fill in missing values

Methode

Interpolation; default is ffill when the function is called without any other arguments

axis

Axis to fill; default is axis=0

inplace

Modifies the calling object without creating a copy

limit

For filling in forward and backward directions, maximum number of consecutive periods to fill