Detecting and filtering outliers¶
Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(rng.normal(size=(1000, 4)))
df.describe()
[1]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | 0.023623 | 0.029803 | -0.028434 | 0.005751 |
| std | 1.012689 | 0.999117 | 0.980014 | 0.969842 |
| min | -3.240759 | -2.757613 | -3.372727 | -2.697726 |
| 25% | -0.633097 | -0.650075 | -0.685945 | -0.684345 |
| 50% | 0.001117 | 0.022371 | -0.009798 | 0.002211 |
| 75% | 0.705048 | 0.710901 | 0.657235 | 0.654270 |
| max | 3.262954 | 3.302385 | 2.956046 | 2.634073 |
Suppose you want to find values in one of the columns whose absolute value is greater than 3:
[2]:
col = df[1]
col[col.abs() > 3]
[2]:
220 3.011150
640 3.201065
674 3.302385
Name: 1, dtype: float64
To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1) to check if a value is in a row:
[3]:
df[(df.abs() > 3).any(axis=1)]
[3]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 3.073224 | 0.392376 | 0.464029 | -1.086741 |
| 50 | 0.456746 | 0.313551 | -3.372727 | 0.232789 |
| 78 | 3.262954 | -1.511093 | -0.243049 | -0.424410 |
| 220 | 0.657494 | 3.011150 | -0.733968 | 0.549828 |
| 504 | -3.240759 | 0.202480 | -0.181495 | 0.088678 |
| 640 | 0.913886 | 3.201065 | -0.896181 | 1.048140 |
| 674 | -0.283886 | 3.302385 | -0.541091 | 0.524652 |
| 833 | 3.010898 | -0.341878 | -0.409523 | 0.264089 |
On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df), which generates values 1 and -1, depending on whether the values in df are positive or negative:
[4]:
df[df > 3] = np.sign(df) * 3
df.describe()
[4]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | 0.023276 | 0.029288 | -0.028434 | 0.005751 |
| std | 1.011630 | 0.997518 | 0.980014 | 0.969842 |
| min | -3.240759 | -2.757613 | -3.372727 | -2.697726 |
| 25% | -0.633097 | -0.650075 | -0.685945 | -0.684345 |
| 50% | 0.001117 | 0.022371 | -0.009798 | 0.002211 |
| 75% | 0.705048 | 0.710901 | 0.657235 | 0.654270 |
| max | 3.000000 | 3.000000 | 2.956046 | 2.634073 |