Detecting and filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

[1]:
import numpy as np
import pandas as pd


rng = np.random.default_rng()
df = pd.DataFrame(rng.normal(size=(1000, 4)))

df.describe()
[1]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.023623 0.029803 -0.028434 0.005751
std 1.012689 0.999117 0.980014 0.969842
min -3.240759 -2.757613 -3.372727 -2.697726
25% -0.633097 -0.650075 -0.685945 -0.684345
50% 0.001117 0.022371 -0.009798 0.002211
75% 0.705048 0.710901 0.657235 0.654270
max 3.262954 3.302385 2.956046 2.634073

Suppose you want to find values in one of the columns whose absolute value is greater than 3:

[2]:
col = df[1]

col[col.abs() > 3]
[2]:
220    3.011150
640    3.201065
674    3.302385
Name: 1, dtype: float64

To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1) to check if a value is in a row:

[3]:
df[(df.abs() > 3).any(axis=1)]
[3]:
0 1 2 3
0 3.073224 0.392376 0.464029 -1.086741
50 0.456746 0.313551 -3.372727 0.232789
78 3.262954 -1.511093 -0.243049 -0.424410
220 0.657494 3.011150 -0.733968 0.549828
504 -3.240759 0.202480 -0.181495 0.088678
640 0.913886 3.201065 -0.896181 1.048140
674 -0.283886 3.302385 -0.541091 0.524652
833 3.010898 -0.341878 -0.409523 0.264089

On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df), which generates values 1 and -1, depending on whether the values in df are positive or negative:

[4]:
df[df > 3] = np.sign(df) * 3

df.describe()
[4]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.023276 0.029288 -0.028434 0.005751
std 1.011630 0.997518 0.980014 0.969842
min -3.240759 -2.757613 -3.372727 -2.697726
25% -0.633097 -0.650075 -0.685945 -0.684345
50% 0.001117 0.022371 -0.009798 0.002211
75% 0.705048 0.710901 0.657235 0.654270
max 3.000000 3.000000 2.956046 2.634073