Detecting and filtering outliers¶
Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:
[1]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000, 4))
df.describe()
[1]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | -0.034508 | 0.011824 | -0.024031 | -0.048423 |
| std | 1.023096 | 1.069939 | 1.037148 | 0.972926 |
| min | -2.998919 | -2.939683 | -3.980539 | -3.180228 |
| 25% | -0.735324 | -0.739318 | -0.690162 | -0.699223 |
| 50% | -0.020213 | 0.009185 | -0.041272 | -0.046438 |
| 75% | 0.661472 | 0.728629 | 0.675814 | 0.588834 |
| max | 3.187850 | 3.693235 | 3.950033 | 3.089895 |
Suppose you want to find values in one of the columns whose absolute value is greater than 3:
[2]:
col = df[1]
col[col.abs() > 3]
[2]:
435 3.693235
Name: 1, dtype: float64
To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1) to check if a value is in a row:
[3]:
df[(df.abs() > 3).any(axis=1)]
[3]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 103 | -0.477368 | -0.100079 | -1.466754 | -3.180228 |
| 188 | 1.962728 | -0.072791 | 3.950033 | -0.012231 |
| 210 | 1.498744 | -0.057742 | 3.412662 | 0.586651 |
| 245 | 3.016760 | 1.527263 | 1.790951 | -0.015122 |
| 282 | 1.006073 | -0.480924 | 0.259646 | 3.089895 |
| 385 | 3.187850 | -1.069850 | -0.641928 | 1.733524 |
| 435 | -0.303929 | 3.693235 | -0.590390 | 0.052511 |
| 606 | -0.220844 | -0.479557 | -3.012150 | -1.476384 |
| 613 | 0.715983 | 0.134178 | -3.835888 | -1.358231 |
| 666 | -0.351409 | 1.919364 | -3.014478 | -0.340513 |
| 743 | 0.227552 | -0.831102 | -0.905155 | -3.046226 |
| 824 | 0.109159 | 0.501608 | -3.980539 | -0.783160 |
| 829 | 3.075201 | 1.517391 | 1.191999 | -0.690774 |
| 882 | -0.445649 | 0.455558 | -3.241675 | 2.569407 |
On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df), which generates values 1 and -1, depending on whether the values in df are positive or negative:
[4]:
df[df.abs() > 3] = np.sign(df) * 3
df.describe()
[4]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | -0.034787 | 0.011131 | -0.023309 | -0.048286 |
| std | 1.022245 | 1.067774 | 1.025773 | 0.971934 |
| min | -2.998919 | -2.939683 | -3.000000 | -3.000000 |
| 25% | -0.735324 | -0.739318 | -0.690162 | -0.699223 |
| 50% | -0.020213 | 0.009185 | -0.041272 | -0.046438 |
| 75% | 0.661472 | 0.728629 | 0.675814 | 0.588834 |
| max | 3.000000 | 3.000000 | 3.000000 | 3.000000 |