Subdividing and categorising data¶
Continuous data is often divided into domains or otherwise grouped for analysis.
Suppose you have data on a group of people in a study that you want to divide into discrete age groups. For this, we generate a dataframe with 250 entries between 0 and 99:
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
ages = rng.integers(0, 99, 250)
df = pd.DataFrame({"Age": ages})
df
[1]:
| Age | |
|---|---|
| 0 | 40 |
| 1 | 33 |
| 2 | 8 |
| 3 | 90 |
| 4 | 98 |
| ... | ... |
| 245 | 59 |
| 246 | 46 |
| 247 | 88 |
| 248 | 10 |
| 249 | 37 |
250 rows × 1 columns
Afterwards, pandas offers us a simple way to divide the results into ten ranges with pandas.cut. To get only whole years, we additionally set precision=0:
[2]:
cats = pd.cut(ages, 10, precision=0)
cats
[2]:
[(39.0, 49.0], (29.0, 39.0], (-0.1, 10.0], (88.0, 98.0], (88.0, 98.0], ..., (59.0, 69.0], (39.0, 49.0], (78.0, 88.0], (10.0, 20.0], (29.0, 39.0]]
Length: 250
Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] < (29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]
With pandas.Categorical.categories you can display the categories:
[3]:
cats.categories
[3]:
IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0],
(39.0, 49.0], (49.0, 59.0], (59.0, 69.0], (69.0, 78.0],
(78.0, 88.0], (88.0, 98.0]],
dtype='interval[float64, right]')
… or even just a single category:
[4]:
cats.categories[0]
[4]:
Interval(-0.1, 10.0, closed='right')
With pandas.Categorical.codes you can display an array where for each value the corresponding category is shown:
[5]:
cats.codes
[5]:
array([4, 3, 0, 9, 9, 1, 3, 4, 1, 8, 9, 6, 8, 7, 1, 3, 4, 2, 1, 2, 7, 0,
8, 9, 1, 4, 2, 9, 4, 5, 8, 6, 3, 8, 6, 5, 5, 9, 0, 0, 4, 8, 0, 5,
5, 4, 0, 5, 9, 6, 2, 6, 9, 6, 4, 4, 8, 6, 7, 7, 2, 2, 4, 6, 8, 0,
6, 8, 3, 4, 6, 0, 1, 7, 7, 6, 3, 9, 6, 2, 0, 7, 8, 0, 7, 2, 6, 0,
2, 3, 4, 1, 5, 0, 7, 0, 0, 9, 0, 7, 6, 0, 0, 1, 1, 6, 5, 2, 1, 2,
5, 0, 0, 0, 0, 0, 1, 2, 0, 6, 8, 4, 3, 6, 6, 5, 9, 6, 1, 1, 7, 5,
3, 7, 6, 1, 3, 3, 8, 6, 1, 4, 4, 9, 7, 2, 7, 5, 2, 7, 5, 5, 4, 5,
5, 9, 4, 9, 3, 7, 3, 3, 1, 2, 0, 4, 0, 4, 0, 3, 5, 8, 9, 0, 5, 7,
8, 2, 4, 0, 4, 3, 3, 6, 0, 9, 7, 7, 3, 1, 0, 1, 7, 8, 0, 2, 3, 5,
3, 3, 1, 2, 5, 8, 0, 3, 1, 5, 0, 3, 9, 2, 0, 1, 0, 9, 4, 6, 3, 8,
7, 1, 9, 1, 0, 3, 1, 8, 4, 0, 5, 2, 6, 9, 2, 5, 5, 4, 9, 1, 3, 6,
9, 0, 6, 6, 4, 8, 1, 3], dtype=int8)
With value_counts we can now look at how the number is distributed among the individual areas:
[6]:
pd.Series(cats).value_counts()
[6]:
(-0.1, 10.0] 38
(29.0, 39.0] 27
(59.0, 69.0] 27
(10.0, 20.0] 26
(39.0, 49.0] 25
(49.0, 59.0] 24
(88.0, 98.0] 22
(20.0, 29.0] 21
(69.0, 78.0] 21
(78.0, 88.0] 19
Name: count, dtype: int64
It is striking that the age ranges do not contain an equal number of years, but with 20.0, 29.0 and 69.0, 78.0 two ranges contain only 9 years. This is due to the fact that the age range only extends from 0 to 98:
[7]:
df.min()
[7]:
Age 0
dtype: int64
[8]:
df.max()
[8]:
Age 98
dtype: int64
With pandas.qcut, on the other hand, the set is divided into areas that are approximately the same size:
[9]:
cats = pd.qcut(ages, 10, precision=0)
[10]:
pd.Series(cats).value_counts()
[10]:
(44.0, 55.0] 31
(-1.0, 7.0] 28
(76.0, 88.0] 27
(25.0, 34.0] 26
(34.0, 44.0] 25
(7.0, 15.0] 24
(65.0, 76.0] 24
(15.0, 25.0] 23
(88.0, 98.0] 22
(55.0, 65.0] 20
Name: count, dtype: int64
If we want to ensure that each age group actually includes exactly ten years, we can specify this directly with pandas.Categorical:
[11]:
age_groups = [f"{i} - {i + 9}" for i in range(0, 99, 10)]
cats = pd.Categorical(age_groups)
cats.categories
[11]:
Index(['0 - 9', '10 - 19', '20 - 29', '30 - 39', '40 - 49', '50 - 59',
'60 - 69', '70 - 79', '80 - 89', '90 - 99'],
dtype='object')
For grouping we can now use pandas.cut. However, the number of labels must be one less than the number of edges:
[12]:
df["Age group"] = pd.cut(df.Age, range(0, 101, 10), right=False, labels=cats)
df
[12]:
| Age | Age group | |
|---|---|---|
| 0 | 40 | 40 - 49 |
| 1 | 33 | 30 - 39 |
| 2 | 8 | 0 - 9 |
| 3 | 90 | 90 - 99 |
| 4 | 98 | 90 - 99 |
| ... | ... | ... |
| 245 | 59 | 50 - 59 |
| 246 | 46 | 40 - 49 |
| 247 | 88 | 80 - 89 |
| 248 | 10 | 10 - 19 |
| 249 | 37 | 30 - 39 |
250 rows × 2 columns