Sorting and ranking¶
Sorting a record by a criterion is another important built-in function. Sorting lexicographically by row or column index is already described in the section Reordering and sorting from levels. In the following we look at sorting the values with DataFrame.sort_values and Series.sort_values:
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
s.sort_index(ascending=False)
[1]:
6 -0.287551
5 -0.073895
4 0.077808
3 0.647918
2 1.370572
1 -0.071934
0 0.823556
dtype: float64
All missing values are sorted to the end of the row by default:
[2]:
s = pd.Series(rng.normal(size=7))
s[s < 0] = np.nan
s.sort_values()
[2]:
5 0.502380
3 1.347849
4 1.488811
0 NaN
1 NaN
2 NaN
6 NaN
dtype: float64
With a DataFrame you can sort on both axes. With by you specify which column or row is to be sorted:
[3]:
df = pd.DataFrame(rng.normal(size=(7, 3)))
df.sort_values(by=2, ascending=False)
[3]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 1 | -0.122280 | -0.013553 | 1.622476 |
| 2 | -0.316663 | 0.823117 | 0.678331 |
| 5 | 0.545206 | -1.685777 | 0.533224 |
| 4 | 0.661617 | 0.054888 | -0.228683 |
| 6 | -0.368610 | 1.419950 | -0.467401 |
| 0 | 0.701885 | 0.046049 | -1.685828 |
| 3 | 0.537244 | 1.251408 | -2.482741 |
You can also sort rows with axis=1 and by:
[4]:
df.sort_values(axis=1, by=[0, 1], ascending=False)
[4]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 0.701885 | 0.046049 | -1.685828 |
| 1 | -0.122280 | -0.013553 | 1.622476 |
| 2 | -0.316663 | 0.823117 | 0.678331 |
| 3 | 0.537244 | 1.251408 | -2.482741 |
| 4 | 0.661617 | 0.054888 | -0.228683 |
| 5 | 0.545206 | -1.685777 | 0.533224 |
| 6 | -0.368610 | 1.419950 | -0.467401 |
Ranking¶
DataFrame.rank and Series.rank assign ranks from one to the number of valid data points in an array:
[5]:
df.rank()
[5]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 7.0 | 3.0 | 2.0 |
| 1 | 3.0 | 2.0 | 7.0 |
| 2 | 2.0 | 5.0 | 6.0 |
| 3 | 4.0 | 6.0 | 1.0 |
| 4 | 6.0 | 4.0 | 4.0 |
| 5 | 5.0 | 1.0 | 5.0 |
| 6 | 1.0 | 7.0 | 3.0 |
If ties occur in the ranking, the middle rank is usually assigned in each group.
[6]:
df2 = pd.concat([df, df[5:]])
df2.rank()
[6]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 9.0 | 4.0 | 2.0 |
| 1 | 4.0 | 3.0 | 9.0 |
| 2 | 3.0 | 6.0 | 8.0 |
| 3 | 5.0 | 7.0 | 1.0 |
| 4 | 8.0 | 5.0 | 5.0 |
| 5 | 6.5 | 1.5 | 6.5 |
| 6 | 1.5 | 8.5 | 3.5 |
| 5 | 6.5 | 1.5 | 6.5 |
| 6 | 1.5 | 8.5 | 3.5 |
The parameter min, on the other hand, assigns the smallest rank in the group:
[7]:
df2.rank(method="min")
[7]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 9.0 | 4.0 | 2.0 |
| 1 | 4.0 | 3.0 | 9.0 |
| 2 | 3.0 | 6.0 | 8.0 |
| 3 | 5.0 | 7.0 | 1.0 |
| 4 | 8.0 | 5.0 | 5.0 |
| 5 | 6.0 | 1.0 | 6.0 |
| 6 | 1.0 | 8.0 | 3.0 |
| 5 | 6.0 | 1.0 | 6.0 |
| 6 | 1.0 | 8.0 | 3.0 |
Other methods with rank¶
Method |
Description |
|---|---|
|
default: assign the average rank to each entry in the same group |
|
uses the minimum rank for the whole group |
|
uses the maximum rank for the whole group |
|
assigns the ranks in the order in which the values appear in the data |
|
like |