Padhai Time

Get Highest Record Per Category

In some of the situations, the data that we get contains multiple entries for an object.

For e.g.

There are multiple salary information available for a user
There are multiple phone numbers available for a user
There are redundant/duplicate records available for a user

In all of these situations, we may have to remove some of the rows and we have to make sure that we have a single entry per user. Hence in these cases, it becomes mandatory to be aware about some Python trick to achieve this.

Although there can be ‘n’ no. of ways for solving this problem, however, we are going to show two approaches.

Consider the data is available in below form where we have Population information for different states of Countries:

Expected Output:

Get the single row for each country where the population is highest.

Approach 1:

Sort the data based on country name and population in ascending order
Drop duplicate records for each Country while keeping the last record
All the rows will get deleted except the last one for each category

Code:

data.sort_values(['CountryName', 'Population'], ascending=True, inplace = True)

data.drop_duplicates(['CountryName'], keep="last", inplace = True)

Alternate Approach:

You can sort the data in descending order and can keep the first row as well.

Code:

data.sort_values(['CountryName', 'Population'], ascending=False, inplace = True)

data.drop_duplicates(['CountryName'], keep="first", inplace = True)

Approach 2:

Group the data based on Country Name
Find the rows which have the highest population
Take the index numbers of those rows
Finally filter out the rows for those selected indexes

Code:

data.iloc[data.groupby('CountryName')['Population'].nlargest(1).index.get_level_values(1)]

Python has just lot of these one liner tricks to solve big big problems.

Bengaluru, India

contact.padhaitime@gmail.com