Remove duplicate rows from a pandas dataframe: Case Insenstive comparison

Asked 4 years, 3 months ago

Modified 8 months ago

Viewed 7k times

I want to remove duplicate rows from the dataframe based on values in two columns: Column1 and Column2

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

On using:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

I get:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

But using same code for dataframe (Cat and Bat cases changed)

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                   'Column2': ["'Bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

I get:

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Expected Output:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

How can this comparison be done case insensitively?

asked May 14, 2019 at 8:15

3 Answers

Sorted by: Reset to default

I figured it out. Create new uppercase columns and then use them to remove the duplicates. Once done, drop the uppercase columns.

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                       'Column2': ["'Bat'", "'flower'", "'bat'"],
                       'Column3': ["'xyz'", "'abc'", "'lmn'"]})

df['Column1_Upper'] = df['Column1'].astype(str).str.upper()
df['Column2_Upper'] = df['Column2'].astype(str).str.upper()

This gives:

+---+---------+----------+---------+---------------+---------------+
|   | Column1 | Column2  | Column3 | Column1_Upper | Column2_Upper |
+---+---------+----------+---------+---------------+---------------+
| 0 | 'Cat'   | 'Bat'    | 'xyz'   | 'CAT'         | 'BAT'         |
| 1 | 'toy'   | 'flower' | 'abc'   | 'TOY'         | 'FLOWER'      |
| 2 | 'cat'   | 'bat'    | 'lmn'   | 'CAT'         | 'BAT'         |
+---+---------+----------+---------+---------------+---------------+

Finally, run the below to drop the duplicates and created columns.

result_df = df.drop_duplicates(subset=['Column1_Upper', 'Column2_Upper'], keep='first')
result_df.drop(['Column1_Upper', 'Column2_Upper'], axis=1, inplace=True)
print(result_df)

This gives:

+-----------------------------+
|   Column1   Column2 Column3 |
+-----------------------------+
| 0   'Cat'     'Bat'   'xyz' |
| 1   'toy'  'flower'   'abc' |
+-----------------------------+

edited Dec 2, 2020 at 14:20

answered May 14, 2019 at 9:25

You could convert the dataframe to lower case and then apply your solution.

Your dataframe.

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                   'Column2': ["'Bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

print(df)

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then apply lower string.

result_df = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Column1', 'Column2'], keep='first')

print(result_df)
  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

Then filter df for upper case.

df.loc[result_df.index]

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'

edited May 14, 2019 at 8:34

answered May 14, 2019 at 8:29

First, convert all the string values to lowercase to make them case insensitive using the following line:

df[['Column1', 'Column2']] = df[['Column1', 'Column2']].applymap(lambda x: x.lower())

You will get the output as follows.

    Column1   Column2  Column3 
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Now apply the drop duplicates function.

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

reference: here

edited Dec 20, 2022 at 21:05

answered May 14, 2019 at 8:42

Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

Not the answer you're looking for? Browse other questions tagged or ask your own question.

3 Answers

Your Answer

Sign up or log in

Post as a guest

Recommend

印度制造业正式崛起，中国每年至少损失1.8亿支手机订单？

Global Salary Converter by Figures

Sherpa-Map - Cycling routing with AI tools, weather & so much more | Product...

echarts柱状图数据太多设置滚动条

Vanna AI - Python-based AI SQL agent | Product Hunt

Amazon pinches sellers: Use our costly logistics services or pay extra fee [Upda...

使用three.js(webgl)搭建智慧楼宇、设备检测、数字孪生——第十三课

Marketch - Eliminate guess work, automate hiring and recruiting | Product Hunt

Founder Secrets - Weekly newsletter on how top founders grew their startups | Pr...

一日一技：如何对Python代码进行混淆

About Joyk