7

Remove duplicate rows from a pandas dataframe: Case Insenstive comparison

 1 year ago
source link: https://stackoverflow.com/questions/56125863/remove-duplicate-rows-from-a-pandas-dataframe-case-insenstive-comparison/56127032#56127032
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Remove duplicate rows from a pandas dataframe: Case Insenstive comparison

Asked 4 years, 3 months ago
Modified 8 months ago
Viewed 7k times

I want to remove duplicate rows from the dataframe based on values in two columns: Column1 and Column2

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

On using:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

I get:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

But using same code for dataframe (Cat and Bat cases changed)

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                   'Column2': ["'Bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

I get:

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Expected Output:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

How can this comparison be done case insensitively?

asked May 14, 2019 at 8:15

3 Answers

I figured it out. Create new uppercase columns and then use them to remove the duplicates. Once done, drop the uppercase columns.

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                       'Column2': ["'Bat'", "'flower'", "'bat'"],
                       'Column3': ["'xyz'", "'abc'", "'lmn'"]})

df['Column1_Upper'] = df['Column1'].astype(str).str.upper()
df['Column2_Upper'] = df['Column2'].astype(str).str.upper()

This gives:

+---+---------+----------+---------+---------------+---------------+
|   | Column1 | Column2  | Column3 | Column1_Upper | Column2_Upper |
+---+---------+----------+---------+---------------+---------------+
| 0 | 'Cat'   | 'Bat'    | 'xyz'   | 'CAT'         | 'BAT'         |
| 1 | 'toy'   | 'flower' | 'abc'   | 'TOY'         | 'FLOWER'      |
| 2 | 'cat'   | 'bat'    | 'lmn'   | 'CAT'         | 'BAT'         |
+---+---------+----------+---------+---------------+---------------+

Finally, run the below to drop the duplicates and created columns.

result_df = df.drop_duplicates(subset=['Column1_Upper', 'Column2_Upper'], keep='first')
result_df.drop(['Column1_Upper', 'Column2_Upper'], axis=1, inplace=True)
print(result_df)

This gives:

+-----------------------------+
|   Column1   Column2 Column3 |
+-----------------------------+
| 0   'Cat'     'Bat'   'xyz' |
| 1   'toy'  'flower'   'abc' |
+-----------------------------+
answered May 14, 2019 at 9:25

You could convert the dataframe to lower case and then apply your solution.

Your dataframe.

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                   'Column2': ["'Bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

print(df)

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then apply lower string.

result_df = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Column1', 'Column2'], keep='first')

print(result_df)
  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

Then filter df for upper case.

df.loc[result_df.index]

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'
answered May 14, 2019 at 8:29

First, convert all the string values to lowercase to make them case insensitive using the following line:

df[['Column1', 'Column2']] = df[['Column1', 'Column2']].applymap(lambda x: x.lower())

You will get the output as follows.

    Column1   Column2  Column3 
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Now apply the drop duplicates function.

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

reference: here

answered May 14, 2019 at 8:42

Your Answer

Sign up or log in

Sign up using Google
Sign up using Facebook
Sign up using Email and Password

Post as a guest

Name
Email

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

Not the answer you're looking for? Browse other questions tagged or ask your own question.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK