Remove duplicate rows from a pandas dataframe: Case Insenstive comparison
source link: https://stackoverflow.com/questions/56125863/remove-duplicate-rows-from-a-pandas-dataframe-case-insenstive-comparison/56127032#56127032
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Remove duplicate rows from a pandas dataframe: Case Insenstive comparison
I want to remove duplicate rows from the dataframe based on values in two columns: Column1
and Column2
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
On using:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
I get:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
But using same code for dataframe
(Cat and Bat cases changed)
df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
'Column2': ["'Bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
I get:
Column1 Column2 Column3
0 'Cat' 'Bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Expected Output:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
How can this comparison be done case insensitively?
3 Answers
I figured it out. Create new uppercase columns and then use them to remove the duplicates. Once done, drop the uppercase columns.
df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
'Column2': ["'Bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
df['Column1_Upper'] = df['Column1'].astype(str).str.upper()
df['Column2_Upper'] = df['Column2'].astype(str).str.upper()
This gives:
+---+---------+----------+---------+---------------+---------------+
| | Column1 | Column2 | Column3 | Column1_Upper | Column2_Upper |
+---+---------+----------+---------+---------------+---------------+
| 0 | 'Cat' | 'Bat' | 'xyz' | 'CAT' | 'BAT' |
| 1 | 'toy' | 'flower' | 'abc' | 'TOY' | 'FLOWER' |
| 2 | 'cat' | 'bat' | 'lmn' | 'CAT' | 'BAT' |
+---+---------+----------+---------+---------------+---------------+
Finally, run the below to drop the duplicates and created columns.
result_df = df.drop_duplicates(subset=['Column1_Upper', 'Column2_Upper'], keep='first')
result_df.drop(['Column1_Upper', 'Column2_Upper'], axis=1, inplace=True)
print(result_df)
This gives:
+-----------------------------+
| Column1 Column2 Column3 |
+-----------------------------+
| 0 'Cat' 'Bat' 'xyz' |
| 1 'toy' 'flower' 'abc' |
+-----------------------------+
You could convert the dataframe to lower case and then apply your solution.
Your dataframe.
df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
'Column2': ["'Bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Column1 Column2 Column3
0 'Cat' 'Bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then apply lower string.
result_df = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
Then filter df for upper case.
df.loc[result_df.index]
Column1 Column2 Column3
0 'Cat' 'Bat' 'xyz'
1 'toy' 'flower' 'abc'
First, convert all the string values to lowercase to make them case insensitive using the following line:
df[['Column1', 'Column2']] = df[['Column1', 'Column2']].applymap(lambda x: x.lower())
You will get the output as follows.
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Now apply the drop duplicates function.
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
reference: here
Your Answer
Post as a guest
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK