anonympy - Data Anonymization with Python - JOYK Joy of Geek, Geek News, Link all geek

Introduction

Our world is bombarded with digital data. 2.5 quintillion bytes is the number for amount of data produced every day. And most of the time, data is personal and sensitive, something that the person whom it relates to, wouldn't want to disclose it. Some examples of personal and sensitive data are names, identification card numbers, ethnicity, etc. However, data also contains valuable business insights. So, how do we balance privacy and the need to gather and share valuable information? That's where data anonymization comes in.

Background

With the rising need of data anonymization and extensibility of Python's packages, I thought it would be nice to create a library which can provide numerous data anonymization techniques and be easy to use. Please meet, my very first package - anonympy, created with the hope to contribute to open-source community and help other users to deal sensitive data. As for now, the package provides functions to anonymize tabular (pd.DataFrame) and image data.

Using the Code

As a usage example, let's anonymize the following dataset - sample.csv.
Let's start by installing the package. It can be achieved in two steps:

Python

Copy Code

pip install anonympy
pip install cape-privacy==0.3.0 --no-deps

Next, load our sample dataset which we will try to anonymize:

Python

Copy Code

import pandas as pd

url = r'https://raw.githubusercontent.com/ArtLabss/open-data-anonimizer/
      0287f675a535101f145cb975baf361a96ff71ed3/examples/files/new.csv'
df = pd.read_csv(url, parse_dates=['birthdate'])
df.head()

By looking at columns, we can see that all are personal and sensitive. Therefore, we will have to apply relevant techniques to each and every column. We need to initializing our dfAnonymizer object.

Python

Copy Code

from anonympy.pandas import dfAnonymizer 

anonym = dfAnonymizer(df)

It’s important to know of what data type is a column before applying any functions. Let’s check the data types and see what methods are available to us.

Python

Copy Code

# check dtypes 
print(anonym.numeric_columns) 
print(anonym.categorical_columns) 
print(anonym.datetime_columns) 

... ['salary', 'age']
... ['first_name', 'address', 'city', 'phone', 'email', 'web']
... ['birthdate']

# available methods for each data type
from anonympy.pandas.utils import available_methods

print(available_methods())

... `numeric`:        
  * Perturbation - "numeric_noise"         
  * Binning - "numeric_binning"         
  * PCA Masking - "numeric_masking"        
  * Rounding - "numeric_rounding" 
`categorical`:         
  * Synthetic Data - "categorical_fake"         
  * Synthetic Data Auto - "categorical_fake_auto"         
  * Resampling from same Distribution - "categorical_resampling"         
  * Tokenazation - "categorical_tokenization"         
  * Email Masking - "categorical_email_masking" 
`datetime`:         
  * Synthetic Date - "datetime_fake"         
  * Perturbation - "datetime_noise" 
`general`:         
  * Drop Column - "column_suppression"

In our dataset, we have 6 categorical columns, 2 numerical and 1 of datetime type. Also, from the list that available_methods returned, we can find functions for each data type.

Let’s add some random noise to age column, round the values in salary column and partially mask email column.

Python

Copy Code

anonym.numeric_noise('age')   
anonym.numeric_rounding('salary')  
anonym.categorical_email_masking('email') 

# or with a single line 
# anonym.anonymize({'age':'numeric_noise',                      
                    'salary':'numeric_rounding',                      
                    'email':'categorical_email_masking'})

To see the changes call to_df(), or for short summary, call info() method.

Python

Copy Code

anonym.info()

Now we would like to substitute names in first_name column with fake ones. For that, we first have to check if Faker has a corresponding method for that.

Python

Copy Code

from anonympy.pandas.utils import fake_methods  

print(fake_methods('f')) # agrs: None / 'all' / any letter  

... factories, file_extension, file_name, file_path, firefox, first_name, 
first_name_female, first_name_male, first_name_nonbinary, fixed_width, 
format, free_email, free_email_domain, future_date, future_datetime

Good, Faker has a method called first_name, let’s permutate the column.

Python

Copy Code

anonym.categorical_fake('first_name') 

# passing a dictionary is also valid -> {column_name: method_name} 
# anonym.categorical_fake({'first_name': 'first_name_female'}

Checking fake_methods for other column names it turns out, Faker also has methods for address and city. The web column can be substituted with url method and phone with phone_number.

Python

Copy Code

anonym.categorical_fake_auto() # this will change `address` and `city` 
                               # because column names correspond to method names 
anonym.categorical_fake({'web': 'url', 'phone': 'phone_number'}) # here we need to specify, 
                               # because column names differs from method name

Last column left to anonymize is birthdate. Since we have age column which contains the same information, we could drop this column using column_supression method. However, for the sake of clarity, let’s add some noise to it.

Python

Copy Code

anonym.datetime_noise('birthdate')

That’s it. Let’s now compare our datasets before and after anonymization.

Before:

After:

And now, your dataset is safe for public release.

Points of Interest

Data privacy and protection is an important part data handling and should be paid proper attention to. Everyone wants his personal and sensitive data to be protected and secure. Therefore, in this article, I showed you how to use anonympy for simple anonymization and pseudoanonymization with python. This library should not be used as a magic wand that will do everything, you still have to thoroughly understand your data and the techniques that are being applied and always keep in mind your end goal.
Here is the GitHub repository for the package - anonympy.

Good Luck with anonymizing your data!

History

9th February, 2022: Initial version

anonympy - Data Anonymization with Python

Introduction

Background

Using the Code

Points of Interest

History

Recommend

申请抖音公会站外流水怎么填写？

Level up your charging game ahead of Galaxy S22 pre-orders arriving with Anker's...

Share datasets like Notion pages

Sony announces loopy LinkBuds, hopes you'll never take them off

Everyone hated the new WhatsApp contact list UI, so it's moving back to the old...

Google could open up Fast Pair to all manufacturers with Android 13

Samsung Galaxy A23 and A23 5G leak reveals significant differences between the t...

看评论买商品还靠谱吗？

Early Galaxy Z Fold4 and Z Flip4 rumors suggest the S22 isn't the only phone Sam...

Google's iOS-like app installation progress indicators are rolling out widely

About Joyk