12

anonympy - Data Anonymization with Python

 2 years ago
source link: https://www.codeproject.com/Articles/5324569/anonympy-Data-Anonymization-with-Python
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Introduction

Our world is bombarded with digital data. 2.5 quintillion bytes is the number for amount of data produced every day. And most of the time, data is personal and sensitive, something that the person whom it relates to, wouldn't want to disclose it. Some examples of personal and sensitive data are names, identification card numbers, ethnicity, etc. However, data also contains valuable business insights. So, how do we balance privacy and the need to gather and share valuable information? That's where data anonymization comes in.

Background

With the rising need of data anonymization and extensibility of Python's packages, I thought it would be nice to create a library which can provide numerous data anonymization techniques and be easy to use. Please meet, my very first package - anonympy, created with the hope to contribute to open-source community and help other users to deal sensitive data. As for now, the package provides functions to anonymize tabular (pd.DataFrame) and image data.

Using the Code

As a usage example, let's anonymize the following dataset - sample.csv.
Let's start by installing the package. It can be achieved in two steps:

Python
Copy Code
pip install anonympy
pip install cape-privacy==0.3.0 --no-deps

Next, load our sample dataset which we will try to anonymize:

Python
Copy Code
import pandas as pd

url = r'https://raw.githubusercontent.com/ArtLabss/open-data-anonimizer/
      0287f675a535101f145cb975baf361a96ff71ed3/examples/files/new.csv'
df = pd.read_csv(url, parse_dates=['birthdate'])
df.head()

Image 1

By looking at columns, we can see that all are personal and sensitive. Therefore, we will have to apply relevant techniques to each and every column. We need to initializing our dfAnonymizer object.

Python
Copy Code
from anonympy.pandas import dfAnonymizer 

anonym = dfAnonymizer(df)

It’s important to know of what data type is a column before applying any functions. Let’s check the data types and see what methods are available to us.

Python
Copy Code
# check dtypes 
print(anonym.numeric_columns) 
print(anonym.categorical_columns) 
print(anonym.datetime_columns) 

... ['salary', 'age']
... ['first_name', 'address', 'city', 'phone', 'email', 'web']
... ['birthdate']

# available methods for each data type
from anonympy.pandas.utils import available_methods

print(available_methods())

... `numeric`:        
  * Perturbation - "numeric_noise"         
  * Binning - "numeric_binning"         
  * PCA Masking - "numeric_masking"        
  * Rounding - "numeric_rounding" 
`categorical`:         
  * Synthetic Data - "categorical_fake"         
  * Synthetic Data Auto - "categorical_fake_auto"         
  * Resampling from same Distribution - "categorical_resampling"         
  * Tokenazation - "categorical_tokenization"         
  * Email Masking - "categorical_email_masking" 
`datetime`:         
  * Synthetic Date - "datetime_fake"         
  * Perturbation - "datetime_noise" 
`general`:         
  * Drop Column - "column_suppression" 

In our dataset, we have 6 categorical columns, 2 numerical and 1 of datetime type. Also, from the list that available_methods returned, we can find functions for each data type.

Let’s add some random noise to age column, round the values in salary column and partially mask email column.

Python
Copy Code
anonym.numeric_noise('age')   
anonym.numeric_rounding('salary')  
anonym.categorical_email_masking('email') 

# or with a single line 
# anonym.anonymize({'age':'numeric_noise',                      
                    'salary':'numeric_rounding',                      
                    'email':'categorical_email_masking'})

To see the changes call to_df(), or for short summary, call info() method.

Python
Copy Code
anonym.info()

Image 2

Now we would like to substitute names in first_name column with fake ones. For that, we first have to check if Faker has a corresponding method for that.

Python
Copy Code
from anonympy.pandas.utils import fake_methods  

print(fake_methods('f')) # agrs: None / 'all' / any letter  

... factories, file_extension, file_name, file_path, firefox, first_name, 
first_name_female, first_name_male, first_name_nonbinary, fixed_width, 
format, free_email, free_email_domain, future_date, future_datetime 

Good, Faker has a method called first_name, let’s permutate the column.

Python
Copy Code
anonym.categorical_fake('first_name') 

# passing a dictionary is also valid -> {column_name: method_name} 
# anonym.categorical_fake({'first_name': 'first_name_female'}

Checking fake_methods for other column names it turns out, Faker also has methods for address and city. The web column can be substituted with url method and phone with phone_number.

Python
Copy Code
anonym.categorical_fake_auto() # this will change `address` and `city` 
                               # because column names correspond to method names 
anonym.categorical_fake({'web': 'url', 'phone': 'phone_number'}) # here we need to specify, 
                               # because column names differs from method name 

Last column left to anonymize is birthdate. Since we have age column which contains the same information, we could drop this column using column_supression method. However, for the sake of clarity, let’s add some noise to it.

Python
Copy Code
anonym.datetime_noise('birthdate')

That’s it. Let’s now compare our datasets before and after anonymization.

Before:

After:

And now, your dataset is safe for public release.

Points of Interest

Data privacy and protection is an important part data handling and should be paid proper attention to. Everyone wants his personal and sensitive data to be protected and secure. Therefore, in this article, I showed you how to use anonympy for simple anonymization and pseudoanonymization with python. This library should not be used as a magic wand that will do everything, you still have to thoroughly understand your data and the techniques that are being applied and always keep in mind your end goal.
Here is the GitHub repository for the package - anonympy.

Good Luck with anonymizing your data!

History

  • 9th February, 2022: Initial version

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK