15

Data Masking in Percona Server for MySQL to Protect Sensitive Data

 4 years ago
source link: https://www.percona.com/blog/2020/01/06/data-masking-in-percona-server-for-mysql-to-protect-sensitive-data/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

vuuieiR.png!web Starting with Percona Server for MySQL 8.0.17 , Percona Server ships with a data masking plugin, using the same API as the MySQL Enterprise Masking and De-identification feature. This plugin was developed by Francisco Miguel Biete and was submitted to Percona Server as a community contribution. He also wrote an excellent post describing his contribution for Percona’s Community Blog titled Percona Server for MySQL 8.0 – New Data Masking Feature and the full list of functions can be found in the documentation of the MySQL Enterprise plugin .

What is Data Masking?

The data masking plugin mentioned above provides easy-to-use methods to hide sensitive data, such as social security numbers:

mysql> SELECT ssn, mask_ssn(ssn) FROM employees WHERE id = 1337;
+-------------+---------------+
| ssn         | mask_ssn(ssn) |
+-------------+---------------+
| 935-20-5725 | XXX-XX-5725   |
+-------------+---------------+
1 row in set (0.00 sec)

Or even to blacklist words contained in a dictionary, replacing them with randomly chosen replacements:

mysql> SELECT city, gen_blacklist(city, 'US_Cities', 'DE_Cities') FROM employees WHERE id = 1337;
+----------+-----------------------------------------------+
| city     | gen_blacklist(city, 'US_Cities', 'DE_Cities') |
+----------+-----------------------------------------------+
| Houston  | Berlin                                        |
+----------+-----------------------------------------------+
1 row in set (0.00 sec)
 
mysql> SELECT city, gen_blacklist(city, 'US_Cities', 'DE_Cities') FROM employees WHERE id = 1337;
+----------+-----------------------------------------------+
| city     | gen_blacklist(city, 'US_Cities', 'DE_Cities') |
+----------+-----------------------------------------------+
| Houston  | Essen                                         |
+----------+-----------------------------------------------+
1 row in set (0.00 sec)

Performance

An important question before using a plugin like this is its cost: How does it compare with a different approach? As the goal is hiding some data from the user, this approach could be:

  1. Not masking at the SQL-side at all, and implementing it completely in the program that uses the database
  2. Using existing functions and user functions in views/queries directly to achieve a similar result
  3. Adding masked columns, or masked tables, to the database

The runtime performance of selects in 1 and 3 are roughly the same: for 1, the masking performance is moved to the client program, not the SQL server. For 3, it happens during inserts. Both also have additional costs, as (1) requires changes in (all) client programs, and (3) increases the space requirements significantly.

The second option is only realistic for some of the features: for example, the SSN masking above could be implemented with the right substring and concatenation.

Masking Functions

Most functions provided by the plugin perform simple string replacement, such as the function that masks an SSN, credit card number, or simply a substring at the given positions. These have the same performance characteristic, for example, we can see the following numbers running sysbench tests using different SSN masking scenarios:

Implementation Queries per second SELECT ssn (unmasked) 2503 SELECT mask_ssn(ssn) 2287 SELECT CONCAT(“XXX-XX-“, RIGHT(ssn, 4)) 2303

As this table shows, mask_ssn and manual string manipulation have similar performance, while if the plugin is loaded, the masking function is easier to use and maintain.

Replacement Functions

Some functions, however, operate on a list of strings, such as the already mentioned gen_blacklist. These lists use linear complexity algorithms in 8.0.17 and 8.0.18, and such are best suited for small dictionaries. Starting with 8.0.19, the algorithm is changed to a logarithmic complexity search, at the cost of increased dictionary loading cost. Since dictionaries are typically loaded once during the servers run but used many times, this reduces execution times with larger dictionaries.

With logarithmic search times, the performance of some dictionary sizes are shown in the table below:

Dictionary size Queries per second 0 (no blacklisting) 2506 10 2063 100 1885 200 1863 500 1731 1000 1558 1000 (100 matching) 1875

Note that while the search itself is logarithmic in complexity, the queries were all run on the same dataset: one million records, each randomly chosen from a dictionary of 5000 entries. In most rows in the table, not only the dictionary size is larger, but the blacklist function also finds more matches, resulting in more and more string replacement. As all items in the dictionary are from the dictionary used to generate the column, the smallest, 10 item dictionary only replaced 0.2% of the results, while with a size of 1000, it reached 20%. The test was kept this way as this is realistic – a larger dictionary usually results in more matches.

Also, the last entry showcases a dictionary where only 10% of the records are present in the actual table, showing the search performance against the other 1000 entry table.

Contributions

Percona Server welcomes contributions, both new features such as this plugin or bug fixes correcting existing issues. Patches should be submitted via a github pull request to the percona-server repository . Before submitting something, make sure to:

  1. Base the change on the branch of the lowest version that makes sense (5.6, 5.7, 8.0). For bug fixes, this generally means the earliest version where the issue can be found. For new features or improvements, it can be a higher version.
  2. Follow the coding style and conventions of the base version. For 8.0, the source code has a clang-format configuration file, and new changes should be formatted with clang-format-8. For 5.6 and 5.7, the code should be manually formatted similarly to the existing code; keep in mind that InnoDB and the rest of the server uses a different style!
  3. Only submit original work, or if the diff includes third party code (libraries, samples, etc.), make sure to keep the original copyright, and that the license is compatible with GPL.

After receiving the submission, a Percona Server developer will review it. This may take time, depending on the size of the patch. The reviewer is able to request changes before the pull request is merged or may change the code directly before merging.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK