Thoughts on Data Masking

Often times, production data needs to be moved to different environments for testing/developing purposes. However, some of that data can be people’s name, birthday, address, account number, etc., that we don’t want testers and/or developers to see, due to privacy and regulatory concerns. Hence the need to mask those data. I can certainly see this needs grow over time for all database platforms. There are software out there that does this sort of task, or similar tasks, such as data generation tool. Oracle actually has a Data Masking Pack since 10g for this purpose. Here are some of my thoughts on this topic.

One method of masking data is through reshuffling, which shuffles the value in target column(s) that you want to protect randomly across different rows.

Another way of doing it is through data generation. For instance, for target column(s), we just replace its value with something else.

For reshuffling, obviously the data element is still meaningful. In other words, a reshuffled account number is still a valid account number, only now its original owner has been changed. Depending on how stringent the requirements are, this may or may not be enough.

For data generation, we have this question to consider: is the format of the generated data important to us? If yes, then obviously some intelligence needs to be built in so that the generated data follows the format we define. For instance, a valid credit card number is 16 digits long, has certain prefix and/or suffix, the nth digit has a certain meaning, so on and so forth.

Another example is people’s name. Do we replace the name with some random letters we concoct together, or do we want the name to be realistic? If we want realistic names, then we may have to supply a dictionary for the masking software to pull that data from.

In either case, we also have the unique and foreign key constraints to deal with, if there are any. In certain instances where more than one schema/database is involved, the complexity increases exponentially.

Regardless of the method being used, performance of data masking process is important to consider. If the volume of data to be masked is small, then it may not be a big deal. But, as is often the case, you may have a huge transaction table that has millions and millions of rows to mask, then performance is a definite concern.

One idea I am toying around with for data masking performance issue is through low-level data manipulation. For instance, in MySQL, maybe play with rowid. And for Sql Server, play around with fileid, pageid, and such.

Another way to get around that is to do masking through batches. In other words, divide a big task into smaller tasks and tackle them one at a time.

Personally, I like the idea of data reshuffling. On one hand, the data element is meaningful. I know I don’t want to work with randomly generated gibberish that does not make sense to me. On the other hand, if one wants to do performance testing in test or development environment, one would like to have the data distribution as close to production as possible. And data reshuffling can probably keep the data distribution pretty close to that of production.

In my next entry, I will share a simple C# program I wrote to reshuffle data inside a CSV file.

Leave a Comment