If you are working as a data architect in any organisation, that wants to migrate its data and workloads to the cloud, you will always get asked one question
How do you guarantee the security of data in the cloud?
Data security is one of the fundamental functions of any data platform.
In today’s article, we will have a look at various approaches to implementing data security in the cloud.
Data Security, at a very broad level, can be categorised as
Security of data at rest
Security of data in transit
There are multiple other dimensions of data security including the right levels of access control, hiding sensitive data, secure data sharing and many more.
Today, we will focus only on the Security of “data at rest”
What does “data at rest” mean?
Data at rest means the data that is stored in your storage layer. It can be either in a data lake or a warehouse or a lakehouse.
Any data that resides in your data platform’s storage repository is considered as “data at rest” and should be protected at any given point of time.
What are approaches to secure the “data at rest”?
Data at rest can be secure by using below mentioned approaches. These can be used individually or combined to provide better security based on the use case to use case.
Encryption
Data at rest can be encrypted using encryption techniques. If you are using the AWS platform, you can use KMS based approach for encrypting your data.
Encryption ensures that only the authorised users with the decryption keys can decrypt and see the actual values of data
Masking
Masking means abstracting the data using masking techniques. These may include simply replacing the original values with some standard set of values. E.g.
Replace customer mobile numbers with 0s or customer names with ‘XXX’
You can also go for partial replacement instead of replacing the complete values. E.g Replace Credit Card numbers with XXXX excluding the last 4 digits
You can either mask the data while storing it in your data lake layers, that are accessed by users, or have a dynamic masking scheme when users are querying the data.
Tokenization
Tokenization is an approach to abstract the underlying data by some meaningful logic that can be used to reverse the tokens (de-tokenization) to get actual values.
You can use an algorithm to tokenize your data before storing it in layers that are accessed by users.
Tokenization has 2 approaches
Vault-based tokenization - Tokens are stored in a central database and accessed by doing a lookup.
Vaultless tokenization - There is no central store for storing tokens, data is tokenized or detokenized on the fly while storing or accessing data using algorithms.
Each of these topics needs to be explored and studied in greater depth to understand these fully and to decide which approach you should use for your use case.
In some of my future newsletters, I’ll try to discuss these separately and their benefits. Till then, keep exploring, keep learning!
Nice article, could you pls elaborate "masking" with some real time examples ? mainly interested in "How-to" of implementation.
Nice article, could you pls elaborate "masking" with some real time examples ? mainly interested in "How-to" of implementation.