Companies today have a lot of data. Because of the large amounts, they are stored in structured format in DWH. As time progressed and companies gathered more data, DWH became exponentially more expensive to expand, while not all data needed to be immediately available at all times.
To solve this problem, companies started storing historic data in Hadoop
platforms, which are easy and cheap to expand and provide large distributed data
storage, as well as distributed computing power for data processing.
However, all the data cannot be stored forever, because of the GDPR Directive enforced by the European Union.
GDPR specifies what customers can request a company to do with their data and states that companies cannot process or store personal information forever. How long personal information can be stored is defined by the laws of each country.
Since companies cannot keep personal data forever, after some time they must be either deleted or anonymized. But deleting means that the company is losing valuable data and fully anonymizing may render data useless for any analytical purposes.
For our customers, we want to keep as much clean data for as long as possible, so analysts can still efficiently work with the data and bring innovative AI and machine learning solutions. To make this possible, our approach is split in 3 phases:
The analysis process is the most important part, since each customer has different data, different needs as well as different definition of what is considered personally identifiable information.
Therefore, each table and each column is analyzed to find out whether it is personally identifiable information and if so, what kind of anonymization or pseudo-anonymization method should be used.
All of this is validated with the customer.
Next steps include configuration, deployment and execution of the anonymization process.
Adastra has developed an anonymization framework for both DWH and hadoop environments, which allows for out-of-the-box anonymization, pseudo-anonymization and tokenization of values, in case some values must be reversible.
New anonymization methods can be added, in case existing ones are not satisfactory. When the framework is deployed and configured, adding new table to anonymize is simply a matter of configuration.