Beyond GDPR
Damien Clochard
PostgreSQL DBA & Co-founder at Dalibo
President of PostgreSQLFr Association
I Am Not A Lawyer
I Am Not A Privacy Expert
Don’t take my word for it / Check the links !
GDPR: 1 year later
Why Anonymization is hard
Anonymization Pipelines
PostgreSQL Anonymizer
Individual Rights
Principles
Impact
Pseudonymization vs Anonymization
(source: Individual Rights)
(source: GDPR Principles)
(source: GDPR Enforcement Tracker)
Most sanctions are linked to Article 32:
« Insufficient technical and organisational measures to ensure information security »
(source Article 32 - Security of processing )
« Personally identifiable information is pseudonymised when it is modified in a way that it can no longer be linked to a single data subject without the use of additional data. »
Not even mentioned in the GDPR !
Pseudonymized data still falls within the scope of the Regulation.
Pseudonymization is a security requirement
Anonymization is an exit door
The additional data should be kept separate from the pseudonymized data and subject to technical and organisational measures to make it hard to link a piece of data to someone’s identity
Encryption is not anonymization !
Encrypted data are still covered by GDPR because the original data can be retrieved with the encryption key.
(source: WP29 Opinion on Anonymisation Techniques)
The possibility to isolate a record and identify a subject in the dataset.
Identify a subject in the dataset using other datasets
Netflix Ratings + IMDB Ratings
Hospital visits + State voting records
(sources: Netflix prize + Hospital Reidentification )
Identify a subject using a set of indirect identifiers.
87% of the U.S. population are uniquely identified by date of birth, gender and zip code
(source : Latanya Sweeney)
you can’t prove that re-identification is impossible
(source: De-indentification still doesn’t work)
« To determine [if] a person is identifiable, account should be taken of all the means reasonably likely to be used […] to identify the person directly or indirectly.
« To ascertain whether means are reasonably likely to be used to identify the person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing »
(source: Recital 26)
This means you have to measure the “reasonable risk” of re-identification, on a regular basis.
Minimizing the risk of data leaks by reducing the attack surface
This is a direct implementation of the “Storage Limitation” principle
Started as a personal project last year
Now part of the “Dalibo Labs” initiative
This is a prototype !
Currently in version 0.4
Declare masking rules within the database model
Anonymization is done internally
Dynamic Masking or In-Place Substitution
Batteries included : Builtin masking functions
Inspired by MS SQL Server Dynamic Data Masking
Using the Community RPM Repo:
( thanks Devrim ! )
( thanks Alvaro ! )
This will update all lines of all tables containing at least one masking rule.
This is gonna be slow and trigger heavy write workloads.
$ psql [...] -qtA -c 'SELECT anon.dump()' your_dabatase > dump.sql
Let’s take a basic example :
Step 1 : Activate the dynamic masking engine
Step 2 : Declare a masked user
The masked user has a read-only access to the anonymized data of the masked tables.
Step 3 : Declare the masking rules
Step 4 : Connect with the masked user
Basically :
search_path
tms_system_rows
for random functionsThe extension provides functions to implement 5 main anonymization techniques:
=# SECURITY LABEL FOR anon
-# ON COLUMN employee.salary
-# IS 'MASKED WITH FUNCTION
-# anon.add_noise_on_numeric_column(user, salary, 0.33)
-# ';
All values of the column will be randomly shifted with a ratio of +/- 33%
The dataset remains meaningful
AVG()
and SUM()
are similar to the original
works only for dates and numeric values
“extreme values” may cause re-identification (“singling out”)
The dataset remains meaningful
Perfect for Foreign Keys
Works bad with low distribution (ex: boolean)
The table must have a primary key
Simple and Fast
Usefull for columns with NOT NULL
constraints
Useless for analytics
Just a more elaborate version of Randomization
Great for developpers and CI tests
You can load your own dictionnaries !
=# SECURITY LABEL FOR anon
-# ON COLUMN employee.phone
-# IS 'MASKED WITH FUNCTION anon.partial(phone,4,'******',2)';
+33142928107
becomes +331******07
Perfect for phone number, credit cards, etc.
The user can still recognize his/her own data
Transformation is IMMUTABLE
Works only for TEXT / VARCHAR types
PostgreSQL 9.6 and later
Dynamic Masking works with only one schema
Research on K-Anonymity
Mesure the risk of reidentification
Suggest masking rules based on heuristics
Implement Generalization functions
Differential Privacy extension by Google
Smart Sampling with pg_sample
Feedback and bugs !
Images and geodata
Join the project at :
GDPR sanctions are really real
Data Leak is your main risk
Reduce your attack surface (“Storage Limitation”)
Anonymize whenever you can
Anonymize inside the database
Encryption is not Anonymization !
Developpers should write the masking rules
It’s hard…. PostgreSQL must help them.
The Postgres community has won so many battles
Now we have to focus on data privacy
Dalibo is a french-speaking employee-owned remote-working company
We’re looking for:
Contact : damien.clochard@dalibo.com
Follow : @daamien
Feedback : https://2019.pgconf.eu/f
Other Projects : Dalibo Labs