Anonymization

Beyond GDPR

Who I Am

Damien Clochard
PostgreSQL DBA & Co-founder at Dalibo
President of PostgreSQLFr Association

Who I Am Not

I Am Not A Lawyer
I Am Not A Privacy Expert
Don’t take my word for it / Check the links !

My Story

GDPR: 1 year later
Why Anonymization is hard
Anonymization Pipelines
PostgreSQL Anonymizer

The right to be informed
The right of access
The right to rectification
The right to erasure
The right to restrict processing
The right to data portability
The right to object
etc.

(source: Individual Rights)

Lawfulness, fairness and transparency
Security
Data Minization
Privacy By Design
Data Protection By Design
Pseudonymization
Storage Limitation
Accuracy
Purprose Limitation

(source: GDPR Principles)

Sanctions are coming

July 2019 : Marriott (UK) fined 110M€
July 2019 : British Airways (UK) fined 204 M€
June 2019 : Sergic (France) fined 400 k€
June 2019 : LaLiga (Spain) fined 250 k€
May 2019 : Municipality of Bergen (Norway) fined 170 k€
April 2019 : Airbus (France) fined 200k€
And many more

(source: GDPR Enforcement Tracker)

Beware of Article 32 !

Most sanctions are linked to Article 32:

« Insufficient technical and organisational measures to ensure information security »

(source Article 32 - Security of processing )

In other words: “Data Leaks”

Pseudonymization

« Personally identifiable information is pseudonymised when it is modified in a way that it can no longer be linked to a single data subject without the use of additional data. »

Anonymization

Not even mentioned in the GDPR !

Does it really matter ?

Yes

Pseudonymized data still falls within the scope of the Regulation.

2 different things

Pseudonymization is a security requirement
Anonymization is an exit door

Pseudonymization

The additional data should be kept separate from the pseudonymized data and subject to technical and organisational measures to make it hard to link a piece of data to someone’s identity

Example: Encryption

Encryption is not anonymization !

Encrypted data are still covered by GDPR because the original data can be retrieved with the encryption key.

Why Anonymization is hard

Singling out
Linkability
Inference

(source: WP29 Opinion on Anonymisation Techniques)

Singling Out

The possibility to isolate a record and identify a subject in the dataset.

SELECT * FROM employees;

  id  |  name          | job  | salary
------+----------------+------+--------
 1578 | xkjefus3sfzd   | NULL |    1498
 2552 | cksnd2se5dfa   | NULL |    2257
 5301 | fnefckndc2xn   | NULL |   45489
 7114 | npodn5ltyp3d   | NULL |    1821

Linkability

Identify a subject in the dataset using other datasets

Netflix Ratings + IMDB Ratings
Hospital visits + State voting records

(sources: Netflix prize + Hospital Reidentification )

Inference

Identify a subject using a set of indirect identifiers.

87% of the U.S. population are uniquely identified by date of birth, gender and zip code

(source : Latanya Sweeney)

This is a losing game !

you can’t prove that re-identification is impossible

(source: De-indentification still doesn’t work)

« To determine [if] a person is identifiable, account should be taken of all the means reasonably likely to be used […] to identify the person directly or indirectly.

« To ascertain whether means are reasonably likely to be used to identify the person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing »

(source: Recital 26)

Mesure the threat

This means you have to measure the “reasonable risk” of re-identification, on a regular basis.

Anonymization Pipelines

Minimizing the risk of data leaks by reducing the attack surface

This is a direct implementation of the “Storage Limitation” principle

Basic Example

Worst Scenario

ETL

Cloud Anonymization

PostgreSQL Anonymizer

What is this ?

Started as a personal project last year
Now part of the “Dalibo Labs” initiative
This is a prototype !
Currently in version 0.4

Goals

Declare masking rules within the database model
Anonymization is done internally
Dynamic Masking or In-Place Substitution
Batteries included : Builtin masking functions
Inspired by MS SQL Server Dynamic Data Masking

Example: Real Data

=# SELECT * FROM customer;
 id  |   full_name      |   birth    | zipcode | fk_shop
-----+------------------+------------+---------+---------
 911 | Chuck Norris     | 1940-03-10 | 75001   | 12
 112 | David Hasselhoff | 1952-07-17 | 90001   | 423

Example: Anonymized Data

=# SELECT * FROM customer;
 id  |     full_name     |   birth    | zipcode | fk_shop
-----+-------------------+------------+---------+---------
 911 | Michel Duffus     | 1970-03-24 | 63824   | 12
 112 | Andromache Tulip  | 1921-03-24 | 38199   | 423

Install

$ sudo pgxn install ddlx
$ sudo pgxn install postgresql_anonymizer

Install

Using the Community RPM Repo:

$ yum install https://.../pgdg-redhat-repo-latest.noarch.rpm
$ yum install postgresql_anonymizer12

( thanks Devrim ! )

Configure

shared_preload_libraries = '[...], anon'

Load

=# CREATE EXTENSION IF NOT EXISTS anon CASCADE;
=# SELECT anon.load();

Declare a masking rule

SECURITY LABEL FOR anon
ON COLUMN customer.zipcode
IS 'anon.random_zipcode()';

( thanks Alvaro ! )

Now we have 3 options

In-Place Anonymization
Anonymous Dumps
Dynamic Masking

In-Place Anonymization

=# SELECT anon.anonymize_column('customer','zipcode');

=# SELECT anon.anonymize_table('customer');

=# SELECT anon.anonymize_database();

In-Place Anonymization

This will update all lines of all tables containing at least one masking rule.

This is gonna be slow and trigger heavy write workloads.

Anonymous Dumps

=# SELECT anon.dump();

Anonymous Dumps

$ psql [...] -qtA -c 'SELECT anon.dump()' your_dabatase > dump.sql

Dynamic Masking

Let’s take a basic example :

=# SELECT * FROM people;
 id | fistname | lastname |   phone
----+----------+----------+------------
 T1 | Sarah    | Conor    | 0609110911
(1 row)

Dynamic Masking

Step 1 : Activate the dynamic masking engine

=# CREATE EXTENSION IF NOT EXISTS anon CASCADE;
=# SELECT anon.start_dynamic_masking();

Dynamic Masking

Step 2 : Declare a masked user

=# CREATE ROLE skynet LOGIN;
=# SECURITY LABEL FOR anon ON ROLE skynet
-# IS 'MASKED';

The masked user has a read-only access to the anonymized data of the masked tables.

Dynamic Masking

Step 3 : Declare the masking rules

SECURITY LABEL FOR anon ON COLUMN people.name
IS 'MASKED WITH FUNCTION anon.random_last_name()';

SECURITY LABEL FOR anon ON COLUMN people.phone
IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';

Dynamic Masking

Step 4 : Connect with the masked user

=# \! psql peopledb -U skynet -c 'SELECT * FROM people;'
 id | fistname | lastname  |   phone
----+----------+-----------+------------
 T1 | Sarah    | Stranahan | 06******11
(1 row)

How it Works

Basically :

500 lines of pl/pgsql
An event trigger on DDL commands
Silently creates a “masking view” upon the real table
Tricks masked users with search_path
use of TABLESAMPLE with tms_system_rows for random functions

Masking Functions

The extension provides functions to implement 5 main anonymization techniques:

Noise Addition
Shuffling / Permutation
Randomization
Faking / Synthetizing
Partial destruction

Noise Addition

=# SECURITY LABEL FOR anon
-# ON COLUMN employee.salary
-# IS 'MASKED WITH FUNCTION
-#     anon.add_noise_on_numeric_column(user, salary, 0.33)
-# ';

All values of the column will be randomly shifted with a ratio of +/- 33%

Noise Addition

The dataset remains meaningful
AVG() and SUM() are similar to the original
works only for dates and numeric values
“extreme values” may cause re-identification (“singling out”)

Shuffling

=# SECURITY LABEL FOR anon
-# ON COLUMN employee.fk_company
-# IS 'MASKED WITH FUNCTION
-#     anon.shuffle_column(employee, fk_company, id)
-# ';

Shuffling

The dataset remains meaningful
Perfect for Foreign Keys
Works bad with low distribution (ex: boolean)
The table must have a primary key

Randomization

=# SECURITY LABEL FOR anon
-# ON COLUMN employee.birth
-# IS 'MASKED WITH FUNCTION
-#     anon.random_date_between(''01/01/1920'',now())
-#';

Randomization

Simple and Fast
Usefull for columns with NOT NULL constraints
Useless for analytics

Faking

=# SECURITY LABEL FOR anon
-# ON COLUMN employee.lastname
-# IS 'MASKED WITH FUNCTION
-#     anon.fake_last_name()
-# ';

Faking

Just a more elaborate version of Randomization
Great for developpers and CI tests
You can load your own dictionnaries !

Partial Destruction

=# SECURITY LABEL FOR anon
-# ON COLUMN employee.phone
-# IS 'MASKED WITH FUNCTION anon.partial(phone,4,'******',2)';

+33142928107 becomes +331******07

Partial Destruction

Perfect for phone number, credit cards, etc.
The user can still recognize his/her own data
Transformation is IMMUTABLE
Works only for TEXT / VARCHAR types

Known Limitations

PostgreSQL 9.6 and later
Dynamic Masking works with only one schema

Future developments

Research on K-Anonymity
Mesure the risk of reidentification
Suggest masking rules based on heuristics
Implement Generalization functions

How to Contribute ?

Feedback and bugs !
Images and geodata
Join the project at :

https://gitlab.com/dalibo/postgresql_anonymizer

In a nutshell

GDPR sanctions are really real
Data Leak is your main risk
Reduce your attack surface (“Storage Limitation”)
Anonymize whenever you can
Anonymize inside the database
Encryption is not Anonymization !

Our Next Challenge:

Privacy By Design

Developpers should write the masking rules
It’s hard…. PostgreSQL must help them.
The Postgres community has won so many battles
Now we have to focus on data privacy

We’re Hiring !

Dalibo is a french-speaking employee-owned remote-working company

We’re looking for:

PostgreSQL Development DBAs
PostgreSQL Production DBAs
Python Backend Developer
Key Account Manager

Grazie !

Contact : damien.clochard@dalibo.com
Follow : @daamien
Feedback : https://2019.pgconf.eu/f
Other Projects : Dalibo Labs

Anonymization

Who I Am

Who I Am Not

My Story

Menu

GDPR

GDPR: Individual Rights

GDPR: Principles & Concepts

Sanctions are coming

Beware of Article 32 !

In other words: “Data Leaks”

Pseudonymization

Anonymization

Does it really matter ?

Yes

2 different things

Pseudonymization

Example: Encryption

Why Anonymization is hard

Singling Out

Linkability

Inference

This is a losing game !

GDPR gives a margin of error

Mesure the threat

Anonymization Pipelines

Basic Example

Worst Scenario

ETL

Cloud Anonymization

PostgreSQL Anonymizer

What is this ?

Goals

Example: Real Data

Example: Anonymized Data

Install

Install

Configure

Load

Declare a masking rule

Now we have 3 options

In-Place Anonymization

In-Place Anonymization

Anonymous Dumps

Anonymous Dumps

Dynamic Masking

Dynamic Masking

Dynamic Masking

Dynamic Masking

Dynamic Masking

How it Works

How it Works

Masking Functions

Noise Addition

Noise Addition

Shuffling

Shuffling

Randomization

Randomization

Faking

Faking

Partial Destruction

Partial Destruction

Known Limitations

Future developments

Other Tools for Postgres

How to Contribute ?

In a nutshell

Our Next Challenge:

Privacy By Design

We’re Hiring !

Grazie !