PostgreSQL Anonymizer

Beyond GDPR

Who I Am

  • Damien Clochard

  • PostgreSQL DBA & Co-founder at Dalibo

  • President of PostgreSQLFr Association

Who I Am Not

  • I Am Not A Lawyer

  • I Am Not A Privacy Expert

  • Don’t take my word for it / Check the links !

My Journey

  • Why Anonymization is hard

  • Anonymization Pipelines

  • PostgreSQL Anonymizer

Why Anonymization is hard



(source: WP29 Opinion on Anonymisation Techniques)

Singling Out

The possibility to isolate a record and identify a subject in the dataset.

SELECT * FROM employees;

  id  |  name          | job  | salary
------+----------------+------+--------
 1578 | xkjefus3sfzd   | NULL |    1498
 2552 | cksnd2se5dfa   | NULL |    2257
 5301 | fnefckndc2xn   | NULL |   45489
 7114 | npodn5ltyp3d   | NULL |    1821

Linkability

Identify a subject in the dataset using other datasets

  • Netflix Ratings + IMDB Ratings

  • Hospital visits + State voting records



(sources: Netflix prize + Hospital Reidentification )

Inference

Identify a subject using a set of indirect identifiers.


87% of the U.S. population are uniquely identified by date of birth, gender and zip code


(source : Latanya Sweeney)

Anonymization Pipelines

Minimizing the risk of data leaks by reducing the attack surface

Basic Example

Worst Scenario

ETL

Cloud Anonymization

PostgreSQL Anonymizer

What is this ?

  • Started as a research project in 2018

  • Now part of the “Dalibo Labs” initiative

  • Currently in beta

  • Version 1.0 is coming by the end of 2020

Goals

  • Declare masking rules within the database model

  • Anonymization is done internally

  • Dynamic Masking / Anonymous Export / In-Place Masking

  • Batteries included : Builtin masking functions

  • Inspired by MS SQL Server Dynamic Data Masking

Example: Real Data

=# SELECT * FROM customer;
 id  |   full_name      |   birth    | zipcode | fk_shop
-----+------------------+------------+---------+---------
 911 | Chuck Norris     | 1940-03-10 | 75001   | 12
 112 | David Hasselhoff | 1952-07-17 | 90001   | 423

Example: Anonymized Data

=# SELECT * FROM customer;
 id  |     full_name     |   birth    | zipcode | fk_shop
-----+-------------------+------------+---------+---------
 911 | Michel Duffus     | 1970-03-24 | 63824   | 12
 112 | Andromache Tulip  | 1921-03-24 | 38199   | 423

Install

Using the Community RPM Repo:

$ yum install https://.../pgdg-redhat-repo-latest.noarch.rpm
$ yum install postgresql_anonymizer12

Load & Init

ALTER DATABASE foo SET session_preload_libraries = 'anon';
CREATE EXTENSION anon CASCADE;
SELECT anon.init();

Declare a masking rule

SECURITY LABEL FOR anon
ON COLUMN customer.zipcode
IS 'MASKED WITH FUNCTION anon.random_zipcode()';

Example

CREATE TABLE player( id SERIAL, name TEXT, points INT);

INSERT INTO player VALUES
  ( 1, 'Kareem Abdul-Jabbar', 38387),
  ( 5, 'Michael Jordan', 32292 );

SECURITY LABEL FOR anon ON COLUMN player.name
  IS 'MASKED WITH FUNCTION anon.fake_last_name()';

SECURITY LABEL FOR anon ON COLUMN player.id
  IS 'MASKED WITH VALUE NULL';

Now we have 3 options

  • In-Place Anonymization
  • Anonymous Dumps
  • Dynamic Masking

In-Place Anonymization

=# SELECT anon.anonymize_column('customer','zipcode');
=# SELECT anon.anonymize_table('customer');
=# SELECT anon.anonymize_database();

In-Place Anonymization

This will update all lines of all tables containing at least one masking rule.

This is gonna be slow and trigger heavy write workloads.

Anonymous Dumps

$ pg_dump_anon -h localhost -U bob foo > anonymous_dump.sql

Dynamic Masking

Let’s take a basic example :

=# SELECT * FROM people;
 id | fistname | lastname |   phone
----+----------+----------+------------
 T1 | Sarah    | Conor    | 0609110911
(1 row)

Dynamic Masking

Step 1 : Activate the dynamic masking engine

=# CREATE EXTENSION IF NOT EXISTS anon CASCADE;
=# SELECT anon.start_dynamic_masking();

Dynamic Masking

Step 2 : Declare a masked user

CREATE ROLE skynet LOGIN;

SECURITY LABEL FOR anon ON ROLE skynet
  IS 'MASKED';

The masked user has a read-only access to the anonymized data of the masked tables.

Dynamic Masking

Step 3 : Declare the masking rules

SECURITY LABEL FOR anon 
  ON COLUMN people.name
  IS 'MASKED WITH FUNCTION anon.random_last_name()';

SECURITY LABEL FOR anon 
  ON COLUMN people.phone
  IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';

Dynamic Masking

Step 4 : Connect with the masked user

=# \! psql peopledb -U skynet -c 'SELECT * FROM people;'
 id | fistname | lastname  |   phone
----+----------+-----------+------------
 T1 | Sarah    | Stranahan | 06******11
(1 row)

How it Works

Batteries Included: 10 Masking techniques

  • Destruction
  • Noise Addition
  • Shuffling / Permutation
  • Randomization
  • Faking / Synthetizing
  • Advanced Faking
  • Pseudonymization
  • Hashing
  • Partial Scrambling
  • Generalization

Destruction

SECURITY LABEL FOR anon
  ON COLUMN users.address
  IS 'MASKED WITH VALUE ''CONFIDENTIAL'' ';

Destruction

  • Simple, Fast, Efficient

  • Required for NOT NULL columns

Noise Addition

=# SECURITY LABEL FOR anon
-# ON COLUMN employee.salary
-# IS 'MASKED WITH FUNCTION
-#     anon.add_noise_on_numeric_column(user, salary, 0.33)
-# ';

All values of the column will be randomly shifted with a ratio of +/- 33%

Noise Addition

  • The dataset remains meaningful

  • AVG() and SUM() are similar to the original

  • works only for dates and numeric values

  • “extreme values” may cause re-identification (“singling out”)

  • risk repetition attack, especially with dynamic masking

Shuffling

SECURITY LABEL FOR anon
   ON COLUMN employee.fk_company
   IS 'MASKED WITH FUNCTION
     anon.shuffle_column(employee, fk_company, id)
';

Shuffling

  • The dataset remains meaningful

  • Perfect for Foreign Keys

  • Works bad with low distribution (ex: boolean)

  • The table must have a primary key

Randomization

SECURITY LABEL FOR anon
   ON COLUMN employee.birth
   IS 'MASKED WITH FUNCTION
     anon.random_date_between(''01/01/1920'',now())
';

Randomization

  • Simple and Fast

  • Usefull for columns with NOT NULL constraints

  • Useless for analytics

Faking

SECURITY LABEL FOR anon
  ON COLUMN employee.lastname
  IS 'MASKED WITH FUNCTION 
      anon.fake_last_name()
';

Faking

  • Just a more realistic version of Randomization

  • Great for developpers and CI tests

  • You can load your own dictionnaries !

  • Very basic implementation

Advanced Faking

CREATE EXTENSION faker SCHEMA faker CASCADE;

SELECT faker.faker('it_IT');

SELECT faker.name();
     name
---------------------------
   Sig. Alighieri Monti

Advanced Faking

  • Based on the well-known Python Faker library

  • Complete, Powerful, Extensible

  • Slow

Partial Scrambling

=# SECURITY LABEL FOR anon
-# ON COLUMN employee.phone
-# IS 'MASKED WITH FUNCTION anon.partial(phone,4,'******',2)';

+33142928107 becomes +331******07

Partial Destruction

  • Similar to the “Destruction” approach

  • Perfect for phone number, credit cards, etc.

  • The user can still recognize his/her own data

  • Transformation is IMMUTABLE

  • Works only for TEXT / VARCHAR types

Pseudonymization

SECURITY LABEL FOR anon
  ON COLUMN users.city
  IS 'MASKED WITH FUNCTION anon.pseudo_city(users.email) ';

Pseudonymization

This is an IMMUTABLE transformation:

 SELECT anon.pseudo_city('bob@gmail.com');
 pseudo_city
-------------
 Moriki

SELECT anon.pseudo_city('bob@gmail.com');
 pseudo_city
-------------
 Moriki

Pseudonymization

  • Useful for Foreing Keys and UNIQUE columns

  • You can build an index on pseudonymized columns

  • Pseudonymized Data are still covered by GDPR !

Hashing

SECURITY LABEL FOR anon
  ON COLUMN users.login
  IS 'MASKED WITH FUNCTION anon.pseudo_email(users.login) ';

Hashing

SELECT anon.hash('bob@gmail');
                   hash
------------------------------------------------------------------
95b6accef02c5a725a8c9abf19ab5575f99ca3d9997984181e4b3f81d96cbca4d0

Generalization

SELECT * FROM patient;
     ssn     | firstname |  zip  |   birth    |  disease    
-------------+-----------+---------+------------+---------------
 253-51-6170 | Alice     | 47012 | 1989-12-29 | Flu
 091-20-0543 | Bob       | 42678 | 1979-03-22 | Allergy
 565-94-1926 | Caroline  | 42678 | 1971-07-22 | Flu
 510-56-7882 | Eleanor   | 47909 | 1989-12-15 | Acne

Generalization

CREATE MATERIALIZED VIEW generalized_patient AS
SELECT
  'REDACTED'::TEXT AS firstname,
  anon.generalize_int4range(zipcode,1000) AS zipcode,
  anon.generalize_daterange(birth,'decade') AS birth,
  disease
FROM patient;

Generalization

SELECT * FROM generalized_patient;
 firstname |      zip      |          birth          |  disease    
-----------+---------------+-------------------------+---------------
 REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Flu
 REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Allergy
 REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Flu
 REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Acne

Generalization

  • The data remains true but less precise

  • Ideal for data science, reporting and analytics (RANGE types)

  • The degree of Anonymization can be measured with the k-anonymity function

  • Dynamic masking won’t work (because the data model has changed)

  • Can’t be used in CI

Write your own Masks !

  • Use your own set of fake data

  • write simple SQL functions, easy to test and maintain

  • Useful for JSON columns

In a Nutshell

  • Write your masking rules inside the database

  • Different strategies for different use cases

  • Combine with other tools (pg_sample, pg_audit, etc.)

thanks !