Data Masking and Anonymization for PostgreSQL

The Anonymization Challenge
8 Strategies
PostgreSQL Anonymizer
The Future

My Story

Let’s do this !

Meh.

Do I really need Java or Ruby for this ?
Describe data manipulations in a YAML file ?
Extract data from Postgres and load them back ?

So I started building my own SQL script…

What is anonymization ?

Modify a dataset to avoid any identification while remaining suitable for testing, data analysis and data processing.

Why ?

Open Data
Continuous Integration
Functional Testing
Analytics
Audit
Development

Static vs Dynamic Anonymization

Dynamic Masking offers an altered view of the real data without modifying it. Some users may only read the masked data, others may access the authentic version.
Permanent Alteration is the definitive action of substituting the sensitive information with uncorrelated data. Once processed, the authentic data cannot be retrieved.

Why it’s hard

Singling out
Linkability
Indirect Identifiers

Singling Out

The possibility to isolate a record and identify a subject in the dataset.

SELECT * FROM employees;

  id  |  name          | job  | salary 
------+----------------+------+--------
 1578 | xkjefus3sfzd   | NULL |    1498 
 2552 | cksnd2se5dfa   | NULL |    2257 
 5301 | fnefckndc2xn   | NULL |   45489
 7114 | npodn5ltyp3d   | NULL |    1821

Linkability

Identify a subject in the dataset using other datasets

Netflix Ratings + IMDB Ratings ¹
Hospital visits + State voting records ²

Indirect Identifiers

87% of the U.S. Population are uniquely identified by date of birth, gender and zip code ³

This is a losing game !

You can’t measure the usefulness of the anonymized dataset

You can’t prove that re-identification is impossible ⁴

« Anonymization is hard » (WP29 Opinion 05/2014)
« Pseudonymization is enough » (Recital 83)
« Data Protection By Design » ( Article 25 )

8 strategies to anonymize data

Example

CREATE TABLE marriott_client (
    id SERIAL,
    name TEXT NOT NULL,
    passwd TEXT NOT NULL,
    address TEXT,
    age INTEGER,
    points INTEGER,
    phone TEXT
);

0. Sampling

-- Work only on 20% of a table

SELECT * FROM marriott_client 
TABLESAMPLE BERNOULLI(20);

0. Sampling

Sampling is not Anonymization but….

Direct implementation of the “Data Minimisation” principle of GPDR
Reducing dataset will reduce the risk of re-identification
The anonymization process will be faster
Use pg_sample to keep referential integrity among several tables

1. Suppression

-- Just remove the data

UPDATE marriott_client 
SET name = NULL;

UPDATE marriott_client 
SET points= 0;

UPDATE marriott_client 
SET passwd = "CONFIDENTIAL";

1. Suppression

Simple and Efficient
Direct implementation of the “Data Minimisation” principle of GPDR
Breaks integrity constraints ( CHECK / NOT NULL)
Useless for functional testing

2. Random Substitution

-- Replace the data with a purely random value

UPDATE marriott_client 
SET name = md5(random()::text);

UPDATE marriott_client 
SET points= 100*random();

2. Random Substitution

Simple and Fast
Avoid breaking NOT NULL constraints
Still useless for functional testing

3. Adding Noise

-- Randomly "shifting" the value of +/- 25%
UPDATE marriott_client 
SET points = points *  (1+(2*random()-1) * 0.25) ;

3. Adding Noise

The dataset remains meaningful
AVG() and SUM() are similar to the original
works only for dates and numeric values
“extreme values” may cause re-identification (“singling out”)

4. Encryption

-- uses an encryption algorithm 
CREATE EXTENSION pgcrypto;

-- generate a random salt and throw it away
UPDATE marriott_client
SET name = crypt('name', gen_salt('md5'));

4. Encryption

Respect the UNIQUE constraint
Possible implementation of “Pseudonymization”
The transformation can be IMMUTABLE
Functional testing is weird
If the key is stolen, authentic data can be revealed.

5. Shuffling

-- Mixing values within the same column
WITH p1 AS (
    SELECT row_number() over (order by random()) n,
           points AS points1
    FROM marriott_client),  
p2 AS (
    SELECT row_number() over (order by random()) n,
           id AS id2
    FROM marriott_client )
UPDATE marriott_client
SET points = p1.points1
FROM p1 join p2 on p1.n = p2.n
WHERE id = p2.id2;

5. Shuffling

The dataset remains meaningful
Perfect for Foreign Keys
Works bad with low distribution (ex: boolean)

6. Faking / Mocking

-- replace data with **random-but-plausible** values.

UPDATE marriott_client 
SET address = fake_address();

6. Faking / Mocking

The faking function is hard to write (see faker)
For complex data types, it’s hard produce relevant synthetic data
Not appropriate for analytics because the values are not “real”

7. Partial Suppression

--  "01 42 92 81 00" becomes "XX XX XX 81 00"

UPDATE marriott_client
SET phone = 'XX XX XX ' || substring(phone FROM 9 FOR 5 );

7. Partial Suppression

The user can still recognize his/her own data
Transformation is IMMUTABLE
Works only for TEXT / VARCHAR types

8. Generalization

-- Instead of "Client X is 28 years old", 
-- Let's say  "Client X is between 20 and 30 years old." 

CREATE TABLE anonymous_client
AS  SELECT
        id,
        '*' AS name,
        int4range(age/10*10, (age/10+1)*10) AS age
    FROM marriott_client
;

8. Generalization

The data type has changed
Breaks CI, functional tests and any use related to the application.
Fine for data analytics and aggregation.
Risk of singling-out.

Recap

Suppression : Useless attributes
Random : Useless attributes with constraints
Noise (Numeric and Dates) : Dev / CI / Functional Testing
Encryption (Text) : UNIQUE attributes
Shuffle : Foreign keys / Analytics
Faking : Dev / CI / Functional Testing
Partial (Text) : Direct Identifiers
Generalization ( Numeric and Dates) : Analytics

https://gitlab.com/dalibo/postgresql_anonymizer/

What is this ?

Started as a personal project last year
Now part of the “Dalibo Labs” initiative
This is a prototype !

Goals

Transform data inside PostgreSQL
Implement useful features ( noise, shuffling, faking, etc.)
Define anonymization policy with SQL statements
PoC for Dynamic Masking

Install

$ sudo pgxn install postgresql_anonymizer

Load

CREATE EXTENSION anon CASCADE;
SELECT anon.load();

Random

UPDATE marriott_client
SET birth=anon.random_date_between('01/01/1920',now());

Noise

-- shift date d1 on table t1 by +/- 2 years
SELECT anon.add_noise_on_datetime_column(t1,d1,'2 years');

Shuffle

SELECT anon.shuffle_column(marriott_client,points);

Faking

UPDATE marriott_client
SET company = anon.fake_company();

Partial

--  replace 01 42 92 81 00 by XX XX XX 81 00
UPDATE marriott_client
SET phone = anon.partial(phone,0, 'XX XX XX ', 5);

Declarative Dynamic Masking !

Regular user can see the real data
Others can only view anonymized data

Create a masked user

CREATE ROLE skynet;
COMMENT ON ROLE skynet IS 'MASKED';

Put masks on columns

-- Random Mask
COMMENT ON COLUMN marriott_client.surname
IS 
'MASKED WITH FUNCTION anon.random_last_name()';

-- Partial Mask 
COMMENT ON COLUMN marriott_client.phone
IS 
'MASKED WITH FUNCTION anon.partial(phone,2,$$*-***-**$$,2)';

Normal user will see :

SELECT * FROM marriott_client WHERE id = '800';
  id  | firstname | surname      |   phone
------+-----------+--------------+------------
  800 | Sarah     | Connor       | 408-555-1439
(1 row)

Masked user will see :

SET ROLE skynet;
SELECT * FROM marriott_client WHERE id = '800';
  id  | firstname | surname      |   phone
------+-----------+--------------+------------
  800 | Sarah     | Nunziata     | 40*-***-*19
(1 row)

Masked users can’t read/write masked columns

SET ROLE skynet;
SELECT * FROM marriott_client 
WHERE surname ILIKE 'CONNOR';
(0 rows)

SET ROLE skynet;
DELETE FROM marriott_client 
WHERE surname ILIKE 'CONNOR';
ERROR:  permission denied for view marriott_client

How it Works

Just 1 big fat hack

Basically :

500 lines of pl/pgsql
A misappropriation of the COMMENT syntax
An event trigger on DDL commands
Silently creates a “masking view” upon the real table
Tricks masked users with search_path
use of TABLESAMPLE with tms_system_rows for random functions

Limits

PostgreSQL 9.6 and later
Only one schema
What if the columns COMMENTs are really used ?
Masked users can’t use pg_dump
Performances ?
SELECT * from pg_stats

Toward «Anonymity by Design»

Any anonymization policy should be defined as close as possible to the data. Just like integrity constraints, security rules and triggers. PostgreSQL should allow developers to declare how a column will be masked.

Extend Postgres DDL dialect

ALTER TABLE  marriott_client
ALTER COLUMN email 
ADD MASK WITH FUNCTION foo(email);

GRANT UNMASK TO admin;

( MS SQL Server already has it )

The plan

Step 1 : build a PoC in pl/pgsql
Step 2 : implement C functions
Step 3 : build a patch for Postgres ?

How to HELP us

Feedback and bugs !
images and geodata
Ideas on how to implement this as an extension
Advice about extending the SQL syntax
Research on K-Anonymity and Differential_Privacy

Contact me !

Free software communities must lead the way to build a future where privacy and anonymity are available to everyone. And of course PostgreSQL has an important role to in this domain because it’s by far the wolrd’s most dynamic and innovative database engine.

Photo Credits

qiagen, arvin_benitez, tonynewell, gorbould, vshioshvili, w4nd3rl0st, ysn, 125222692@N04, thecostumeguild

https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf↩︎
https://techscience.org/a/2015092903/↩︎
http://latanyasweeney.org/work/identifiability.html↩︎
http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf↩︎

Data Masking and Anonymization for PostgreSQL

My Story

Let’s do this !

Meh.

What is anonymization ?

Why ?

Static vs Dynamic Anonymization

Why it’s hard

Singling Out

Linkability

Indirect Identifiers

This is a losing game !

What does the GDPR say ?

8 strategies to anonymize data

Example

0. Sampling

0. Sampling

0. Sampling

1. Suppression

1. Suppression

1. Suppression

2. Random Substitution

2. Random Substitution

2. Random Substitution

3. Adding Noise

3. Adding Noise

3. Adding Noise

4. Encryption

4. Encryption

4. Encryption

5. Shuffling

5. Shuffling

5. Shuffling

6. Faking / Mocking

6. Faking / Mocking

6. Faking / Mocking

7. Partial Suppression

7. Partial Suppression

7. Partial Suppression

8. Generalization

8. Generalization

8. Generalization

Recap

What is this ?

Goals

Install

Load

Random

Noise

Shuffle

Faking

Partial

Declarative Dynamic Masking !

Create a masked user

Put masks on columns

Normal user will see :

Masked user will see :

Masked users can’t read/write masked columns

How it Works

Just 1 big fat hack

Limits

Toward «Anonymity by Design»

Extend Postgres DDL dialect

The plan

How to HELP us

Contact me !

Not Just about GDPR….

Links

More Links

Photo Credits