Data Masking and Anonymization for PostgreSQL
- The Anonymization Challenge
- 8 Strategies
- PostgreSQL Anonymizer
- The Future
Meh.
Do I really need Java or Ruby for this ?
Describe data manipulations in a YAML file ?
Extract data from Postgres and load them back ?
So I started building my own SQL script…
What is anonymization ?
Modify a dataset to avoid any identification while remaining suitable for testing, data analysis and data processing.
Why ?
- Open Data
- Continuous Integration
- Functional Testing
- Analytics
- Audit
- Development
Static vs Dynamic Anonymization
Dynamic Masking offers an altered view of the real data without modifying it. Some users may only read the masked data, others may access the authentic version.
Permanent Alteration is the definitive action of substituting the sensitive information with uncorrelated data. Once processed, the authentic data cannot be retrieved.
Why it’s hard
Singling out
Linkability
Indirect Identifiers
Singling Out
The possibility to isolate a record and identify a subject in the dataset.
SELECT * FROM employees;
id | name | job | salary
------+----------------+------+--------
1578 | xkjefus3sfzd | NULL | 1498
2552 | cksnd2se5dfa | NULL | 2257
5301 | fnefckndc2xn | NULL | 45489
7114 | npodn5ltyp3d | NULL | 1821
Linkability
Identify a subject in the dataset using other datasets
- Netflix Ratings + IMDB Ratings
- Hospital visits + State voting records
Indirect Identifiers
87% of the U.S. Population are uniquely identified by date of birth, gender and zip code
This is a losing game !
You can’t measure the usefulness of the anonymized dataset
You can’t prove that re-identification is impossible
8 strategies to anonymize data
Example
CREATE TABLE marriott_client (
id SERIAL,
name TEXT NOT NULL,
passwd TEXT NOT NULL,
address TEXT,
age INTEGER,
points INTEGER,
phone TEXT
);
0. Sampling
-- Work only on 20% of a table
SELECT * FROM marriott_client
TABLESAMPLE BERNOULLI(20);
0. Sampling
Sampling is not Anonymization but….
Direct implementation of the “Data Minimisation” principle of GPDR
Reducing dataset will reduce the risk of re-identification
The anonymization process will be faster
Use pg_sample to keep referential integrity among several tables
1. Suppression
-- Just remove the data
UPDATE marriott_client
SET name = NULL;
UPDATE marriott_client
SET points= 0;
UPDATE marriott_client
SET passwd = "CONFIDENTIAL";
1. Suppression
Simple and Efficient
Direct implementation of the “Data Minimisation” principle of GPDR
Breaks integrity constraints ( CHECK
/ NOT NULL
)
Useless for functional testing
2. Random Substitution
-- Replace the data with a purely random value
UPDATE marriott_client
SET name = md5(random()::text);
UPDATE marriott_client
SET points= 100*random();
3. Adding Noise
-- Randomly "shifting" the value of +/- 25%
UPDATE marriott_client
SET points = points * (1+(2*random()-1) * 0.25) ;
3. Adding Noise
The dataset remains meaningful
AVG()
and SUM()
are similar to the original
works only for dates and numeric values
“extreme values” may cause re-identification (“singling out”)
4. Encryption
-- uses an encryption algorithm
CREATE EXTENSION pgcrypto;
-- generate a random salt and throw it away
UPDATE marriott_client
SET name = crypt('name', gen_salt('md5'));
4. Encryption
Respect the UNIQUE
constraint
Possible implementation of “Pseudonymization”
The transformation can be IMMUTABLE
Functional testing is weird
If the key is stolen, authentic data can be revealed.
5. Shuffling
-- Mixing values within the same column
WITH p1 AS (
SELECT row_number() over (order by random()) n,
points AS points1
FROM marriott_client),
p2 AS (
SELECT row_number() over (order by random()) n,
id AS id2
FROM marriott_client )
UPDATE marriott_client
SET points = p1.points1
FROM p1 join p2 on p1.n = p2.n
WHERE id = p2.id2;
6. Faking / Mocking
-- replace data with **random-but-plausible** values.
UPDATE marriott_client
SET address = fake_address();
6. Faking / Mocking
The faking function is hard to write (see faker)
For complex data types, it’s hard produce relevant synthetic data
Not appropriate for analytics because the values are not “real”
7. Partial Suppression
-- "01 42 92 81 00" becomes "XX XX XX 81 00"
UPDATE marriott_client
SET phone = 'XX XX XX ' || substring(phone FROM 9 FOR 5 );
7. Partial Suppression
The user can still recognize his/her own data
Transformation is IMMUTABLE
Works only for TEXT / VARCHAR types
8. Generalization
-- Instead of "Client X is 28 years old",
-- Let's say "Client X is between 20 and 30 years old."
CREATE TABLE anonymous_client
AS SELECT
id,
'*' AS name,
int4range(age/10*10, (age/10+1)*10) AS age
FROM marriott_client
;
8. Generalization
The data type has changed
Breaks CI, functional tests and any use related to the application.
Fine for data analytics and aggregation.
Risk of singling-out.
Recap
- Suppression : Useless attributes
- Random : Useless attributes with constraints
- Noise (Numeric and Dates) : Dev / CI / Functional Testing
- Encryption (Text) : UNIQUE attributes
- Shuffle : Foreign keys / Analytics
- Faking : Dev / CI / Functional Testing
- Partial (Text) : Direct Identifiers
- Generalization ( Numeric and Dates) : Analytics
What is this ?
- Started as a personal project last year
- Now part of the “Dalibo Labs” initiative
- This is a prototype !
Goals
- Transform data inside PostgreSQL
- Implement useful features ( noise, shuffling, faking, etc.)
- Define anonymization policy with SQL statements
- PoC for Dynamic Masking
Install
$ sudo pgxn install postgresql_anonymizer
Load
CREATE EXTENSION anon CASCADE;
SELECT anon.load();
Random
UPDATE marriott_client
SET birth=anon.random_date_between('01/01/1920',now());
Noise
-- shift date d1 on table t1 by +/- 2 years
SELECT anon.add_noise_on_datetime_column(t1,d1,'2 years');
Shuffle
SELECT anon.shuffle_column(marriott_client,points);
Faking
UPDATE marriott_client
SET company = anon.fake_company();
Partial
-- replace 01 42 92 81 00 by XX XX XX 81 00
UPDATE marriott_client
SET phone = anon.partial(phone,0, 'XX XX XX ', 5);
Declarative Dynamic Masking !
- Regular user can see the real data
- Others can only view anonymized data
Create a masked user
CREATE ROLE skynet;
COMMENT ON ROLE skynet IS 'MASKED';
Put masks on columns
-- Random Mask
COMMENT ON COLUMN marriott_client.surname
IS
'MASKED WITH FUNCTION anon.random_last_name()';
-- Partial Mask
COMMENT ON COLUMN marriott_client.phone
IS
'MASKED WITH FUNCTION anon.partial(phone,2,$$*-***-**$$,2)';
Normal user will see :
SELECT * FROM marriott_client WHERE id = '800';
id | firstname | surname | phone
------+-----------+--------------+------------
800 | Sarah | Connor | 408-555-1439
(1 row)
Masked user will see :
SET ROLE skynet;
SELECT * FROM marriott_client WHERE id = '800';
id | firstname | surname | phone
------+-----------+--------------+------------
800 | Sarah | Nunziata | 40*-***-*19
(1 row)
Masked users can’t read/write masked columns
SET ROLE skynet;
SELECT * FROM marriott_client
WHERE surname ILIKE 'CONNOR';
(0 rows)
SET ROLE skynet;
DELETE FROM marriott_client
WHERE surname ILIKE 'CONNOR';
ERROR: permission denied for view marriott_client
How it Works
Just 1 big fat hack
Basically :
- 500 lines of pl/pgsql
- A misappropriation of the
COMMENT
syntax
- An event trigger on DDL commands
- Silently creates a “masking view” upon the real table
- Tricks masked users with
search_path
- use of TABLESAMPLE with
tms_system_rows
for random functions
Limits
- PostgreSQL 9.6 and later
- Only one schema
- What if the columns COMMENTs are really used ?
- Masked users can’t use pg_dump
- Performances ?
SELECT * from pg_stats
Toward «Anonymity by Design»
Any anonymization policy should be defined as close as possible to the data. Just like integrity constraints, security rules and triggers. PostgreSQL should allow developers to declare how a column will be masked.
Extend Postgres DDL dialect
ALTER TABLE marriott_client
ALTER COLUMN email
ADD MASK WITH FUNCTION foo(email);
GRANT UNMASK TO admin;
( MS SQL Server already has it )
The plan
Step 1 : build a PoC in pl/pgsql
Step 2 : implement C functions
Step 3 : build a patch for Postgres ?
How to HELP us
Feedback and bugs !
images and geodata
Ideas on how to implement this as an extension
Advice about extending the SQL syntax
Research on K-Anonymity and Differential_Privacy
Not Just about GDPR….
Free software communities must lead the way to build a future where privacy and anonymity are available to everyone. And of course PostgreSQL has an important role to in this domain because it’s by far the wolrd’s most dynamic and innovative database engine.