Damien Clochard
Co-founder of Dalibo
Active member of the French PostgreSQL community
In 2018, someone asked me to anonymize a database
I thought it would be quick and easy
I started up writing a bunch of pl/pgsql helper functions…
I called those functions PostgreSQL Anonymizer
… and somehow 7 years later I’m still working on it
This is my story :)
The Paradox of Privacy
6 principles to protect Data Privacy
How to implement them with PostgreSQL Anonymizer
« What is privacy ? »
As our everyday life is getting more and more digitalized,
We want more and more privacy.
If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.
Eric Schmidt, Google CEO, 2009
Collecting and selling your personnal info is their business.
The data broker industry is worth $200 billion a year.
That’s the GDP of Hungary
Do you any know of these companies ?
Top 5 Data Brokers | Annual Revenue |
---|---|
Experian | $9.7 billion |
Equifax | $5.1 billion |
Epsilon | $2.9 billion |
Acxiom | $2.7 billion |
CoreLogic | $1 billion |
Those 5 brokers combined are bigger than Red Hat
There are actually 200+ data brokers
In theory, you can send a letter asking them to remove your data
… and do it again next month
Of course, some companies will write these letters for you for 10$/month !
This market is absolutely NOT regulated
… and the number of data leaks is exploding
In 2024, 1.7 billion individual notifications of data breach.
That’s +300% compared to 2023
People want more Privacy
Tech Companies want more data
The masking policy should be written by the application developpers
Define different roles with different masking rules
Minimizing the risk of data leaks by reducing the places where the data is stored
The type and amount of personal information you collect must be limited to what is directly relevant and necessary to accomplish a specified purpose.
Frequent impact assessment to evaluate the origin, nature, particularity and severity of the risk of leaking personal data
If you’re not sure wether or not a column contains personal information,
you should treat it like it contains personal information.
An open-source Postgres extension publish by DALIBO
In production since 2020
Written in Rust + pl/pgsql
Install it via RPM / DEB / Docker / Ansible
Available on Google Cloud SQL, Azure, Crunchy Bridge, Neon, …
Many outside contributors : DGFIP, ANR, …
5 different masking methods
10 types of masking functions
The extension has a declarative approach of anonymization.
Declare masking rules within the database model
Using the SECURITY LABEL
syntax
Declare that a role is MASKED
The masking rules will be automatically applied to this role
By definition a MASKED
role is read-only
Other roles can still read and write the real data
You can now share and reload the anonymous_dump.sql
file
anywhere you want
Use the anonymous dump to refresh development environments
This is the regular pg_dump
--format=custom
is supported
In most use case, the masked role do not need to access ALL the data
A sample is sufficient for tests, analytics, demo, training data
This is called sampling
Did you know that PostgreSQL already has a
TABLESAMPLE
clause ?
Let’s say you have a huge amounts of http logs stored in a table. You want to remove the ip addresses and extract only 10% of the table:
You want to remove the ip addresses and extract only 10% of the table:
Another approach for sampling is to use Row
Level Security Policies, also known as RLS
or
Row Security Policies
.
Let’s use the same example as a above, this time we want to define a limit so the mask users can only see the logs of the last 6 months.
The extension provides an implementation of the
k-anonymity
A detection function to find columns that should have a mask
k-anonymity is an industry-standard term used to describe a property of an anonymized dataset.
Any anonymized individual cannot be distinguished from at least k-1 other individuals inside the dataset
The higher the value, the better…
Imagine a database named access_logs
with a basic table
containing HTTP logs:
Now let’s activate privacy by default:
We can now anonymize the table without writing any masking rule.
The Battle for Privacy is happening now..
Wether you want it or not : you’re in it
Privacy by Design is the key concept
then Role Saparation and ASR
Finally : Data Minimisation, Risk Evaluation, Privacy By Default
This is a team effort
Application developpers
Sysadmins
DPO
etc.
Anonymous Dumps : Simply export the masked data into an SQL file
Static Masking : Permanently remove the personal info
Dynamic Masking : Hide personal info only for the masked users
Project
https://gitlab.com/dalibo/postgresql_anonymizer/
Tutorial
Contact