{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Open Diffix demo - pg_diffix\n",
"\n",
"This notebook demonstrates how to use the `pg_diffix`, a PostgreSQL extention that implements Diffix Elm anonymization.\n",
"\n",
"The notebook demonstrates\n",
"1. The SQL limitations imposed by Diffix Elm\n",
"2. How Diffix Elm perturbs data to achieve anonymization\n",
"\n",
"Feel free to use this notebook as a starting point for playing with Diffix Elm.\n",
"\n",
"## For more information\n",
"\n",
"To read more about Diffix Elm, visit [open-diffix.org](https://open-diffix.org).\n",
"A good overview of Diffix Elm can be found [here](https://www.open-diffix.org/blog/diffix-elm-automates-what-statistics-offices-have-been-doing-for-decades/).\n",
"A detailed description is [available on ArXiv](https://arxiv.org/abs/2201.04351).\n",
"Besides including a **full specification**, the latter includes a complete privacy analysis and guidance for writing a risk assessment.\n",
"\n",
"## How to run\n",
"\n",
"To run this notebook locally you need `psycopg2` and `ipython-sql` installed. For the meta-commands you need `pgspecial`.\n",
"\n",
"Below you can find connection details for 3 users in this database instance:\n",
"\n",
"- `direct_user` with direct (non-anonymized) access to `banking`\n",
"- `trusted_user` with anonymized access to `banking` in trusted mode\n",
"- `untrusted_user` with anonymized access to `banking` in untrusted mode"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"vscode": {
"languageId": "python"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Connected to database.\n"
]
}
],
"source": [
"%load_ext sql\n",
"%sql postgresql://direct_user:demo@demo-pg.open-diffix.org/banking\n",
"%sql postgresql://trusted_user:demo@demo-pg.open-diffix.org/banking\n",
"%sql postgresql://untrusted_user:demo@demo-pg.open-diffix.org/banking\n",
"%config SqlMagic.displaycon = False # No connection string in output.\n",
"%config SqlMagic.feedback = False # No Done, rows affected.\n",
"print('Connected to database.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset has the following tables:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
" \n",
" Schema | \n",
" Name | \n",
" Type | \n",
" Owner | \n",
"
\n",
" \n",
" public | \n",
" accounts | \n",
" table | \n",
" postgres | \n",
"
\n",
" \n",
" public | \n",
" accounts_receivables | \n",
" table | \n",
" postgres | \n",
"
\n",
" \n",
" public | \n",
" clients | \n",
" table | \n",
" postgres | \n",
"
\n",
" \n",
" public | \n",
" credit_cards | \n",
" table | \n",
" postgres | \n",
"
\n",
" \n",
" public | \n",
" dispositions | \n",
" table | \n",
" postgres | \n",
"
\n",
" \n",
" public | \n",
" loans | \n",
" table | \n",
" postgres | \n",
"
\n",
" \n",
" public | \n",
" loss_events | \n",
" table | \n",
" postgres | \n",
"
\n",
" \n",
" public | \n",
" orders | \n",
" table | \n",
" postgres | \n",
"
\n",
" \n",
" public | \n",
" transactions | \n",
" table | \n",
" postgres | \n",
"
\n",
"
"
],
"text/plain": [
"[('public', 'accounts', 'table', 'postgres'),\n",
" ('public', 'accounts_receivables', 'table', 'postgres'),\n",
" ('public', 'clients', 'table', 'postgres'),\n",
" ('public', 'credit_cards', 'table', 'postgres'),\n",
" ('public', 'dispositions', 'table', 'postgres'),\n",
" ('public', 'loans', 'table', 'postgres'),\n",
" ('public', 'loss_events', 'table', 'postgres'),\n",
" ('public', 'orders', 'table', 'postgres'),\n",
" ('public', 'transactions', 'table', 'postgres')]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql \\dt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can explore columns of tables with `\\d `"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" Column | \n",
" Type | \n",
" Modifiers | \n",
"
\n",
" \n",
" loan_id | \n",
" integer | \n",
" not null | \n",
"
\n",
" \n",
" account_id | \n",
" integer | \n",
" | \n",
"
\n",
" \n",
" date | \n",
" text | \n",
" | \n",
"
\n",
" \n",
" amount | \n",
" integer | \n",
" | \n",
"
\n",
" \n",
" duration | \n",
" integer | \n",
" | \n",
"
\n",
" \n",
" payments | \n",
" double precision | \n",
" | \n",
"
\n",
" \n",
" status | \n",
" text | \n",
" | \n",
"
\n",
"
"
],
"text/plain": [
"[['loan_id', 'integer', ' not null'],\n",
" ['account_id', 'integer', ''],\n",
" ['date', 'text', ''],\n",
" ['amount', 'integer', ''],\n",
" ['duration', 'integer', ''],\n",
" ['payments', 'double precision', ''],\n",
" ['status', 'text', '']]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql \\d loans"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Examples\n",
"\n",
"Diffix Elm supports a very limited subset of SQL.\n",
"\n",
"It supports only one aggregate, `count()`. Specifically, it supports `count(*)`, `count(column)`, and `count(distinct column)`.\n",
"\n",
"It supports the following generalization functions for numeric and text columns:\n",
"\n",
"- `diffix.floor_by(numeric_column, K)`\n",
"- `diffix.round_by(numeric_column, K)`\n",
"- `diffix.ceil_by(numeric_column, K)`\n",
"- `substring(text_column, index, length)`\n",
"\n",
"Let's see what it looks like when data is queried by the trusted user (receives anonymized output) and the direct user (receives raw data)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" status | \n",
" real_count | \n",
"
\n",
" \n",
" A | \n",
" 203 | \n",
"
\n",
" \n",
" B | \n",
" 31 | \n",
"
\n",
" \n",
" C | \n",
" 403 | \n",
"
\n",
" \n",
" D | \n",
" 45 | \n",
"
\n",
"
"
],
"text/plain": [
"[('A', 203), ('B', 31), ('C', 403), ('D', 45)]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql direct_user@banking\n",
"SELECT status, count(*) AS real_count\n",
"FROM loans\n",
"GROUP BY status\n",
"ORDER BY status ASC"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" status | \n",
" anon_count | \n",
"
\n",
" \n",
" A | \n",
" 202 | \n",
"
\n",
" \n",
" B | \n",
" 30 | \n",
"
\n",
" \n",
" C | \n",
" 402 | \n",
"
\n",
" \n",
" D | \n",
" 44 | \n",
"
\n",
"
"
],
"text/plain": [
"[('A', 202), ('B', 30), ('C', 402), ('D', 44)]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT status, count(*) AS anon_count\n",
"FROM loans\n",
"GROUP BY status\n",
"ORDER BY status ASC"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The two queries above are identical, but the anonymized results are slightly perturbed.\n",
"\n",
"Diffix Elm perturbs data in three ways:\n",
"\n",
"- Adding noise to counts\n",
"- Suppressing column values that pertain to too few individuals\n",
"- Hiding the impact of individuals that contribute disproportionately to counts\n",
"\n",
"The following gives examples of these.\n",
"\n",
"## Adding noise\n",
"\n",
"Diffix Elm adds noise to the results of aggregates.\n",
"This is why the counts in the above two queries differ.\n",
"\n",
"## Suppressing values with few individuals\n",
"\n",
"Let us take another look at the `loans` table we queried above.\n",
"The following query shows that many of the loan dates are unique:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" date | \n",
" count | \n",
"
\n",
" \n",
" 971019 | \n",
" 1 | \n",
"
\n",
" \n",
" 951002 | \n",
" 1 | \n",
"
\n",
" \n",
" 980522 | \n",
" 1 | \n",
"
\n",
" \n",
" 970519 | \n",
" 1 | \n",
"
\n",
" \n",
" 980129 | \n",
" 1 | \n",
"
\n",
" \n",
" 981201 | \n",
" 1 | \n",
"
\n",
" \n",
" 950922 | \n",
" 1 | \n",
"
\n",
" \n",
" 940825 | \n",
" 1 | \n",
"
\n",
" \n",
" 981029 | \n",
" 1 | \n",
"
\n",
" \n",
" 961024 | \n",
" 1 | \n",
"
\n",
"
"
],
"text/plain": [
"[('971019', 1),\n",
" ('951002', 1),\n",
" ('980522', 1),\n",
" ('970519', 1),\n",
" ('980129', 1),\n",
" ('981201', 1),\n",
" ('950922', 1),\n",
" ('940825', 1),\n",
" ('981029', 1),\n",
" ('961024', 1)]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql direct_user@banking\n",
"SELECT date, count(*)\n",
"FROM loans\n",
"GROUP BY date\n",
"HAVING count(*) = 1\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In fact 457 out of the 682 loans in the table (or approximately 67%) had a date that only appeared once:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" num_occurrences | \n",
" count_loans_sharing_date | \n",
"
\n",
" \n",
" 457 | \n",
" 1 | \n",
"
\n",
" \n",
" 84 | \n",
" 2 | \n",
"
\n",
" \n",
" 15 | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
"
\n",
"
"
],
"text/plain": [
"[(457, 1), (84, 2), (15, 3), (3, 4)]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql direct_user@banking\n",
"SELECT count(*) num_occurrences, count_loans_sharing_date\n",
"FROM (\n",
" SELECT date, count(*) AS count_loans_sharing_date\n",
" FROM loans\n",
" GROUP BY date\n",
") t\n",
"GROUP BY count_loans_sharing_date\n",
"ORDER BY count(*) DESC"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This means that, without anonymization, if we know the date on which somebody went to the bank to take out a loan,\n",
"we could most likely learn the amount of the loan and other information:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" date | \n",
" amount | \n",
" duration | \n",
" status | \n",
" count | \n",
"
\n",
" \n",
" 930705 | \n",
" 96396 | \n",
" 12 | \n",
" B | \n",
" 1 | \n",
"
\n",
" \n",
" 930711 | \n",
" 165960 | \n",
" 36 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 930728 | \n",
" 127080 | \n",
" 60 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 930803 | \n",
" 105804 | \n",
" 36 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 930906 | \n",
" 274740 | \n",
" 60 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 930913 | \n",
" 87840 | \n",
" 24 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 930915 | \n",
" 52788 | \n",
" 12 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 930924 | \n",
" 174744 | \n",
" 24 | \n",
" B | \n",
" 1 | \n",
"
\n",
" \n",
" 931013 | \n",
" 154416 | \n",
" 48 | \n",
" A | \n",
" 1 | \n",
"
\n",
" \n",
" 931104 | \n",
" 117024 | \n",
" 24 | \n",
" A | \n",
" 1 | \n",
"
\n",
"
"
],
"text/plain": [
"[('930705', 96396, 12, 'B', 1),\n",
" ('930711', 165960, 36, 'A', 1),\n",
" ('930728', 127080, 60, 'A', 1),\n",
" ('930803', 105804, 36, 'A', 1),\n",
" ('930906', 274740, 60, 'A', 1),\n",
" ('930913', 87840, 24, 'A', 1),\n",
" ('930915', 52788, 12, 'A', 1),\n",
" ('930924', 174744, 24, 'B', 1),\n",
" ('931013', 154416, 48, 'A', 1),\n",
" ('931104', 117024, 24, 'A', 1)]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql direct_user@banking\n",
"SELECT date, amount, duration, status, count(*)\n",
"FROM loans\n",
"GROUP BY date, amount, duration, status\n",
"ORDER BY date\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we know that an individual has taken a loan on Sept. 6, 1993, then we learn that it is a 60 month loan for 274740 dollars.\n",
"\n",
"When Diffix is used to anonymize the data, the system detects output rows that can be linked to a single or a small number of individuals,\n",
"and suppresses that information.\n",
"\n",
"We can see this in practice by repeating the two queries we just ran on the raw dataset, but protected by Diffix Elm."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" date | \n",
" count | \n",
"
\n",
" \n",
" * | \n",
" 667 | \n",
"
\n",
" \n",
" 971208 | \n",
" 3 | \n",
"
\n",
" \n",
" 960429 | \n",
" 3 | \n",
"
\n",
" \n",
" 970307 | \n",
" 3 | \n",
"
\n",
" \n",
" 970317 | \n",
" 3 | \n",
"
\n",
" \n",
" 970503 | \n",
" 3 | \n",
"
\n",
"
"
],
"text/plain": [
"[('*', 667),\n",
" ('971208', 3),\n",
" ('960429', 3),\n",
" ('970307', 3),\n",
" ('970317', 3),\n",
" ('970503', 3)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT date, count(*)\n",
"FROM loans\n",
"GROUP BY date"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we see that almost all of the data has been suppressed. \n",
"Counts for only five dates are displayed.\n",
"These are dates where there are enough distinct individuals with loans to show the data.\n",
"All other dates have one or very loans. Diffix Elm automatically recognizes this and suppresses the dates,\n",
"merging them into a single bin with the value '`*`' which we call the _suppression bin_."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" date | \n",
" amount | \n",
" duration | \n",
" status | \n",
" count | \n",
"
\n",
" \n",
" * | \n",
" None | \n",
" None | \n",
" * | \n",
" 683 | \n",
"
\n",
"
"
],
"text/plain": [
"[('*', None, None, '*', 683)]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT date, amount, duration, status, count(*)\n",
"FROM loans\n",
"GROUP BY date, amount, duration, status"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The query above, which attempts to display all columns, suppresses even more aggressively:\n",
"all of the data is suppressed and merged into a single suppression bin.\n",
"This is because the four column values taken together are distinct for every individual. \n",
"\n",
"The text columns of suppressed bins are denoted with the value '`*`' (this is configurable), while numeric columns are denoted with `NULL`.\n",
"Diffix Elm provides a custom function, `diffix.is_suppress_bin(*)`, which returns `True` for the suppression bin, and `False` otherwise:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" date | \n",
" count | \n",
" is_suppress_bin | \n",
"
\n",
" \n",
" * | \n",
" 667 | \n",
" True | \n",
"
\n",
" \n",
" 971208 | \n",
" 3 | \n",
" False | \n",
"
\n",
" \n",
" 960429 | \n",
" 3 | \n",
" False | \n",
"
\n",
" \n",
" 970307 | \n",
" 3 | \n",
" False | \n",
"
\n",
" \n",
" 970317 | \n",
" 3 | \n",
" False | \n",
"
\n",
" \n",
" 970503 | \n",
" 3 | \n",
" False | \n",
"
\n",
"
"
],
"text/plain": [
"[('*', 667, True),\n",
" ('971208', 3, False),\n",
" ('960429', 3, False),\n",
" ('970307', 3, False),\n",
" ('970317', 3, False),\n",
" ('970503', 3, False)]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT date, count(*), diffix.is_suppress_bin(*)\n",
"FROM loans\n",
"GROUP BY date"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Generalization\n",
"\n",
"The queries above are anonymous, but they are also useless. All of the data has been suppressed.\n",
"**Generalization** is used to avoid suppression.\n",
"\n",
"For example, rather than attempt to count the number of loans for individual days,\n",
"we can generalize these to years using `substring()`:\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" year | \n",
" count | \n",
"
\n",
" \n",
" 93 | \n",
" 21 | \n",
"
\n",
" \n",
" 94 | \n",
" 102 | \n",
"
\n",
" \n",
" 98 | \n",
" 159 | \n",
"
\n",
" \n",
" 96 | \n",
" 113 | \n",
"
\n",
" \n",
" 95 | \n",
" 91 | \n",
"
\n",
" \n",
" 97 | \n",
" 198 | \n",
"
\n",
"
"
],
"text/plain": [
"[('93', 21), ('94', 102), ('98', 159), ('96', 113), ('95', 91), ('97', 198)]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT substring(date, 1, 2) AS year, count(*)\n",
"FROM loans\n",
"GROUP BY substring(date, 1, 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we see that while no data has been suppressed, it comes at the cost of a loss of precision.\n",
"We can improve this by generalizing to month (here showing only the first 10 rows):"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" year_month | \n",
" count | \n",
"
\n",
" \n",
" * | \n",
" 24 | \n",
"
\n",
" \n",
" 9312 | \n",
" 7 | \n",
"
\n",
" \n",
" 9401 | \n",
" 5 | \n",
"
\n",
" \n",
" 9403 | \n",
" 7 | \n",
"
\n",
" \n",
" 9405 | \n",
" 10 | \n",
"
\n",
" \n",
" 9406 | \n",
" 13 | \n",
"
\n",
" \n",
" 9407 | \n",
" 13 | \n",
"
\n",
" \n",
" 9408 | \n",
" 7 | \n",
"
\n",
" \n",
" 9409 | \n",
" 11 | \n",
"
\n",
" \n",
" 9410 | \n",
" 6 | \n",
"
\n",
"
"
],
"text/plain": [
"[('*', 24),\n",
" ('9312', 7),\n",
" ('9401', 5),\n",
" ('9403', 7),\n",
" ('9405', 10),\n",
" ('9406', 13),\n",
" ('9407', 13),\n",
" ('9408', 7),\n",
" ('9409', 11),\n",
" ('9410', 6)]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT substring(date, 1, 4) AS year_month, count(*)\n",
"FROM loans\n",
"GROUP BY substring(date, 1, 4)\n",
"ORDER BY substring(date, 1, 4)\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here only 3.5% of the data is suppressed (roughly 24 of 682 loans). \n",
"\n",
"Diffix Elm allows generalization of both text and numeric columns.\n",
"In the following, we use the `difix.floor_by(col, amount)` function to generate loan amount bins of 50000:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" amount | \n",
" count | \n",
"
\n",
" \n",
" None | \n",
" 5 | \n",
"
\n",
" \n",
" 0.0 | \n",
" 125 | \n",
"
\n",
" \n",
" 50000.0 | \n",
" 178 | \n",
"
\n",
" \n",
" 100000.0 | \n",
" 89 | \n",
"
\n",
" \n",
" 150000.0 | \n",
" 103 | \n",
"
\n",
" \n",
" 200000.0 | \n",
" 57 | \n",
"
\n",
" \n",
" 250000.0 | \n",
" 41 | \n",
"
\n",
" \n",
" 300000.0 | \n",
" 35 | \n",
"
\n",
" \n",
" 350000.0 | \n",
" 25 | \n",
"
\n",
" \n",
" 400000.0 | \n",
" 10 | \n",
"
\n",
" \n",
" 450000.0 | \n",
" 11 | \n",
"
\n",
"
"
],
"text/plain": [
"[(None, 5),\n",
" (0.0, 125),\n",
" (50000.0, 178),\n",
" (100000.0, 89),\n",
" (150000.0, 103),\n",
" (200000.0, 57),\n",
" (250000.0, 41),\n",
" (300000.0, 35),\n",
" (350000.0, 25),\n",
" (400000.0, 10),\n",
" (450000.0, 11)]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT diffix.floor_by(amount, 50000) AS amount, count(*)\n",
"FROM loans\n",
"GROUP BY diffix.floor_by(amount, 50000)\n",
"ORDER BY diffix.floor_by(amount, 50000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Post Processing\n",
"\n",
"Although Diffix Elm places strong SQL restrictions on anonymizing queries,\n",
"it allows arbitrary SQL when post-processing the result of an anonymized query.\n",
"\n",
"The following post processes the anonymizing sub-query to make the output more readable."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" year | \n",
" amount | \n",
" count | \n",
"
\n",
" \n",
" Other | \n",
" Other | \n",
" 37 | \n",
"
\n",
" \n",
" 1993 | \n",
" 50k - 100k | \n",
" 8 | \n",
"
\n",
" \n",
" 1994 | \n",
" 0k - 50k | \n",
" 16 | \n",
"
\n",
" \n",
" 1994 | \n",
" 50k - 100k | \n",
" 30 | \n",
"
\n",
" \n",
" 1994 | \n",
" 100k - 150k | \n",
" 14 | \n",
"
\n",
" \n",
" 1994 | \n",
" 150k - 200k | \n",
" 16 | \n",
"
\n",
" \n",
" 1994 | \n",
" 200k - 250k | \n",
" 9 | \n",
"
\n",
" \n",
" 1994 | \n",
" 250k - 300k | \n",
" 8 | \n",
"
\n",
" \n",
" 1994 | \n",
" 300k - 350k | \n",
" 3 | \n",
"
\n",
" \n",
" 1995 | \n",
" 0k - 50k | \n",
" 21 | \n",
"
\n",
"
"
],
"text/plain": [
"[('Other', 'Other', 37),\n",
" ('1993', '50k - 100k', 8),\n",
" ('1994', '0k - 50k', 16),\n",
" ('1994', '50k - 100k', 30),\n",
" ('1994', '100k - 150k', 14),\n",
" ('1994', '150k - 200k', 16),\n",
" ('1994', '200k - 250k', 9),\n",
" ('1994', '250k - 300k', 8),\n",
" ('1994', '300k - 350k', 3),\n",
" ('1995', '0k - 50k', 21)]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT\n",
" CASE WHEN anonymized.year = '*' THEN 'Other'\n",
" ELSE '19' || anonymized.year\n",
" END AS year,\n",
" CASE WHEN anonymized.amount IS NULL THEN 'Other'\n",
" ELSE anonymized.amount/1000 || 'k - ' || (anonymized.amount/1000 + 50) || 'k'\n",
" END AS amount,\n",
" count\n",
"FROM (\n",
" SELECT substring(date, 1, 2) as year, diffix.floor_by(amount, 50000) AS amount, count(*) AS count\n",
" FROM loans\n",
" GROUP BY 1, 2\n",
") AS anonymized\n",
"ORDER BY anonymized.year ASC, anonymized.amount ASC\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following counts the number of unsuppressed bins in an output (including the suppression bin itself):"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" number_of_unsuppressed_bins_including_suppress_bin | \n",
"
\n",
" \n",
" 59 | \n",
"
\n",
"
"
],
"text/plain": [
"[(59,)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT count(*) AS number_of_unsuppressed_bins_including_suppress_bin\n",
"FROM (\n",
" SELECT substring(date, 1, 4), count(*)\n",
" FROM loans\n",
" GROUP BY substring(date, 1, 4)\n",
") t"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following computes the min and max loan amount bins. Note that this may not the true min and max,\n",
"but simply the min and max of the bin value for unsuppressed bins. The true values may be more or less."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" min_loan_amount | \n",
" max_loan_amount | \n",
"
\n",
" \n",
" 0.0 | \n",
" 550000.0 | \n",
"
\n",
"
"
],
"text/plain": [
"[(0.0, 550000.0)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT min(amount_floor) AS min_loan_amount,\n",
" max(amount_ceiling) AS max_loan_amount\n",
"FROM (\n",
" SELECT diffix.floor_by(amount, 50000) AS amount_floor,\n",
"\t diffix.ceil_by(amount, 50000) AS amount_ceiling,\n",
"\t count(*)\n",
" FROM loans\n",
" GROUP BY 1, 2\n",
") t"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following computes the average number of transactions per user."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" average_trans_per_user | \n",
"
\n",
" \n",
" 234 | \n",
"
\n",
"
"
],
"text/plain": [
"[(234,)]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT total_transactions/total_users AS average_trans_per_user\n",
"FROM (\n",
" SELECT count(DISTINCT account_id) AS total_users,\n",
" count(*) AS total_transactions\n",
" FROM transactions\n",
") t"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Proportional Noise\n",
"\n",
"Diffix Elm uses noise to hide the presence of absence of individuals.\n",
"If individuals contribute more to a count, then Diffix Elm recognizes this and increases the noise proprotionally.\n",
"\n",
"As an example, consider the following query made on the `transactions` table for both the raw and anonymized answers.\n",
"This query counts the number of distinct individuals (as identified by their account_id) for each transaction type."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" type | \n",
" count | \n",
"
\n",
" \n",
" PRIJEM | \n",
" 4500 | \n",
"
\n",
" \n",
" VYBER | \n",
" 1144 | \n",
"
\n",
" \n",
" VYDAJ | \n",
" 4500 | \n",
"
\n",
"
"
],
"text/plain": [
"[('PRIJEM', 4500), ('VYBER', 1144), ('VYDAJ', 4500)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql direct_user@banking\n",
"SELECT type, count(DISTINCT account_id)\n",
"FROM transactions\n",
"GROUP BY type\n",
"ORDER BY type"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" type | \n",
" count | \n",
"
\n",
" \n",
" PRIJEM | \n",
" 4501 | \n",
"
\n",
" \n",
" VYBER | \n",
" 1143 | \n",
"
\n",
" \n",
" VYDAJ | \n",
" 4500 | \n",
"
\n",
"
"
],
"text/plain": [
"[('PRIJEM', 4501), ('VYBER', 1143), ('VYDAJ', 4500)]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT type, count(DISTINCT account_id)\n",
"FROM transactions\n",
"GROUP BY type\n",
"ORDER BY type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The amount of noise is very small. This is because each individual contributes only one to the count.\n",
"The amount of noise needed to hide one user is quite small (roughly plus or minus five).\n",
"\n",
"Now consider counting the number of transactions rather than the number of distince individuals:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" type | \n",
" count | \n",
"
\n",
" \n",
" PRIJEM | \n",
" 405083 | \n",
"
\n",
" \n",
" VYBER | \n",
" 16666 | \n",
"
\n",
" \n",
" VYDAJ | \n",
" 634571 | \n",
"
\n",
"
"
],
"text/plain": [
"[('PRIJEM', 405083), ('VYBER', 16666), ('VYDAJ', 634571)]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql direct_user@banking\n",
"SELECT type, count(*)\n",
"FROM transactions\n",
"GROUP BY type\n",
"ORDER BY type"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" type | \n",
" count | \n",
"
\n",
" \n",
" PRIJEM | \n",
" 405145 | \n",
"
\n",
" \n",
" VYBER | \n",
" 16647 | \n",
"
\n",
" \n",
" VYDAJ | \n",
" 634406 | \n",
"
\n",
"
"
],
"text/plain": [
"[('PRIJEM', 405145), ('VYBER', 16647), ('VYDAJ', 634406)]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql trusted_user@banking\n",
"SELECT type, count(*)\n",
"FROM transactions\n",
"GROUP BY type\n",
"ORDER BY type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the counts differ by 62, 19, and 165 respectively.\n",
"To see why, let's look at the average number of transactions per individual per type:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" type | \n",
" average_transactions | \n",
"
\n",
" \n",
" PRIJEM | \n",
" 90 | \n",
"
\n",
" \n",
" VYBER | \n",
" 14 | \n",
"
\n",
" \n",
" VYDAJ | \n",
" 141 | \n",
"
\n",
"
"
],
"text/plain": [
"[('PRIJEM', 90), ('VYBER', 14), ('VYDAJ', 141)]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql direct_user@banking\n",
"SELECT type, count_transactions/count_users AS average_transactions\n",
"FROM (\n",
" SELECT type,\n",
" count(DISTINCT account_id) AS count_users,\n",
" count(*) AS count_transactions,\n",
" count(*)/count(distinct account_id) AS avg\n",
" FROM transactions\n",
" GROUP BY 1\n",
") t"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the amount of noise applied by Diffix Elm is proportional to the amount contributed by individuals."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"orig_nbformat": 3
},
"nbformat": 4,
"nbformat_minor": 2
}