{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Open Diffix demo - pg_diffix\n", "\n", "This notebook demonstrates how to use the `pg_diffix`, a PostgreSQL extention that implements Diffix Elm anonymization.\n", "\n", "The notebook demonstrates\n", "1. The SQL limitations imposed by Diffix Elm\n", "2. How Diffix Elm perturbs data to achieve anonymization\n", "\n", "Feel free to use this notebook as a starting point for playing with Diffix Elm.\n", "\n", "## For more information\n", "\n", "To read more about Diffix Elm, visit [open-diffix.org](https://open-diffix.org).\n", "A good overview of Diffix Elm can be found [here](https://www.open-diffix.org/blog/diffix-elm-automates-what-statistics-offices-have-been-doing-for-decades/).\n", "A detailed description is [available on ArXiv](https://arxiv.org/abs/2201.04351).\n", "Besides including a **full specification**, the latter includes a complete privacy analysis and guidance for writing a risk assessment.\n", "\n", "## How to run\n", "\n", "To run this notebook locally you need `psycopg2` and `ipython-sql` installed. For the meta-commands you need `pgspecial`.\n", "\n", "Below you can find connection details for 3 users in this database instance:\n", "\n", "- `direct_user` with direct (non-anonymized) access to `banking`\n", "- `trusted_user` with anonymized access to `banking` in trusted mode\n", "- `untrusted_user` with anonymized access to `banking` in untrusted mode" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "vscode": { "languageId": "python" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to database.\n" ] } ], "source": [ "%load_ext sql\n", "%sql postgresql://direct_user:demo@demo-pg.open-diffix.org/banking\n", "%sql postgresql://trusted_user:demo@demo-pg.open-diffix.org/banking\n", "%sql postgresql://untrusted_user:demo@demo-pg.open-diffix.org/banking\n", "%config SqlMagic.displaycon = False # No connection string in output.\n", "%config SqlMagic.feedback = False # No Done, rows affected.\n", "print('Connected to database.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset has the following tables:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SchemaNameTypeOwner
publicaccountstablepostgres
publicaccounts_receivablestablepostgres
publicclientstablepostgres
publiccredit_cardstablepostgres
publicdispositionstablepostgres
publicloanstablepostgres
publicloss_eventstablepostgres
publicorderstablepostgres
publictransactionstablepostgres
" ], "text/plain": [ "[('public', 'accounts', 'table', 'postgres'),\n", " ('public', 'accounts_receivables', 'table', 'postgres'),\n", " ('public', 'clients', 'table', 'postgres'),\n", " ('public', 'credit_cards', 'table', 'postgres'),\n", " ('public', 'dispositions', 'table', 'postgres'),\n", " ('public', 'loans', 'table', 'postgres'),\n", " ('public', 'loss_events', 'table', 'postgres'),\n", " ('public', 'orders', 'table', 'postgres'),\n", " ('public', 'transactions', 'table', 'postgres')]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql \\dt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can explore columns of tables with `\\d `" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ColumnTypeModifiers
loan_idinteger not null
account_idinteger
datetext
amountinteger
durationinteger
paymentsdouble precision
statustext
" ], "text/plain": [ "[['loan_id', 'integer', ' not null'],\n", " ['account_id', 'integer', ''],\n", " ['date', 'text', ''],\n", " ['amount', 'integer', ''],\n", " ['duration', 'integer', ''],\n", " ['payments', 'double precision', ''],\n", " ['status', 'text', '']]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql \\d loans" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Examples\n", "\n", "Diffix Elm supports a very limited subset of SQL.\n", "\n", "It supports only one aggregate, `count()`. Specifically, it supports `count(*)`, `count(column)`, and `count(distinct column)`.\n", "\n", "It supports the following generalization functions for numeric and text columns:\n", "\n", "- `diffix.floor_by(numeric_column, K)`\n", "- `diffix.round_by(numeric_column, K)`\n", "- `diffix.ceil_by(numeric_column, K)`\n", "- `substring(text_column, index, length)`\n", "\n", "Let's see what it looks like when data is queried by the trusted user (receives anonymized output) and the direct user (receives raw data)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statusreal_count
A203
B31
C403
D45
" ], "text/plain": [ "[('A', 203), ('B', 31), ('C', 403), ('D', 45)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql direct_user@banking\n", "SELECT status, count(*) AS real_count\n", "FROM loans\n", "GROUP BY status\n", "ORDER BY status ASC" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statusanon_count
A202
B30
C402
D44
" ], "text/plain": [ "[('A', 202), ('B', 30), ('C', 402), ('D', 44)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT status, count(*) AS anon_count\n", "FROM loans\n", "GROUP BY status\n", "ORDER BY status ASC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two queries above are identical, but the anonymized results are slightly perturbed.\n", "\n", "Diffix Elm perturbs data in three ways:\n", "\n", "- Adding noise to counts\n", "- Suppressing column values that pertain to too few individuals\n", "- Hiding the impact of individuals that contribute disproportionately to counts\n", "\n", "The following gives examples of these.\n", "\n", "## Adding noise\n", "\n", "Diffix Elm adds noise to the results of aggregates.\n", "This is why the counts in the above two queries differ.\n", "\n", "## Suppressing values with few individuals\n", "\n", "Let us take another look at the `loans` table we queried above.\n", "The following query shows that many of the loan dates are unique:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecount
9710191
9510021
9805221
9705191
9801291
9812011
9509221
9408251
9810291
9610241
" ], "text/plain": [ "[('971019', 1),\n", " ('951002', 1),\n", " ('980522', 1),\n", " ('970519', 1),\n", " ('980129', 1),\n", " ('981201', 1),\n", " ('950922', 1),\n", " ('940825', 1),\n", " ('981029', 1),\n", " ('961024', 1)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql direct_user@banking\n", "SELECT date, count(*)\n", "FROM loans\n", "GROUP BY date\n", "HAVING count(*) = 1\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact 457 out of the 682 loans in the table (or approximately 67%) had a date that only appeared once:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_occurrencescount_loans_sharing_date
4571
842
153
34
" ], "text/plain": [ "[(457, 1), (84, 2), (15, 3), (3, 4)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql direct_user@banking\n", "SELECT count(*) num_occurrences, count_loans_sharing_date\n", "FROM (\n", " SELECT date, count(*) AS count_loans_sharing_date\n", " FROM loans\n", " GROUP BY date\n", ") t\n", "GROUP BY count_loans_sharing_date\n", "ORDER BY count(*) DESC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that, without anonymization, if we know the date on which somebody went to the bank to take out a loan,\n", "we could most likely learn the amount of the loan and other information:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateamountdurationstatuscount
9307059639612B1
93071116596036A1
93072812708060A1
93080310580436A1
93090627474060A1
9309138784024A1
9309155278812A1
93092417474424B1
93101315441648A1
93110411702424A1
" ], "text/plain": [ "[('930705', 96396, 12, 'B', 1),\n", " ('930711', 165960, 36, 'A', 1),\n", " ('930728', 127080, 60, 'A', 1),\n", " ('930803', 105804, 36, 'A', 1),\n", " ('930906', 274740, 60, 'A', 1),\n", " ('930913', 87840, 24, 'A', 1),\n", " ('930915', 52788, 12, 'A', 1),\n", " ('930924', 174744, 24, 'B', 1),\n", " ('931013', 154416, 48, 'A', 1),\n", " ('931104', 117024, 24, 'A', 1)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql direct_user@banking\n", "SELECT date, amount, duration, status, count(*)\n", "FROM loans\n", "GROUP BY date, amount, duration, status\n", "ORDER BY date\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we know that an individual has taken a loan on Sept. 6, 1993, then we learn that it is a 60 month loan for 274740 dollars.\n", "\n", "When Diffix is used to anonymize the data, the system detects output rows that can be linked to a single or a small number of individuals,\n", "and suppresses that information.\n", "\n", "We can see this in practice by repeating the two queries we just ran on the raw dataset, but protected by Diffix Elm." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecount
*667
9712083
9604293
9703073
9703173
9705033
" ], "text/plain": [ "[('*', 667),\n", " ('971208', 3),\n", " ('960429', 3),\n", " ('970307', 3),\n", " ('970317', 3),\n", " ('970503', 3)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT date, count(*)\n", "FROM loans\n", "GROUP BY date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that almost all of the data has been suppressed. \n", "Counts for only five dates are displayed.\n", "These are dates where there are enough distinct individuals with loans to show the data.\n", "All other dates have one or very loans. Diffix Elm automatically recognizes this and suppresses the dates,\n", "merging them into a single bin with the value '`*`' which we call the _suppression bin_." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateamountdurationstatuscount
*NoneNone*683
" ], "text/plain": [ "[('*', None, None, '*', 683)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT date, amount, duration, status, count(*)\n", "FROM loans\n", "GROUP BY date, amount, duration, status" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The query above, which attempts to display all columns, suppresses even more aggressively:\n", "all of the data is suppressed and merged into a single suppression bin.\n", "This is because the four column values taken together are distinct for every individual. \n", "\n", "The text columns of suppressed bins are denoted with the value '`*`' (this is configurable), while numeric columns are denoted with `NULL`.\n", "Diffix Elm provides a custom function, `diffix.is_suppress_bin(*)`, which returns `True` for the suppression bin, and `False` otherwise:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecountis_suppress_bin
*667True
9712083False
9604293False
9703073False
9703173False
9705033False
" ], "text/plain": [ "[('*', 667, True),\n", " ('971208', 3, False),\n", " ('960429', 3, False),\n", " ('970307', 3, False),\n", " ('970317', 3, False),\n", " ('970503', 3, False)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT date, count(*), diffix.is_suppress_bin(*)\n", "FROM loans\n", "GROUP BY date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generalization\n", "\n", "The queries above are anonymous, but they are also useless. All of the data has been suppressed.\n", "**Generalization** is used to avoid suppression.\n", "\n", "For example, rather than attempt to count the number of loans for individual days,\n", "we can generalize these to years using `substring()`:\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearcount
9321
94102
98159
96113
9591
97198
" ], "text/plain": [ "[('93', 21), ('94', 102), ('98', 159), ('96', 113), ('95', 91), ('97', 198)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT substring(date, 1, 2) AS year, count(*)\n", "FROM loans\n", "GROUP BY substring(date, 1, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that while no data has been suppressed, it comes at the cost of a loss of precision.\n", "We can improve this by generalizing to month (here showing only the first 10 rows):" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
year_monthcount
*24
93127
94015
94037
940510
940613
940713
94087
940911
94106
" ], "text/plain": [ "[('*', 24),\n", " ('9312', 7),\n", " ('9401', 5),\n", " ('9403', 7),\n", " ('9405', 10),\n", " ('9406', 13),\n", " ('9407', 13),\n", " ('9408', 7),\n", " ('9409', 11),\n", " ('9410', 6)]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT substring(date, 1, 4) AS year_month, count(*)\n", "FROM loans\n", "GROUP BY substring(date, 1, 4)\n", "ORDER BY substring(date, 1, 4)\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here only 3.5% of the data is suppressed (roughly 24 of 682 loans). \n", "\n", "Diffix Elm allows generalization of both text and numeric columns.\n", "In the following, we use the `difix.floor_by(col, amount)` function to generate loan amount bins of 50000:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amountcount
None5
0.0125
50000.0178
100000.089
150000.0103
200000.057
250000.041
300000.035
350000.025
400000.010
450000.011
" ], "text/plain": [ "[(None, 5),\n", " (0.0, 125),\n", " (50000.0, 178),\n", " (100000.0, 89),\n", " (150000.0, 103),\n", " (200000.0, 57),\n", " (250000.0, 41),\n", " (300000.0, 35),\n", " (350000.0, 25),\n", " (400000.0, 10),\n", " (450000.0, 11)]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT diffix.floor_by(amount, 50000) AS amount, count(*)\n", "FROM loans\n", "GROUP BY diffix.floor_by(amount, 50000)\n", "ORDER BY diffix.floor_by(amount, 50000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Post Processing\n", "\n", "Although Diffix Elm places strong SQL restrictions on anonymizing queries,\n", "it allows arbitrary SQL when post-processing the result of an anonymized query.\n", "\n", "The following post processes the anonymizing sub-query to make the output more readable." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearamountcount
OtherOther37
199350k - 100k8
19940k - 50k16
199450k - 100k30
1994100k - 150k14
1994150k - 200k16
1994200k - 250k9
1994250k - 300k8
1994300k - 350k3
19950k - 50k21
" ], "text/plain": [ "[('Other', 'Other', 37),\n", " ('1993', '50k - 100k', 8),\n", " ('1994', '0k - 50k', 16),\n", " ('1994', '50k - 100k', 30),\n", " ('1994', '100k - 150k', 14),\n", " ('1994', '150k - 200k', 16),\n", " ('1994', '200k - 250k', 9),\n", " ('1994', '250k - 300k', 8),\n", " ('1994', '300k - 350k', 3),\n", " ('1995', '0k - 50k', 21)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT\n", " CASE WHEN anonymized.year = '*' THEN 'Other'\n", " ELSE '19' || anonymized.year\n", " END AS year,\n", " CASE WHEN anonymized.amount IS NULL THEN 'Other'\n", " ELSE anonymized.amount/1000 || 'k - ' || (anonymized.amount/1000 + 50) || 'k'\n", " END AS amount,\n", " count\n", "FROM (\n", " SELECT substring(date, 1, 2) as year, diffix.floor_by(amount, 50000) AS amount, count(*) AS count\n", " FROM loans\n", " GROUP BY 1, 2\n", ") AS anonymized\n", "ORDER BY anonymized.year ASC, anonymized.amount ASC\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following counts the number of unsuppressed bins in an output (including the suppression bin itself):" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
number_of_unsuppressed_bins_including_suppress_bin
59
" ], "text/plain": [ "[(59,)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT count(*) AS number_of_unsuppressed_bins_including_suppress_bin\n", "FROM (\n", " SELECT substring(date, 1, 4), count(*)\n", " FROM loans\n", " GROUP BY substring(date, 1, 4)\n", ") t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following computes the min and max loan amount bins. Note that this may not the true min and max,\n", "but simply the min and max of the bin value for unsuppressed bins. The true values may be more or less." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min_loan_amountmax_loan_amount
0.0550000.0
" ], "text/plain": [ "[(0.0, 550000.0)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT min(amount_floor) AS min_loan_amount,\n", " max(amount_ceiling) AS max_loan_amount\n", "FROM (\n", " SELECT diffix.floor_by(amount, 50000) AS amount_floor,\n", "\t diffix.ceil_by(amount, 50000) AS amount_ceiling,\n", "\t count(*)\n", " FROM loans\n", " GROUP BY 1, 2\n", ") t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following computes the average number of transactions per user." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
average_trans_per_user
234
" ], "text/plain": [ "[(234,)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT total_transactions/total_users AS average_trans_per_user\n", "FROM (\n", " SELECT count(DISTINCT account_id) AS total_users,\n", " count(*) AS total_transactions\n", " FROM transactions\n", ") t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Proportional Noise\n", "\n", "Diffix Elm uses noise to hide the presence of absence of individuals.\n", "If individuals contribute more to a count, then Diffix Elm recognizes this and increases the noise proprotionally.\n", "\n", "As an example, consider the following query made on the `transactions` table for both the raw and anonymized answers.\n", "This query counts the number of distinct individuals (as identified by their account_id) for each transaction type." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typecount
PRIJEM4500
VYBER1144
VYDAJ4500
" ], "text/plain": [ "[('PRIJEM', 4500), ('VYBER', 1144), ('VYDAJ', 4500)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql direct_user@banking\n", "SELECT type, count(DISTINCT account_id)\n", "FROM transactions\n", "GROUP BY type\n", "ORDER BY type" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typecount
PRIJEM4501
VYBER1143
VYDAJ4500
" ], "text/plain": [ "[('PRIJEM', 4501), ('VYBER', 1143), ('VYDAJ', 4500)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT type, count(DISTINCT account_id)\n", "FROM transactions\n", "GROUP BY type\n", "ORDER BY type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The amount of noise is very small. This is because each individual contributes only one to the count.\n", "The amount of noise needed to hide one user is quite small (roughly plus or minus five).\n", "\n", "Now consider counting the number of transactions rather than the number of distince individuals:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typecount
PRIJEM405083
VYBER16666
VYDAJ634571
" ], "text/plain": [ "[('PRIJEM', 405083), ('VYBER', 16666), ('VYDAJ', 634571)]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql direct_user@banking\n", "SELECT type, count(*)\n", "FROM transactions\n", "GROUP BY type\n", "ORDER BY type" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typecount
PRIJEM405145
VYBER16647
VYDAJ634406
" ], "text/plain": [ "[('PRIJEM', 405145), ('VYBER', 16647), ('VYDAJ', 634406)]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql trusted_user@banking\n", "SELECT type, count(*)\n", "FROM transactions\n", "GROUP BY type\n", "ORDER BY type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the counts differ by 62, 19, and 165 respectively.\n", "To see why, let's look at the average number of transactions per individual per type:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeaverage_transactions
PRIJEM90
VYBER14
VYDAJ141
" ], "text/plain": [ "[('PRIJEM', 90), ('VYBER', 14), ('VYDAJ', 141)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql direct_user@banking\n", "SELECT type, count_transactions/count_users AS average_transactions\n", "FROM (\n", " SELECT type,\n", " count(DISTINCT account_id) AS count_users,\n", " count(*) AS count_transactions,\n", " count(*)/count(distinct account_id) AS avg\n", " FROM transactions\n", " GROUP BY 1\n", ") t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the amount of noise applied by Diffix Elm is proportional to the amount contributed by individuals." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "orig_nbformat": 3 }, "nbformat": 4, "nbformat_minor": 2 }