{ "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1-final" }, "orig_nbformat": 2, "kernelspec": { "name": "python3", "display_name": "Python 3", "language": "python" } }, "nbformat": 4, "nbformat_minor": 2, "cells": [ { "source": [ "# Anonymization Primitives aka Diffix Toolbox\n", "\n", "This notebook will showcase current features of the Diffix PostgreSQL extension.\n", "\n", "The functionality presented below is the raw interface to Diffix, intended only for people who know what they're doing.\n", "I like to think of it as a \"Diffix Level 0\" or \"Diffix Toolbox\".\n", "\n", "To run this notebook locally you need `psycopg2`, `ipython-sql`, and of course a PostgreSQL database with the extension installed.\n", "\n", "Let's connect to a database that has `pg_diffix` installed." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Connected to database.\n" ] } ], "source": [ "%load_ext sql\n", "%sql postgresql://edon:secret@localhost:5432/diffixdb\n", "%config SqlMagic.displaycon = False # No connection string in output.\n", "%config SqlMagic.feedback = False # No Done, rows affected.\n", "# Make sure pg_diffix is installed and loaded.\n", "%sql load 'pg_diffix'\n", "print('Connected to database.')" ] }, { "source": [ "## Setting up the dataset\n", "\n", "Let's create a table `purchases` with some columns." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[]" ] }, "metadata": {}, "execution_count": 2 } ], "source": [ "%%sql\n", "DROP TABLE IF EXISTS purchases;\n", "CREATE TABLE purchases (\n", " uid INTEGER,\n", " product TEXT,\n", " brand TEXT,\n", " price FLOAT\n", ");" ] }, { "source": [ "We'll use this helper function to insert many copies of a row at once." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def insert(how_many, row):\n", " uid, product, brand, price = row\n", " for i in range(how_many):\n", " %sql \\\n", " INSERT INTO purchases (uid, product, brand, price) \\\n", " VALUES (:uid, :product, :brand, :price)" ] }, { "source": [ "Let's insert some data." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Done\n" ] } ], "source": [ "# User 1\n", "insert(3, [ 1, 'phone', 'apple', 1000 ])\n", "insert(1, [ 1, 'laptop', 'apple', 2000 ])\n", "insert(1, [ 1, 'chair', 'ikea', 50 ])\n", "\n", "# User 2\n", "insert(1, [ 2, 'phone', 'samsung', 600 ])\n", "insert(2, [ 2, 'laptop', 'dell', 1000 ])\n", "insert(1, [ 2, 'chair', 'ikea', 50 ])\n", "\n", "# User 3\n", "insert(1, [ 3, 'phone', 'apple', 900 ])\n", "insert(7, [ 3, 'phone', 'samsung', 400 ])\n", "insert(1, [ 3, 'chair', 'ikea', 50 ])\n", "\n", "# User 4\n", "insert(2, [ 4, 'phone', 'samsung', 500 ])\n", "insert(1, [ 4, 'laptop', 'hp', 1000 ])\n", "insert(1, [ 4, 'chair', 'ikea', 50 ])\n", "\n", "# User 5\n", "insert(1, [ 5, 'phone', 'apple', 1000 ])\n", "insert(1, [ 5, 'laptop', 'hp', 1000 ])\n", "insert(1, [ 5, 'chair', 'ikea', 50 ])\n", "insert(1, [ 5, 'candy', 'milka', 1 ])\n", "\n", "# User 6\n", "insert(2, [ 6, 'phone', 'apple', 800 ])\n", "insert(1, [ 6, 'laptop', 'asus', 1000 ])\n", "insert(4, [ 6, 'candy', 'milka', 1 ])\n", "insert(2, [ 6, 'candy', 'kitkat', 1 ])\n", "insert(1, [ 6, 'candy', 'snickers', 1 ])\n", "insert(2, [ 6, 'candy', 'mars', 1 ])\n", "insert(1, [ 6, 'candy', 'm&m', 1 ])\n", "\n", "# User 7\n", "insert(1, [ 7, 'phone', 'samsung', 500 ])\n", "insert(1, [ 7, 'laptop', 'hp', 1000 ])\n", "insert(1, [ 7, 'chair', 'ikea', 50 ])\n", "insert(1, [ 7, 'rocket', 'spacex', 1e9 ])\n", "\n", "print('Done')" ] }, { "source": [ "Let's get an overview of products and how many times they have been purchased.\n", "No anonymization yet - these are regular counts." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('candy', 11, 2),\n", " ('chair', 6, 6),\n", " ('laptop', 7, 6),\n", " ('phone', 18, 7),\n", " ('rocket', 1, 1)]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
productsalesunique_buyers
candy112
chair66
laptop76
phone187
rocket11
" }, "metadata": {}, "execution_count": 5 } ], "source": [ "%%sql\n", "SELECT product, count(*) as sales, count(distinct uid) as unique_buyers\n", "FROM purchases\n", "GROUP BY product" ] }, { "source": [ "We notice that some products have few unique buyers.\n", "Our goal is to protect their identity by applying anonymizing aggregates.\n", "\n", "The anonymizing version of the `count()` aggregate is `diffix_count()`.\n", "\n", "`diffix_x()` aggregates accept the AID as their first argument, followed by other possible arguments.\n", "\n", "In this case, `count(*)` becomes `diffix_count(uid)` and an aggregate such as `count(price)` becomes `diffix_count(uid, price)`.\n", "\n", "Let's see it in action." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('laptop', 7, 7),\n", " ('chair', 6, 5),\n", " ('phone', 18, 12),\n", " ('candy', 11, 3),\n", " ('rocket', 1, 2)]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
producttrue_countdiffix_count
laptop77
chair65
phone1812
candy113
rocket12
" }, "metadata": {}, "execution_count": 6 } ], "source": [ "%%sql\n", "SELECT product, count(*) as true_count, diffix_count(uid)\n", "FROM purchases\n", "GROUP BY product" ] }, { "source": [ "Some results are lower than their real count. Let's find out why that happened.\n", "\n", "`diffix_x()` functions include an `explain_diffix_x()` helper for debugging." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('laptop', 'uniq=6, seed=e02c556f5e0f\\ntop=[2➔2, 1➔1 | 4➔1, 5➔1, 6➔1, 7➔1]\\ntrue=7, flat=6, final=7'),\n", " ('chair', 'uniq=6, seed=d6ce455f4e08\\ntop=[1➔1, 2➔1 | 3➔1, 4➔1, 5➔1, 7➔1]\\ntrue=6, flat=6, final=5'),\n", " ('phone', 'uniq=7, seed=c33a827d3c7e\\ntop=[3➔8, 1➔3 | 4➔2, 6➔2, 2➔1, 5➔1, 7➔1]\\ntrue=18, flat=10, final=12'),\n", " ('candy', 'uniq=2, seed=f7f7a7d92a5a\\ntop=[6➔10 | 5➔1]\\ntrue=11, flat=2, final=3'),\n", " ('rocket', 'uniq=1, seed=842059fe8157\\ntop=[7➔1]\\ntrue=1, flat=0, final=2')]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
productexplain_diffix_count
laptopuniq=6, seed=e02c556f5e0f
top=[2➔2, 1➔1 | 4➔1, 5➔1, 6➔1, 7➔1]
true=7, flat=6, final=7
chairuniq=6, seed=d6ce455f4e08
top=[1➔1, 2➔1 | 3➔1, 4➔1, 5➔1, 7➔1]
true=6, flat=6, final=5
phoneuniq=7, seed=c33a827d3c7e
top=[3➔8, 1➔3 | 4➔2, 6➔2, 2➔1, 5➔1, 7➔1]
true=18, flat=10, final=12
candyuniq=2, seed=f7f7a7d92a5a
top=[6➔10 | 5➔1]
true=11, flat=2, final=3
rocketuniq=1, seed=842059fe8157
top=[7➔1]
true=1, flat=0, final=2
" }, "metadata": {}, "execution_count": 7 } ], "source": [ "%%sql\n", "SELECT product, explain_diffix_count(uid)\n", "FROM purchases\n", "GROUP BY product" ] }, { "source": [ "We see that some rows have too few AIDs (`candy`, `rocket`) or they have individuals with large contributions (`phone`).\n", "\n", "We should filter out rows that have too few AIDs from the result set.\n", "We do that by using the `diffix_lcf(aid)` aggregate." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('laptop', 7), ('chair', 5), ('phone', 12)]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
productdiffix_count
laptop7
chair5
phone12
" }, "metadata": {}, "execution_count": 8 } ], "source": [ "%%sql\n", "SELECT product, diffix_count(uid)\n", "FROM purchases\n", "GROUP BY product\n", "HAVING diffix_lcf(uid)" ] }, { "source": [ "Rows `candy` and `rocket` are gone.\n", "\n", "How does `diffix_lcf` work? Let's see." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('laptop', 'uniq=6, seed=fbd63230ac41\\nthresh=4, pass=true'),\n", " ('chair', 'uniq=6, seed=630013b163d5\\nthresh=5, pass=true'),\n", " ('phone', 'uniq=7, seed=9e8c7f3f93c7\\nthresh=4, pass=true'),\n", " ('candy', 'uniq=2, seed=ebb4a2ec3ca5\\nthresh=5, pass=false'),\n", " ('rocket', 'uniq=1, seed=bff99cd5977f\\nthresh=5, pass=false')]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
productexplain_diffix_lcf
laptopuniq=6, seed=fbd63230ac41
thresh=4, pass=true
chairuniq=6, seed=630013b163d5
thresh=5, pass=true
phoneuniq=7, seed=9e8c7f3f93c7
thresh=4, pass=true
candyuniq=2, seed=ebb4a2ec3ca5
thresh=5, pass=false
rocketuniq=1, seed=bff99cd5977f
thresh=5, pass=false
" }, "metadata": {}, "execution_count": 9 } ], "source": [ "%%sql\n", "SELECT product, explain_diffix_lcf(uid)\n", "FROM purchases\n", "GROUP BY product" ] }, { "source": [ "Great! By using `diffix_count` and `diffix_lcf` have achieved our basic goal of anonymizing the count and hiding underrepresented buckets.\n", "\n", "Let's check out how many buckets have been suppressed in the process.\n" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[(2,)]" ], "text/html": "\n \n \n \n \n \n \n
suppressed
2
" }, "metadata": {}, "execution_count": 10 } ], "source": [ "%%sql\n", "SELECT count(*) as suppressed\n", "FROM (\n", " SELECT product, diffix_lcf(uid) as passes_lcf\n", " FROM purchases\n", " GROUP BY product\n", ") x\n", "WHERE not x.passes_lcf" ] }, { "source": [ "Let's try something else. Let's find out from what brands our products are sold from." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('candy', 'kitkat, mars, milka, m&m, snickers'),\n", " ('chair', 'ikea'),\n", " ('laptop', 'apple, asus, dell, hp'),\n", " ('phone', 'apple, samsung'),\n", " ('rocket', 'spacex')]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
productbrands
candykitkat, mars, milka, m&m, snickers
chairikea
laptopapple, asus, dell, hp
phoneapple, samsung
rocketspacex
" }, "metadata": {}, "execution_count": 11 } ], "source": [ "%%sql\n", "SELECT product, string_agg(DISTINCT brand, ', ') as brands\n", "FROM purchases\n", "GROUP BY product" ] }, { "source": [ "Some products have few brands. We'll try to apply anonymization with the goal of protecting the brands' identity. In this case we'll use `brand` as the AID." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('laptop', 4), ('chair', 2), ('phone', 2), ('candy', 9), ('rocket', 2)]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
productdiffix_count
laptop4
chair2
phone2
candy9
rocket2
" }, "metadata": {}, "execution_count": 12 } ], "source": [ "%%sql\n", "SELECT product, diffix_count(brand)\n", "FROM purchases\n", "GROUP BY product" ] }, { "source": [ "The result is different from the `uid` anonymized count we had earlier.\n", "\n", "Let's compare the results and see what happened." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('laptop', 7, 7, 4),\n", " ('chair', 6, 5, 2),\n", " ('phone', 18, 12, 2),\n", " ('candy', 11, 3, 9),\n", " ('rocket', 1, 2, 2)]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
producttrue_countby_uidby_brand
laptop774
chair652
phone18122
candy1139
rocket122
" }, "metadata": {}, "execution_count": 13 } ], "source": [ "%%sql\n", "SELECT product,\n", " count(*) as true_count,\n", " diffix_count(uid) as by_uid,\n", " diffix_count(brand) as by_brand\n", "FROM purchases\n", "GROUP BY product" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('laptop', 'uniq=6, seed=e02c556f5e0f\\ntop=[2➔2, 1➔1 | 4➔1, 5➔1, 6➔1, 7➔1]\\ntrue=7, flat=6, final=7', 'uniq=4, seed=097d1f3ed09b\\ntop=[hp➔3, dell➔2 | apple➔1, asus➔1]\\ntrue=7, flat=4, final=4'),\n", " ('chair', 'uniq=6, seed=d6ce455f4e08\\ntop=[1➔1, 2➔1 | 3➔1, 4➔1, 5➔1, 7➔1]\\ntrue=6, flat=6, final=5', 'uniq=1, seed=741a3c802039\\ntop=[ikea➔6]\\ntrue=6, flat=0, final=2'),\n", " ('phone', 'uniq=7, seed=c33a827d3c7e\\ntop=[3➔8, 1➔3 | 4➔2, 6➔2, 2➔1, 5➔1, 7➔1]\\ntrue=18, flat=10, final=12', 'uniq=2, seed=9c4771a1c67f\\ntop=[samsung➔11, apple➔7 | ]\\ntrue=18, flat=0, final=2'),\n", " ('candy', 'uniq=2, seed=f7f7a7d92a5a\\ntop=[6➔10 | 5➔1]\\ntrue=11, flat=2, final=3', 'uniq=5, seed=ccbd9692e7da\\ntop=[milka➔5 | kitkat➔2, mars➔2, snickers➔1, m&m➔1]\\ntrue=11, flat=8, final=9'),\n", " ('rocket', 'uniq=1, seed=842059fe8157\\ntop=[7➔1]\\ntrue=1, flat=0, final=2', 'uniq=1, seed=c8d7aed4b124\\ntop=[spacex➔1 | ]\\ntrue=1, flat=0, final=2')]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
productby_uidby_brand
laptopuniq=6, seed=e02c556f5e0f
top=[2➔2, 1➔1 | 4➔1, 5➔1, 6➔1, 7➔1]
true=7, flat=6, final=7
uniq=4, seed=097d1f3ed09b
top=[hp➔3, dell➔2 | apple➔1, asus➔1]
true=7, flat=4, final=4
chairuniq=6, seed=d6ce455f4e08
top=[1➔1, 2➔1 | 3➔1, 4➔1, 5➔1, 7➔1]
true=6, flat=6, final=5
uniq=1, seed=741a3c802039
top=[ikea➔6]
true=6, flat=0, final=2
phoneuniq=7, seed=c33a827d3c7e
top=[3➔8, 1➔3 | 4➔2, 6➔2, 2➔1, 5➔1, 7➔1]
true=18, flat=10, final=12
uniq=2, seed=9c4771a1c67f
top=[samsung➔11, apple➔7 | ]
true=18, flat=0, final=2
candyuniq=2, seed=f7f7a7d92a5a
top=[6➔10 | 5➔1]
true=11, flat=2, final=3
uniq=5, seed=ccbd9692e7da
top=[milka➔5 | kitkat➔2, mars➔2, snickers➔1, m&m➔1]
true=11, flat=8, final=9
rocketuniq=1, seed=842059fe8157
top=[7➔1]
true=1, flat=0, final=2
uniq=1, seed=c8d7aed4b124
top=[spacex➔1 | ]
true=1, flat=0, final=2
" }, "metadata": {}, "execution_count": 14 } ], "source": [ "%%sql\n", "SELECT product, \n", " explain_diffix_count(uid) as by_uid,\n", " explain_diffix_count(brand) as by_brand\n", "FROM purchases\n", "GROUP BY product" ] }, { "source": [ "Let's see what we get if we also include the overall count." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[(None, 'uniq=7, seed=c33a827d3c7e\\ntop=[6➔13, 3➔9 | 1➔5, 2➔4, 4➔4, 5➔4, 7➔4]\\ntrue=43, flat=29, final=31', 'uniq=12, seed=6f888ae1c421\\ntop=[samsung➔11 | apple➔8, ikea➔6, milka➔5, hp➔3, dell➔2]\\ntrue=43, flat=37, final=37'),\n", " ('laptop', 'uniq=6, seed=e02c556f5e0f\\ntop=[2➔2, 1➔1 | 4➔1, 5➔1, 6➔1, 7➔1]\\ntrue=7, flat=6, final=7', 'uniq=4, seed=097d1f3ed09b\\ntop=[hp➔3, dell➔2 | apple➔1, asus➔1]\\ntrue=7, flat=4, final=4'),\n", " ('chair', 'uniq=6, seed=d6ce455f4e08\\ntop=[1➔1, 2➔1 | 3➔1, 4➔1, 5➔1, 7➔1]\\ntrue=6, flat=6, final=5', 'uniq=1, seed=741a3c802039\\ntop=[ikea➔6]\\ntrue=6, flat=0, final=2'),\n", " ('phone', 'uniq=7, seed=c33a827d3c7e\\ntop=[3➔8, 1➔3 | 4➔2, 6➔2, 2➔1, 5➔1, 7➔1]\\ntrue=18, flat=10, final=12', 'uniq=2, seed=9c4771a1c67f\\ntop=[samsung➔11, apple➔7 | ]\\ntrue=18, flat=0, final=2'),\n", " ('candy', 'uniq=2, seed=f7f7a7d92a5a\\ntop=[6➔10 | 5➔1]\\ntrue=11, flat=2, final=3', 'uniq=5, seed=ccbd9692e7da\\ntop=[milka➔5 | kitkat➔2, mars➔2, snickers➔1, m&m➔1]\\ntrue=11, flat=8, final=9'),\n", " ('rocket', 'uniq=1, seed=842059fe8157\\ntop=[7➔1]\\ntrue=1, flat=0, final=2', 'uniq=1, seed=c8d7aed4b124\\ntop=[spacex➔1 | ]\\ntrue=1, flat=0, final=2')]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
productby_uidby_brand
Noneuniq=7, seed=c33a827d3c7e
top=[6➔13, 3➔9 | 1➔5, 2➔4, 4➔4, 5➔4, 7➔4]
true=43, flat=29, final=31
uniq=12, seed=6f888ae1c421
top=[samsung➔11 | apple➔8, ikea➔6, milka➔5, hp➔3, dell➔2]
true=43, flat=37, final=37
laptopuniq=6, seed=e02c556f5e0f
top=[2➔2, 1➔1 | 4➔1, 5➔1, 6➔1, 7➔1]
true=7, flat=6, final=7
uniq=4, seed=097d1f3ed09b
top=[hp➔3, dell➔2 | apple➔1, asus➔1]
true=7, flat=4, final=4
chairuniq=6, seed=d6ce455f4e08
top=[1➔1, 2➔1 | 3➔1, 4➔1, 5➔1, 7➔1]
true=6, flat=6, final=5
uniq=1, seed=741a3c802039
top=[ikea➔6]
true=6, flat=0, final=2
phoneuniq=7, seed=c33a827d3c7e
top=[3➔8, 1➔3 | 4➔2, 6➔2, 2➔1, 5➔1, 7➔1]
true=18, flat=10, final=12
uniq=2, seed=9c4771a1c67f
top=[samsung➔11, apple➔7 | ]
true=18, flat=0, final=2
candyuniq=2, seed=f7f7a7d92a5a
top=[6➔10 | 5➔1]
true=11, flat=2, final=3
uniq=5, seed=ccbd9692e7da
top=[milka➔5 | kitkat➔2, mars➔2, snickers➔1, m&m➔1]
true=11, flat=8, final=9
rocketuniq=1, seed=842059fe8157
top=[7➔1]
true=1, flat=0, final=2
uniq=1, seed=c8d7aed4b124
top=[spacex➔1 | ]
true=1, flat=0, final=2
" }, "metadata": {}, "execution_count": 15 } ], "source": [ "%%sql\n", "SELECT product, \n", " explain_diffix_count(uid) as by_uid,\n", " explain_diffix_count(brand) as by_brand\n", "FROM purchases\n", "GROUP BY GROUPING SETS (\n", " (product),\n", " ()\n", ")" ] }, { "source": [ "## Choosing an AID\n", "\n", "As we just saw, the choice of AID matters.\n", "\n", "Considering these are regular aggregates, we could combine them in a primitive attempt to anonymize by both AIDs.\n", "\n", "For each bucket, we take the lowest anonymized count and only print those rows that have sufficient unique AIDs of both types." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('laptop', 4), ('phone', 2)]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n
productanon_count
laptop4
phone2
" }, "metadata": {}, "execution_count": 16 } ], "source": [ "%%sql\n", "SELECT product, least( diffix_count(uid), diffix_count(brand) ) as anon_count\n", "FROM purchases\n", "GROUP BY product\n", "HAVING diffix_lcf(uid) AND diffix_lcf(brand)" ] }, { "source": [ "The results are pretty poor. Let's try if we can get something better if we take the overall count." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[(43, 31)]" ], "text/html": "\n \n \n \n \n \n \n \n \n
true_countanon_count
4331
" }, "metadata": {}, "execution_count": 17 } ], "source": [ "%%sql\n", "SELECT\n", " count(*) as true_count,\n", " least( diffix_count(uid), diffix_count(brand) ) as anon_count\n", "FROM purchases" ] }, { "source": [ "That's not too bad!" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Configuring Diffix\n", "\n", "Running the same query multiple times returns the same results because we are using pseudorandom sequences derived from AIDs and a base seed.\n", "\n", "Let's see our current seed." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('diffix',)]" ], "text/html": "\n \n \n \n \n \n \n
pg_diffix.noise_seed
diffix
" }, "metadata": {}, "execution_count": 18 } ], "source": [ "%%sql\n", "SHOW pg_diffix.noise_seed" ] }, { "source": [ "Let's compare results when using different seeds." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('uniq=7, seed=307da96580c4\\nthresh=3, pass=true',)]" ], "text/html": "\n \n \n \n \n \n \n
explain_diffix_lcf
uniq=7, seed=307da96580c4
thresh=3, pass=true
" }, "metadata": {}, "execution_count": 19 } ], "source": [ "%%sql\n", "SET pg_diffix.noise_seed = 'some seed';\n", "\n", "SELECT explain_diffix_lcf(uid)\n", "FROM purchases;" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('uniq=7, seed=e41c45431789\\nthresh=4, pass=true',)]" ], "text/html": "\n \n \n \n \n \n \n
explain_diffix_lcf
uniq=7, seed=e41c45431789
thresh=4, pass=true
" }, "metadata": {}, "execution_count": 20 } ], "source": [ "%%sql\n", "SET pg_diffix.noise_seed = 'some other seed';\n", "\n", "SELECT explain_diffix_lcf(uid)\n", "FROM purchases;" ] }, { "source": [ "Let's return it to default value." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('diffix',)]" ], "text/html": "\n \n \n \n \n \n \n
pg_diffix.noise_seed
diffix
" }, "metadata": {}, "execution_count": 21 } ], "source": [ "%%sql\n", "RESET pg_diffix.noise_seed;\n", "SHOW pg_diffix.noise_seed;" ] }, { "source": [ "Here are all parameters that we can customize. Only superusers are allowed to modify these." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('pg_diffix.low_count_threshold_max', '5', 'Maximum low count threshold (inclusive).'),\n", " ('pg_diffix.low_count_threshold_min', '2', 'Minimum low count threshold (inclusive).'),\n", " ('pg_diffix.noise_seed', 'diffix', 'Seed used for initializing noise layers.'),\n", " ('pg_diffix.noise_sigma', '1', 'Standard deviation of noise added to aggregates.'),\n", " ('pg_diffix.outlier_count_max', '2', 'Maximum outlier count (inclusive).'),\n", " ('pg_diffix.outlier_count_min', '1', 'Minimum outlier count (inclusive).'),\n", " ('pg_diffix.top_count_max', '6', 'Maximum top contributors count (inclusive).'),\n", " ('pg_diffix.top_count_min', '4', 'Minimum top contributors count (inclusive).')]" ], "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
namesettingshort_desc
pg_diffix.low_count_threshold_max5Maximum low count threshold (inclusive).
pg_diffix.low_count_threshold_min2Minimum low count threshold (inclusive).
pg_diffix.noise_seeddiffixSeed used for initializing noise layers.
pg_diffix.noise_sigma1Standard deviation of noise added to aggregates.
pg_diffix.outlier_count_max2Maximum outlier count (inclusive).
pg_diffix.outlier_count_min1Minimum outlier count (inclusive).
pg_diffix.top_count_max6Maximum top contributors count (inclusive).
pg_diffix.top_count_min4Minimum top contributors count (inclusive).
" }, "metadata": {}, "execution_count": 22 } ], "source": [ "%%sql\n", "SELECT name, setting, short_desc FROM pg_settings\n", "WHERE name LIKE 'pg_diffix.%'" ] } ] }