# Analyst guide

This document describes features and restrictions of `pg_diffix` for users with anonymized access to a database.
The [banking notebook](banking.ipynb) provides a walkthrough with examples and explanations of various
mechanisms that Diffix uses to protect personal data.

## Table of Contents

- [Access levels](#access-levels)
- [Anonymizing queries](#anonymizing-queries)
  - [Explicit grouping](#explicit-grouping)
  - [Implicit grouping](#implicit-grouping)
  - [Input filtering](#input-filtering)
- [Unrestricted queries](#unrestricted-queries)
- [Post processing](#post-processing)
- [Utility statements](#utility-statements)
  - [Other SQL statements](#other-sql-statements)
  - [diffix.show_labels()](#diffixshow_labels)
  - [diffix.access_level()](#diffixaccess_level)
- [Suppress bin](#suppress-bin)
- [Supported functions](#supported-functions)
  - [Aggregates](#aggregates)
    - [diffix.count_histogram(aid, bin_size)](#diffixcount_histogramaid-bin_size)
  - [Numeric generalization functions](#numeric-generalization-functions)
    - [diffix.floor_by(col, K)](#diffixfloor_bycol-k)
    - [diffix.round_by(col, K)](#diffixround_bycol-k)
    - [diffix.ceil_by(col, K)](#diffixceil_bycol-k)
    - [width_bucket(operand, low, high, count)`](#width_bucketoperand-low-high-count)
  - [String generalization functions](#string-generalization-functions)
    - [substring(text_column, start, count)](#substringtext_column-start-count)
  - [Type casts](#type-casts)
  - [Utility functions](#utility-functions)
    - [diffix.is_suppress_bin(*)](#diffixis_suppress_bin)
    - [diffix.unnest_histogram(histogram)](#diffixunnest_histogramhistogram)

# Access levels

Users can have one of the following access levels to a database:

- `direct` - Direct (non-anonymized) access to data. Restrictions listed in this document do not apply in direct mode.
- `anonymized_trusted` - Anonymized access to data. Prevents accidental release of personal data.
- `anonymized_untrusted` - Anonymized access to data. Prevents intentional release of personal data.

Use `SELECT diffix.access_level()` to see the current access level.

# Anonymizing queries

In anonymized access level, queries targeting personal tables are restricted to a limited subset of SQL.
A personal table is a relation that contains data of individuals or other protected entities.
Administrators identify and mark such tables during configuration.

When selecting data from personal tables, the following SQL features are allowed:

## Explicit grouping

Grouping queries have the following form:

```
SELECT col1, col2, ..., count(...)
FROM personal_table
GROUP BY col1, col2, ...
```

Zero or more table columns `col1`, `col2`, ... may be specified.
[Numeric](#numeric-generalization-functions) and [string](#string-generalization-functions) columns may optionally be generalized.

`count(...)` is any of the supported [aggregate](#aggregates) variants.
Any number of aggregates may be specified (including none).

**Example:**

```
SELECT city, year_of_birth, count(*), diffix.count_noise(*)
FROM customers
GROUP BY city, year_of_birth
```

**Example:**

```
SELECT count(*), count(DISTINCT city)
FROM customers
```

## Implicit grouping

The `GROUP BY` clause may be omitted:

```
SELECT col1, col2, ..., colX
FROM personal_table
```

In this case, the results are grouped implicitly by the selected columns.
Bins with insufficient individuals are filtered out, and the remaining
bins are repeated for (anonymized) `count(*)` times.

Semantically this is equivalent to:

```
SELECT col1, col2, ..., colX, generate_series(1, anon_count)
FROM (
  SELECT col1, col2, ..., colX, count(*) AS anon_count
  FROM personal_table
  GROUP BY col1, col2, ..., colX
) x
```

**Example:**

```
SELECT city, year_of_birth
FROM customers
```

## Input filtering

The `WHERE` clause may specify one or more simple equality conditions which all have to match simultaneously for a row to be included in the analysis.

A simple equality condition consists of an equality between a raw or generalized column, on the left side, and a constant value, on the right side.

Combining conditions using `OR` or specifying oher types of conditions is not permitted.

**Example:**

```
SELECT count(*)
FROM customers
WHERE city = 'Berlin' AND gender = 'F'
```

**Example:**

```
SELECT city, count(*)
FROM customers
WHERE substring(date_of_birth, 1, 4) = 2000
GROUP BY city
```

AID columns are not allowed in filtering conditions.
Administrators can additionally restrict the usage of certain columns for data filtering by untrusted users.

# Unrestricted queries

Queries against personal tables are limited to the SQL subset described in the [anonymized queries](#anonymized-queries) section.
However, queries that do not target personal tables are unrestricted, meaning any SQL can be executed.

If `pg_diffix.treat_unmarked_tables_as_public` is set to `false`, then tables not marked as `personal` or `public` are not queryable in any way.
Use [diffix.show_labels()](#diffixshow_labels) to see which tables are marked as `personal` or `public`.

# Post processing

Any SQL may be executed on the output of anonymizing subqueries as their output is no longer considered personal.

**Example:** The following query selects the average number of individuals in each city.

```
SELECT avg(x.num_individuals)
FROM (
  SELECT city, count(*) as num_individuals
  FROM customers
  GROUP BY city
) x
```

Projections of grouping expressions in `SELECT` and the `HAVING` clause of anonymizing
queries are also considered post processing and are accepted.

**Example:** The following query post processes the grouping labels and filters the output bins.

```
SELECT 'City: ' || upper(city), count(*)
FROM customers
GROUP BY city
HAVING count(*) > 10
```

# Utility statements

## Other SQL statements

Most utility statements are disallowed in anonymized access level.
Allowed SQL statements include:

- `EXPLAIN` - explain is partially allowed. Options `COSTS` and `ANALYZE` are rejected.
- `SET` - a subset of parameters can be changed by non-superusers, even in anonymized mode.
  Note that most `pg_diffix` parameters require superuser access.

Any statement that alters the database (`INSERT`, `DELETE`, `UPDATE`, `DROP`, ...) or causes any other persistent state change is forbidden.

`COPY` statements and other indirect methods of reading data from the database are forbidden.

## diffix.show_labels()

Run `SELECT * FROM diffix.show_labels()` to see the complete list of labels on database objects.

## diffix.access_level()

Run `SELECT diffix.access_level()` to see the current [access level](#access-levels) of the session.
Possible values are `direct`, `anonymized_trusted`, and `anonymized_untrusted`.

# Suppress bin

When performing grouping in queries that target personal tables, bins that pertain to too few individuals are suppressed.

Diffix combines the counts of all suppressed rows to a special **suppress bin**.
The suppress bin is the first row of the anonymized query result, unless sorted or [post processed](#post-processing) by other means.
The suppress bin may itself be suppressed.
All grouping columns of the suppress bin are set to `*` for text-typed columns and `NULL` for other types.

The configuration parameter `pg_diffix.compute_suppress_bin` controls whether the suppress bin is computed and emitted.

The configuration parameter `pg_diffix.text_label_for_suppress_bin` specifies the value to use for text-typed grouping labels.
Default value is `*`.

The aggregate `diffix.is_suppress_bin(*)` returns `true` for the suppress bin, and false for any other bin.
This can be used to identify the suppress bin if `NULL` values are ambiguous for a grouping label.

# Supported functions

## Aggregates

The following versions of aggregates are supported:

- `count(*)` - count all rows.
- `count(col)` - counts non-null occurrences of the given column.
- `count(distinct col)` - counts distinct values of the given column.
- `sum(col)` - sums values in the given column.
- `avg(col)` - calculates the average of the given column.
- `diffix.count_histogram(aid, bin_size=1)` - computes a histogram that describes the distribution of rows among entities.
  See [below](#diffixcount_histogramaid-bin_size) for details.

Results of these aggregates are anonymized by applying noise as described in the specification.

Each of the `count(...)`, `sum(...)`, `avg(...)` has an accompanying aggregate,
which returns the approximate magnitude of noise added during anonymization (in terms of its standard deviation).
These are: `diffix.count_noise(...)`, `diffix.sum_noise(...)`, `diffix.avg_noise(...)`, respectively.

### diffix.count_histogram(aid, bin_size)

Returns a 2-dimensional array of shape `bigint[][2]`, where each entry is a pair of `[row_count, num_entities]`.
The `row_count` represents the number of rows contributed by `num_entities` distinct protected entities.

**Example:**

```
SELECT diffix.count_histogram(account)
FROM transactions;

        count_histogram
--------------------------------
 {{NULL,7},{1,15},{2,13},{4,6}}
(1 row)
```

The result of the above query can be interpreted as:
15 accounts have made a single transaction (1 row in result bucket), 13 accounts have made 2 transactions (2 rows),
6 accounts have made 4 transactions, and 7 accounts have made some other number of transactions (identified by the `NULL` count).

The reported `num_entities` is a noisy value, but not the `row_count` itself. Bins with insufficient `num_entities` are merged to
a suppress bin of shape `{NULL, num_entities}` where `num_entities` is also noisy. The suppress bin may itself be suppressed.

The optional `bin_size` parameter allows generalizing the bins' `row_count` to minimize suppression.
It acts identically to the `diffix.floor_by()` function.

The histogram array can be unwrapped to a set of pairs by using [diffix.unnest_histogram()](#diffixunnest_histogramhistogram).

**Restrictions:** The `aid` parameter must be a reference to a column tagged as an AID (identifier of a protected entity).

In untrusted mode, `bin_size` is restricted to a money style number:
1, 2, or 5 preceeded by or followed by zeros ⟨... 0.1, 0.2, 0.5, 1, 2, 5, 10, ...⟩.

## Numeric generalization functions

### diffix.floor_by(col, K)

Rounds column to width `K` and aligns to the lower edge of the interval. Equivalent to `floor(col / K) * K`.

**Restrictions:** In untrusted mode, `K` is restricted to a money style number:
1, 2, or 5 preceeded by or followed by zeros ⟨... 0.1, 0.2, 0.5, 1, 2, 5, 10, ...⟩.

### diffix.round_by(col, K)

Rounds column to width `K` and aligns to the closest edge of the interval. Equivalent to `round(col / K) * K`.

**Restrictions:** Not allowed in untrusted mode.

### diffix.ceil_by(col, K)

Rounds column to width `K` and aligns to the upper edge of the interval. Equivalent to `ceil(col / K) * K`.

**Restrictions:** Not allowed in untrusted mode.

### width_bucket(operand, low, high, count)`

Default [Postgres function](https://www.postgresql.org/docs/14/functions-math.html) that returns the
bucket number of the column value in an equal-width histogram.

**Restrictions:** Not allowed in untrusted mode.

## String generalization functions

### substring(text_column, start, count)

Default [Postgres function](https://www.postgresql.org/docs/14/functions-string.html).

**Restrictions:** In untrusted mode, only `start = 1` is allowed.

## Type casts

When selecting or generalizing columns of personal tables, the following type conversions are allowed:

- All numeric types may be converted to other numeric types.
  Numeric types are: `smallint`, `integer`, `bigint`, `float`, `double`, and `numeric` (`decimal`).
  When converting from a real type to an integer type, implicit rounding occurs (away from zero).
  Such casts count as generalizing expressions and no further generalizations are allowed.
- Date/time types may be converted to text.
  Date/time types are: `date`, `time`, `timestamp`, `timetz`, and `timestamptz`.
  The [DateStyle](https://www.postgresql.org/docs/14/runtime-config-client.html#GUC-DATESTYLE)
  configuration parameter determines the output format of the string.

**Example:**

```
SELECT substring(date_of_birth::text, 1, 4) AS year, count(*)
FROM customers
GROUP BY 1
```

## Utility functions

### diffix.is_suppress_bin(*)

Aggregate that returns `true` only for the suppress bin, `false` otherwise.

### diffix.unnest_histogram(histogram)

Unnests a 2-dimensional array into a result set of 1-dimensional arrays.

**Example:**

```
SELECT diffix.unnest_histogram(diffix.count_histogram(account)) AS bins
FROM transactions;

   bins
----------
 {NULL,7}
 {1,15}
 {2,13}
 {4,6}
(4 rows)
```