Query Rewriting Pipeline
========================

This page is a detailed walkthrough of :cfile:`provsql.c`, the core of
ProvSQL.  It describes how every SQL query is intercepted and rewritten
to propagate provenance tokens.

For the high-level architecture and data-flow overview, see
:doc:`architecture`.


Entry Point: ``provsql_planner``
--------------------------------

PostgreSQL's planner hook allows extensions to intercept every query
before planning.  :cfunc:`provsql_planner` is installed by
:cfunc:`_PG_init` and is called for every query:

1. ``INSERT ... SELECT`` (``CMD_INSERT``): delegates to
   :cfunc:`process_insert_select` to propagate provenance from the
   source ``SELECT`` into the target table's ``provsql`` column.

2. ``SELECT`` (``CMD_SELECT``): checks whether any relation in the
   range table carries a ``provsql`` column (:cfunc:`has_provenance`).
   If so, calls :cfunc:`process_query` to rewrite the query tree.

3. After rewriting, the (possibly modified) query is passed to the
   previous planner hook or ``standard_planner``.

When ``provsql.verbose_level >= 20`` (PostgreSQL 15+), the full query
text is logged before and after rewriting.  At level ≥ 40, the time
spent in the rewriter is logged.


``process_query``: The Main Rewriting Function
-----------------------------------------------

:cfunc:`process_query` is the heart of ProvSQL.  It receives a
``Query`` tree, rewrites it to carry provenance, and returns the
modified tree.  The function is recursive: subqueries and CTEs are
processed by re-entering :cfunc:`process_query`.

The function proceeds in the following order:

Step 0: Early Exit
^^^^^^^^^^^^^^^^^^

If the query has no ``FROM`` clause (``q->rtable == NULL``), there is
nothing to track -- return immediately.

Step 1: CTE Inlining
^^^^^^^^^^^^^^^^^^^^

Non-recursive common table expressions (``WITH`` clauses) are
inlined as subqueries in the range table via :cfunc:`inline_ctes`.
This converts ``RTE_CTE`` entries to ``RTE_SUBQUERY`` so that the
subsequent recursive processing can track provenance through them.
Recursive CTEs are not supported and raise an error.

Inlining happens before set-operation handling because ``UNION`` /
``EXCEPT`` branches may reference CTEs.

Step 2: Strip Existing Provenance Columns
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:cfunc:`remove_provenance_attributes_select` scans the target list and
removes any existing ``provsql`` UUID columns (which might be inherited
from base tables or subqueries).  These are stripped so that the
rewriter can append a single, freshly computed provenance expression at
the end.  Matching ``GROUP BY`` / ``ORDER BY`` references and
set-operation column lists are also adjusted.

Step 3: Set-Operation Handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the query has ``setOperations`` (``UNION``, ``EXCEPT``):

- **Non-ALL variants** (``UNION``, ``EXCEPT`` without ``ALL``):
  :cfunc:`rewrite_non_all_into_external_group_by` wraps the set
  operation in a new outer query with ``GROUP BY`` on all columns.
  This implements duplicate elimination as provenance addition (⊕).
  The function then re-enters :cfunc:`process_query` on the wrapper.

- ``UNION ALL``: each branch is processed independently.  The
  provenance tokens from different branches are combined with ⊕
  (``provenance_plus``) by :cfunc:`process_set_operation_union`.

- ``EXCEPT ALL``: :cfunc:`transform_except_into_join` rewrites
  ``A EXCEPT ALL B`` as a ``LEFT JOIN`` with a ``provenance_monus``
  (⊖) gate, plus a filter removing zero-provenance tuples.

- ``INTERSECT`` is not supported (raises an error).

Step 4: Aggregate ``DISTINCT`` Rewrite
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the query has aggregates with ``DISTINCT`` (e.g.,
``COUNT(DISTINCT x)``), :cfunc:`rewrite_agg_distinct` performs a
structural rewrite: the ``DISTINCT`` inside the aggregate is moved to
an inner subquery with ``GROUP BY``, and the outer query aggregates
over the deduplicated results.  The function returns a new query tree,
and :cfunc:`process_query` re-enters on it.

Step 5: Discovery -- ``get_provenance_attributes``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:cfunc:`get_provenance_attributes` walks every entry in the range table
and collects ``Var`` nodes pointing to provenance columns:

- **RTE_RELATION** (base table): scans column names for one called
  ``provsql`` of type UUID.

- **RTE_SUBQUERY**: recursively calls :cfunc:`process_query` on the
  subquery.  If the subquery's rewritten target list contains a
  ``provsql`` column, a ``Var`` pointing to it is added to the parent's
  provenance list.  Outer ``Var`` attribute numbers are adjusted to
  account for any columns removed from the inner query.

- **RTE_JOIN**: for inner, left, right, and full joins, nothing is
  collected directly -- the underlying base-table RTEs supply the
  tokens.  Semi-joins and anti-joins raise an error.

- **RTE_FUNCTION**: if the function returns a single UUID column named
  ``provsql``, it is collected.

- **RTE_VALUES** / **RTE_GROUP**: no provenance to collect.

If no provenance attributes are found, the query is returned unmodified.

Step 6: Unsupported Feature Checks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before proceeding, the function checks for:

- **Sublinks** (``EXISTS``, ``IN``, scalar subqueries): not supported.
- ``DISTINCT ON``: not supported.
- ``DISTINCT`` (plain): converted to ``GROUP BY`` via
  :cfunc:`transform_distinct_into_group_by`.
- ``GROUPING SETS`` / ``CUBE`` / ``ROLLUP``: not supported (except
  the trivial ``GROUP BY ()``).

Step 7: Build Column Map
^^^^^^^^^^^^^^^^^^^^^^^^^

:cfunc:`build_column_map` creates a per-RTE mapping from column
attribute numbers to global column identifiers.  This is used by
where-provenance to record which columns participate in equijoin
conditions and projections.

Step 8: Aggregation Rewriting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the query has aggregates, :cfunc:`replace_aggregations_by_provenance_aggregate`
walks the target list and replaces each standard aggregate (e.g.,
``SUM``, ``COUNT``) with a ``provenance_aggregate`` call that wraps
the original aggregate result and the provenance of the aggregated
rows.  The provenance of grouped rows is combined with ⊕
(``provenance_plus``) via ``array_agg`` + ``provenance_plus``.

After aggregation rewriting:

- :cfunc:`migrate_aggtoken_quals_to_having` moves any ``WHERE``
  comparisons on aggregate results to ``HAVING``, because
  aggregate-typed values can only be filtered after grouping.

- :cfunc:`insert_agg_token_casts` inserts type casts for
  :cfunc:`agg_token` values used in arithmetic or window functions.

See :doc:`aggregation` for the semantics of the ``agg`` /
``semimod`` / ``value`` gates produced here, and for the exact
shape of the ``provenance_aggregate`` call that replaces each
``Aggref``.

Step 9: Expression Building -- ``make_provenance_expression``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:cfunc:`make_provenance_expression` combines the collected provenance
``Var`` nodes into a single expression tree:

**Combining tokens from multiple tables** (the ⊗ / ⊕ / ⊖ choice):

- ``SR_TIMES`` (default for ``SELECT ... FROM ... WHERE``): wraps
  the list of provenance tokens in ``provenance_times(ARRAY[...])``
  to create a multiplication gate.

- ``SR_PLUS`` (``UNION ALL``): uses the single provenance token
  from the union directly (each branch already has its own token).

- ``SR_MONUS`` (``EXCEPT ALL``): wraps the two tokens in
  ``provenance_monus(left, right)``.

If a single table is in the ``FROM`` clause, no combining function is
needed -- the token is used as-is.

**GROUP BY / aggregation**:  When ``group_by_rewrite`` or
``aggregation`` is true, the combined token is wrapped in
``provenance_plus(array_agg(token))`` -- this sums the provenance of
all tuples in each group.

**Delta gate**:  For queries with aggregation but no ``HAVING``
clause, a ``provenance_delta`` gate is added.  This implements the
δ-semiring operator that normalizes aggregate provenance.

**HAVING**:  When a ``HAVING`` clause is present,
:cfunc:`having_Expr_to_provenance_cmp` translates the ``HAVING``
predicate into a ``provenance_cmp`` gate tree.  The original
``havingQual`` is removed from the query (its semantics are now
captured in the provenance circuit).

**Where-provenance** (when ``provsql.where_provenance`` is enabled):

- **Equijoin gates**: :cfunc:`add_eq_from_Quals_to_Expr` scans
  ``JOIN ... ON`` conditions and ``WHERE`` equalities, wrapping
  the provenance expression in ``provenance_eq`` gates that record
  which column pairs were compared.

- **Projection gates**: if the output columns are a strict subset
  of the input columns (or are reordered), a ``provenance_project``
  gate is added with an integer array recording the column mapping.

Step 10: Splicing -- ``add_to_select``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:cfunc:`add_to_select` appends the provenance expression to the
query's target list as a new column named ``provsql``.  If the query
has ``GROUP BY``, the column is added to the ``groupClause`` as well.

Step 11: Replace ``provenance()`` Calls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:cfunc:`replace_provenance_function_by_expression` walks the target
list looking for calls to the ``provenance()`` SQL function.  Each
occurrence is replaced with the computed provenance expression, so
that ``SELECT provenance() FROM ...`` returns the actual provenance
token.


Rewriting Rules and Formal Semantics
-------------------------------------

The rewriting implemented by :cfile:`provsql.c` realises the
rewriting rules (R1)--(R5) from the ICDE 2026 paper
:cite:`sen2026provsql`, which is the authoritative reference for
the formal semantics and correctness results.  The rules are stated
over an extended relational algebra on annotated relations;
:cfile:`provsql.c` has to reproduce them on PostgreSQL ``Query``
trees, so the mapping below is approximate rather than literal --
the pipeline phases described earlier on this page interleave the
rules with PostgreSQL-specific bookkeeping (range-table surgery,
target-list rewriting, ``HAVING`` handling, where-provenance...).

.. list-table::
   :header-rows: 1
   :widths: 15 25 60

   * - Rule
     - Algebra operator
     - Concrete realisation in :cfile:`provsql.c`
   * - (R1)
     - Projection :math:`\Pi`
     - The provenance token column is *kept* on the target list.
       :cfunc:`get_provenance_attributes` collects the token
       columns from the range table and :cfunc:`add_to_select`
       appends the computed token expression back onto the
       rewritten target list.
   * - (R2)
     - Cross product / join :math:`\times`
     - The ``SR_TIMES`` branch of
       :cfunc:`make_provenance_expression` wraps the list of
       per-input tokens in a ``provenance_times(ARRAY[...])``
       (⊗) call.
   * - (R3)
     - Duplicate elimination :math:`\varepsilon`
     - ``SELECT DISTINCT``, ``UNION``, and ``EXCEPT`` (the
       non-``ALL`` set operations) are all rewritten into an outer
       ``GROUP BY`` by
       :cfunc:`transform_distinct_into_group_by` /
       :cfunc:`rewrite_non_all_into_external_group_by`, and the
       grouped tokens are combined with ``provenance_plus`` (⊕)
       via ``array_agg`` inside
       :cfunc:`make_provenance_expression`.
   * - (R4)
     - Multiset difference :math:`-`
     - ``EXCEPT ALL`` is rewritten by
       :cfunc:`transform_except_into_join` into a ``LEFT JOIN`` on
       all data columns, and the ``SR_MONUS`` branch of
       :cfunc:`make_provenance_expression` wraps the two tokens in
       ``provenance_monus`` (⊖).
   * - (R5)
     - Aggregation :math:`\gamma`
     - :cfunc:`replace_aggregations_by_provenance_aggregate` walks
       the target list and replaces each ``Aggref`` with a
       ``provenance_aggregate`` call built by
       :cfunc:`make_aggregation_expression` (which in turn wraps
       arguments in ``provenance_semimod``).  The enclosing
       provenance expression is then wrapped in a
       ``provenance_delta`` (δ) gate by
       :cfunc:`make_provenance_expression`.

Two algebra operators are deliberately **not** rewritten, matching
the paper:

- **Selection** :math:`\sigma` -- ``WHERE`` clauses pass through
  unchanged because selection does not affect provenance (with
  where-provenance enabled, the rewriter additionally wraps the
  result in ``provenance_eq`` / ``provenance_project`` gates, but
  that is orthogonal to R1--R5).
- **Multiset sum** :math:`\uplus` -- ``UNION ALL`` is left alone by
  :cfunc:`process_set_operation_union`; the tokens from each branch
  flow through independently.

Formal Verification
^^^^^^^^^^^^^^^^^^^

The correctness of rules (R1) and (R2) -- projection and cross
product -- has a machine-checked proof in the ProvSQL Lean 4
library.  The main theorem is
`Query.rewriting_valid
<https://provsql.org/lean-docs/Provenance/QueryRewriting.html#Query.rewriting_valid>`_
in the ``Provenance.QueryRewriting`` module: for every
supported relational-algebra query ``q`` and every annotated
database ``d``, evaluating ``q`` against the annotated semantics
and then projecting to the composite (tuple + annotation)
representation yields the same result as evaluating the
rewritten query against the composite database.  This is the
formal analogue of the "the rewritten query produces the same
provenance as the annotated semantics" correctness statement
from the ICDE paper, for the partial fragment proved.