# v0.66.0 — DuckLake Phase 3a: Parquet Sink Infrastructure

**Status:** Planned  
**Scope:** Large  
**Theme:** DuckLake Sink — Core Write Path  
**Plan reference:** [plans/ecosystem/PLAN_DUCKLAKE.md — Phase 3](../plans/ecosystem/PLAN_DUCKLAKE.md#phase-3-ducklake-sink-and-view-registration-v066v068)

---

## Why This Release

v0.65.0 (Phase 2) made pg_trickle an efficient *reader* of DuckLake — it can
consume change feeds and maintain stream tables with O(Δ) work. v0.66.0 closes
the other half of the loop: pg_trickle becomes a *writer* into DuckLake.

The value of this is significant. Any stream table — a per-minute aggregation, a
running total, a filtered join, a machine-learning feature vector — can now be
published back into the data lake as a first-class DuckLake table, readable by
DuckDB, Spark, Trino, and every other analytics engine in the ecosystem. No
custom ETL, no Kafka, no Airflow: a single `CREATE STREAM TABLE … SINK 'ducklake'`
call turns any incrementally maintained result set into a live Parquet dataset.

This release implements the write-path plumbing. The next release (v0.67.0) adds
discoverability and the tutorials that make the feature approachable.

---

## Deliverables

### Features (extension code)

| ID | Feature | Effort |
|----|---------|--------|
| F-4 | Parquet delta export via `arrow-rs` | 2 weeks |
| F-2 | DuckLake sink output mode (`sink => 'ducklake'`) | 2 weeks |
| —  | S3 / object-store upload integration | 1 week |
| —  | DuckLake catalog transaction writer | 1 week |
| F-9 | Encryption key pass-through | 1 week |

#### F-4: Parquet Delta Export

For any stream table, allow the computed delta from each refresh cycle to be
serialised as a Parquet file and written to a configurable object-store path.
This is the foundation for F-2 and is also independently useful for users who
want to keep a change log in their data lake:

```sql
SELECT pgtrickle.alter_stream_table(
    'revenue_by_region',
    delta_export_path => 's3://my-lake/deltas/revenue_by_region/'
);
```

Implementation uses the `arrow-rs` crate, which is already a transitive
dependency of several pgrx-ecosystem crates. Parquet files are written with
snappy compression by default; `delta_export_compression` GUC allows `gzip`,
`zstd`, or `none`.

Each file is named `<epoch_ms>_<refresh_id>.parquet` so files sort
chronologically and are easily consumed by downstream processes.

#### F-2: DuckLake Sink Output Mode

Add a `sink => 'ducklake'` parameter to `pgtrickle.create_stream_table()` and
`pgtrickle.alter_stream_table()`. When this mode is active, the refresh worker:

1. Runs the differential SQL to produce the Δ result set.
2. Serialises the delta to Parquet (F-4).
3. Uploads the Parquet file to object storage.
4. Opens a DuckLake catalog transaction and records the new data file.
5. Commits the DuckLake snapshot, making the results visible to all DuckLake
   clients atomically.

The sink is append-only by default (stream tables produce deltas, and DuckLake
accumulates them). A `ducklake_sink_mode = 'append' | 'replace'` option is
provided for users who want to overwrite the full result on each refresh cycle
instead of accumulating deltas.

#### S3 / Object-Store Upload Integration

A thin upload layer wraps the `object_store` crate (already in the Rust
ecosystem) to support S3, GCS, Azure Blob, and local-filesystem paths. The
upload layer:

- Retries on transient network errors (up to a configurable limit).
- Uses multipart upload for files above a configurable threshold (default 5 MB).
- Respects the AWS environment-variable credential chain and PostgreSQL secrets
  stored via `pg_trickle.ducklake_s3_access_key` / `ducklake_s3_secret_key`
  GUCs (values stored encrypted via `pg_trickle.encryption_key`).

#### DuckLake Catalog Transaction Writer

The catalog writer opens a PostgreSQL transaction, inserts a row into
`ducklake_data_file`, updates `ducklake_table_stats`, and inserts a new row in
`ducklake_snapshot`. It uses Spi::connect() with a short-lived connection —
consistent with existing pg_trickle catalog access patterns — and rolls back
cleanly if the S3 upload fails so that the catalog never references a file that
does not exist.

#### F-9: Encryption Key Pass-Through

If the DuckLake catalog has `ducklake_data_file.encryption_key` set for existing
files, the writer must generate a fresh Parquet encryption key for each new file
it writes, store that key in `ducklake_data_file.encryption_key` using the same
format DuckLake itself uses, and apply the key during Parquet serialisation.
This is a small but essential feature: many production lakes are encrypted from
day one, and a sink that cannot write to them is not useful.

### Tests

| Item | Effort |
|------|--------|
| E2E tests: DuckLake sink (write path) | 1 week |
| Unit tests: catalog transaction writer rollback on S3 failure | 2 days |

The E2E tests stand up a full stack (PostgreSQL + DuckLake catalog + mock S3
via `MinIO`) and verify:

- Parquet files appear on object storage after each refresh cycle.
- DuckLake `SELECT` from the sink table returns the same rows as a direct
  `SELECT` from the pg_trickle stream table.
- A simulated S3 failure rolls back the DuckLake catalog transaction and does
  not leave dangling snapshot records.
- Encrypted-lake mode round-trips correctly.

---

## Release Gate

This release does **not** ship until:

1. The E2E DuckLake sink tests pass with a live MinIO target.
2. A catalog-rollback test proves that a failed S3 upload leaves zero stale
   `ducklake_snapshot` or `ducklake_data_file` rows.
3. `just lint` passes with zero warnings.
4. `just test-e2e` passes.

---

## What Comes Next

[v0.67.0](v0.67.0.md) — DuckLake Phase 3b: view registration, snapshot
provenance, pg-tide pipeline tutorial, and the tutorials and demos that make the
sink story accessible to a broad audience.