# v0.66.0 — DuckLake Phase 3a: Parquet Sink Infrastructure **Status:** Planned **Scope:** Large **Theme:** DuckLake Sink — Core Write Path **Plan reference:** [plans/ecosystem/PLAN_DUCKLAKE.md — Phase 3](../plans/ecosystem/PLAN_DUCKLAKE.md#phase-3-ducklake-sink-and-view-registration-v066v068) --- ## Why This Release v0.65.0 (Phase 2) made pg_trickle an efficient *reader* of DuckLake — it can consume change feeds and maintain stream tables with O(Δ) work. v0.66.0 closes the other half of the loop: pg_trickle becomes a *writer* into DuckLake. The value of this is significant. Any stream table — a per-minute aggregation, a running total, a filtered join, a machine-learning feature vector — can now be published back into the data lake as a first-class DuckLake table, readable by DuckDB, Spark, Trino, and every other analytics engine in the ecosystem. No custom ETL, no Kafka, no Airflow: a single `CREATE STREAM TABLE … SINK 'ducklake'` call turns any incrementally maintained result set into a live Parquet dataset. This release implements the write-path plumbing. The next release (v0.67.0) adds discoverability and the tutorials that make the feature approachable. --- ## Deliverables ### Features (extension code) | ID | Feature | Effort | |----|---------|--------| | F-4 | Parquet delta export via `arrow-rs` | 2 weeks | | F-2 | DuckLake sink output mode (`sink => 'ducklake'`) | 2 weeks | | — | S3 / object-store upload integration | 1 week | | — | DuckLake catalog transaction writer | 1 week | | F-9 | Encryption key pass-through | 1 week | #### F-4: Parquet Delta Export For any stream table, allow the computed delta from each refresh cycle to be serialised as a Parquet file and written to a configurable object-store path. This is the foundation for F-2 and is also independently useful for users who want to keep a change log in their data lake: ```sql SELECT pgtrickle.alter_stream_table( 'revenue_by_region', delta_export_path => 's3://my-lake/deltas/revenue_by_region/' ); ``` Implementation uses the `arrow-rs` crate, which is already a transitive dependency of several pgrx-ecosystem crates. Parquet files are written with snappy compression by default; `delta_export_compression` GUC allows `gzip`, `zstd`, or `none`. Each file is named `_.parquet` so files sort chronologically and are easily consumed by downstream processes. #### F-2: DuckLake Sink Output Mode Add a `sink => 'ducklake'` parameter to `pgtrickle.create_stream_table()` and `pgtrickle.alter_stream_table()`. When this mode is active, the refresh worker: 1. Runs the differential SQL to produce the Δ result set. 2. Serialises the delta to Parquet (F-4). 3. Uploads the Parquet file to object storage. 4. Opens a DuckLake catalog transaction and records the new data file. 5. Commits the DuckLake snapshot, making the results visible to all DuckLake clients atomically. The sink is append-only by default (stream tables produce deltas, and DuckLake accumulates them). A `ducklake_sink_mode = 'append' | 'replace'` option is provided for users who want to overwrite the full result on each refresh cycle instead of accumulating deltas. #### S3 / Object-Store Upload Integration A thin upload layer wraps the `object_store` crate (already in the Rust ecosystem) to support S3, GCS, Azure Blob, and local-filesystem paths. The upload layer: - Retries on transient network errors (up to a configurable limit). - Uses multipart upload for files above a configurable threshold (default 5 MB). - Respects the AWS environment-variable credential chain and PostgreSQL secrets stored via `pg_trickle.ducklake_s3_access_key` / `ducklake_s3_secret_key` GUCs (values stored encrypted via `pg_trickle.encryption_key`). #### DuckLake Catalog Transaction Writer The catalog writer opens a PostgreSQL transaction, inserts a row into `ducklake_data_file`, updates `ducklake_table_stats`, and inserts a new row in `ducklake_snapshot`. It uses Spi::connect() with a short-lived connection — consistent with existing pg_trickle catalog access patterns — and rolls back cleanly if the S3 upload fails so that the catalog never references a file that does not exist. #### F-9: Encryption Key Pass-Through If the DuckLake catalog has `ducklake_data_file.encryption_key` set for existing files, the writer must generate a fresh Parquet encryption key for each new file it writes, store that key in `ducklake_data_file.encryption_key` using the same format DuckLake itself uses, and apply the key during Parquet serialisation. This is a small but essential feature: many production lakes are encrypted from day one, and a sink that cannot write to them is not useful. ### Tests | Item | Effort | |------|--------| | E2E tests: DuckLake sink (write path) | 1 week | | Unit tests: catalog transaction writer rollback on S3 failure | 2 days | The E2E tests stand up a full stack (PostgreSQL + DuckLake catalog + mock S3 via `MinIO`) and verify: - Parquet files appear on object storage after each refresh cycle. - DuckLake `SELECT` from the sink table returns the same rows as a direct `SELECT` from the pg_trickle stream table. - A simulated S3 failure rolls back the DuckLake catalog transaction and does not leave dangling snapshot records. - Encrypted-lake mode round-trips correctly. --- ## Release Gate This release does **not** ship until: 1. The E2E DuckLake sink tests pass with a live MinIO target. 2. A catalog-rollback test proves that a failed S3 upload leaves zero stale `ducklake_snapshot` or `ducklake_data_file` rows. 3. `just lint` passes with zero warnings. 4. `just test-e2e` passes. --- ## What Comes Next [v0.67.0](v0.67.0.md) — DuckLake Phase 3b: view registration, snapshot provenance, pg-tide pipeline tutorial, and the tutorials and demos that make the sink story accessible to a broad audience.