[← Back to Blog Index](https://trickle-labs.github.io/pg-trickle/blog/) | [Documentation](https://trickle-labs.github.io/pg-trickle/) # Foreign Tables as Stream Table Sources ## IVM over data that lives in another database — or in S3 --- Your data isn't all in one PostgreSQL database. Some of it is in another PostgreSQL instance across the network (`postgres_fdw`). Some of it is in Parquet files on S3 (`parquet_fdw`). Some of it comes from a CSV feed (`file_fdw`). Can you create a stream table that aggregates across these sources? Yes — with caveats. pg_trickle supports foreign tables as stream table sources. The CDC mechanism is different (no triggers on foreign tables, so polling-based detection is used), and the performance characteristics are different (every change detection requires a full scan of the foreign table). But it works, and for small-to-medium foreign tables, it works well. --- ## How It Works Regular source tables use trigger-based CDC: a row-level trigger fires on every DML and writes the change to a buffer table. Foreign tables can't have triggers (in most FDW implementations), so pg_trickle uses a different approach: **Polling-based change detection:** 1. On each refresh cycle, pg_trickle reads the current contents of the foreign table. 2. It compares the current contents with the last known state (using content hashing). 3. Rows that are new, changed, or deleted are identified and written to the change buffer. 4. The delta query proceeds as normal. Step 2 is the expensive part. It requires a full scan of the foreign table. For a 1-million-row foreign table, this means a full network round-trip and comparison on every refresh cycle. --- ## Setting Up ```sql -- Foreign server to another PostgreSQL instance CREATE SERVER remote_analytics FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'analytics-db', dbname 'analytics', port '5432'); CREATE USER MAPPING FOR CURRENT_USER SERVER remote_analytics OPTIONS (user 'reader', password 'secret'); -- Foreign table CREATE FOREIGN TABLE remote_products ( id INT, name TEXT, category TEXT, price NUMERIC ) SERVER remote_analytics OPTIONS (table_name 'products'); -- Stream table using the foreign table SELECT pgtrickle.create_stream_table( name => 'product_summary', query => $$ SELECT rp.category, COUNT(*) AS product_count, AVG(rp.price) AS avg_price, MIN(rp.price) AS min_price FROM remote_products rp GROUP BY rp.category $$, schedule => '30s', cdc_mode => 'trigger' -- only trigger mode works with FDW ); ``` **Note:** You must use `cdc_mode => 'trigger'` (or `'auto'` which starts with triggers). WAL-based CDC doesn't work with foreign tables because foreign table DML doesn't produce local WAL records. pg_trickle detects that `remote_products` is a foreign table and automatically switches to polling-based change detection for that source. --- ## Mixed Sources: Foreign + Local Stream tables can reference both foreign and local tables: ```sql SELECT pgtrickle.create_stream_table( name => 'order_product_summary', query => $$ SELECT o.customer_id, rp.category, SUM(o.total) AS total_spent, COUNT(*) AS order_count FROM orders o JOIN remote_products rp ON rp.id = o.product_id GROUP BY o.customer_id, rp.category $$, schedule => '10s' ); ``` Here: - `orders` is a local table → trigger-based CDC (fast, per-row). - `remote_products` is foreign → polling-based detection (full scan). When `orders` changes, the delta is computed using only the changed orders (trigger CDC). When `remote_products` changes, pg_trickle detects the change via polling and recomputes the affected groups. The practical effect: changes to local tables are reflected in 10 seconds (the schedule). Changes to foreign tables are also reflected in 10 seconds, but each check requires a full scan of the foreign table. --- ## File-Based FDWs `file_fdw` reads from CSV/TSV files on the server's filesystem: ```sql CREATE FOREIGN TABLE exchange_rates ( currency TEXT, rate NUMERIC, effective_date DATE ) SERVER file_server OPTIONS (filename '/data/exchange_rates.csv', format 'csv'); SELECT pgtrickle.create_stream_table( name => 'revenue_in_usd', query => $$ SELECT o.customer_id, SUM(o.total * er.rate) AS revenue_usd FROM orders o JOIN exchange_rates er ON er.currency = o.currency AND er.effective_date = CURRENT_DATE GROUP BY o.customer_id $$, schedule => '1m' ); ``` When the CSV file is updated (new exchange rates), pg_trickle detects the change on the next polling cycle and recomputes the affected aggregates. **Caveat:** `file_fdw` doesn't support transactions. If the file is updated while pg_trickle is reading it, you might get inconsistent results. Use atomic file replacement (write to a temp file, then `mv`) to avoid this. --- ## Parquet and S3 With `parquet_fdw` or `parquet_s3_fdw`: ```sql CREATE FOREIGN TABLE s3_events ( event_id BIGINT, event_type TEXT, payload JSONB, created_at TIMESTAMP ) SERVER parquet_s3 OPTIONS ( filename 's3://my-bucket/events/*.parquet' ); SELECT pgtrickle.create_stream_table( name => 'event_type_counts', query => $$ SELECT event_type, COUNT(*) AS cnt FROM s3_events GROUP BY event_type $$, schedule => '5m' ); ``` This brings S3 data into pg_trickle's IVM pipeline. The full scan of S3 data happens every 5 minutes (the schedule), and only changes are propagated to the stream table. **Performance note:** S3 reads are slow compared to local tables. Set a longer schedule (minutes, not seconds) to amortize the scan cost. --- ## Performance Characteristics | Source Type | CDC Method | Per-Cycle Cost | Best Schedule | |-------------|-----------|----------------|---------------| | Local table (with PK) | Trigger | O(delta) | 1s–10s | | Local table (no PK) | Trigger + content hash | O(delta) | 1s–10s | | Foreign table (postgres_fdw) | Polling | O(foreign table size) | 10s–5m | | Foreign table (file_fdw) | Polling | O(file size) | 1m–1h | | Foreign table (parquet_fdw/S3) | Polling | O(S3 read latency + data size) | 5m–1h | The fundamental limitation: foreign table change detection requires a full scan because there's no trigger or WAL mechanism to capture individual changes. The schedule should be set long enough that the scan cost is acceptable. --- ## Optimization: Materialize First For large foreign tables or frequent refreshes, consider materializing the foreign data into a local table first: ```sql -- Local copy of foreign data CREATE TABLE local_products AS SELECT * FROM remote_products; -- Periodic sync (pg_cron) SELECT cron.schedule('sync-products', '*/5 * * * *', $$ TRUNCATE local_products; INSERT INTO local_products SELECT * FROM remote_products; $$); -- Stream table uses local copy (trigger CDC, fast) SELECT pgtrickle.create_stream_table( name => 'product_summary', query => $$ SELECT ... FROM local_products ... $$, schedule => '5s' ); ``` This separates the foreign-table sync (every 5 minutes, full scan) from the stream table refresh (every 5 seconds, trigger-based delta). The stream table gets fast incremental maintenance; the foreign data sync happens on a longer cadence. --- ## Summary Foreign tables work as stream table sources with polling-based change detection. Every refresh cycle requires a full scan of the foreign table to detect changes. This is inherently slower than trigger-based CDC but enables IVM over data that lives outside PostgreSQL. Use foreign tables directly when: - The foreign table is small (<100K rows) - The refresh schedule is long enough to amortize the scan cost - Simplicity matters more than optimal performance Materialize into a local table first when: - The foreign table is large - You need sub-second refresh latency - The foreign table changes infrequently relative to local tables Either way, pg_trickle handles the rest. Foreign or local, the delta rules are the same. --- ## DuckLake Sources [DuckLake](https://ducklake.select/) (v1.0, April 2026) is a lakehouse format that stores its entire catalog in PostgreSQL. When you run DuckLake on the same PostgreSQL instance as pg_trickle, DuckLake tables become a first-class source for stream tables. ### The Bridge-Table Pattern (Works Today) DuckLake's "data inlining" feature writes small DuckDB inserts directly into regular PostgreSQL tables. For large writes, DuckDB stores rows in Parquet on S3 and records file metadata in the PostgreSQL catalog. The bridge-table pattern works for both: ```sql -- Create a bridge table that receives DuckLake data -- (populated by DuckDB writes via the inlined-data path, -- or via a periodic postgres_query() sync from DuckDB) CREATE TABLE events_bridge ( event_id BIGINT PRIMARY KEY, user_id INT, product_id INT, event_type TEXT, revenue_usd NUMERIC(10,2), occurred_at TIMESTAMPTZ ); -- pg_trickle watches events_bridge with trigger-based CDC (sub-millisecond) SELECT pgtrickle.create_stream_table( name => 'revenue_by_minute', query => $$ SELECT date_trunc('minute', occurred_at) AS minute, product_id, SUM(revenue_usd) AS total_revenue, COUNT(*) AS purchase_count FROM events_bridge WHERE event_type = 'purchase' GROUP BY date_trunc('minute', occurred_at), product_id $$, schedule => '5s', refresh_mode => 'DIFFERENTIAL' ); ``` ### DuckLake Metadata Tables as Sources DuckLake's ~28 metadata tables live in PostgreSQL and are excellent stream table sources for operational monitoring. These tables record every snapshot, every data file, every schema change, and every compaction event: ```sql -- Stream table: small-file count per DuckLake table (compaction alerts) SELECT pgtrickle.create_stream_table( name => 'ducklake_small_file_counts', query => $$ SELECT dt.schema_name, dt.table_name, COUNT(*) AS small_file_count, SUM(df.file_size_bytes) AS total_small_file_bytes FROM ducklake_data_file df JOIN ducklake_table dt ON dt.table_id = df.table_id WHERE df.file_size_bytes < 10 * 1024 * 1024 AND df.deleted_snapshot_id IS NULL GROUP BY dt.schema_name, dt.table_name $$, schedule => '1m', refresh_mode => 'DIFFERENTIAL' ); ``` See [Monitoring Your DuckLake with pg_trickle](ducklake-monitoring.md) for the full set of monitoring stream tables. ### Foreign-Table Path (Alternative for Large Tables) For large DuckLake tables where the full Parquet content must be read, use the `duckdb_fdw` or `parquet_fdw` path with polling-based CDC: ```sql -- Install a Parquet-aware FDW CREATE EXTENSION parquet_fdw; CREATE SERVER duckdb_server FOREIGN DATA WRAPPER parquet_fdw; -- Map the DuckLake-managed Parquet files as a foreign table CREATE FOREIGN TABLE ducklake_events ( user_id INT, event_type TEXT, revenue_usd NUMERIC(10,2), occurred_at TIMESTAMPTZ ) SERVER duckdb_server OPTIONS (filename 's3://my-lake/data/*.parquet'); -- Enable foreign-table polling SET pg_trickle.foreign_table_polling = on; SELECT pgtrickle.create_stream_table( name => 'events_per_user', query => $$ SELECT user_id, COUNT(*) AS event_count FROM ducklake_events GROUP BY user_id $$, schedule => '30s', refresh_mode => 'DIFFERENTIAL' ); ``` **Performance note:** The foreign-table path requires a full Parquet scan on each refresh cycle. For frequently updated DuckLake tables, prefer the bridge-table or DuckLake inlined-data paths. ### What's Coming: Native DuckLake Change-Feed Adapter DuckLake provides a `table_changes(table, from_snapshot, to_snapshot)` function that returns a signed-multiset delta stream — the exact format pg_trickle's DVM engine consumes. A native adapter (planned for v0.65.0) will use this function to eliminate polling entirely, enabling true O(Δ) incremental maintenance over large DuckLake tables without any full scans. See [DuckLake's `table_changes()` Meets pg_trickle's DVM Engine](ducklake-table-changes-dvm.md) for the technical deep-dive. --- *DuckLake tutorials:* - [Real-Time Dashboards on Your Data Lake](ducklake-real-time-dashboards.md) - [The Modern Data Stack in One Box](ducklake-modern-data-stack.md) - [Monitoring Your DuckLake with pg_trickle](ducklake-monitoring.md)