# PLAN_ERGONOMICS — Developer & User Ergonomics Improvements ## Status **All five original tasks implemented** (2026-03-05). Three v0.4.0 ergonomic-hardening items implemented 2026-03-11. See git history for details. | Task | Status | |------|--------| | Task 1 — `'calculated'` keyword + default change | ✅ Done | | Task 2 — Lower `min_schedule_seconds` default to 1 | ✅ Done | | Task 3 — Add `default_schedule_seconds` GUC | ✅ Done | | Task 4 — Remove diamond GUCs | ✅ Done | | Task 5 — Add TOC to SQL_REFERENCE.md and CONFIGURATION.md | ✅ Done | | ERG-B — Warn at `_PG_init` when `cdc_mode='auto'` but `wal_level != 'logical'` | ✅ Done | | ERG-C — Warn at `create_stream_table` when source has no PRIMARY KEY | ✅ Done (pre-existing) | | ERG-F — Emit `WARNING` when `alter_stream_table` triggers implicit full refresh | ✅ Done | | ERG-D — Record manual `refresh_stream_table()` in pgt_refresh_history with `initiated_by='MANUAL'` | ✅ Done | | ERG-E — `pgtrickle.quick_health` view (single-row status summary) | ✅ Done | ### Remaining follow-up ✅ (completed 2026-03-14) - ✅ E2E tests for `'calculated'` as schedule input (verify `NULL` input is rejected) — ERG-T1 - ✅ E2E test that `SHOW pg_trickle.diamond_consistency` returns an error — ERG-T2 - ✅ Update CHANGELOG.md with breaking behavioural changes (default schedule, NULL rejection) — ERG-T5 - ✅ E2E test for ERG-B warning (assert WARNING appears in server log when `wal_level != logical`) — ERG-T4 - ✅ E2E test for ERG-F warning (assert WARNING in client output when `alter_stream_table` with `query =>`) — ERG-T3 --- ## Overview Five focused changes to make pg_trickle friendlier for new users while keeping full flexibility for production use. **No catalog schema migration required.** `'calculated'` is accepted/displayed as a parse-time alias for `NULL`; internal storage stays `NULL`. --- ## Decisions Made | Topic | Decision | |---|---| | `'calculated'` storage | Accepted/rejected at parse time; stored as `NULL` — no SQL migration needed | | `create_stream_table` default | Change from `'1m'` to `'calculated'` (breaking behavioral change; document in CHANGELOG) | | `default_schedule_seconds` default | `1` (matches new `min_schedule_seconds` default) | | Diamond GUC fallbacks | Hardcoded to `'atomic'` / `'fastest'` in Rust (EC-13 set `'atomic'` as correct default); per-table params in `create/alter_stream_table` are kept | --- ## Task 1 — Replace `NULL` with `'calculated'` as the schedule keyword **Files:** [src/api.rs](../src/api.rs), [docs/SQL_REFERENCE.md](../docs/SQL_REFERENCE.md) ### Background Currently, passing `schedule => NULL` to `create_stream_table` activates CALCULATED mode (schedule derived from downstream dependents). `NULL` is unintuitive — `'calculated'` is explicit and self-documenting. `NULL` is kept in catalog storage (no migration), but is no longer accepted as SQL input. The alter sentinel (no-change) stays as SQL `NULL` internally. ### Steps 1. **Input pre-processing** — In `create_stream_table` and `alter_stream_table`, before calling `parse_schedule` / `validate_schedule`, add a check: - If `schedule` input is `Some("calculated")` → convert to `None` (CALCULATED mode). - If `schedule` input is `None` (SQL `NULL`) on a **create** call → return error: *"use 'calculated' instead of NULL to set CALCULATED schedule"*. - `alter_stream_table`'s own sentinel for "no change" remains `NULL` internally and is handled before the pre-processing step. 2. **Change `create_stream_table` SQL default** — Change the `schedule` parameter default from `"'1m'"` to `"'calculated'"` (around line 35 of [src/api.rs](../src/api.rs)). 3. **Update docs** — In [docs/SQL_REFERENCE.md](../docs/SQL_REFERENCE.md), replace all `NULL` schedule examples with `'calculated'` and add a note that `NULL` is no longer valid as schedule input. --- ## Task 2 — Lower `pg_trickle.min_schedule_seconds` default to 1 **Files:** [src/config.rs](../src/config.rs), [docs/CONFIGURATION.md](../docs/CONFIGURATION.md) ### Background The current default of `60` is appropriate for production but makes local development and testing slow and awkward. New users hitting "minimum 60 seconds" immediately is a bad first experience. Production operators know what they are doing and can raise this explicitly. ### Steps 4. In [src/config.rs](../src/config.rs), change the default value of `PGS_MIN_SCHEDULE_SECONDS` from `60` to `1`. 5. Update [docs/CONFIGURATION.md](../docs/CONFIGURATION.md): change the documented default from `60` to `1`. --- ## Task 3 — Introduce `pg_trickle.default_schedule_seconds` GUC **Files:** [src/config.rs](../src/config.rs), [src/api.rs](../src/api.rs), [src/scheduler.rs](../src/scheduler.rs), [docs/CONFIGURATION.md](../docs/CONFIGURATION.md) ### Background `min_schedule_seconds` currently plays two distinct roles: 1. **Floor** (`validate_schedule`): rejects schedules shorter than this value. 2. **Default** (`resolve_calculated_schedule`): isolated CALCULATED stream tables (no downstream dependents) fall back to this as their effective refresh interval. These should be separate GUCs so each can be tuned independently. ### Current caller map | Location | Current role | |---|---| | `src/api.rs` ~L1786 `validate_schedule()` | Floor — keep using `min_schedule_seconds` | | `src/api.rs` ~L1321 `build_from_catalog()` | Default — switch to `default_schedule_seconds` | | `src/api.rs` ~L2247 `build_from_catalog()` | Default — switch to `default_schedule_seconds` | | `src/scheduler.rs` ~L395 `build_from_catalog()` | Default — switch to `default_schedule_seconds` | ### Steps 6. In [src/config.rs](../src/config.rs), add a new GUC static: ```rust static PGS_DEFAULT_SCHEDULE_SECONDS: GucSetting = GucSetting::::new(1); ``` - Default: `1` - Range: `1`–`86400` - Context: `SUSET` - Description: *"Default effective schedule (in seconds) for isolated CALCULATED stream tables that have no downstream dependents."* Register it alongside the existing GUCs. 7. Add accessor: ```rust pub fn pg_trickle_default_schedule_seconds() -> i32 { PGS_DEFAULT_SCHEDULE_SECONDS.get() } ``` 8. In [src/api.rs](../src/api.rs) at ~L1321 and ~L2247, replace: ```rust config::pg_trickle_min_schedule_seconds() ``` with: ```rust config::pg_trickle_default_schedule_seconds() ``` when passing `fallback_schedule_secs` to `StDag::build_from_catalog()`. 9. In [src/scheduler.rs](../src/scheduler.rs) at ~L395, same replacement. 10. The `validate_schedule()` call at ~L1786 continues to use `pg_trickle_min_schedule_seconds()` as the **floor** — no change there. 11. Update [docs/CONFIGURATION.md](../docs/CONFIGURATION.md): add a new GUC section for `pg_trickle.default_schedule_seconds`. --- ## Task 4 — Remove `pg_trickle.diamond_consistency` and `pg_trickle.diamond_schedule_policy` GUCs **Files:** [src/config.rs](../src/config.rs), [src/api.rs](../src/api.rs), [src/scheduler.rs](../src/scheduler.rs), [tests/e2e_diamond_tests.rs](../tests/e2e_diamond_tests.rs), [docs/CONFIGURATION.md](../docs/CONFIGURATION.md) ### Background The GUC defaults (`'none'` and `'fastest'`) are sensible for all practical use cases, already match the SQL column `DEFAULT` values in [src/lib.rs](../src/lib.rs), and are already the values asserted by the diamond E2E tests. Exposing them as settable GUCs adds API surface without meaningful benefit — users who need non-default values can specify them per stream table via `create_stream_table` / `alter_stream_table` parameters, which are **not** removed. ### Current GUC usage | Location | GUC | Replace with | |---|---|---| | `src/api.rs` ~L80 | `pg_trickle_diamond_consistency()` | hardcoded `"none"` | | `src/api.rs` ~L94 | `pg_trickle_diamond_schedule_policy()` | `DiamondSchedulePolicy::Fastest` | | `src/scheduler.rs` ~L679 | `pg_trickle_diamond_schedule_policy()` | `DiamondSchedulePolicy::Fastest` | ### Steps 12. In [src/config.rs](../src/config.rs), remove: - `PGS_DIAMOND_CONSISTENCY` and `PGS_DIAMOND_SCHEDULE_POLICY` statics - Their GUC registrations - Accessor functions `pg_trickle_diamond_consistency()` and `pg_trickle_diamond_schedule_policy()` 13. In [src/api.rs](../src/api.rs) at ~L80 and ~L94, replace the GUC accessor calls with hardcoded defaults: - `"none"` for `diamond_consistency` - `DiamondSchedulePolicy::Fastest` (or equivalent string `"fastest"`) for `diamond_schedule_policy` 14. In [src/scheduler.rs](../src/scheduler.rs) at ~L679, replace the GUC accessor call with the hardcoded `DiamondSchedulePolicy::Fastest` value. 15. The SQL column `DEFAULT 'none'` and `DEFAULT 'fastest'` in [src/lib.rs](../src/lib.rs) (~L119–122) already match — no change needed. 16. In [tests/e2e_diamond_tests.rs](../tests/e2e_diamond_tests.rs), scan for any tests that explicitly **set** these GUCs via `SET pg_trickle.diamond_consistency` or reference the GUC names, and update those to use per-table params instead. Tests that only check per-table default values (e.g. `test_diamond_consistency_default`, `test_diamond_schedule_policy_default`) should continue to pass unchanged. 17. Update [docs/CONFIGURATION.md](../docs/CONFIGURATION.md): - Remove the two GUC sections. - Update the "fifteen configuration variables" count in the Overview. --- ## Task 5 — Add table of contents to SQL_REFERENCE.md and CONFIGURATION.md **Files:** [docs/SQL_REFERENCE.md](../docs/SQL_REFERENCE.md), [docs/CONFIGURATION.md](../docs/CONFIGURATION.md) ### Steps 18. In [docs/SQL_REFERENCE.md](../docs/SQL_REFERENCE.md): add a markdown TOC after the `# SQL Reference` heading, linking to all `##` and `###` headings. 19. In [docs/CONFIGURATION.md](../docs/CONFIGURATION.md): add a markdown TOC after the `# Configuration` heading, linking to all `##` and `###` headings. The TOC should reflect the post-Task-4 state: diamond GUC sections removed, `default_schedule_seconds` section added. --- ## Verification ```bash just fmt && just lint # zero warnings required just test-unit # pure Rust tests just test-integration # Testcontainers tests just test-e2e # E2E including diamond tests ``` ### Manual smoke checks ```sql -- 'calculated' should be the default; no schedule argument needed SELECT pgtrickle.create_stream_table('t', 'SELECT 1 AS x', 'src', 'public'); -- Passing NULL should return an error SELECT pgtrickle.create_stream_table('t2', 'SELECT 1 AS x', 'src', 'public', schedule => NULL); -- Expected: ERROR: use 'calculated' instead of NULL to set CALCULATED schedule -- New GUC defaults SHOW pg_trickle.min_schedule_seconds; -- 1 SHOW pg_trickle.default_schedule_seconds; -- 1 -- Removed GUCs should error SHOW pg_trickle.diamond_consistency; -- ERROR: unrecognized configuration parameter SHOW pg_trickle.diamond_schedule_policy; -- ERROR: unrecognized configuration parameter ``` --- --- ## Proposals (not yet decided) The following improvements were identified during an ergonomics review. Each is described with enough context to make a decision. None are committed to implementation yet. --- ### Proposal A — Return a record from `create_stream_table` and `refresh_stream_table` **Priority:** High **Complexity:** Low **Type:** API addition (non-breaking if done as overload; breaking if current void signature changes) Currently both functions return `VOID`. To check the outcome, callers must issue a follow-up `SELECT` on the catalog. This pattern is repeated everywhere in the test suite and will be repeated in every operator script. Proposed return shapes: ```sql -- create_stream_table (pgt_id bigint, status text, created_at timestamptz) -- refresh_stream_table (status text, rows_inserted bigint, rows_deleted bigint, duration_ms float8) ``` **Decision factors:** - Does changing the return type break existing callers? Yes, if they use `SELECT pgtrickle.create_stream_table(...)` in a script that ignores the result — PostgreSQL will still succeed, but any code using `PERFORM` would need to be unchanged. - Should these be new overloaded functions or replace the existing ones? - Should `status` use a fixed enum (`'created'`, `'already_exists'`, `'skipped'`, `'error'`) or a free-form string? --- ### Proposal B — Warn at init if `cdc_mode = 'auto'` but `wal_level != 'logical'` **Priority:** High **Complexity:** Low **Type:** Safety / UX `pg_trickle.cdc_mode = 'auto'` is designed to automatically upgrade from TRIGGER to WAL-based CDC once PostgreSQL's `wal_level` is set to `logical`. On a standard PostgreSQL install, `wal_level` defaults to `replica`. In this state, `auto` silently stays in TRIGGER mode indefinitely — producing no error, no warning, and no visible status change. FAQ documentation acknowledges this but it is easy to miss. **Proposal:** At extension load (`_PG_init`), check: ```rust if cdc_mode == "auto" && wal_level != "logical" { pgrx::warning!("pg_trickle: cdc_mode='auto' but wal_level is '{}', not 'logical'. \ WAL CDC will not activate until wal_level is changed and PostgreSQL restarted.", wal_level); } ``` **Decision factors:** - Is it acceptable to emit a WARNING at every backend startup if this condition is true? Could be noisy in shared environments. - Alternative: emit the warning only via a dedicated health check function, not at init time. --- ### Proposal C — Warn on `create_stream_table` if source table has no PRIMARY KEY **Priority:** High **Complexity:** Low **Type:** Proactive safety check Stream tables over keyless sources silently mishandle duplicate rows: identical rows share the same internal `__pgt_row_id` and a duplicate INSERT may be treated as a no-op. This is documented as a known limitation under "Keyless Table Duplicate Row Limitation (G7.1)" in the FAQ, but users hit it without warning. **Proposal:** During `create_stream_table`, query `pg_constraint` for the source table(s). If none of the sources referenced in the query have a PRIMARY KEY, emit: ``` WARNING: source table "foo" has no PRIMARY KEY — duplicate rows will not be tracked correctly. Add a primary key or use a unique column as a row identity. ``` **Decision factors:** - Should this be a WARNING (non-fatal) or an ERROR (require explicit opt-in with `allow_keyless => true` parameter)? - The check only applies to the directly named source tables, not to derived sources (e.g., subqueries, CTEs). That limitation should be noted. - Cost: requires a catalog lookup per source table at creation time (cheap). --- ### Proposal D — Record manual `refresh_stream_table` calls in `pgt_refresh_history` **Priority:** Medium **Complexity:** Medium **Type:** Observability / consistency `pgt_refresh_history` only contains rows inserted by the background scheduler worker (`initiated_by = 'SCHEDULER'`). Manual calls to `pgtrickle.refresh_stream_table()` are not recorded. This means: - Operators auditing history miss manual refreshes entirely. - Graphs of refresh frequency are inaccurate. - Test assertions that check "were there N refreshes?" reach around the history table and read `data_timestamp` instead. **Proposal:** At the end of a successful `refresh_stream_table` call, insert a history row with `initiated_by = 'MANUAL'` and the same `duration_ms`, `rows_affected`, and `status` fields the scheduler already writes. **Decision factors:** - History table is currently append-only and retention is controlled by `pg_trickle.history_retention_days`. Manual records would share that policy and clean up automatically — no special handling needed. - Should error outcomes (failed refreshes) also be recorded? Currently even scheduler failures are recorded; consistency suggests yes. - Volume concern: a hot path calling `refresh_stream_table` in a loop would create many history rows. Acceptable for now given the append-only design. --- ### Proposal E — `pgtrickle.quick_health` convenience view **Priority:** Medium **Complexity:** Low **Type:** Monitoring / new SQL object The extension exposes 15+ monitoring functions with overlapping scopes. New operators have no obvious starting point for "is everything OK right now?". The closest existing function is `pgtrickle.health_check()`, but it returns multiple rows with varying structure and requires understanding which columns to focus on. **Proposed view:** ```sql CREATE VIEW pgtrickle.quick_health AS SELECT (SELECT COUNT(*) FROM pgtrickle.pgt_stream_tables) AS total_stream_tables, (SELECT COUNT(*) FROM pgtrickle.pgt_stream_tables WHERE consecutive_errors > 0) AS error_tables, (SELECT COUNT(*) FROM pgtrickle.pgt_stream_tables WHERE data_timestamp < now() - make_interval(secs => schedule_secs * 3)) AS stale_tables, pg_trickle_scheduler_running() AS scheduler_running, -- ... cdc health summary ... CASE WHEN error_tables = 0 AND scheduler_running THEN 'OK' ELSE 'DEGRADED' END AS status; ``` One row, one glance. If `status = 'OK'`, stop reading. **Decision factors:** - Should this be a VIEW (no parameters, always computes) or a function `quick_health()` returning a record? - A VIEW requires thinking about column stability across versions — harder to evolve. A function is easier to change. - Should it be in schema `pgtrickle` (user-facing) or `pgtrickle_internal`? - The "staleness" heuristic (3× schedule interval) needs a defined policy. --- ### Proposal F — Emit WARNING when `alter_stream_table` triggers a full refresh **Priority:** Medium **Complexity:** Low **Type:** Transparency / silent expensive operation When `alter_stream_table` changes the `refresh_mode`, definition query, or certain parameters that invalidate the current materialization, it automatically executes a full refresh (equivalent to truncating and repopulating the stream table). This can be expensive on large tables and currently only emits a `pgrx::info!()` log, which is suppressed at most log levels and never surfaced to the calling client. **Proposal:** Upgrade the message to `pgrx::warning!()` so it appears at the client's session regardless of server `log_min_messages`: ``` WARNING: pg_trickle: stream table "foo" refresh mode changed from INCREMENTAL to FULL; a full refresh was applied (N rows). This may take time on large tables. ``` Optionally also send a `NOTIFY pgtrickle_events, '{"event":"full_refresh","table":"foo"}'` for monitoring automation. **Decision factors:** - Is WARNING the right severity, or should it be NOTICE? NOTICE is always shown to the client; WARNING implies something requires attention. - Should a full NOTIFY channel be introduced (could be useful for Proposal E as well), or is the log message sufficient? - Are there cases where the full refresh is expected and the warning would be spurious (e.g., during initial setup scripts)? --- ### Proposal G — "Which monitoring function?" decision tree in docs **Priority:** Medium **Complexity:** Very Low **Type:** Documentation With 15+ monitoring functions in `pgtrickle`, new operators don't know which to call. Common questions and their current answers: | Question | Function needed | |---|---| | Is my stream table up to date? | `pgtrickle.get_staleness('name')` | | What failed and why? | `pgtrickle.get_refresh_history('name')` | | Is CDC capturing changes? | `pgtrickle.check_cdc_health()` | | Are triggers installed correctly? | `pgtrickle.trigger_inventory()` | | Is the scheduler running? | `pgtrickle.pg_stat_stream_tables` | | Overall system health? | `pgtrickle.health_check()` | **Proposal:** Add a "Monitoring Quick Reference" section near the top of [docs/SQL_REFERENCE.md](../docs/SQL_REFERENCE.md) (or as a dedicated `docs/MONITORING.md`) mapping common operational questions to the right function, with a one-line example for each. **Decision factors:** - Where should it live: a new `MONITORING.md`, a section in `SQL_REFERENCE.md`, or a separate `GETTING_STARTED` sub-section? - Should it be a prose decision tree, a table, or a flowchart (Mermaid)? --- ### Proposal H — "Special Cases & Known Limitations" section in SQL_REFERENCE.md **Priority:** Medium **Complexity:** Very Low **Type:** Documentation Several important behavioral edge cases are currently scattered through different documents: | Case | Currently documented in | |---|---| | TopK (ORDER BY + LIMIT rewrite) | FAQ + buried note in SQL_REFERENCE | | Keyless table duplicate rows | FAQ "Known Delta Computation Limitations" | | SELECT \* schema drift | GETTING_STARTED + FAQ | | JOIN key-change + delete edge case | Footnote deep in SQL_REFERENCE | | Direct DML blocked on stream tables | On-demand error only | Users who hit one of these problems must search across multiple documents. A dedicated section consolidating all special cases and limitations would reduce support burden. **Decision factors:** - Should this be a new top-level `## Special Cases` section within `SQL_REFERENCE.md`, or a standalone `docs/EDGE_CASES.md`? - Should existing references in FAQ/GETTING_STARTED be replaced or kept with cross-links? --- ### Proposal I — Improve cycle-detection error messages **Priority:** Low **Complexity:** Low **Type:** Error quality The current cycle detection error shows the cycle path but doesn't guide the user on how to resolve it: ``` ERROR: dependency cycle detected: foo → bar → foo ``` Users unfamiliar with DAG concepts won't know what to do. A better message: ``` ERROR: dependency cycle detected: foo → bar → foo HINT: Stream tables cannot form circular dependencies. To fix this, ensure each stream table only reads from base tables or stream tables that do not (directly or transitively) read from it. ``` **Decision factors:** - Should the HINT be added inline in the error string, or via a separate `pgrx::hint!()` call (which PostgreSQL displays as a "HINT:" line)? --- ### Proposal J — Emit NOTICE when a query is auto-rewritten **Priority:** Low **Complexity:** Medium **Type:** Transparency pg_trickle rewrites user queries in up to five passes (view inlining, CTE flattening, alias stripping, etc.) before storing them in the catalog. Users who query `pgt_stream_tables.definition` after creating a stream table see a different query from the one they wrote, with no explanation. **Proposal:** After the rewrite passes, if the stored definition differs from the input, emit a NOTICE: ``` NOTICE: query was rewritten for incremental maintenance. Original: SELECT v.x FROM my_view v WHERE v.y > 0 Stored: SELECT base.x FROM base_table base WHERE base.y > 0 ``` **Decision factors:** - How verbose should the notice be? Full before/after, or just "query was simplified"? - Rewriting is always intentional; would this notice confuse users into thinking something is wrong? - The rewrite is currently undone if refresh fails — the NOTICE should only fire after the definition is successfully committed. --- ## CHANGELOG entries (to add under unreleased) - **Breaking**: `create_stream_table` now defaults `schedule` to `'calculated'` instead of `'1m'`. Stream tables without an explicit schedule are now CALCULATED by default. - **Breaking**: Passing `schedule => NULL` to `create_stream_table` is now an error. Use `schedule => 'calculated'` instead. - **Breaking**: GUCs `pg_trickle.diamond_consistency` and `pg_trickle.diamond_schedule_policy` have been removed. Use per-table parameters in `create_stream_table` / `alter_stream_table` instead. - **Changed**: `pg_trickle.min_schedule_seconds` default lowered from `60` to `1` for better out-of-the-box developer experience. - **New**: `pg_trickle.default_schedule_seconds` GUC (default `1`) controls the effective refresh interval for isolated CALCULATED stream tables.