--- layout: default title: Huge-Table Compaction nav_order: 16 --- # Spec: Huge-Table Compaction Operating Model Status: implemented first pass Risk tier: CAUTION Primary goal: make the disk-space and locking contract explicit for large `sorted_heap` deployments. ## Problem `sorted_heap` gets its read performance from physical locality plus zone maps. Eventually, write-heavy tables need maintenance to restore sorted order and fresh pruning metadata. The maintenance verbs are intentionally rewrite-based: - `sorted_heap_compact(regclass)` rewrites the relation in globally sorted PK order and rebuilds the zone map. - `sorted_heap_merge(regclass)` detects the already-sorted prefix, sorts the unsorted tail, and writes a new relation. - online variants reduce blocking time with trigger-based change capture, but they still build a replacement relation before the final swap. That means compaction is not an in-place defragmenter. Operators need temporary disk headroom for the rewrite unit. ## Current Contracts ### Concrete table operations | Operation | Rewrite unit | Main lock shape | Disk headroom | |-----------|--------------|-----------------|---------------| | `sorted_heap_compact` | whole relation | `AccessExclusiveLock` for operation | replacement relation + rebuilt indexes | | `sorted_heap_merge` | whole relation | `AccessExclusiveLock` for operation | replacement relation + rebuilt indexes | | `sorted_heap_compact_online` | whole relation | long copy under weaker lock, brief final `AccessExclusiveLock` | replacement relation + log table + rebuilt indexes | | `sorted_heap_merge_online` | whole relation | same online shape as compact | replacement relation + log table + rebuilt indexes | `sorted_heap_merge` may be faster than full compact when the sorted prefix is large, but it still produces a new relation. It is a CPU/sort optimization, not a disk-headroom optimization. ### Partitioned parent operations Partition helpers make the leaf the rewrite unit: ```sql SELECT * FROM sorted_heap_compact_partitions('events_parent'::regclass); SELECT * FROM sorted_heap_merge_partitions('events_parent'::regclass); ``` Contract: - recurse through concrete leaves; - preflight unsupported leaves and sorted_heap leaves without primary keys; - process supported leaves one at a time; - return one result row per processed or skipped leaf; - reduce temporary disk headroom from "whole logical parent" to "current leaf"; - do not provide one global all-leaves transaction. If a later leaf fails, earlier leaves remain processed. This is intentional: the helper is an operational wrapper over concrete relation maintenance, not a distributed transaction protocol. ## Recommended Huge-Table Strategy Use declarative partitioning to bound the maintenance unit. Typical partition keys: - time range for event/time-series tables; - tenant / knowledge-base / shard id for multi-tenant retrieval; - coarse lifecycle segment for hot/cold data. Recommended flow: 1. Keep each leaf at a size where one rewrite fits operational headroom. 2. Bulk load or ingest into leaves. 3. Run `sorted_heap_compact_partitions(...)` or `sorted_heap_merge_partitions(...)` during maintenance windows. 4. Before the run, inspect `sorted_heap_partition_maintenance_plan(parent, 'compact')` or `sorted_heap_partition_maintenance_plan(parent, 'merge')` to see all blockers, the per-leaf rewrite headroom estimate, and the tablespace that should be checked by external free-space monitoring. 5. Inspect state with `sorted_heap_partition_status(parent)`. 6. For live systems, prefer online concrete operations on the hot leaf when blocking time matters more than total runtime. ## Sizing Heuristic For a concrete leaf, reserve enough free space for: ```text replacement heap relation + rebuilt secondary indexes + transient online log table, for online variants + normal PostgreSQL WAL/checkpoint headroom ``` This spec does not give a universal percentage because row width, index count, TOAST use, WAL settings, and filesystem behavior dominate. The actionable rule is simpler: make partitions small enough that one leaf rewrite is safe. ## Non-Goals - No promise of in-place compaction. - No segment-level rewrite inside one relation yet. - No global all-partition atomic maintenance. - No automatic repartitioning or partition-size advisor. ## Future Segment-Level Compaction Segment-level compaction could reduce disk headroom inside a single large relation, but it needs a separate storage design: - crash-safe mapping from old segment pages to new segment pages; - index TID update or indirection semantics; - zone-map and sorted-prefix updates that remain valid across partial rewrite; - WAL/recovery story for interrupted segment swaps; - planner/executor behavior while old and new segments coexist. Until those are specified and tested, partition-scoped rewrite is the supported large-table operating model. ## Acceptance Tests Already covered: - `SH23` verifies partition parent helpers, nested leaves, unsupported leaves, and no-PK leaves. - `make test-partition-lock` verifies a held leaf lock blocks parent compact and that the helper succeeds after release. - Existing dump/restore, TOAST, crash, and concurrent online tests cover the underlying concrete rewrite paths. Future tests: - benchmark partitioned maintenance over multiple differently sized leaves; - verify failure reporting when one later leaf cannot acquire lock; - continue validating the dry-run estimate against operator feedback; the SQL plan now reports tablespace identity, while actual free bytes remain an external filesystem/platform metric. - attach/detach/default-partition lifecycle regression is covered by SH23-9. ## Quadrumvirate Notes Cassandra: - Likely failure mode: users infer "compact" means in-place. The docs must say rewrite-and-swap plainly. - Likely failure mode: online compact is mistaken for lower disk usage. It lowers blocking, not rewrite headroom. Daedalus: - Reframe from "how do we compact a huge table" to "how do we bound the rewrite unit". Partitioning is the current answer. Maieutic: - Assumption: one logical table must be one physical rewrite unit. Refuted by PostgreSQL declarative partitioning and leaf-scoped helpers. - Assumption: segment-level compaction is a small extension of merge. It is not; it changes crash recovery and index/TID semantics. Adversary: - Any future in-place/segment claim must specify index updates and crash recovery before implementation. - Any operator-facing helper must expose partial failures explicitly.