# RStats Type Documentation ## Overview The `RStats` type implements numerically stable running statistics using Welford's algorithm. This document describes the design decisions, implementation details, and rationale for the sentinel value approach used to represent empty state. ## Sentinel Value Design ### Design Choice: Zero Sentinels The `RStats` type uses **count=0 with all other fields set to 0.0** as the canonical empty state representation. ### Rationale #### Why Count=0 is Primary - **Unambiguous**: count=0 is the natural definition of "no data accumulated" - **Simple**: Single boolean check `(count == 0)` for emptiness - **No conflicts**: Cannot collide with legitimate statistics (any real data has count >= 1) #### Why All Fields Must Be Zero When Empty While count=0 alone is sufficient to identify empty state, we enforce that all other fields (mean, m2, min, max) must also be 0.0 when count=0. This provides: 1. **Canonical Representation** - Empty state has exactly one valid serialization: `(count:0,mean:0,min:0,max:0,stddev:0)` - Enables byte-for-byte comparison of empty values - Simplifies debugging and testing 2. **Clean Display** - Text output shows intuitive zeros: `(count:0,mean:0,min:0,max:0,stddev:0)` - No confusing sentinel values like -1 that might suggest data 3. **Corruption Detection** - If count=0 but other fields are non-zero, we know data is corrupted - Catches bugs in serialization/deserialization - Detects memory corruption or improper initialization 4. **Consistency Across Formats** - Binary and text formats produce identical empty representations - No format-specific edge cases ### Implementation: Defense in Depth The canonical empty state (count=0, all zeros) is enforced at four validation points: #### 1. Text Input Validation (`rstats_in`) **Purpose**: Reject malformed text input from users or external systems **When**: Parsing SQL input like `SELECT '(count:0,mean:1,...)'::rstats` **Error Code**: `ERRCODE_INVALID_TEXT_REPRESENTATION` #### 2. Binary Input Validation (`rstats_recv`) **Purpose**: Catch corruption in network protocol or disk storage **When**: Receiving binary data from client or reading from disk **Error Code**: `ERRCODE_DATA_CORRUPTED` #### 3. Binary Output Validation (`rstats_send`) **Purpose**: Catch bugs in C code that corrupts RStats in memory **When**: Before sending binary data to client **Error Code**: `ERRCODE_DATA_CORRUPTED` **Note**: Includes hint that this is an internal bug #### 4. Runtime Emptiness Check (`rstats_is_empty`) **Purpose**: Guard any code path checking if statistics are empty **When**: Called by `rstats_add_value()` to detect lazy initialization **Error Code**: `ERRCODE_DATA_CORRUPTED` ### Trade-offs and Limitations #### Cannot Distinguish Empty vs. Init(0) The current design cannot distinguish between: - Truly empty statistics: `rstats_set_empty()` - Statistics initialized with single zero value: `rstats_init_internal(&stats, 0.0)` Both result in: count=1 for init(0), but count=0 for empty. **This is acceptable because:** 1. The count field correctly reflects the difference (0 vs. 1) 2. Both are valid states with clear semantics 3. The distinction is not operationally important 4. User can check count if they need to distinguish ### Memory Layout ```c typedef struct RStats { int64 count; /* 8 bytes - Primary empty indicator */ double mean; /* 8 bytes - Must be 0.0 when count=0 */ double m2; /* 8 bytes - Must be 0.0 when count=0 */ double min; /* 8 bytes - Must be 0.0 when count=0 */ double max; /* 8 bytes - Must be 0.0 when count=0 */ } RStats; /* Total: 40 bytes, fixed-size, no varlena header */ ``` ### Performance Considerations The sentinel validation has minimal performance impact: - **Validation Cost**: 4 floating-point equality comparisons - **Compiler Optimization**: Often optimized to SIMD comparison - **Zero Overhead in Happy Path**: If count > 0, skip validation entirely - **Only on Boundaries**: Validation only at I/O and emptiness checks, not during computation ### Compatibility and Migration #### Future Changes If sentinel approach needs modification in the future: 1. Update this documentation with rationale 2. Increment type version (create `rstats--0.1--0.2.sql` migration script) 3. Add compatibility layer if needed 4. Update all four validation points consistently ## RStats Operators and Functions This section documents the available operators and their semantics. ### Type Casts #### Numeric to RStats Implicit casts from numeric types initialize RStats with a single value: ```sql -- From double precision SELECT 42.5::rstats; -- Result: (count:1,mean:42.5,min:42.5,max:42.5,stddev:0) -- From integer SELECT 100::rstats; -- Result: (count:1,mean:100,min:100,max:100,stddev:0) -- From numeric SELECT 3.14159::numeric::rstats; -- Result: (count:1,mean:3.14159,min:3.14159,max:3.14159,stddev:0) ``` **Semantics**: Creates RStats initialized with a single data point. #### Binary Serialization Casts Assignment casts for binary serialization (backup/restore): ```sql -- RStats to bytea (serialization) SELECT rstats()::bytea; -- Result: \x0000000000000000... -- Bytea to RStats (deserialization) SELECT '\x0000000000000000...'::bytea::rstats; -- Result: (count:0,mean:0,min:0,max:0,stddev:0) -- Round-trip preserves state SELECT (10::rstats + 20 + 30)::bytea::rstats; -- Result: (count:3,mean:20,min:10,max:30,stddev:10) ``` **Semantics**: Enables binary backup/restore and external storage. ### Addition Operator (+) Accumulates a new value into running statistics: ```sql -- Add double precision SELECT 10.0::rstats + 20.0 + 30.0; -- Result: (count:3,mean:20,min:10,max:30,stddev:10) -- Add integer SELECT 5::rstats + 10 + 15; -- Result: (count:3,mean:10,min:5,max:15,stddev:5) -- Chain multiple additions SELECT ((10.0::rstats + 20.0) + 30.0) + 40.0; -- Result: (count:4,mean:25,min:10,max:40,stddev:12.91) ``` **Semantics**: - Updates mean, stddev (via m2), min, max using Welford's algorithm - Order-dependent: accumulates values sequentially - Deterministic: same sequence → same result - Handles NULL gracefully: `stats + NULL` returns NULL ### Equality Operator (=) Tests for exact identity (bit-identical state): ```sql -- Exact match: same sequence SELECT (10::rstats + 20 + 30) = (10::rstats + 20 + 30); -- Result: true -- Different sequence, same statistics SELECT (10::rstats + 20) = (15::rstats + 15); -- Result: false (mean is same, but internal state differs) -- Empty state comparison SELECT rstats() = rstats(); -- Result: true ``` **Semantics**: - **Identity, not numerical equivalence**: Tests if two RStats accumulated the exact same sequence of values - **Deterministic algorithm**: Welford's algorithm guarantees same inputs → bit-identical state - **Exact float comparison**: No epsilon tolerance (by design) - **Use cases**: Caching, deduplication, identity checks in hash tables **For numerical similarity**, compare extracted statistics: ```sql -- Check if means are approximately equal SELECT ABS((stats1 -> 'mean') - (stats2 -> 'mean')) < 0.001; ``` ### Distance Operator (<->) Calculates the Mahalanobis distance between two statistical distributions: ```sql -- Compare two distributions SELECT (10::rstats + 20 + 30) <-> (15::rstats + 25 + 35); -- Result: small value (similar distributions) SELECT (10::rstats + 20 + 30) <-> (100::rstats + 200 + 300); -- Result: larger value (dissimilar distributions) -- Find queries with statistics most different from a reference SELECT queryid, stats <-> reference_stats AS distance FROM pg_track_optimizer, (SELECT 0::rstats + 1 + 2 AS reference_stats) ref ORDER BY distance DESC; ``` **Semantics**: - Returns `double precision` representing statistical distance - Lower values indicate more similar distributions - Commutative: `a <-> b` equals `b <-> a` - Useful for clustering queries by statistical similarity or detecting outliers ### Field Accessor Operator (->) Extracts statistical properties as double precision: ```sql SELECT stats -> 'count' AS count, stats -> 'mean' AS mean, stats -> 'stddev' AS stddev, stats -> 'min' AS min, stats -> 'max' AS max FROM (SELECT (10::rstats + 20 + 30) AS stats) t; ``` **Available Fields**: - `count`: Number of accumulated values (int64 → float8) - `mean`: Arithmetic mean - `stddev`: Sample standard deviation (n-1 denominator) - `min`: Minimum value observed - `max`: Maximum value observed **Semantics**: - Returns `double precision` (cast as needed) - Stddev uses sample formula: `sqrt(m2 / (count - 1))` - Returns 0 stddev for count ≤ 1 - Works with expression indexes: `CREATE INDEX ON table ((stats -> 'mean'))` ### Constructor Functions #### Empty Constructor ```sql SELECT rstats(); -- Result: (count:0,mean:0,min:0,max:0,stddev:0) ``` **Semantics**: Creates canonical empty state. #### Polymorphic Constructor ```sql -- Accepts any numeric-convertible type SELECT rstats(42::int2); SELECT rstats(3.14::float4); SELECT rstats(100::bigint); ``` **Semantics**: Initializes with a single value (delegates to type-specific casts). ### Aggregate Function #### rstats_agg(double precision) Aggregates multiple values into a single RStats object: ```sql -- Aggregate values from a table column SELECT rstats_agg(value) FROM measurements; -- Result: (count:N,mean:...,min:...,max:...,stddev:...) -- Aggregate with grouping SELECT category, rstats_agg(price) AS price_stats FROM products GROUP BY category; -- Combine with field accessor SELECT category, rstats_agg(price) -> 'mean' AS avg_price, rstats_agg(price) -> 'stddev' AS price_stddev FROM products GROUP BY category; ``` **Semantics**: - Collects all non-NULL values in the group into running statistics - Returns canonical empty state if no values are aggregated - Uses Welford's algorithm internally for numerical stability - Order of aggregation may affect floating-point rounding (typically negligible) ### References - **Welford's Algorithm**: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm - **PostgreSQL Type System**: https://www.postgresql.org/docs/current/xtypes.html - **Error Codes**: https://www.postgresql.org/docs/current/errcodes-appendix.html --- *Last Updated: 2026-01-10* *Author: Andrei Lepikhov* *Reviewers: Claude Code Review*