# Weighted Statistics PostgreSQL Extension High-performance PostgreSQL extension providing weighted statistical functions optimized for sparse data (automatic handling when `sum(weights) < 1.0`). ## Quick Start ```bash git clone https://github.com/your-repo/weighted_statistics cd weighted_statistics make clean && make && sudo make install psql -c "CREATE EXTENSION weighted_statistics;" ``` **Available Functions:** - `weighted_mean(values[], weights[])` - Weighted mean - `weighted_variance(values[], weights[], ddof)` - Weighted variance (ddof: 0=population, 1=sample) - `weighted_std(values[], weights[], ddof)` - Weighted standard deviation - `weighted_quantile(values[], weights[], quantiles[])` - Empirical CDF quantiles - `wquantile(values[], weights[], quantiles[])` - Type 7 (Hyndman-Fan) quantiles - `whdquantile(values[], weights[], quantiles[])` - Harrell-Davis quantiles - `weighted_median(values[], weights[])` - 50th percentile shortcut for empirical CDF ```sql -- Basic usage examples SELECT weighted_mean(ARRAY[1.0, 2.0, 3.0], ARRAY[0.2, 0.3, 0.5]); -- Result: 2.3 SELECT weighted_quantile(ARRAY[10.0, 20.0, 30.0], ARRAY[0.3, 0.4, 0.3], ARRAY[0.25, 0.5, 0.75]); -- Result: {15.0, 20.0, 25.0} ``` ## Key Features - **up to 14x faster** than PL/pgSQL implementations - **Sparse data handling**: Automatically adds implicit zeros when `sum(weights) < 1.0` - **Mathematically validated**: 100% accuracy against Python reference implementations **Requirements**: PostgreSQL 12+, development headers (`postgresql-server-dev-*`), C compiler ## Usage Examples ```sql -- multiple statistics WITH example_data AS ( SELECT array_agg(val) AS vals, array_agg(weight) AS weights FROM risk_events ) SELECT weighted_mean(vals, weights) AS expected_vals, weighted_std(vals, weights, 1) AS vals_volatility, weighted_quantile(vals, weights, ARRAY[0.05, 0.95]) AS vals_bounds FROM example_data; -- Sparse data handling -- These are equivalent: SELECT weighted_mean(ARRAY[10, 20, 0, 0], ARRAY[0.2, 0.3, 0.25, 0.25]); SELECT weighted_mean(ARRAY[10, 20], ARRAY[0.2, 0.3]); -- Auto-adds 0.5 weight of zeros -- Both return: 8.0 ``` ## Function Comparison | Function | Method | Best For | | ------------------- | ------------------------ | ----------------------------------- | | `weighted_quantile` | Empirical CDF | General use, fast computation | | `wquantile` | Type 7 (R/NumPy default) | Standard statistical analysis | | `whdquantile` | Harrell-Davis | Smooth estimates, light-tailed data | ## Testing & Validation Multi-tier testing ensures both mathematical correctness and behavioral consistency: ```bash # Mathematical validation (against Python reference) cd reference && python validate_against_reference.py --database test_db # Regression testing make installcheck # OR alternative: ./test/run_tests.sh # Performance benchmarking ./benchmark/run_benchmark.sh ``` **Validation Results**: 100% accuracy against Python reference implementations following Akinshin (2023) methods. ## Performance Benchmarked results (5-iteration averages) show significant performance advantages: - **up to 14x faster** than optimized PL/pgSQL (varies by function and array size) - **Sub-millisecond to millisecond execution**: 0.06ms (1K quantiles) to 18.77ms (100K mean PL/pgSQL) - **Quantiles show biggest gains**: Consistent 12-14x faster than PL/pgSQL across all sizes - **Mean functions scale differently**: Equal at 1K elements, 4x faster at 100K elements - **Efficient quantile methods**: Empirical CDF fastest, Type 7 ~1.7x slower, Harrell-Davis 25-33x slower For more details, see the [benchmark/README.md](benchmark/README.md). ## Development & Testing ```bash # Build make clean && make && sudo make install # Test mathematical correctness python reference/validate_against_reference.py # Run regression tests make installcheck # Benchmark performance ./benchmark/run_benchmark.sh ``` **Troubleshooting**: Install dev headers with `sudo apt-get install postgresql-server-dev-$(pg_config --version | grep -oP '\d+')` ## Disclaimer The C implementation and PostgreSQL extension setup were generated by AI assistance, based on the self-written Python reference implementations in the `reference/` directory. These Python implementations follow the weighted statistics methods described in Akinshin (2023). All functions have been rigorously tested against the Python reference implementations to ensure mathematical correctness. The validation harness at `reference/validate_against_reference.py` compares results with high precision tolerance (typically 1e-10) across diverse test cases. --- **Reference**: Akinshin, A. (2023). Weighted quantile estimators. *arXiv preprint*. Production-ready PostgreSQL extension providing 7 mathematically validated weighted statistics functions with high performance and automatic sparse data handling.