# Weighted Statistics PostgreSQL Extension

High-performance PostgreSQL extension providing weighted statistical functions optimized for sparse data (automatic handling when `sum(weights) < 1.0`).

## Quick Start

```bash
git clone https://github.com/your-repo/weighted_statistics
cd weighted_statistics
make clean && make && sudo make install
psql -c "CREATE EXTENSION weighted_statistics;"
```

**Available Functions:**
- `weighted_mean(values[], weights[])` - Weighted mean
- `weighted_variance(values[], weights[], ddof)` - Weighted variance (ddof: 0=population, 1=sample)  
- `weighted_std(values[], weights[], ddof)` - Weighted standard deviation
- `weighted_quantile(values[], weights[], quantiles[])` - Empirical CDF quantiles
- `wquantile(values[], weights[], quantiles[])` - Type 7 (Hyndman-Fan) quantiles
- `whdquantile(values[], weights[], quantiles[])` - Harrell-Davis quantiles
- `weighted_median(values[], weights[])` - 50th percentile shortcut for empirical CDF

```sql
-- Basic usage examples
SELECT weighted_mean(ARRAY[1.0, 2.0, 3.0], ARRAY[0.2, 0.3, 0.5]);
-- Result: 2.3

SELECT weighted_quantile(ARRAY[10.0, 20.0, 30.0], ARRAY[0.3, 0.4, 0.3], ARRAY[0.25, 0.5, 0.75]);
-- Result: {15.0, 20.0, 25.0}
```

## Key Features

- **up to 14x faster** than PL/pgSQL implementations
- **Sparse data handling**: Automatically adds implicit zeros when `sum(weights) < 1.0`
- **Mathematically validated**: 100% accuracy against Python reference implementations

**Requirements**: PostgreSQL 12+, development headers (`postgresql-server-dev-*`), C compiler

## Usage Examples

```sql
-- multiple statistics
WITH example_data AS (
    SELECT array_agg(val) AS vals, array_agg(weight) AS weights
    FROM risk_events
)
SELECT 
    weighted_mean(vals, weights) AS expected_vals,
    weighted_std(vals, weights, 1) AS vals_volatility,
    weighted_quantile(vals, weights, ARRAY[0.05, 0.95]) AS vals_bounds
FROM example_data;

-- Sparse data handling
-- These are equivalent:
SELECT weighted_mean(ARRAY[10, 20, 0, 0], ARRAY[0.2, 0.3, 0.25, 0.25]);
SELECT weighted_mean(ARRAY[10, 20], ARRAY[0.2, 0.3]); -- Auto-adds 0.5 weight of zeros
-- Both return: 8.0
```

## Function Comparison

| Function            | Method                   | Best For                            |
| ------------------- | ------------------------ | ----------------------------------- |
| `weighted_quantile` | Empirical CDF            | General use, fast computation       |
| `wquantile`         | Type 7 (R/NumPy default) | Standard statistical analysis       |
| `whdquantile`       | Harrell-Davis            | Smooth estimates, light-tailed data |

## Testing & Validation

Multi-tier testing ensures both mathematical correctness and behavioral consistency:

```bash
# Mathematical validation (against Python reference)
cd reference && python validate_against_reference.py --database test_db

# Regression testing 
make installcheck
# OR alternative: ./test/run_tests.sh

# Performance benchmarking
./benchmark/run_benchmark.sh
```

**Validation Results**: 100% accuracy against Python reference implementations following Akinshin (2023) methods.

## Performance

Benchmarked results (5-iteration averages) show significant performance advantages:

- **up to 14x faster** than optimized PL/pgSQL (varies by function and array size)
- **Sub-millisecond to millisecond execution**: 0.06ms (1K quantiles) to 18.77ms (100K mean PL/pgSQL)  
- **Quantiles show biggest gains**: Consistent 12-14x faster than PL/pgSQL across all sizes
- **Mean functions scale differently**: Equal at 1K elements, 4x faster at 100K elements
- **Efficient quantile methods**: Empirical CDF fastest, Type 7 ~1.7x slower, Harrell-Davis 25-33x slower

For more details, see the [benchmark/README.md](benchmark/README.md).

## Development & Testing

```bash
# Build
make clean && make && sudo make install

# Test mathematical correctness
python reference/validate_against_reference.py

# Run regression tests
make installcheck

# Benchmark performance  
./benchmark/run_benchmark.sh
```

**Troubleshooting**: Install dev headers with `sudo apt-get install postgresql-server-dev-$(pg_config --version | grep -oP '\d+')`

## Disclaimer

The C implementation and PostgreSQL extension setup were generated by AI assistance, based on the self-written Python reference implementations in the `reference/` directory. These Python implementations follow the weighted statistics methods described in Akinshin (2023). 

All functions have been rigorously tested against the Python reference implementations to ensure mathematical correctness. The validation harness at `reference/validate_against_reference.py` compares results with high precision tolerance (typically 1e-10) across diverse test cases.

---

**Reference**: Akinshin, A. (2023). Weighted quantile estimators. *arXiv preprint*.

Production-ready PostgreSQL extension providing 7 mathematically validated weighted statistics functions with high performance and automatic sparse data handling.