# Preparing a Dataset

This document walks through the full process of preparing a new non-synthetic dataset for benchmarking. The high-level steps are:

1. Load source data into S3
2. Create a config and sample the data at each size you need
3. Convert the sampled parquet data to CSV

## Step 1: Load Source Data into S3

Upload your source data as partitioned parquet files to S3. Each table should be in its own subdirectory under a `source/parquet/` path (filename doesn't matter):

```text
s3://paradedb-ci-benchmark/datasets/{dataset-name}/source/parquet/
├── {table_a}/
│   ├── part-0001.parquet
│   ├── part-0002.parquet
│   └── ...
├── {table_b}/
│   └── ...
└── {table_c}/
    └── ...
```

For example, for the Stack Overflow dataset:

```text
s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/
├── stackoverflow_posts/
├── comments/
└── users/
```

## Step 2: Create a Config and Sample

### Writing the Config

Create a TOML config file at `datasets/{dataset-name}/config.toml` that describes your table relationships. The config specifies which table is the root (the one that gets sampled directly) and how child tables relate to it via joins.

```toml
root_table = "root_table_name"
sampling_seed = 723               # Fixed seed for deterministic results

[[tables]]
name = "root_table_name"

[[tables]]
name = "child_table"
parent = "root_table_name"
parent_join_col = "id"            # Column in the parent table
join_col = "parent_id"            # Corresponding column in the child table
```

Fields:

- `root_table` -- The primary table. The `--rows` argument controls how many rows are sampled from this table.
- `sampling_seed` -- Seed for deterministic, reproducible sampling.
- `[[tables]]` -- One entry per table. The root table has no `parent`. Child tables specify `parent`, `parent_join_col`, and `join_col` to define the relationship.

Child tables are not sampled independently. Instead, they are filtered via an inner join with their parent table, so only rows that reference a sampled parent row are kept. This preserves referential integrity across the dataset.

Tables can form a hierarchy (a child can be a parent of another table). They are processed in topological order.

See `datasets/stackoverflow/config.toml` for a real example.

### Running the Sampling Tool

Run the `sample` command once for each dataset size you need. The `--rows` argument sets the target row count for the root table. Output goes to the `sampled/{size}/parquet/` path.

```bash
# Sample to 10k rows
cargo run --release -- sample \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/parquet/ \
  --config ./datasets/stackoverflow/config.toml \
  --rows 10000

# Sample to 100k rows
cargo run --release -- sample \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/parquet/ \
  --config ./datasets/stackoverflow/config.toml \
  --rows 100000

# Sample to 1m rows
cargo run --release -- sample \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/parquet/ \
  --config ./datasets/stackoverflow/config.toml \
  --rows 1000000
```

Notes:

- The output path must be empty (no pre-existing data).
- For small targets (<=100k rows), sampling is exact using reservoir sampling. For larger targets, it uses system sampling and the result will be approximate (within ~3-5%).
- Use `--dry-run` to validate inputs and see planned row counts without writing anything.

## Step 3: Convert Sampled Data to CSV

Run the `convert` command for each sampled size to produce CSV versions. The `--tables` flag takes a comma-separated list of all tables to convert.

```bash
# Convert 10k sampled data
cargo run --release -- convert \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/csv/ \
  --tables stackoverflow_posts,comments,users

# Convert 100k sampled data
cargo run --release -- convert \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/csv/ \
  --tables stackoverflow_posts,comments,users

# Convert 1m sampled data
cargo run --release -- convert \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/csv/ \
  --tables stackoverflow_posts,comments,users
```

You can also convert the full source data:

```bash
cargo run --release -- convert \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/source/csv/ \
  --tables stackoverflow_posts,comments,users
```

Notes:

- The output path must be empty.
- Row counts are verified after conversion to ensure no data is lost.
- Use `--dry-run` to validate without writing.
- AWS credentials must be accessible via the standard credential chain (env vars, `~/.aws/credentials`, or instance metadata).

## Final S3 Layout

After completing all three steps, your dataset will look like this:

```text
s3://paradedb-ci-benchmark/datasets/{dataset-name}/
├── source/
│   ├── parquet/
│   │   ├── {table_a}/
│   │   ├── {table_b}/
│   │   └── {table_c}/
│   └── csv/
│       ├── {table_a}/
│       ├── {table_b}/
│       └── {table_c}/
└── sampled/
    ├── 10k/
    │   ├── parquet/
    │   │   ├── {table_a}/
    │   │   ├── {table_b}/
    │   │   └── {table_c}/
    │   └── csv/
    │       ├── {table_a}/
    │       ├── {table_b}/
    │       └── {table_c}/
    ├── 100k/
    │   └── ...
    └── 1m/
        └── ...
```