# Design & Historical Implementation Plan This document preserves the initial phased implementation plan and design considerations for `pgproto`. ## 🏗️ Architecture (Historical) ### 1. Internal Storage Protobuf messages are binary. We store them internally using a Postgres `varlena` (variable length) structure. ```c typedef struct { int32 length; // Total size including this header char data[1]; // Serialized Protobuf bytes } ProtobufData; ``` ### 2. Schema Registry (Dynamic Reflection) To understand what fields are in a binary blob, the extension needs the schema. We will use the **Schema-Registered** model. 1. **Registry Table:** A system table (or extension-owned table) will store `FileDescriptorSet` blobs generated by `protoc`. 2. **Caching (Shared/Session Memory):** To avoid parsing the schema on every row access, we will cache parsed descriptors in a hash table using Postgres' `TopMemoryContext` for session duration. --- ## 📅 Phased Implementation Plan ### Phase 0: Toolchain Setup (Docker) Establish the development environment inside an isolated Docker container to avoid polluting the host machine. - **Base Environment:** A `Dockerfile` based on the official `postgres:18` image (Latest Stable). - **System Dependencies:** `build-essential`, `postgresql-server-dev-18`, `libprotobuf-c-dev`, `protobuf-c-compiler`. ### Phase 1: Varlena Infrastructure & Field-Tag Extraction Establish the custom type and the C build environment. - **Files Requirements:** `pgproto.control`, `Makefile` (PGXS), `pgproto--1.0.sql`, `pgproto.c`. - **Internal Custom Type:** `protobuf` tracking a Varlena structure (`vl_len_` and `vl_dat`). - **I/O Handlers:** `protobuf_in` and `protobuf_out` using Hex encoding. - **Target Functions:** `pb_get_int32(protobuf, tag_number)`. ### Phase 2: Schema Registry & Dynamic Reflection Transition from hardcoded tag numbers to named query paths. - **Schema Table:** `pb_schemas` storing `FileDescriptorSet` binary blobs. - **Caching Architecture:** Cache parsed descriptors in a session-wide hash table (`TopMemoryContext`) to prevent parsing on every row fetch. - **Target Functions:** `pb_get_string(protobuf, 'schema_name.MessageName', 'field.subfield')`. ### Phase 3: Optimizations & Lazy Parsing Improve performance of reading large protobuf messages. - **Core Logic:** Instead of full deserialization, skip byte-streams of unrelated tags. Use `protobuf-c` pointer skipping or raw wire format tag jumps. ### Phase 4: Query Polish (TOAST, Operators) Bridge developer ergonomics. - **TOAST Support:** Mark storage as `extended` so Postgres automatically compresses large protobuf messages out-of-line. - **Operators:** Shorthand syntaxes like `protobuf -> 'field'` and `protobuf #> '{path,to_field}'`. ### Phase 5: Purge JSONB (Strict Native Purity) The final objective of zero JSONB reliance. - **Removals:** Strip any `pb_to_jsonb` utilities or internal `jsonb` conversion pathways used as bridges. - **Custom Indexing:** Implement direct indexing using custom C operator classes rather than relying on JSONB indices. --- ## 💻 API Draft (Initial) ### Custom Types - `protobuf`: The custom type for storing serialized bytes. ### Functions - `pb_to_jsonb(protobuf, text schema_name)` returns `jsonb` - `pb_get_string(protobuf, text schema_name, text path)` returns `text` - `pb_get_int(protobuf, text schema_name, text path)` returns `int4` ### Operators - `protobuf -> path` (Shorthand for extraction).