json_fdw ======== This PostgreSQL extension implements a Foreign Data Wrapper (FDW) for JSON files. The extension doesn't require any data to be loaded into the database, and supports analytic queries against array types, nested fields, and heterogeneous documents. json\_fdw currently only works with PostgreSQL 9.2, and uses YAJL to parse JSON files. Future releases of this wrapper will use the JSON parser functions that are to going to be introduced in the PostgreSQL 9.3 release. Building -------- json\_fdw depends on yajl-2.0 for parsing, and zlib-devel to read compressed files. So we need to install these packages first: ## Fedora 17+ sudo yum install zlib-devel yajl-devel ## Ubuntu 12.10+ sudo apt-get update sudo apt-get install zlib1g-dev libyajl-dev ## Other Linux Distributions (First install zlib-devel, cmake, and ruby) wget http://github.com/lloyd/yajl/tarball/2.0.1 -O yajl-2.0.1.tar.gz tar -xzvf yajl-2.0.1.tar.gz cd lloyd-yajl-f4b2b1a ./configure make sudo make install echo "/usr/local/lib" | sudo tee /etc/ld.so.conf.d/libyajl.conf sudo ldconfig Once you have yajl-2.0 and zlib installed on your machine, you are ready to build json\_fdw. For this, you need to include the pg\_config directory path in your make command. This path is typically the same as your PostgreSQL installation's bin/ directory path. For example: PATH=/usr/local/pgsql/bin/:$PATH make sudo PATH=/usr/local/pgsql/bin/:$PATH make install **Note**: In RedHat 5.X and CentOS 5.X you may need to edit the Makefile and change "-l:libyajl.so.2" to "-lyajl". Usage ----- These two parameters can be set on a JSON foreign table object. * filename: The absolute path of a json file or a gzipped json file. * max\_error\_count: Maximum number of invalid json documents to skip before erroring out. Defaults to 0. As an example, we demonstrate querying a compressed JSON file from scratch here. We note that the underlying file contains JSON documents separated by newlines, and that no data needs to be loaded into the database. Let's now start with downloading the file. wget http://examples.citusdata.com/customer_reviews_nested_1998.json.gz Next, let's log into Postgres, and run the following commands to create a foreign table associated with this JSON file. -- load extension first time after install CREATE EXTENSION json_fdw; -- create server object CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw; -- create foreign table CREATE FOREIGN TABLE customer_reviews ( customer_id TEXT, "review.date" DATE, "review.rating" INTEGER, "product.id" CHAR(10), "product.group" TEXT, "product.title" TEXT, "product.similar_ids" CHAR(10)[] ) SERVER json_server OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz'); -- optionally, collect data distribution statistics ANALYZE customer_reviews; Finally, let's run some example SQL queries on your JSON file. -- find all reviews a particular customer made on the Dune series in 1998 SELECT customer_id, "review.rating", "product.id", "product.title" FROM customer_reviews WHERE customer_id ='A27T7HVDXA3K2A' AND "product.title" LIKE '%Dune%' AND "review.date" >= '1998-01-01' AND "review.date" <= '1998-12-31'; -- do we have a correlation between a book's title's length and its review ratings? SELECT width_bucket(length("product.title"), 1, 50, 5) title_length_bucket, round(avg("review.rating"), 2) AS review_average, count(*) FROM customer_reviews WHERE "product.group" = 'Book' GROUP BY title_length_bucket ORDER BY title_length_bucket; Table Schema Conventions ------------------------ There are three things worth noting about table schemas. First, nested fields in JSON documents are referenced using dot separators. For example, a field defined as "review": { "rating" : 5 } in a JSON document is declared as "review.rating" in the foreign table schema. The quotes around "review.rating" are necessary, as identifiers that include dots aren't valid in Postgres otherwise. Second, the foreign table schema is defined at read-time. If you have an additional field that you'd like to query, such as "review.votes", you can simply add the column name and start querying for data. You can even create multiple table schemas for the same underlying file, and query through them. Third, json\_fdw assumes that underlying data can be heterogeneous. If you are querying for a column, and this field doesn't exist in a document, or the field's data type doesn't match the declared column type, json\_fdw considers that particular field to be null. Querying Multiple Files ----------------------- json\_fdw borrows its semantics from file\_fdw, and associates one foreign table with one JSON file. If you'd like to query all your JSON files from one table, you could take one of two approaches. You could either use PostgreSQL's basic table partitioning feature, and manually create one child table per JSON file. Alternatively, you could use CitusDB binaries, and "stage" data into a distributed foreign table. With this approach, you can also have the database automatically collect statistics about the underlying data, and apply query optimizations such as partition pruning. For more info, please see our documentation page at [http://citusdata.com/docs/foreign-data](http://citusdata.com/docs/foreign-data) , or contact us at engage @ citusdata.com. Limitations ----------- * json\_fdw only supports files that consist of one JSON document per line. It doesn't support objects that span multiple lines. * PostgreSQL limits column names to 63 characters by default. If you need column names that are longer, you can increase the NAMEDATALEN constant in src/include/pg\_config\_manual.h, compile, and reinstall. Copyright --------- Copyright (c) 2013 Citus Data, Inc. This module is free software; you can redistribute it and/or modify it under the GNU GPL v3.0 License. For all types of questions and comments about the wrapper, please contact us at engage @ citusdata.com.