# pgPDF: `pdf` type for Postgres

[![build](https://github.com/Florents-Tselai/pgpdf/actions/workflows/build.yml/badge.svg)](https://github.com/Florents-Tselai/pgpdf/actions/workflows/build.yml)
![GitHub Repo stars](https://img.shields.io/github/stars/Florents-Tselai/pgpdf)
<a href="https://hub.docker.com/repository/docker/florents/pgpdf"><img alt="Docker Pulls" src="https://img.shields.io/docker/pulls/florents/pgpdf"></a>

This extension for PostgreSQL provides a `pdf` data type and assorted functions.

You can create a `pdf` type, by casting either a `text` filepath or `bytea` column.

```tsql
SELECT '/tmp/pgintro.pdf'::pdf;
```

```tsql
                                       pdf                                        
----------------------------------------------------------------------------------
 PostgreSQL Introduction                                                         +
 Digoal.Zhou                                                                     +
 7/20/2011Catalog                                                                +
  PostgreSQL Origin 
```

If you don’t have the PDF file in your filesystem,
but have already stored its content in a `bytea` column,
you can just cast it to `pdf`.

```tsql
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;
```

**Why?**: 
This allows you to work with PDFs in an ACID-compliant way.
The usual alternative relies on external scripts or services which can easily 
make your data ingestion pipeline brittle and leave your raw data out-of-sync.

The actual PDF parsing is done by [poppler](https://poppler.freedesktop.org).

Also check blog: 
- [Full Text Search on PDFs With Postgres](https://tselai.com/full-text-search-pdf-postgres)
- [pgpdf: pdf type for Postgres](https://tselai.com/pgpdf-pdf-type-postgres)

## Usage

Download some PDFs. 

```sh
wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf
```

Create a table with a `pdf` column:

```tsql
CREATE TABLE pdfs(name text primary key, doc pdf);

INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');
```

Parsing and validation should happen automatically.
The files will be read from the disk only once!

> [!NOTE]
> The filepath should be accessible by the `postgres` process / user!
> That's different than the user running psql.
> If you don't understand what this means, as your DBA!

### String Functions and Operators

Standard Postgres [String Functions and Operators](https://www.postgresql.org/docs/17/functions-string.html)
should work as usual:

```tsql
SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;
```

```tsql
SELECT upper('/tmp/pgintro.pdf'::pdf::text);
```

``` tsql
SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';
```

### Full-Text Search (FTS)

You can also perform full-text search (FTS), since you can work on a `pdf` file like normal text.

```tsql
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
```

```tsql
 ?column? 
----------
 t
(1 row)
```

```tsql
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');
```

```tsql
 ?column? 
----------
 f
(1 row)
```

### Document similarity with `pg_trgm`

You can use [pg_trgm](https://postgresql.org/docs/17/interactive/pgtrgm.html)
to get the similarity between two documents:

```tsql
CREATE EXTENSION pg_trgm;

SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);
```

### Metadata

The following functions are available:

- `pdf_title(pdf) → text`
- `pdf_author(pdf) → text`
- `pdf_num_pages(pdf) → integer`

  Total number of pages in the document
- `pdf_page(pdf, integer) → text`

  Get the i-th page as text
- `pdf_creator(pdf) → text`
- `pdf_keywords(pdf) → text`
- `pdf_metadata(pdf) → text`
- `pdf_version(pdf) → text`
- `pdf_subject(pdf) → text`
- `pdf_creation(pdf) → timestamp`
- `pdf_modification(pdf) → timestamp`

```tsql
SELECT pdf_title('/tmp/pgintro.pdf');
```

```tsql
        pdf_title        
-------------------------
 PostgreSQL Introduction
(1 row)
```

```tsql
SELECT pdf_author('/tmp/pgintro.pdf');
```

```tsql
 pdf_author 
------------
 周正中
(1 row)
```

Getting a subset of pages

```tsql
SELECT pdf_num_pages('/tmp/pgintro.pdf');
```

```tsql
 pdf_num_pages 
---------------
            24
(1 row)
```

```tsql
SELECT pdf_page('/tmp/pgintro.pdf', 1);
```

```tsql
           pdf_page           
------------------------------
 Catalog                     +
  PostgreSQL Origin         +
  Layout                    +
  Features                  +
  Enterprise Class Attribute+
  Case
(1 row)
```

```tsql
SELECT pdf_subject('/tmp/pgintro.pdf');
```

```tsql
 pdf_subject 
-------------
 
(1 row)
```

```tsql
SELECT pdf_creation('/tmp/pgintro.pdf');
```

```tsql
       pdf_creation       
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)
```

```tsql
SELECT pdf_modification('/tmp/pgintro.pdf');
```

```tsql
     pdf_modification     
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)
```

```tsql
SELECT pdf_creator('/tmp/pgintro.pdf');
```

```tsql
            pdf_creator             
------------------------------------
 Microsoft® Office PowerPoint® 2007
(1 row)
```

```tsql
SELECT pdf_metadata('/tmp/pgintro.pdf');
```

```tsql
 pdf_metadata 
--------------
 
(1 row)
```

```tsql
SELECT pdf_version('/tmp/pgintro.pdf');
```

```tsql
 pdf_version 
-------------
 PDF-1.5
(1 row)
```

## Installation

Install [poppler](https://poppler.freedesktop.org) dependencies

**Linux**
```
sudo apt install -y libpoppler-glib-dev pkg-config
```

**Homebrew/MacOS**

```
brew install poppler pkgconf
```

```
cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo
```

After the installation, in a session:

```tsql
CREATE EXTENSION pgpdf;
```

### Docker

Get the [Docker image](https://hub.docker.com/r/florents/pgpdf) with:

```sh
docker pull florents/pgpdf:pg17
```

This adds pgpdf to the [Postgres image](https://hub.docker.com/_/postgres) (replace `17` with your Postgres server version, and run it the same way).

Run the image in a container.

```sh
docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17
```

Through another terminal, connect to the running server (container).

```sh
PGPASSWORD=pass psql -h localhost -p 5432 -U postgres
```

> [!WARNING]
> Reading arbitrary binary data (PDF) into your database can pose security risks.
> Only use this for files you trust.