Case Study: Île-de-France Public Transit
==========================================
This case study, extending the scenario introduced in
:cite:`DBLP:journals/pvldb/SenellartJMR18`, applies ProvSQL to the
real-world GTFS dataset for Île-de-France public transit, demonstrating
Boolean provenance at scale for wheelchair accessibility reasoning.
The Scenario
------------
The `STIF GTFS dataset `_
describes hundreds of transit routes and tens of thousands of stops. You
want to answer: starting from **Bagneux** station (served by RER B and
several bus lines), which stops and routes are reachable, and is the
entire journey from Bagneux to each destination *fully*
wheelchair-accessible?
Boolean provenance answers the second question: a result token evaluates
to ``true`` if and only if *every* record along the path has the
wheelchair flag set.
.. warning::
This case study requires external data files. The dataset is **not**
bundled with ProvSQL due to its size (the compressed download is
several hundred megabytes). Download instructions are in
:ref:`stif-setup` below.
.. _stif-setup:
Setup
-----
**Download the GTFS data.** Obtain the Île-de-France GTFS archive from
`data.gouv.fr `_
(direct download: `IDFM-gtfs.zip `_).
Extract the archive; you will need the four text files
``routes.txt``, ``stops.txt``, ``trips.txt``, and ``stop_times.txt``.
**Load the schema and data.** Download
:download:`setup.sql <../../casestudy3/setup.sql>`
and run it from the directory containing the four GTFS files::
cd /path/to/gtfs-files
psql -d mydb -f /path/to/setup.sql
This creates four tables:
* ``routes`` – transit lines (RER A, B, M1, bus 91, …)
* ``stops`` – individual stop points with GPS coordinates and a
``wheelchair_boarding`` flag
* ``trips`` – individual scheduled journeys, each with a
``wheelchair_accessible`` flag
* ``stop_times`` – arrival and departure times at each stop for each trip
The script also adds provenance tracking and creates a combined
``wheelchair`` mapping table from both the trip and stop wheelchair columns.
.. note::
The setup script already creates the most important indexes
(on ``stop_id``, ``trip_id``, and parent station), which are
essential for acceptable performance on this large dataset.
Step 1: Explore the Database
-----------------------------
At the start of every session, set the search path so that ProvSQL functions
can be called without the ``provsql.`` prefix:
.. code-block:: postgresql
SET search_path TO public, provsql;
Inspect the four tables:
.. code-block:: postgresql
SELECT COUNT(*) FROM routes;
SELECT COUNT(*) FROM stops;
SELECT COUNT(*) FROM trips;
SELECT COUNT(*) FROM stop_times;
To find the stop IDs for Bagneux station and its platforms:
.. code-block:: postgresql
SELECT * FROM stops WHERE stop_name = 'Bagneux';
Step 2: Provenance and Wheelchair Mapping
------------------------------------------
Provenance has already been added by ``setup.sql``. The ``wheelchair``
mapping table combines ``wheelchair_accessible`` from ``trips`` and
``wheelchair_boarding`` from ``stops``. A result token evaluates to
``true`` (1) under :sqlfunc:`sr_boolean` if *every* contributing row
has its wheelchair column set to 1.
Inspect the mapping:
.. code-block:: postgresql
SELECT * FROM wheelchair LIMIT 10;
Step 3: Reachable Stops from Bagneux
-------------------------------------
Find all stops reachable from Bagneux on the same trip and later in the
sequence – in other words, stops you can reach by boarding a vehicle at
Bagneux without changing:
.. code-block:: postgresql
SELECT DISTINCT s2.stop_name, r2.route_long_name
FROM stops s0
JOIN stops s1 ON s1.parent_station = s0.stop_id
JOIN stop_times t1 ON s1.stop_id = t1.stop_id
JOIN stop_times t2 ON t1.trip_id = t2.trip_id
AND t1.stop_sequence < t2.stop_sequence
JOIN stops s2 ON s2.stop_id = t2.stop_id
JOIN trips u2 ON u2.trip_id = t2.trip_id
JOIN routes r2 ON r2.route_id = u2.route_id
WHERE s0.stop_name = 'Bagneux'
ORDER BY r2.route_long_name, s2.stop_name;
This returns several dozen distinct (stop, route) pairs covering the reachable
network (the exact number depends on the GTFS dataset version).
Step 4: Boolean Provenance – Full Wheelchair Accessibility
----------------------------------------------------------
Add Boolean provenance evaluation to mark which results are fully
wheelchair-accessible along *every* leg. Because the query returns one
row per trip (each with its own provenance circuit), materialize the
result first and then aggregate per destination:
.. code-block:: postgresql
CREATE TEMP TABLE bagneux_b AS
SELECT s2.stop_name,
r2.route_long_name,
sr_boolean(provenance(), 'wheelchair') AS accessible
FROM stops s0
JOIN stops s1 ON s1.parent_station = s0.stop_id
JOIN stop_times t1 ON s1.stop_id = t1.stop_id
JOIN stop_times t2 ON t1.trip_id = t2.trip_id
AND t1.stop_sequence < t2.stop_sequence
JOIN stops s2 ON s2.stop_id = t2.stop_id
JOIN trips u2 ON u2.trip_id = t2.trip_id
JOIN routes r2 ON r2.route_id = u2.route_id
WHERE s0.stop_name = 'Bagneux';
SELECT stop_name, route_long_name, bool_or(accessible) AS accessible
FROM bagneux_b
GROUP BY stop_name, route_long_name
ORDER BY route_long_name, stop_name;
:sqlfunc:`sr_boolean` evaluates the provenance token under the Boolean
semiring, looking up each leaf token in the ``wheelchair`` table.
A result of ``true`` means every record along *some* path from Bagneux
to that stop has the wheelchair flag set; ``false`` means no fully
accessible path exists.
Step 5: Inspect Individual Results with :sqlfunc:`sr_formula`
--------------------------------------------------------------
For a stop that is *not* fully accessible, use :sqlfunc:`sr_formula` to
identify which specific trip or stop is responsible. Here we inspect
the ``Paul Bert`` stop on route 391 as an example:
.. code-block:: postgresql
SELECT s2.stop_name,
sr_formula(provenance(), 'wheelchair') AS formula
FROM stops s0
JOIN stops s1 ON s1.parent_station = s0.stop_id
JOIN stop_times t1 ON s1.stop_id = t1.stop_id
JOIN stop_times t2 ON t1.trip_id = t2.trip_id
AND t1.stop_sequence < t2.stop_sequence
JOIN stops s2 ON s2.stop_id = t2.stop_id
JOIN trips u2 ON u2.trip_id = t2.trip_id
JOIN routes r2 ON r2.route_id = u2.route_id
WHERE s0.stop_name = 'Bagneux'
AND r2.route_long_name = '391'
AND s2.stop_name = 'Paul Bert'
LIMIT 1;
The formula shows which token carries a ``0`` wheelchair value,
pinpointing the accessibility barrier. For example::
stop_name | formula
----------+-----------------
Paul Bert | (1 ⊗ 1 ⊗ 0 ⊗ 1)
The four factors correspond to the four provenance-enabled table
instances in the join: the Bagneux station record (``stops``, 1), its
platform record (``stops``, 1), the Paul Bert stop record
(``stops``, 0), and the trip record (``trips``, 1). The ``0`` on the
third factor pinpoints the specific Paul Bert stop served by route 391
as the accessibility barrier. Note that there are several stops named
``Paul Bert`` in the dataset; the one served by route 391 has
``wheelchair_boarding = 0``, as we can verify:
.. code-block:: postgresql
SELECT DISTINCT s2.stop_name, s2.wheelchair_boarding
FROM stops s0
JOIN stops s1 ON s1.parent_station = s0.stop_id
JOIN stop_times t1 ON s1.stop_id = t1.stop_id
JOIN stop_times t2 ON t1.trip_id = t2.trip_id
AND t1.stop_sequence < t2.stop_sequence
JOIN stops s2 ON s2.stop_id = t2.stop_id
JOIN trips u2 ON u2.trip_id = t2.trip_id
JOIN routes r2 ON r2.route_id = u2.route_id
WHERE s0.stop_name = 'Bagneux'
AND r2.route_long_name = '391'
AND s2.stop_name = 'Paul Bert';