Help!  It's broken!
------------------------------

You're having trouble getting it to work, and are scratching your head
as to what might be wrong.

Here are some idiosyncracies that other people have stumbled over that
might help you to "stumble more quickly."

If you have additions to make to this document, please contact
Christopher Browne <cbbrowne@ca.afilias.info>

1.  I looked for the _clustername namespace, and it wasn't there.

If the DSNs are wrong, then slon instances can't connect to the nodes.

This will generally lead to nodes remaining entirely untouched.

Recheck the connection configuration.  By the way, since slon links to
libpq, you could have password information stored in $HOME/.pgpass,
partially filling in right/wrong authentication information there.

2.  Everything in my script _looks_ OK, and some data is getting
pushed around, but not all of it.

Slony logs might look like the following:

DEBUG1 remoteListenThread_1: connected to 'host=host004 dbname=pgbenchrep user=postgres port=5432'
ERROR  remoteListenThread_1: "select ev_origin, ev_seqno, ev_timestamp,        ev_minxid, ev_maxxid, ev_xip,        ev_type,        ev_data1, ev_data2,        ev_data3, ev_data4,        ev_data5, ev_data6,        ev_data7, ev_data8 from "_pgbenchtest".sl_event e where (e.ev_origin = '1' and e.ev_seqno > '1') order by e.ev_origin, e.ev_seqno" - could not receive data from server: Operation now in progress

On AIX and Solaris (and possibly elsewhere), both Slony-I _and
PostgreSQL_ must be compiled with the --enable-thread-safety option.
The above results when PostgreSQL isn't so compiled.

What happens is that the libc (threadsafe) and libpq (non-threadsafe)
use different memory locations for errno, thereby leading to the
request failing.

Problems like this crop up with disadmirable regularity on AIX and
Solaris; it may take something of an "object code audit" to make sure
that ALL of the necessary components have been compiled and linked
with --enable-thread-safety.

For instance, I ran into the problem one that LD_LIBRARY_PATH had been
set, on Solaris, to point to libraries from an old PostgreSQL compile.
That meant that even though the database had been compiled with
--enable-thread-safety, and slon had been compiled against that, slon
was being dynamically linked to the "bad old thread-unsafe version,"
so slon didn't work.  It wasn't clear until I ran "ldd" against slon.

3.  I tried creating a CLUSTER NAME with a "-" in it.  That didn't work.

Slony-I uses the same rules for unquoted identifiers as the PostgreSQL
main parser, so no, you probably shouldn't put a "-" in your
identifier name.

You may be able to defeat this by putting "quotes" around identifier
names, but it's liable to bite you some, so this is something that is
probably not worth working around.

4.  After an immediate stop of postgresql (simulation of system crash)
in pg_catalog.pg_listener a tuple with
relname='_${cluster_name}_Restart' exists. slon doesn't start cause it
thinks another process is serving the cluster on this node.  What can
I do? The tuples can't be dropped from this relation.

The logs claim that "Another slon daemon is serving this node already"

It's handy to keep a slonik script like the following one around to
run in such cases:
================================================================================
twcsds004[/opt/twcsds004/OXRS/slony-scripts]$ cat restart_org.slonik 
cluster name = oxrsorg ;
node 1 admin conninfo = 'host=32.85.68.220 dbname=oxrsorg user=postgres port=5532';
node 2 admin conninfo = 'host=32.85.68.216 dbname=oxrsorg user=postgres port=5532';
node 3 admin conninfo = 'host=32.85.68.244 dbname=oxrsorg user=postgres port=5532';
node 4 admin conninfo = 'host=10.28.103.132 dbname=oxrsorg user=postgres port=5532';
restart node 1;
restart node 2;
restart node 3;
restart node 4;
================================================================================

'restart node n' cleans this stuff up so that you can restart the
node.

In version 1.0.3 and above, the startup process of slon looks for this
condition, and automatically cleans it up.

5.  If I run a "ps" command, I, and everyone else, can see passwords
on the command line.

Take the passwords out of the Slony configuration, and put them into
$(HOME)/.pgpass.

6.  When I run the sample setup script I get an error message similar
to:

<stdin>:64: PGRES_FATAL_ERROR load '$libdir/xxid';  - ERROR:  LOAD:
could not open file '$libdir/xxid': No such file or directory

Evidently, you haven't got the xxid.so library in the $libdir
directory that the PostgreSQL instance is using.  Note that the Slony
components need to be installed on EACH ONE of the nodes, not just on
the "master."

This may also point to there being some other mismatch between the
PostgreSQL binary instance and the Slony-I instance.  If you compiled
Slony-I yourself, on a machine that may have multiple PostgreSQL
builds "lying around," it's possible that the slon or slonik binaries
are asking to load something that isn't actually in the library
directory for the PostgreSQL database cluster that it's hitting.

Long and short: This points to a need to "audit" what installations of
PostgreSQL and Slony you have in place on the machine(s).
Unfortunately, just about any mismatch will cause things not to link
up quite right.  Look back at #2...

7.  An oddity - no need for Fully Qualified Name for table keys...

set add table (set id = 1, origin = 1, id = 27, full qualified name = 'nspace.some_table', key = 'key_on_whatever', 
    comment = 'Table some_table in namespace nspace with a candidate primary key');

If you have
   key = 'nspace.key_on_whatever'
the request will FAIL.

8.  I'm trying to get a slave subscribed, and get the following
messages in the logs:

DEBUG1 copy_set 1
DEBUG1 remoteWorkerThread_1: connected to provider DB
WARN   remoteWorkerThread_1: transactions earlier than XID 127314958 are still in progress
WARN   remoteWorkerThread_1: data copy for set 1 failed - sleep 60 seconds

Oops.  What I forgot to mention, as well, was that I was trying to add
TWO subscribers, concurrently.

That doesn't work out: Slony-I won't work on the COPY commands
concurrently.  See src/slon/remote_worker.c, function copy_set()

This has the (perhaps unfortunate) implication that you cannot
populate two slaves concurrently.  You have to subscribe one to the
set, and only once it has completed setting up the subscription
(copying table contents and such) can the second subscriber start
setting up the subscription.

It could also be possible for there to be an old outstanding
transaction blocking Slony-I from processing the sync.  You might want
to take a look at pg_locks to see what's up:

sampledb=# select * from pg_locks where transaction is not null order by transaction;
 relation | database | transaction |   pid   |     mode      | granted 
----------+----------+-------------+---------+---------------+---------
          |          |   127314921 | 2605100 | ExclusiveLock | t
          |          |   127326504 | 5660904 | ExclusiveLock | t
(2 rows)

See?  127314921 is indeed older than 127314958, and it's still running.

$ ps -aef | egrep '[2]605100'
postgres 2605100  205018   0 18:53:43  pts/3  3:13 postgres: postgres sampledb localhost COPY 

This happens to be a COPY transaction involved in setting up the
subscription for one of the nodes.  All is well; the system is busy
setting up the first subscriber; it won't start on the second one
until the first one has completed subscribing.

By the way, if there is more than one database on the PostgreSQL
cluster, and activity is taking place on the OTHER database, that will
lead to there being "transactions earlier than XID whatever" being
found to be still in progress.  The fact that it's a separate database
on the cluster is irrelevant; Slony-I will wait until those old
transactions terminate.

9.  I tried setting up a second replication set, and got the following error:

<stdin>:9: Could not create subscription set 2 for oxrslive!
<stdin>:11: PGRES_FATAL_ERROR select "_oxrslive".setAddTable(2, 1, 'public.replic_test', 'replic_test__Slony-I_oxrslive_rowID_key', 'Table public.replic_test without primary key');  - ERROR:  duplicate key violates unique constraint "sl_table-pkey"
CONTEXT:  PL/pgSQL function "setaddtable_int" line 71 at SQL statement

The table IDs used in SET ADD TABLE are required to be unique ACROSS
ALL SETS.  Thus, you can't restart numbering at 1 for a second set; if
you are numbering them consecutively, a subsequent set has to start
with IDs after where the previous set(s) left off.

10.  I need to drop a table from a replication set, and I can't see
how to do that.

This can be accomplished several ways, not all equally desirable ;-).

a) You could drop the whole replication set, and recreate it with just
the tables that you need.  Alas, that means recopying a whole lot of
data, and kills the usability of the cluster on the rest of the set
while that's happening.

b) If you are running 1.0.3 or later, there is the command SET DROP
TABLE, which will "do the trick."

c) If you are still using 1.0.1 or 1.0.2, the _essential_
functionality of SET DROP TABLE involves the functionality in
droptable_int().  You can fiddle this by hand by finding the table ID
for the table you want to get rid of, which you can find in sl_table,
and then run the following three queries, on each host:

  select _slonyschema.alterTableRestore(40);
  select _slonyschema.tableDropKey(40);
  delete from _slonyschema.sl_table where tab_id = 40;

The schema will obviously depend on how you defined the Slony-I
cluster.  The table ID, in this case, 40, will need to change to the
ID of the table you want to have go away.

You'll have to run these three queries on all of the nodes, preferably
firstly on the "master" node, so that the dropping of this propagates
properly.  Implementing this via a SLONIK statement with a new Slony
event would do that.  Submitting the three queries using EXECUTE
SCRIPT could do that.  Also possible would be to connect to each
database and submit the queries by hand.

11.  I want to stop replicating a sequence

If you are running 1.0.3 or later, there is a SET DROP SEQUENCE
command in Slonik to allow you to do this, parallelling SET DROP
TABLE.

If you are running 1.0.2 or earlier, the process is a bit more manual.

Supposing I want to get rid of the two sequences listed below,
whois_cachemgmt_seq and epp_whoi_cach_seq_, we start by needing the
seq_id values.

oxrsorg=# select * from _oxrsorg.sl_sequence  where seq_id in (93,59);
 seq_id | seq_reloid | seq_set |             seq_comment             
--------+------------+---------+-------------------------------------
     93 |  107451516 |       1 | Sequence public.whois_cachemgmt_seq
     59 |  107451860 |       1 | Sequence public.epp_whoi_cach_seq_
(2 rows)


The data that needs to be deleted to stop Slony from continuing to
replicate these are thus:

delete from _oxrsorg.sl_seqlog where seql_seqid in (93, 59);
delete from _oxrsorg.sl_sequence where seq_id in (93,59);

Those two queries could be submitted to all of the nodes via
ddlscript() / EXECUTE SCRIPT, thus eliminating the sequence everywhere
"at once."  Or they may be applied by hand to each of the nodes.

Similarly to SET DROP TABLE, this should be in place for Slony-I version
1.0.3 as SET DROP SEQUENCE.

12.  I tried to add a table to a set, and got the following message:

   Slony-I: cannot add table to currently subscribed set 1

You cannot add tables to sets that already have subscribers.

The workaround to this is to create ANOTHER set, add the new tables to
that new set, subscribe the same nodes subscribing to "set 1" to the
new set, and then merge the sets together.

13.  Some nodes start consistently falling behind

I have been running Slony-I on a node for a while, and am seeing
system performance suffering.

I'm seeing long running queries of the form:

   fetch 100 from LOG;

This is characteristic of pg_listener (which is the table containing
NOTIFY data) having plenty of dead tuples in it.  That makes NOTIFY
events take a long time, and causes the affected node to gradually
fall further and further behind.

You quite likely need to do a VACUUM FULL on pg_listener, to
vigorously clean it out, and need to vacuum pg_listener really
frequently.  Once every five minutes would likely be AOK.

Slon daemons already vacuum a bunch of tables, and cleanup_thread.c
contains a list of tables that are frequently vacuumed automatically.
In Slony-I 1.0.2, pg_listener is not included.  In 1.0.3 and later, it
is regularly vacuumed, so this should cease to be an issue.

14.  I started doing a backup using pg_dump, and suddenly Slony stops
replicating anything.

Ouch.  What happens here is a conflict between:

 a) pg_dump, which has taken out an AccessShareLock on all of the
    tables in the database, including the Slony-I ones, and

 b) A Slony-I sync event, which wants to grab a AccessExclusiveLock on
    the table sl_event.

The initial query that will be blocked is thus:

    select "_slonyschema".createEvent('_slonyschema, 'SYNC', NULL);     

(You can see this in pg_stat_activity, if you have query display
turned on in postgresql.conf)

The actual query combination that is causing the lock is from the
function Slony_I_ClusterStatus(), found in slony1_funcs.c, and is
localized in the code that does:

  LOCK TABLE %s.sl_event;
  INSERT INTO %s.sl_event (...stuff...)
  SELECT currval('%s.sl_event_seq');

The LOCK statement will sit there and wait until pg_dump (or whatever
else has pretty much any kind of access lock on sl_event) completes.  

Every subsequent query submitted that touches sl_event will block
behind the createEvent call.

There are a number of possible answers to this:

 a) Have pg_dump specify the schema dumped using --schema=whatever,
    and don't try dumping the cluster's schema.

 b) It would be nice to add an "--exclude-schema" option to pg_dump to
    exclude the Slony cluster schema.  Maybe in 8.0 or 8.1...

Note that 1.0.3 uses a more precise lock that is less exclusive that
should relieve this problem.

15.  The slons spent the weekend out of commission [for some reason],
and it's taking a long time to get a sync through.

You might want to take a look at the sl_log_1/sl_log_2 tables, and do
a summary to see if there are any really enormous Slony-I transactions
in there.  Up until at least 1.0.2, there needs to be a slon connected
to the master in order for SYNC events to be generated.

If none are being generated, then all of the updates until the next
one is generated will collect into one rather enormous Slony-I
transaction.

Conclusion: Even if there is not going to be a subscriber around, you
_really_ want to have a slon running to service the "master" node.

Some future version (probably 1.1) may provide a way for SYNC counts
to be updated on the master by the stored function that is invoked by
the table triggers.

16.  I pointed a subscribing node to a different parent and it stopped
replicating

We noticed this happening when we wanted to re-initialize a node,
where we had configuration thus:

 Node 1 - master
 Node 2 - child of node 1 - the node we're reinitializing
 Node 3 - child of node 3 - node that should keep replicating

The subscription for node 3 was changed to have node 1 as provider,
and we did DROP SET/SUBSCRIBE SET for node 2 to get it repopulating.

Unfortunately, replication suddenly stopped to node 3.

The problem was that there was not a suitable set of "listener paths"
in sl_listen to allow the events from node 1 to propagate to node 3.
The events were going through node 2, and blocking behind the
SUBSCRIBE SET event that node 2 was working on.

The following slonik script dropped out the listen paths where node 3
had to go through node 2, and added in direct listens between nodes 1
and 3.

cluster name = oxrslive;
 node 1 admin conninfo='host=32.85.68.220 dbname=oxrslive user=postgres port=5432';
 node 2 admin conninfo='host=32.85.68.216 dbname=oxrslive user=postgres port=5432';
 node 3 admin conninfo='host=32.85.68.244 dbname=oxrslive user=postgres port=5432';
 node 4 admin conninfo='host=10.28.103.132 dbname=oxrslive user=postgres port=5432';
try {
      store listen (origin = 1, receiver = 3, provider = 1);
      store listen (origin = 3, receiver = 1, provider = 3);
      drop listen (origin = 1, receiver = 3, provider = 2);
      drop listen (origin = 3, receiver = 1, provider = 2);
}

Immediately after this script was run, SYNC events started propagating
again to node 3.

This points out two principles:

 1.  If you have multiple nodes, and cascaded subscribers, you need to
     be quite careful in populating the STORE LISTEN entries, and in
     modifying them if the structure of the replication "tree" changes.

 2.  Version 1.1 probably ought to provide better tools to help manage
     this.