Failover with pglogical3 v3.7
pglogical has support for following failover of both the provider (logical master) and subscriber (logical replica) if the conditions described in the following sections are met.
Only failover to streaming physical replicas is supported. pglogical subscribers cannot switch from replication from the provider to replicating from another peer subscriber.
Provider failover setup
With appropriate configuration of the provider and the provider's physical standby(s), pglogical subscriber(s) can follow failover of the provider to a promoted physical streaming replica of the provider.
Given a topology like this:
On failure of Provider1 and promotion of Provider2 to replace it, pglogical on Subscriber1 can consistently follow the failover and promotion if:
- Provider1 and Provider2 run PostgreSQL 10 or newer
- The connection between Provider1 and Provider2 uses streaming replication
with hot standby feedback and a physical replication slot. It's OK if WAL
archiving and a
restore_command
is configured as a fallback. - Provider2 has:
recovery.conf
:primary_conninfo
pointing to Provider1primary_slot_name
naming a physical replication slot on Provider1 to be used only by Provider2
postgresql.conf
:pglogical
in itsshared_preload_libraries
hot_standby = on
hot_standby_feedback = on
pglogical.synchronize_failover_slot_names
can be modified to specify which slots should be synchronized (default is all pglogical/bdr slots)
- Provider1 has:
postgresql.conf
:pglogical.standby_slot_names
lists the physical replication slot used for Provider2'sprimary_slot_name
. Promotion will still work if this is not set, but subscribers may be inconsistent per the linked documentation on the setting.
- Provider2 has had time to sync and has created a copy of Subscriber1's
logical replication slot. pglogical3 creates master slots on replicas
automatically once the replica's resource reservations can satisfy the master
slot's requirements, so just check that all pglogical slots on the master exist
on the standby, and have
confirmed_flush_lsn
set. - Provider2 takes over Provider1's IP address or hostname or Subscriber1's
existing subscription is reconfigured to connect to Provider2 using
pglogical.alter_node_add_interface
andpglogical.alter_subscription_interface
.
It is not necessary for Subscriber1 to be aware of or able to connect to Provider2 until it is promoted.
The post-failover topology is:
The reason pglogical must run on the provider's replica, and the provider's replica must use a physical replication slot, is due to limitations in PostgreSQL itself.
Normally when a PostgreSQL instance is replaced by a promoted physical replica of the same instance, any replication slots on that node are lost. Replication slot status is not itself replicated along physical replication connections and does not appear in WAL. So if the failed-and-replaced node was the upstream provider of any logical subscribers, those subscribers stop being able to receive data and cannot recover. Physical failover breaks logical replication connections.
To work around this, pglogical3 running on the failover-candidate replica syncs
the state of the master provider's logical replication slot(s) to the replica.
It also sends information back to the master to ensure that those slots
guarantees' (like catalog_xmin
) are respected by the master. That
synchronization requires a physical replication slot to avoid creating
excessive master bloat and to ensure the reservation is respected by the master
even if the replication connection is broken.
Subscriber failover setup
pglogical automatically follows failover of a subscriber to a streaming physical replica of the subscriber. No additional configuration is required.
WARNING: At present it's possible for the promoted subscriber to lose some
transactions that were committed on the failed subscriber and confirmed-flushed
to the provider, but not yet replicated to the new subscriber at the time of
promotion. That's because the provider will silently start replication at the
greater of the position the subscriber sends from its replication origin and
the position the master has recorded in its slot's confirmed_flush_lsn
.
Where possible you should execute a planned failover by stopping the subscription on Subscriber1 and waiting until Subscriber2 is caught up to Subscriber1 before failing over.
Given the server topology:
Upon promotion of Subscriber2 to replace a failed Subscriber1, logical replication will resume normally. It doesn't matter whether Subscriber2 has the same IP address or not.
For replication to resume promptly it may be necessary to explicitly terminate the walsender for Subscriber1 on Provider1 if the connection failure is not detected promptly by Provider1. pglogical enables TCP keepalives by default so in the absence of manual action it should exit and release the slot automatically in a few minutes.
It is important that Subscriber1 be fenced or otherwise conclusively terminated before Subscriber2 is promoted. Otherwise Subscriber1 can interfere with Subscriber2's replication progress tracking on Provider1 and create gaps in the replication stream.
After failover the topology is:
Note: at this time it is possible that there can be a small window of replicated data loss around the window of failover. pglogical on Subscriber1 may send confirmation of receipt of data to Provider1 before ensuring that Subscriber2 has received and flushed that data.
Additional functions
pglogical.sync_failover_slots()
Signal the supervisor to restart the mecanism to synchronize the failover
slots specifyed in the pglogical.synchronize_failover_slot_names
Synopsis
pglogical.syncfailover_slots();
This function should be run on the subscriber.
Legacy: Provider failover with pglogical2 using failover slots
An earlier effort to support failover of logical replication used the "failover slots" patch to PostgreSQL 9.6. This patch is carried in 2ndQPostgres 9.6 (only), but did not get merged into any community PostgreSQL version. pglogical2 supports using 2ndQPostgres and failover slots to follow provider failover.
The failover slots patch is neither required nor supported by pglogical3.
pglogical3 only supports provider failover on PostgreSQL 10 or newer, since
that is the first PostgreSQL version that contains support for sending
catalog_xmin
in hot standby feedback and for logical decoding to follow
timeline switches.
This section is retained to explain the change in failover models and reduce any confusion that may arise when updating from pglogical2 to pglogical3.