Error handling in pglogical v3.7
The main tool for troubleshooting is the PostgreSQL log file.
On the upstream side, monitoring uses the views:
On the subscriber side there are numerous helper functions and views that may be consulted to gain insight into pglogical's configuration and behaviour.
Start with the configuration and status summary views:
Configuration can be fetched from:
Observe worker activity with:
Statistics are reported by:
Other views provide logs and details:
The relation pglogical.worker_error_summary
is particularly important for
getting a quick overview of recent problems, though the logs should generally
be your main reference.
pglogical.worker_error
and pglogical.worker_error_summary
These relations show the last error reported by each kind of pglogical worker. Only the most recent error is retained for each distinct worker task. Receiver workers are tracked separately to their writer(s), as are any writer(s) used for table (re)sync purposes.
walsender workers cannot record errors in pglogical.worker_error
. Their
errors are only available in the log files.
pglogical.worker_error_summary
is a convenience view over
pglogical.worker_error
available in 3.7 and above.
pglogical.worker_tasks
The pglogical.worker_tasks
view shows pglogical's current worker launch rate
limiting state as well as some basic statistics on background worker launch
and registration activity.
Unlike the other views listed here, it is not specific to the current database and pglogical node; state for all pglogical nodes on the current PostgreSQL instance is shown. Join on the current database to filter it.
pglogical.worker_tasks
does not track walsenders and output plugins.
See the configuration option
pglogical.min_worker_backoff_delay
for
rate limit settings and overrides.
pglogical.apply_log
and pglogical.apply_log_summary
The pglogical.apply_log_summary
view summarizes the record of apply worker
events kept in pglogical.apply_log
. This records human-readable information
about conflicts and errors that arose during apply.
pglogical.sub_log
The pglogical.sub_log
table contains conflict log filter definitions that
are applied when recording entries in pglogical.apply_log
, controlling
whether conflicts are recorded to a log table and/or postgres log, or silently
dropped. It's managed by pglogical.alter_subscription_add_log(...)
and
pglogical.alter_subscription_remove_log()
.
If you aren't seeing expected conflict information when debugging an issue, check to make sure you have not filtered it out.
When pglogical workers encounter an error condition during operation they
report the error to the PostgreSQL log file, record the error to the
pglogical.worker_error
table if possible, and exit.
Unlike normal PostgreSQL user backends they do not attempt to recover from most errors and resume normal operation. Instead the worker in question will be relaunched soon and will resume operations at the last recoverable point. In the case of apply workers and walsenders that generally means restarting the last uncommitted transaction from the beginning.
This is an intentional design choice to make error handling and recovery simpler and more robust.
For example, if an apply worker tries to apply an UPDATE
and the new row
violates a secondary unique constraint on the target table, the apply worker
will report the unique violation error and exit. The error information will be
visible in pglogical.worker_error_summary
(3.7+, pglogical.worker_error
on 3.6).
The walsender worker on the peer end will exit automatically as well. The apply
worker will be relaunched by the manager worker for the database in a few
seconds and will retry the failed transaction from the beginning. If the
conflicting row has since been removed the transaction will apply normally and
replication will resume. If not, the worker will error again and the cycle will
repeat until the cause of the error is fixed. In this case the fix would
typically be for another subscription or a local application write to replicate
a change that clears the unhandled conflict condition or for the administrator
to intervene to change the conflicting row.
Diagnosing and fixing errors
It's important to first check that your schema and deployment don't violate any of the restrictions imposed by pglogical. Also check the additional writer-specific restrictions from the pglogical writer you are using, most likely the HeapWriter.
Common problems
Some issues that arise when operating pglogical include:
Incorrect or changed provider address or hostname. Update the interface definition for the subscription.
Use
pglogical.alter_node_add_interface(...)
andpglogical.alter_subscription_interface(...)
to change the subscriber's recorded address for the provider.Incorrect
pg_hba.conf
on provider disallowing subscriber from connecting. The subscriber must be able to connect in both replication and ordinary non-replication mode.Correct the
pg_hba.conf
on the provider andSELECT pg_reload_conf();
on the provider.Incompatible schema definitions on provider and subscriber caused by schema changes being made without DDL replication enabled and without use of
pglogical.replicate_ddl_command
. For example, missing columns on subscriber that haven't been excluded by a column filter, differing data types for columns between provider and subscriber, etc.(Some data type differences are actually permitted, but care must be taken that the text representations are compatible. Do not use differing data types for PostgreSQL built-in data types. See restrictions.)
Incorrectly defined
ENABLE REPLICA
orENABLE ALWAYS
triggers firing on apply on the subscriber and causing errors.Heap writers configured to fire normal triggers and foreign key validation triggers (using writer option
config.session_replication_role
). Problems arise when not all triggers have been checked to ensure they'll work correctly with row-replication and without statement triggers being fired as well. Or when FK violations or check constraint violations are created by replication set configuration such as row and column filters or by referenced tables not being replicated along with the referencing tables.Inconsistent versions of PostgreSQL or extensions between provider and subscriber where the version difference affects the behaviour or limits of a data type being replicated.
pglogical explicitly supports replicating between different versions of PostgreSQL, so a version difference alone is not a problem. But the data being replicated must be valid on the subscriber's PostgreSQL version.
For example, apply errors may occur when replicating data from PostGIS 3.0 to PostGIS 2.5 where not all the 3.0 data is understood by 2.5. Similarly, replicating from a PostgreSQL configured without integer datetimes to one with integer datetimes may result in errors if there are non-integer datetimes with values outside the somewhat narrower range permitted by integer datetimes support.
Multiple data source issues
Additional classes of error tend to arise with any sort of multiple-data-source configuration, i.e. multiple subscriptions to different providers for the same tables and/or local writes to tables that are also part of a subscription. Some of these affect BDR3 as well.
These include:
Tables with multiple unique constraints may cause unique violation errors during apply if the table receives writes from multiple sources.
Updating the PRIMARY KEY value for rows, or deleting a key then inserting the same key again soon afterwards. This may cause unique violation errors during apply if the table receives writes from more than one source, i.e. multiple providers and/or local writes.