Node management v5

Each database that's member of a PGD group must be represented by its own node. A node is a unique identifier of a database in a PGD group.

At present, each node can be a member of just one node group. (This might be extended in later releases.) Each node can subscribe to one or more replication sets to give fine-grained control over replication.

A PGD group might also contain zero or more subgroups, allowing you to create a variety of different architectures.

Creating and joining a PGD group

For PGD, every node must connect to every other node. To make configuration easy, when a new node joins, it configures all existing nodes to connect to it. For this reason, every node, including the first PGD node created, must know the PostgreSQL connection string, sometimes referred to as a data source name (DSN), that other nodes can use to connect to it. Both formats of connection string are supported. So you can use either key-value format, like host=myhost port=5432 dbname=mydb, or URI format, like postgresql://myhost:5432/mydb.

The SQL function bdr.create_node_group() creates the PGD group from the local node. Doing so activates PGD on that node and allows other nodes to join the PGD group, which consists of only one node at that point. At the time of creation, you must specify the connection string for other nodes to use to connect to this node.

Once the node group is created, every further node can join the PGD group using the bdr.join_node_group() function.

Alternatively, use the command line utility bdr_init_physical to create a new node, using pg_basebackup (or a physical standby) of an existing node. If using pg_basebackup, the bdr_init_physical utility can optionally specify the base backup of only the target database. The earlier behavior was to back up the entire database cluster. With this utility, the activity completes faster and also uses less space because it excludes unwanted databases. If you specify only the target database, then the excluded databases get cleaned up and removed on the new node.

When a new PGD node is joined to an existing PGD group or a node subscribes to an upstream peer, before replication can begin the system must copy the existing data from the peer nodes to the local node. This copy must be carefully coordinated so that the local and remote data starts out identical. It's not enough to use pg_dump yourself. The BDR extension provides built-in facilities for making this initial copy.

During the join process, the BDR extension synchronizes existing data using the provided source node as the basis and creates all metadata information needed for establishing itself in the mesh topology in the PGD group. If the connection between the source and the new node disconnects during this initial copy, restart the join process from the beginning.

The node that is joining the cluster must not contain any schema or data that already exists on databases in the PGD group. We recommend that the newly joining database be empty except for the BDR extension. However, it's important that all required database users and roles are created.

Optionally, you can skip the schema synchronization using the synchronize_structure parameter of the bdr.join_node_group() function. In this case, the schema must already exist on the newly joining node.

We recommend that you select the source node that has the best connection (the closest) as the source node for joining. Doing so lowers the time needed for the join to finish.

Coordinate the join procedure using the Raft consensus algorithm, which requires most existing nodes to be online and reachable.

The logical join procedure (which uses the bdr.join_node_group() function) performs data sync doing COPY operations and uses multiple writers (parallel apply) if those are enabled.

Node join can execute concurrently with other node joins for the majority of the time taken to join. However, only one regular node at a time can be in either of the states PROMOTE or PROMOTING, which are typically fairly short if all other nodes are up and running. Otherwise the join is serialized at this stage. The subscriber-only nodes are an exception to this rule, and they can be concurrently in PROMOTE and PROMOTING states as well, so their join process is fully concurrent.

The join process uses only one node as the source, so it can be executed when nodes are down if a majority of nodes are available. This can cause a complexity when running logical join. During logical join, the commit timestamp of rows copied from the source node is set to the latest commit timestamp on the source node. Committed changes on nodes that have a commit timestamp earlier than this (because nodes are down or have significant lag) can conflict with changes from other nodes. In this case, the newly joined node can be resolved differently to other nodes, causing a divergence. As a result, we recommend not running a node join when significant replication lag exists between nodes. If this is necessary, run LiveCompare on the newly joined node to correct any data divergence once all nodes are available and caught up.

pg_dump can fail when there is concurrent DDL activity on the source node because of cache-lookup failures. Since bdr.join_node_group() uses pg_dump internally, it might fail if there's concurrent DDL activity on the source node. Retrying the join works in that case.

Joining a heterogeneous cluster

PGD 4.0 node can join a EDB Postgres Distributed cluster running 3.7.x at a specific minimum maintenance release (such as 3.7.6) or a mix of 3.7 and 4.0 nodes. This procedure is useful when you want to upgrade not just the PGD major version but also the underlying PostgreSQL major version. You can achieve this by joining a 3.7 node running on PostgreSQL 12 or 13 to a EDB Postgres Distributed cluster running 3.6.x on PostgreSQL 11. The new node can also run on the same PostgreSQL major release as all of the nodes in the existing cluster.

PGD ensures that the replication works correctly in all directions even when some nodes are running 3.6 on one PostgreSQL major release and other nodes are running 3.7 on another PostgreSQL major release. But we recommend that you quickly bring the cluster into a homogenous state by parting the older nodes once enough new nodes join the cluster. Don't run any DDLs that might not be available on the older versions and vice versa.

A node joining with a different major PostgreSQL release can't use physical backup taken with bdr_init_physical, and the node must join using the logical join method. This is necessary because the major PostgreSQL releases aren't on-disk compatible with each other.

When a 3.7 node joins the cluster using a 3.6 node as a source, certain configurations, such as conflict resolution, aren't copied from the source node. The node must be configured after it joins the cluster.

Connection DSNs and SSL (TLS)

The DSN of a node is simply a libpq connection string, since nodes connect using libpq. As such, it can contain any permitted libpq connection parameter, including those for SSL. The DSN must work as the connection string from the client connecting to the node in which it's specified. An example of such a set of parameters using a client certificate is:

sslmode=verify-full sslcert=bdr_client.crt sslkey=bdr_client.key
sslrootcert=root.crt

With this setup, the files bdr_client.crt, bdr_client.key, and root.crt must be present in the data directory on each node, with the appropriate permissions. For verify-full mode, the server's SSL certificate is checked to ensure that it's directly or indirectly signed with the root.crt certificate authority and that the host name or address used in the connection matches the contents of the certificate. In the case of a name, this can match a Subject Alternative Name or, if there are no such names in the certificate, the Subject's Common Name (CN) field. Postgres doesn't currently support subject alternative names for IP addresses, so if the connection is made by address rather than name, it must match the CN field.

The CN of the client certificate must be the name of the user making the PGD connection. This is usually the user postgres. Each node requires matching lines permitting the connection in the pg_hba.conf file. For example:

hostssl all         postgres 10.1.2.3/24 cert
hostssl replication postgres 10.1.2.3/24 cert

Another setup might be to use SCRAM-SHA-256 passwords instead of client certificates and not verify the server identity as long as the certificate is properly signed. Here the DSN parameters might be:

sslmode=verify-ca sslrootcert=root.crt

The corresponding pg_hba.conf lines are:

hostssl all         postgres 10.1.2.3/24 scram-sha-256
hostssl replication postgres 10.1.2.3/24 scram-sha-256

In such a scenario, the postgres user needs a .pgpass file containing the correct password.

Witness nodes

If the cluster has an even number of nodes, it might be useful to create an extra node to help break ties in the event of a network split (or network partition, as it is sometimes called).

Rather than create an additional full-size node, you can create a micro node, sometimes called a witness node. This is a normal PGD node that is deliberately set up not to replicate any tables or data to it.

Logical standby nodes

PGD allows you to create a logical standby node, also known as an offload node, a read-only node, receive-only node, or logical-read replicas. A master node can have zero, one, or more logical standby nodes.

Note

Logical standby nodes can be used in environments where network traffic between data centers is a concern; otherwise having more data nodes per location is always preferred.

With a physical standby node, the node never comes up fully, forcing it to stay in continual recovery mode. PGD allows something similar. bdr.join_node_group has the pause_in_standby option to make the node stay in half-way-joined as a logical standby node. Logical standby nodes receive changes but don't send changes made locally to other nodes.

Later, if you want, use bdr.promote_node() to move the logical standby into a full, normal send/receive node.

A logical standby is sent data by one source node, defined by the DSN in bdr.join_node_group. Changes from all other nodes are received from this one source node, minimizing bandwidth between multiple sites.

There are multiple options for high availability:

  • If the source node dies, one physical standby can be promoted to a master. In this case, the new master can continue to feed any or all logical standby nodes.

  • If the source node dies, one logical standby can be promoted to a full node and replace the source in a failover operation similar to single-master operation. If there are multiple logical standby nodes, the other nodes can't follow the new master, so the effectiveness of this technique is limited to one logical standby.

In case a new standby is created from an existing PGD node, the needed replication slots for operation aren't synced to the new standby until at least 16 MB of LSN has elapsed since the group slot was last advanced. In extreme cases, this might require a full 16 MB before slots are synced or created on the streaming replica. If a failover or switchover occurs during this interval, the streaming standby can't be promoted to replace its PGD node, as the group slot and other dependent slots don't exist yet.

The slot sync-up process on the standby solves this by invoking a function on the upstream. This function moves the group slot in the entire EDB Postgres Distributed cluster by performing WAL switches and requesting all PGD peer nodes to replay their progress updates. This causes the group slot to move ahead in a short time span. This reduces the time required by the standby for the initial slot's sync-up, allowing for faster failover to it, if required.

On PostgreSQL, it's important to ensure that the slot's sync-up completes on the standby before promoting it. You can run the following query on the standby in the target database to monitor and ensure that the slots synced up with the upstream. The promotion can go ahead when this query returns true.

SELECT true FROM pg_catalog.pg_replication_slots WHERE
    slot_type = 'logical' AND confirmed_flush_lsn IS NOT NULL;

You can also nudge the slot sync-up process in the entire PGD cluster by manually performing WAL switches and by requesting all PGD peer nodes to replay their progress updates. This activity causes the group slot to move ahead in a short time and also hastens the slot sync-up activity on the standby. You can run the following queries on any PGD peer node in the target database for this:

SELECT bdr.run_on_all_nodes('SELECT pg_catalog.pg_switch_wal()');
SELECT bdr.run_on_all_nodes('SELECT bdr.request_replay_progress_update()');

Use the monitoring query on the standby to check that these queries do help in faster slot sync-up on that standby.

Logical standby nodes can be protected using physical standby nodes, if desired, so Master->LogicalStandby->PhysicalStandby. You can't cascade from LogicalStandby to LogicalStandby.

A logical standby does allow write transactions, so the restrictions of a physical standby don't apply. You can use this to great benefit, since it allows the logical standby to have additional indexes, longer retention periods for data, intermediate work tables, LISTEN/NOTIFY, temp tables, materialized views, and other differences.

Any changes made locally to logical standbys that commit before the promotion aren't sent to other nodes. All transactions that commit after promotion are sent onwards. If you perform writes to a logical standby, take care to quiesce the database before promotion.

You might make DDL changes to logical standby nodes but they aren't replicated and they don't attempt to take global DDL locks. PGD functions that act similarly to DDL also aren't replicated. See DDL replication. If you made incompatible DDL changes to a logical standby, then the database is a divergent node. Promotion of a divergent node currently results in replication failing. As a result, plan to either ensure that a logical standby node is kept free of divergent changes if you intend to use it as a standby, or ensure that divergent nodes are never promoted.

Physical standby nodes

PGD also enables you to create traditional physical standby failover nodes. These are commonly intended to directly replace a PGD node in the cluster after a short promotion procedure. As with any standard Postgres cluster, a node can have any number of these physical replicas.

There are, however, some minimal prerequisites for this to work properly due to the use of replication slots and other functional requirements in PGD:

  • The connection between PGD primary and standby uses streaming replication through a physical replication slot.
  • The standby has:
    • recovery.conf (for PostgreSQL <12, for PostgreSQL 12+ these settings are in postgres.conf):
      • primary_conninfo pointing to the primary
      • primary_slot_name naming a physical replication slot on the primary to be used only by this standby
    • postgresql.conf:
      • shared_preload_libraries = 'bdr', there can be other plugins in the list as well, but don't include pglogical
      • hot_standby = on
      • hot_standby_feedback = on
  • The primary has:
    • postgresql.conf:
      • bdr.standby_slot_names specifies the physical replication slot used for the standby's primary_slot_name.

While this is enough to produce a working physical standby of a PGD node, you need to address some additional concerns.

Once established, the standby requires enough time and WAL traffic to trigger an initial copy of the primary's other PGD-related replication slots, including the PGD group slot. At minimum, slots on a standby are live and can survive a failover only if they report a nonzero confirmed_flush_lsn as reported by pg_replication_slots.

As a consequence, check physical standby nodes in newly initialized PGD clusters with low amounts of write activity before assuming a failover will work normally. Failing to take this precaution can result in the standby having an incomplete subset of required replication slots needed to function as a PGD node, and thus an aborted failover.

The protection mechanism that ensures physical standby nodes are up to date and can be promoted (as configured bdr.standby_slot_names) affects the overall replication latency of the PGD group. This is because the group replication happens only when the physical standby nodes are up to date.

For these reasons, we generally recommend to use either logical standby nodes or a subscribe-only group instead of physical standby nodes. They both have better operational characteristics in comparison.

You can can manually ensure the group slot is advanced on all nodes (as much as possible), which helps hasten the creation of PGD-related replication slots on a physical standby using the following SQL syntax:

SELECT bdr.move_group_slot_all_nodes();

Upon failover, the standby must perform one of two actions to replace the primary:

  • Assume control of the same IP address or hostname as the primary.
  • Inform the EDB Postgres Distributed cluster of the change in address by executing the bdr.alter_node_interface function on all other PGD nodes.

Once this is done, the other PGD nodes reestablish communication with the newly promoted standby -> primary node. Since replication slots are synchronized only periodically, this new primary might reflect a lower LSN than expected by the existing PGD nodes. If this is the case, PGD fast forwards each lagging slot to the last location used by each PGD node.

Take special note of the bdr.standby_slot_names parameter as well. It's important to set it in a EDB Postgres Distributed cluster where there is a primary -> physical standby relationship or when using subscriber-only groups.

PGD maintains a group slot that always reflects the state of the cluster node showing the most lag for any outbound replication. With the addition of a physical replica, PGD must be informed that there is a nonparticipating node member that, regardless, affects the state of the group slot.

Since the standby doesn't directly communicate with the other PGD nodes, the standby_slot_names parameter informs PGD to consider named slots as needed constraints on the group slot as well. When set, the group slot is held if the standby shows lag, even if the group slot is normally advanced.

As with any physical replica, this type of standby can also be configured as a synchronous replica. As a reminder, this requires:

  • On the standby:
    • Specifying a unique application_name in primary_conninfo
  • On the primary:
    • Enabling synchronous_commit
    • Including the standby application_name in synchronous_standby_names

It's possible to mix physical standby and other PGD nodes in synchronous_standby_names. CAMO and Eager All-Node Replication use different synchronization mechanisms and don't work with synchronous replication. Make sure synchronous_standby_names doesn't include any PGD node if either CAMO or Eager All-Node Replication is used. Instead use only non-PGD nodes, for example, a physical standby.

Subgroups

A group can also contain zero or more subgroups. Each subgroup can be allocated to a specific purpose in the top-level parent group. The node_group_type specifies the type when the subgroup is created.

Subscriber-only groups

As the name suggests, this type of node subscribes only to replication changes from other nodes in the cluster. However, no other nodes receive replication changes from subscriber-only nodes. This is somewhat similar to logical standby nodes. But in contrast to logical standby, the subscriber-only nodes are fully joined to the cluster. They can receive replication changes from all other nodes in the cluster and hence aren't affected by unavailability or parting of any one node in the cluster.

A subscriber-only node is a fully joined PGD node and hence it receives all replicated DDLs and acts on those. It also uses Raft to consistently report its status to all nodes in the cluster. The subscriber-only node doesn't have Raft voting rights and hence can't become a Raft leader or participate in the leader election. Also, while it receives replicated DDLs, it doesn't participate in DDL or DML lock acquisition. In other words, a currently down subscriber-only node doesn't stop a DML lock from being acquired.

The subscriber-only node forms the building block for PGD Tree topology. In this topology, a small number of fully active nodes are replicating changes in all directions. A large number of subscriber-only nodes receive only changes but never send any changes to any other node in the cluster. This topology avoids connection explosion due to a large number of nodes, yet provides an extremely large number of leaf nodes that you can use to consume the data.

To make use of subscriber-only nodes, first create a PGD group of type subscriber-only. Make it a subgroup of the group from which the member nodes receive the replication changes. Once you create the subgroup, all nodes that intend to become subscriber-only nodes must join the subgroup. You can create more than one subgroup of subscriber-only type, and they can have different parent groups.

Once a node successfully joins the subscriber-only subgroup, it becomes a subscriber-only node and starts receiving replication changes for the parent group. Any changes made directly on the subscriber-only node aren't replicated.

See bdr.create_node_group() to know how to create a subgroup of a specific type and belonging to a specific parent group.

Notes

Since a subscriber-only node doesn't replicate changes to any node in the cluster, it can't act as a source for syncing replication changes when a node is parted from the cluster. But if the subscriber-only node already received and applied replication changes from the parted node that no other node in the cluster currently has, then that causes inconsistency between the nodes.

For now, you can solve this by setting bdr.standby_slot_names and bdr.standby_slots_min_confirmed so that there is always a fully active PGD node that is ahead of the subscriber-only nodes.

This might be improved in a future release. We might either allow subscriber-only nodes to be ahead in the replication and then use them as replication source for sync or simply provide ways to optionally remove the inconsistent subscriber-only nodes from the cluster when another fully joined node is parted.

Decoding worker

PGD4 provides an option to enable a decoding worker process that performs decoding once, no matter how many nodes are sent data. This introduces a new process, the WAL decoder, on each PGD node. One WAL sender process still exists for each connection, but these processes now just perform the task of sending and receiving data. Taken together, these changes reduce the CPU overhead of larger PGD groups and also allow higher replication throughput since the WAL sender process now spends more time on communication.

enable_wal_decoder is an option for each PGD group, which is currently disabled by default. You can use bdr.alter_node_group_config() to enable or disable the decoding worker for a PGD group.

When the decoding worker is enabled, PGD stores logical change record (LCR) files to allow buffering of changes between decoding and when all subscribing nodes received data. LCR files are stored under the pg_logical directory in each local node's data directory. The number and size of the LCR files varies as replication lag increases, so this also needs monitoring. The LCRs that aren't required by any of the PGD nodes are cleaned periodically. The interval between two consecutive cleanups is controlled by bdr.lcr_cleanup_interval, which defaults to 3 minutes. The cleanup is disabled when bdr.lcr_cleanup_interval is zero.

When disabled, logical decoding is performed by the WAL sender process for each node subscribing to each node. In this case, no LCR files are written.

Even though the decoding worker is enabled for a PGD group, following GUCs control the production and use of LCR per node. By default these are false. For production and use of LCRs, enable the decoding worker for the PGD group and set these GUCs to to true on each of the nodes in the PGD group.

  • bdr.enable_wal_decoder When turned false, all WAL senders using LCRs restart to use WAL directly. When true along with the PGD group config, a decoding worker process is started to produce LCR and WAL Senders use LCR.
  • bdr.receive_lcr When true on the subscribing node, it requests WAL sender on the publisher node to use LCRs if available.

Notes

As of now, a decoding worker decodes changes corresponding to the node where it's running. A logical standby is sent changes from all the nodes in the PGD group through a single source. Hence a WAL sender serving a logical standby can't use LCRs right now.

A subscriber-only node receives changes from respective nodes directly. Hence a WAL sender serving a subscriber-only node can use LCRs.

Even though LCRs are produced, the corresponding WALs are still retained similar to the case when a decoding worker isn't enabled. In the future, it might be possible to remove WAL corresponding the LCRs, if they aren't otherwise required.

For reference, the first 24 characters of an LCR file name are similar to those in a WAL file name. The first 8 characters of the name are all '0' right now. In the future, they are expected to represent the TimeLineId similar to the first 8 characters of a WAL segment file name. The following sequence of 16 characters of the name is similar to the WAL segment number, which is used to track LCR changes against the WAL stream.

However, logical changes are reordered according to the commit order of the transactions they belong to. Hence their placement in the LCR segments doesn't match the placement of corresponding WAL in the WAL segments.

The set of last 16 characters represents the subsegment number in an LCR segment. Each LCR file corresponds to a subsegment. LCR files are binary and variable sized. The maximum size of an LCR file can be controlled by bdr.max_lcr_segment_file_size, which defaults to 1 GB.

Node restart and down node recovery

PGD is designed to recover from node restart or node disconnection. The disconnected node rejoins the group by reconnecting to each peer node and then replicating any missing data from that node.

When a node starts up, each connection begins showing bdr.node_slots.state = catchup and begins replicating missing data. Catching up continues for a period of time that depends on the amount of missing data from each peer node and will likely increase over time, depending on the server workload.

If the amount of write activity on each node isn't uniform, the catchup period from nodes with more data can take significantly longer than other nodes. Eventually, the slot state changes to bdr.node_slots.state = streaming.

Nodes that are offline for longer periods, such as hours or days, can begin to cause resource issues for various reasons. Don't plan on extended outages without understanding the following issues.

Each node retains change information (using one replication slot for each peer node) so it can later replay changes to a temporarily unreachable node. If a peer node remains offline indefinitely, this accumulated change information eventually causes the node to run out of storage space for PostgreSQL transaction logs (WAL in pg_wal), and likely causes the database server to shut down with an error similar to this:

PANIC: could not write to file "pg_wal/xlogtemp.559": No space left on device

Or, it might report other out-of-disk related symptoms.

In addition, slots for offline nodes also hold back the catalog xmin, preventing vacuuming of catalog tables.

On EDB Postgres Extended Server and EDB Postgres Advanced Server, offline nodes also hold back freezing of data to prevent losing conflict-resolution data (see Origin conflict detection).

Administrators must monitor for node outages (see monitoring) and make sure nodes have enough free disk space. If the workload is predictable, you might be able to calculate how much space is used over time, allowing a prediction of the maximum time a node can be down before critical issues arise.

Don't manually remove replication slots created by PGD. If you do, the cluster becomes damaged and the node that was using the slot must be parted from the cluster, as described in Replication slots created by PGD.

While a node is offline, the other nodes might not yet have received the same set of data from the offline node, so this might appear as a slight divergence across nodes. The parting process corrects this imbalance across nodes. (Later versions might do this earlier.)

Replication slots created by PGD

On a PGD master node, the following replication slots are created by PGD:

  • One group slot, named bdr_<database name>_<group name>
  • N-1 node slots, named bdr_<database name>_<group name>_<node name>, where N is the total number of PGD nodes in the cluster, including direct logical standbys, if any
Warning

Don't drop those slots. PGD creates and manages them and drops them when or if necessary.

On the other hand, you can create or drop replication slots required by software like Barman or logical replication using the appropriate commands for the software without any effect on PGD. Don't start slot names used by other software with the prefix bdr_.

For example, in a cluster composed of the three nodes alpha, beta, and gamma, where PGD is used to replicate the mydb database and the PGD group is called mygroup:

  • Node alpha has three slots:
    • One group slot named bdr_mydb_mygroup
    • Two node slots named bdr_mydb_mygroup_beta and bdr_mydb_mygroup_gamma
  • Node beta has three slots:
    • One group slot named bdr_mydb_mygroup
    • Two node slots named bdr_mydb_mygroup_alpha and bdr_mydb_mygroup_gamma
  • Node gamma has three slots:
    • One group slot named bdr_mydb_mygroup
    • Two node slots named bdr_mydb_mygroup_alpha and bdr_mydb_mygroup_beta

Group replication slot

The group slot is an internal slot used by PGD primarily to track the oldest safe position that any node in the PGD group (including all logical standbys) has caught up to, for any outbound replication from this node.

The group slot name is given by the function bdr.local_group_slot_name().

The group slot can:

  • Join new nodes to the PGD group without having all existing nodes up and running (although the majority of nodes should be up), without incurring data loss in case the node that was down during join starts replicating again.
  • Part nodes from the cluster consistently, even if some nodes haven't caught up fully with the parted node.
  • Hold back the freeze point to avoid missing some conflicts.
  • Keep the historical snapshot for timestamp-based snapshots.

The group slot is usually inactive and is fast forwarded only periodically in response to Raft progress messages from other nodes.

Warning

Don't drop the group slot. Although usually inactive, it's still vital to the proper operation of the EDB Postgres Distributed cluster. If you drop it, then some or all of the features can stop working or have incorrect outcomes.

Hashing long identifiers

The name of a replication slotlike any other PostgreSQL identifiercan't be longer than 63 bytes. PGD handles this by shortening the database name, the PGD group name, and the name of the node in case the resulting slot name is too long for that limit. Shortening an identifier is carried out by replacing the final section of the string with a hash of the string itself.

For example, consider a cluster that replicates a database named db20xxxxxxxxxxxxxxxx (20 bytes long) using a PGD group named group20xxxxxxxxxxxxx (20 bytes long). The logical replication slot associated to node a30xxxxxxxxxxxxxxxxxxxxxxxxxxx (30 bytes long) is called since 3597186, be9cbd0, and 7f304a2 are respectively the hashes of db20xxxxxxxxxxxxxxxx, group20xxxxxxxxxxxxx, and a30xxxxxxxxxxxxxxxxxxxxxxxxxx.

bdr_db20xxxx3597186_group20xbe9cbd0_a30xxxxxxxxxxxxx7f304a2

Removing a node from a PGD group

Since PGD is designed to recover from extended node outages, you must explicitly tell the system if you're removing a node permanently. If you permanently shut down a node and don't tell the other nodes, then performance suffers and eventually the whole system stops working.

Node removal, also called parting, is done using the bdr.part_node() function. You must specify the node name (as passed during node creation) to remove a node. You can call the bdr.part_node() function from any active node in the PGD group, including the node that you're removing.

Just like the join procedure, parting is done using Raft consensus and requires a majority of nodes to be online to work.

The parting process affects all nodes. The Raft leader manages a vote between nodes to see which node has the most recent data from the parting node. Then all remaining nodes make a secondary, temporary connection to the most-recent node to allow them to catch up any missing data.

A parted node still is known to PGD but won't consume resources. A node might be added again under the same name as a parted node. In rare cases, you might want to clear all metadata of a parted node by using the function bdr.drop_node().

Uninstalling PGD

Dropping the PGD extension removes all the PGD objects in a node, including metadata tables. You can do this with the following command:

DROP EXTENSION bdr;

If the database depends on some PGD-specific objects, then you can't drop the PGD extension. Examples include:

  • Tables using PGD-specific sequences such as SnowflakeId or galloc
  • Column using CRDT data types
  • Views that depend on some PGD catalog tables

Remove those dependencies before dropping the BDR extension. For example, drop the dependent objects, alter the column type to a non-PGD equivalent, or change the sequence type back to local.

Warning

You can drop the BDR extension only if the node was successfully parted from its PGD node group or if it's the last node in the group. Dropping PGD metadata breaks replication to and from the other nodes.

Warning

When dropping a local PGD node or the BDR extension in the local database, any preexisting session might still try to execute a PGD-specific workflow and therefore fail. You can solve the problem by disconnecting the session and then reconnecting the client or by restarting the instance.

There's also a bdr.drop_node() function. Use this function only in emergencies, such as if there's a problem with parting.

Listing PGD topology

Listing PGD groups

The following simple query lists all the PGD node groups of which the current node is a member. It currently returns only one row.

SELECT node_group_name
FROM bdr.local_node_summary;

You can display the configuration of each node group using a more complex query:

SELECT g.node_group_name
, ns.pub_repsets
, ns.sub_repsets
, g.node_group_default_repset     AS default_repset
, node_group_check_constraints    AS check_constraints
FROM bdr.local_node_summary ns
JOIN bdr.node_group g USING (node_group_name);

Listing nodes in a PGD group

You can extract the list of all nodes in a given node group (such as mygroup) from the bdr.node_summary view as shown in the following example:

SELECT node_name         AS name
, node_seq_id            AS ord
, peer_state_name        AS current_state
, peer_target_state_name AS target_state
, interface_connstr      AS dsn
FROM bdr.node_summary
WHERE node_group_name = 'mygroup';

The read-only state of a node, as shown in the current_state or in the target_state query columns, is indicated as STANDBY.

List of node states

  • NONE: Node state is unset when the worker starts, expected to be set quickly to the current known state.
  • CREATED: bdr.create_node() was executed, but the node isn't a member of any EDB Postgres Distributed cluster yet.
  • JOIN_START: bdr.join_node_group() begins to join the local node to an existing EDB Postgres Distributed cluster.
  • JOINING: The node join has started and is currently at the initial sync phase, creating the schema and data on the node.
  • CATCHUP: Initial sync phase is completed. Now the join is at the last step of retrieving and applying transactions that were performed on the upstream peer node since the join started.
  • STANDBY: Node join finished, but hasn't yet started to broadcast changes. All joins spend some time in this state, but if defined as a logical standby, the node continues in this state.
  • PROMOTE: Node was a logical standby and we just called bdr.promote_node to move the node state to ACTIVE. These two PROMOTEstates have to be coherent to the fact that only one node can be with a state higher than STANDBY but lower than ACTIVE.
  • PROMOTING: Promotion from logical standby to full PGD node is in progress.
  • ACTIVE: The node is a full PGD node and is currently ACTIVE. This is the most common node status.
  • PART_START: Node was ACTIVE or STANDBY and we just called bdr.part_node to remove the node from the EDB Postgres Distributed cluster.
  • PARTING: Node disconnects from other nodes and plays no further part in consensus or replication.
  • PART_CATCHUP: Nonparting nodes synchronize any missing data from the recently parted node.
  • PARTED: Node parting operation is now complete on all nodes.

Only one node at a time can be in either of the states PROMOTE or PROMOTING.

Node management interfaces

You can add and remove nodes dynamically using the SQL interfaces.

bdr.create_node

This function creates a node.

Synopsis

bdr.create_node(node_name text, local_dsn text)

Parameters

  • node_name Name of the new node. Only one node is allowed per database. Valid node names consist of lowercase letters, numbers, hyphens, and underscores.
  • local_dsn Connection string to the node.

Notes

This function creates a record for the local node with the associated public connection string. There can be only one local record, so once it's created, the function reports an error if run again.

This function is a transactional function. You can roll it back and the changes made by it are visible to the current transaction.

The function holds lock on the newly created bdr node until the end of the transaction.

bdr.drop_node

Drops a node.

Warning

This function isn't intended for regular use. Execute it only if instructed by Technical Support.

This function removes the metadata for a given node from the local database. The node can be either:

  • The local node, in which case all the node metadata is removed, including information about remote nodes.
  • A remote node, in which case only metadata for that specific node is removed.

Synopsis

bdr.drop_node(node_name text, cascade boolean DEFAULT false, force boolean DEFAULT false)

Parameters

  • node_name Name of an existing node.
  • cascade Deprecated, will be removed in the future.
  • force Circumvents all sanity checks and forces the removal of all metadata for the given PGD node despite a possible danger of causing inconsistencies. Only Technical Support uses a forced node drop in case of emergencies related to parting.

Notes

Before you run this, part the node using bdr.part_node().

This function removes metadata for a given node from the local database. The node can be the local node, in which case all the node metadata are removed, including information about remote nodes. Or it can be the remote node, in which case only metadata for that specific node is removed.

Note

PGD4 can have a maximum of 1024 node records (both ACTIVE and PARTED) at one time because each node has a unique sequence number assigned to it, for use by snowflakeid and timeshard sequences. PARTED nodes aren't automatically cleaned up. If this becomes a problem, you can use this function to remove those records.

bdr.create_node_group

This function creates a PGD group with the local node as the only member of the group.

Synopsis

bdr.create_node_group(node_group_name text,
                      parent_group_name text DEFAULT NULL,
                      join_node_group boolean DEFAULT true,
                      node_group_type text DEFAULT NULL)

Parameters

  • node_group_name Name of the new PGD group. As with the node name, valid group names must consist of only lowercase letters, numbers, and underscores.
  • parent_group_name The name of the parent group for the subgroup.
  • join_node_group This parameter helps a node to decide whether to join the group being created by it. The default value is true. This is used when a node is creating a shard group that it doesn't want to join. This can be false only if you specify parent_group_name.
  • node_group_type The valid values are NULL, subscriber-only, datanode, read coordinator, and write coordinator. subscriber-only type is used to create a group of nodes that receive changes only from the fully joined nodes in the cluster, but they never send replication changes to other nodes. See Subscriber-only groups for more details. Datanode implies that the group represents a shard, whereas the other values imply that the group represents respective coordinators. Except subscriber-only, the other values are reserved for future use. NULL implies a normal general-purpose node group is created.

Notes

This function passes a request to the local consensus worker that's running for the local node.

The function isn't transactional. The creation of the group is a background process, so once the function finishes, you can't roll back the changes. Also, the changes might not be immediately visible to the current transaction. You can call bdr.wait_for_join_completion to wait until they are.

The group creation doesn't hold any locks.

bdr.alter_node_group_config

This function changes the configuration parameters of an existing PGD group. Options with NULL value (default for all of them) aren't modified.

Synopsis

bdr.alter_node_group_config(node_group_name text,
                            insert_to_update boolean DEFAULT NULL,
                            update_to_insert boolean DEFAULT NULL,
                            ignore_redundant_updates boolean DEFAULT NULL,
                            check_full_tuple boolean DEFAULT NULL,
                            apply_delay interval DEFAULT NULL,
                            check_constraints boolean DEFAULT NULL,
                            num_writers int DEFAULT NULL,
							              enable_wal_decoder boolean DEFAULT NULL,
							              streaming_mode text DEFAULT NULL,
                            default_commit_scope text DEFAULT NULL)

Parameters

  • node_group_name Name of an existing PGD group. The local node must be part of the group.

  • insert_to_update Reserved for backward compatibility.

  • update_to_insert Reserved for backward compatibility.

      versions of PGD. Use `bdr.alter_node_set_conflict_resolver` instead.
  • ignore_redundant_updates Reserved for backward compatibility.

  • check_full_tuple Reserved for backward compatibility.

  • apply_delay Reserved for backward compatibility.

  • check_constraints Whether the apply process checks the constraints when writing replicated data. This option is deprecated and will be disabled or removed in future versions of PGD.

  • num_writers Number of parallel writers for subscription backing this node group. -1 means the default (as specified by the GUC bdr.writers_per_subscription) is used. Valid values are either -1 or a positive integer.

  • enable_wal_decoder Enables/disables the decoding worker process. You can't enable the decoding worker process if streaming_mode is already enabled.

  • streaming_mode Enables/disables streaming of large transactions. When set to off, streaming is disabled. When set to any other value, large transactions are decoded while they're still in progress, and the changes are sent to the downstream. If the value is set to file, then the incoming changes of streaming transactions are stored in a file and applied only after the transaction is committed on upstream. If the value is set to writer, then the incoming changes are directly sent to one of the writers, if available. If parallel apply is disabled or no writer is free to handle streaming transaction, then the changes are written to a file and applied after the transaction is committed. If the value is set to auto, PGD tries to intelligently pick between file and writer, depending on the transaction property and available resources. You can't enable streaming_mode if the WAL decoder is already enabled.

    For more details, see Transaction streaming.

  • default_commit_scope The commit scope to use by default, initially the local commit scope. This applies only to the top-level node group. You can use individual rules for different origin groups of the same commit scope. See Origin groups for more details.

Notes

This function passes a request to the group consensus mechanism to change the defaults. The changes made are replicated globally using the consensus mechanism.

The function isn't transactional. The request is processed in the background so you can't roll back the function call. Also, the changes might not be immediately visible to the current transaction.

This function doesn't hold any locks.

Warning

When you use this function to change the apply_delay value, the change doesn't apply to nodes that are already members of the group. This restriction has little consequence on production use because this value normally isn't used outside of testing.

bdr.join_node_group

This function joins the local node to an already existing PGD group.

Synopsis

bdr.join_node_group (
    join_target_dsn text,
    node_group_name text DEFAULT NULL,
    pause_in_standby boolean DEFAULT false,
    wait_for_completion boolean DEFAULT true,
    synchronize_structure text DEFAULT 'all'
)

Parameters

  • join_target_dsn Specifies the connection string to an existing (source) node in the PGD group you want to add the local node to.
  • node_group_name Optional name of the PGD group. Defaults to NULL, which tries to detect the group name from information present on the source node.
  • pause_in_standby Optionally tells the join process to join only as a logical standby node, which can be later promoted to a full member.
  • wait_for_completion Wait for the join process to complete before returning. Defaults to true.
  • synchronize_structure Set the kind of structure (schema) synchronization to do during the join. Valid options are all, which synchronizes the complete database structure, and none, which doesn't synchronize any structure. However, it still synchronizes data.

If wait_for_completion is specified as false, this is an asynchronous call that returns as soon as the joining procedure starts. You can see progress of the join in logs and the bdr.event_summary information view or by calling the bdr.wait_for_join_completion() function after bdr.join_node_group() returns.

Notes

This function passes a request to the group consensus mechanism by way of the node that the join_target_dsn connection string points to. The changes made are replicated globally by the consensus mechanism.

The function isn't transactional. The joining process happens in the background and you can't roll it back. The changes are visible only to the local transaction if wait_for_completion was set to true or by calling bdr.wait_for_join_completion later.

Node can be part of only a single group, so you can call this function only once on each node.

Node join doesn't hold any locks in the PGD group.

bdr.switch_node_group

This function switches the local node from its current subgroup to another subgroup within the same existing PGD node group.

Synopsis

bdr.switch_node_group (
    node_group_name text,
    wait_for_completion boolean DEFAULT true
)

Parameters

  • node_group_name Name of the PGD group or subgroup.
  • wait_for_completion Wait for the switch process to complete before returning. Defaults to true.

If wait_for_completion is specified as false, this is an asynchronous call that returns as soon as the switching procedure starts. You can see progress of the switch in logs and the bdr.event_summary information view or by calling the bdr.wait_for_join_completion() function after bdr.switch_node_group() returns.

Notes

This function passes a request to the group consensus mechanism. The changes made are replicated globally by the consensus mechanism.

The function isn't transactional. The switching process happens in the background and you can't roll it back. The changes are visible only to the local transaction if wait_for_completion was set to true or by calling bdr.wait_for_join_completion later.

The local node changes membership from its current subgroup to another subgroup within the same PGD node group without needing to part the cluster. The node's kind must match that of existing nodes within the target subgroup.

Node switching doesn't hold any locks in the PGD group.

Restrictions: Currently, the function only allows switching between a subgroup and its PGD node group. To effect a move between subgroups it is necessary to make two separate calls: 1) switch from subgroup to node group and, 2) switch from node group to other subgroup.

bdr.promote_node

This function promotes a local logical standby node to a full member of the PGD group.

Synopsis

bdr.promote_node(wait_for_completion boolean DEFAULT true)

Notes

This function passes a request to the group consensus mechanism to change the defaults. The changes made are replicated globally by the consensus mechanism.

The function isn't transactional. The promotion process happens in the background, and you can't roll it back. The changes are visible only to the local transaction if wait_for_completion was set to true or by calling bdr.wait_for_join_completion later.

The promotion process holds lock against other promotions. This lock doesn't block other bdr.promote_node calls but prevents the background process of promotion from moving forward on more than one node at a time.

bdr.wait_for_join_completion

This function waits for the join procedure of a local node to finish.

Synopsis

bdr.wait_for_join_completion(verbose_progress boolean DEFAULT false)

Parameters

  • verbose_progress Optionally prints information about individual steps taken during the join procedure.

Notes

This function waits until the checks state of the local node reaches the target state, which was set by bdr.create_node_group, bdr.join_node_group, or bdr.promote_node.

bdr.part_node

Removes (parts) the node from the PGD group but doesn't remove data from the node.

You can call the function from any active node in the PGD group, including the node that you're removing. However, once the node is parted, it can't part other nodes in the cluster.

Note

If you're parting the local node, you must set wait_for_completion to false. Otherwise, it reports an error.

Warning

This action is permanent. If you want to temporarily halt replication to a node, see bdr.alter_subscription_disable().

Synopsis

bdr.part_node (
    node_name text,
	wait_for_completion boolean DEFAULT true,
	force boolean DEFAULT false
)

Parameters

  • node_name Name of an existing node to part.
  • wait_for_completion If true, the function doesn't return until the node is fully parted from the cluster. Otherwise the function starts the parting procedure and returns immediately without waiting. Always set to false when executing on the local node or when using force.
  • force Forces removal of the node on the local node. This sets the node state locally if consensus can't be reached or if the node parting process is stuck.
Warning

Using force = true can leave the PGD group in an inconsistent state. Use it only to recover from failures in which you can't remove the node any other way.

Notes

This function passes a request to the group consensus mechanism to part the given node. The changes made are replicated globally by the consensus mechanism. The parting process happens in the background, and you can't roll it back. The changes made by the parting process are visible only to the local transaction if wait_for_completion was set to true.

With force set to true, on consensus failure, this function sets the state of the given node only on the local node. In such a case, the function is transactional (because the function changes the node state) and you can roll it back. If the function is called on a node that is already in process of parting with force set to true, it also marks the given node as parted locally and exits. This is useful only when the consensus can't be reached on the cluster (that is, the majority of the nodes are down) or if the parting process is stuck. But it's important to take into account that when the parting node that was receiving writes, the parting process can take a long time without being stuck. The other nodes need to resynchronize any missing data from the given node. The force parting completely skips this resynchronization and can leave the other nodes in an inconsistent state.

The parting process doesn't hold any locks.

bdr.alter_node_interface

This function changes the connection string (DSN) of a specified node.

Synopsis

bdr.alter_node_interface(node_name text, interface_dsn text)

Parameters

  • node_name Name of an existing node to alter.
  • interface_dsn New connection string for a node.

Notes

Run this function and make the changes only on the local node. This means that you normally execute it on every node in the PGD group, including the node that is being changed.

This function is transactional. You can roll it back, and the changes are visible to the current transaction.

The function holds lock on the local node.

bdr.alter_subscription_enable

This function enables either the specified subscription or all the subscriptions of the local PGD node. This is also known as resume subscription. No error is thrown if the subscription is already enabled. Returns the number of subscriptions affected by this operation.

Synopsis

bdr.alter_subscription_enable(
    subscription_name name DEFAULT NULL,
    immediate boolean DEFAULT false
)

Parameters

  • subscription_name Name of the subscription to enable. If NULL (the default), all subscriptions on the local node are enabled.
  • immediate This currently has no effect.

Notes

This function isn't replicated and affects only local node subscriptions (either a specific node or all nodes).

This function is transactional. You can roll it back, and the current transaction can see any catalog changes. The subscription workers are started by a background process after the transaction has committed.

bdr.alter_subscription_disable

This function disables either the specified subscription or all the subscriptions of the local PGD node. Optionally, it can also immediately stop all the workers associated with the disabled subscriptions. This is also known as pause subscription. No error is thrown if the subscription is already disabled. Returns the number of subscriptions affected by this operation.

Synopsis

bdr.alter_subscription_disable(
    subscription_name name DEFAULT NULL,
    immediate boolean DEFAULT false,
    fast boolean DEFAULT true
)

Parameters

  • subscription_name Name of the subscription to disable. If NULL (the default), all subscriptions on the local node are disabled.
  • immediate Used to force the action immediately, stopping all the workers associated with the disabled subscription. When this option is true, you can't run this function inside of the transaction block.
  • fast This argument influences the behavior of immediate. If set to true (the default) it stops all the workers associated with the disabled subscription without waiting for them to finish current work.

Notes

This function isn't replicated and affects only local node subscriptions (either a specific subscription or all subscriptions).

This function is transactional. You can roll it back, and the current transaction can see any catalog changes. However, the timing of the subscription worker stopping depends on the value of immediate. If set to true, the workers receive the stop without waiting for the COMMIT. If the fast argument is set to true, the interruption of the workers doesn't wait for current work to finish.

Node-management commands

PGD also provides a command-line utility for adding nodes to the PGD group using physical copy (pg_basebackup) of an existing node and for converting a physical standby of an existing node to a new node in the PGD group.

bdr_init_physical

This is a regular command that's added to PostgreSQL's bin directory.

You must specify a data directory. If this data directory is empty, use the pg_basebackup -X stream to fill the directory using a fast block-level copy operation.

If the specified data directory isn't empty, this is used as the base for the new node. If the data directory is already active as a physical standby node, you need to stop the standby before running bdr_init_physical, which manages Postgres. Initially it waits for catchup and then promotes to a master node before joining the PGD group. The --standby option, if used, turns the existing physical standby into a logical standby node. It refers to the end state of the new PGD node, not the starting state of the specified data directory.

This command drops all PostgreSQL-native logical replication subscriptions from the database (or disables them when the -S option is used) as well as any replication origins and slots.

Synopsis

bdr_init_physical [OPTION] ...

Options

General options
  • -D, --pgdata=DIRECTORY The data directory to use for the new node. It can be either an empty or nonexistent directory or a directory populated using the pg_basebackup -X stream command (required).
  • -l, --log-file=FILE Use FILE for logging. The default is bdr_init_physical_postgres.log.
  • -n, --node-name=NAME The name of the newly created node (required).
  • --replication-sets=SETS The name of a comma-separated list of replication set names to use. All replication sets are used if not specified.
  • --standby Create a logical standby (receive-only node) rather than full send/receive node.
  • --node-group-name Group to join. Defaults to the same group as source node.
  • -s, --stop Stop the server once the initialization is done.
  • -v Increase logging verbosity.
  • -L Perform selective pg_basebackup when used with an empty/nonexistent data directory (-D option). This is a feature of EDB Postgres Extended Server only.
  • -S Instead of dropping logical replication subscriptions, disable them.
Connection options
  • -d, --remote-dsn=CONNSTR Connection string for remote node (required).
  • --local-dsn=CONNSTR Connection string for local node (required).
Configuration files override
  • --hba-conf Path to the new pg_hba.conf.
  • --postgresql-conf Path to the new postgresql.conf.
  • --postgresql-auto-conf Path to the new postgresql.auto.conf.

Notes

The replication set names specified in the command don't affect the data that exists in the data directory before the node joins the PGD group. This is true whether bdr_init_physical makes its own base backup or an existing base backup is being promoted to a new PGD node. Thus the --replication-sets option affects only the data published and subscribed to after the node joins the PGD node group. This behavior is different from the way replication sets are used in a logical join, as when using bdr.join_node_group().

The operator can truncate unwanted tables after the join completes. Refer to the bdr.tables catalog to determine replication set membership and identify tables that aren't members of any subscribed-to replication set. We strongly recommend that you truncate the tables rather than drop them, because:

  • DDL replication sets aren't necessarily the same as row (DML) replication sets, so you might inadvertently drop the table on other nodes.
  • If you later want to add the table to a replication set and you dropped it on some subset of nodes, you need to re-create it only on those nodes without creating DDL conflicts before you can add it to any replication sets.

It's simpler and safer to truncate your nonreplicated tables, leaving them present but empty.

A future version of PGD might automatically omit or remove tables that aren't part of the selected replication sets for a physical join, so your application should not rely on details of the behavior documented here.