Architectural Overview v3.7
BDR provides loosely-coupled multi-master logical replication using a mesh topology. This means that you can write to any server and the changes will be sent directly, row-by-row to all the other servers that are part of the same BDR group.
By default BDR uses asynchronous replication, applying changes on the peer nodes only after the local commit. An optional eager all node replication feature allows for commiting on all nodes using consensus.
Basic Architecture
Multiple Groups
A BDR node is a member of at least one Node Group, and in the most basic architecture there is a single node group for the whole BDR cluster.
Multiple Masters
Each node (database) participating in a BDR group both receives changes from other members and can be written to directly by the user.
This is distinct from Hot or Warm Standby, where only one master server accepts writes, and all the other nodes are standbys that replicate either from the master or from another standby.
You don't have to write to all the masters, all of the time; it's a frequent configuration to direct writes mostly to just one master. However, if you just want one-way replication, the use of pglogical may be more appropriate.
Asynchronous, by default
Changes made on one BDR node are not replicated to other nodes until
they are committed locally. As a result the data is not exactly the
same on all nodes at any given time; some nodes will have data that
has not yet arrived at other nodes. PostgreSQL's block-based replication
solutions default to asynchronous replication as well. In BDR,
because there are multiple masters and as a result multiple data streams,
data on different nodes might differ even when
synchronous_commit
and synchronous_standby_names
are used.
Mesh Topology
BDR is structured around a mesh network where every node connects to every other node and all nodes exchange data directly with each other. There is no forwarding of data within BDR except in special circumstances such as node addition and node removal. Data may arrive from outside the EDB Postgres Distributed cluster or be sent onwards using pglogical or native PostgreSQL logical replication.
Logical Replication
Logical replication is a method of replicating data rows and their changes, based upon their replication identity (usually a primary key). We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. Index changes are not replicated, thereby avoiding write amplification and reducing bandwidth.
Logical replication starts by copying a snapshot of the data from the source node. Once that is done, later commits are sent to other nodes as they occur in real time. Changes are replicated without re-executing SQL, so the exact data written is replicated quickly and accurately.
Nodes apply data in the order in which commits were made on the source node, ensuring transactional consistency is guaranteed for the changes from any single node. Changes from different nodes are applied independently of other nodes to ensure the rapid replication of changes.
Replicated data is sent in binary form, when it is safe to do so.
High Availability
Each master node can be protected by one or more standby nodes, so any node that goes down can be quickly replaced and continue. Each standby node can be either a logical or a physical standby node.
Replication continues between currently connected nodes even if one or more nodes are currently unavailable. When the node recovers, replication can restart from where it left off without missing any changes.
Nodes can run different release levels, negotiating the required protocols to communicate. As a result, EDB Postgres Distributed clusters can use rolling upgrades, even for major versions of database software.
DDL is automatically replicated across nodes by default. DDL execution can be user controlled to allow rolling application upgrades, if desired.
Limits
BDR can run hundreds of nodes on good enough hardware and network, however for mesh based deployments it's generally not recommended to run more than 32 nodes in one cluster. Each master node can be protected by multiple physical or logical standby nodes; there is no specific limit on the number of standby nodes, but typical usage would be to have 2-3 standbys per master. Standby nodes don't add additional connections to the mesh network so they are not included in the 32 node recommendation.
BDR currently has hard limit of no more than 1000 active nodes as this is the current maximum Raft connections allowed.
BDR places a limit that at most 10 databases in any one PostgreSQL instance can be BDR nodes across different BDR node groups. However BDR works best if only one BDR database per PostgreSQL instance is used.
The minimum recommended number of nodes in EDB Postgres Distributed cluster is 3, because with 2 nodes the consensus stops working if one of the node stops working.
Architectural Options & Performance
Characterising BDR performance
BDR can be configured in a number of different architectures, each of which has different performance and scalability characteristics.
The Group is the basic building block of a BDR Group consisting of 2+ nodes (servers). Within a Group, each node is in a different AZ, with dedicated router and backup, giving Immediate Switchover and High Availability. Each Group has a dedicated Replication Set defined on it. If the Group loses a node it is easily repaired/replaced by copying an existing node from the Group.
Adding more master nodes to a BDR Group does not result in significant write throughput increase when most tables are replicated because BDR has to replay all the writes on all nodes. Because BDR writes are in general more effective than writes coming from Postgres clients via SQL, some performance increase can be achieved. Read throughput generally scales linearly with the number of nodes.
The following architectures are available:
- Multimaster/Single Group
- BDR AlwaysOn
The simplest architecture is just to have one Group, so let's examine that first:
BDR MultiMaster within one Group
By default, BDR will keep one copy of each table on each node in the Group and any changes will be propagated to all nodes in the Group.
Since copies of data are everywhere, SELECTs need only ever access the local node. On a read-only cluster, performance on any one node will not be affected by the number of nodes. Thus adding nodes will increase linearly the total possible SELECT throughput.
INSERTs, UPDATEs and DELETEs (DML) are performed locally, then the changes will be propagated to all nodes in the Group. The overhead of DML apply is less than the original execution, so if you run a pure write workload on multiple nodes concurrently, a multi-node cluster will be able to handle more TPS than a single node.
Conflict handling has a cost that will act to reduce the throughput. The throughput is then dependent upon how much contention the application displays in practice. Applications with very low contention will perform better than a single node; applications with high contention could perform worse than a single node. These results are consistent with any multi-master technology, they are not a facet or peculiarity of BDR.
Eager replication can avoid conflicts, but is inherently more expensive.
Changes are sent concurrently to all nodes so that the replication lag is minimised. Adding more nodes means using more CPU for replication, so peak TPS will reduce slightly as each new node is added.
If the workload tries to uses all CPU resources then this will resource constrain replication, which could then affect the replication lag.
BDR AlwaysOn
The AlwaysOn architecture is built from 2 Groups, in 2 separate regions. Each Group provides HA and IS, but together they also provide Disaster Recovery (DR), so we refer to this architecture as AlwaysOn with Very High Availability.
Tables are created across both Groups, so any change goes to all nodes, not just to nodes in the local Group.
One node is the target for the main application. All other nodes are described as shadow nodes (or "read-write replica"), waiting to take over when needed. If a node loses contact we switch immediately to a shadow node to continue processing. If a Group fails, we can switch to the other Group. Scalability is not the goal of this architecture.
Since we write mainly to only one node, the possibility of contention between is reduced to almost zero and as a result performance impact is much reduced.
CAMO is eager replication within the local Group, lazy with regard to other Groups.
Secondary applications may execute against the shadow nodes, though these should be reduced or interrupted if the main application begins using that node.
Future feature: One node is elected as main replicator to other Groups, limiting CPU overhead of replication as the cluster grows and minimizing the bandwidth to other Groups.
Deployment
BDR3 is intended to be deployed in one of a small number of known-good configurations, using either TPAexec or a configuration management approach and deployment architecture approved by Technical Support.
Manual deployment is not recommended and may not be supported.
Please refer to the TPAexec Architecture User Manual
for your architecture.
Log messages and documentation are currently available only in English.
Clocks and Timezones
BDR has been designed to operate with nodes in multiple timezones, allowing a truly worldwide database cluster. Individual servers do not need to be configured with matching timezones, though we do recommend using log_timezone = UTC to ensure the human readable server log is more accessible and comparable.
Server clocks should be synchronized using NTP or other solutions.
Clock synchronization is not critical to performance, as is the case with some other solutions. Clock skew can impact Origin Conflict Detection, though BDR provides controls to report and manage any skew that exists. BDR also provides Row Version Conflict Detection, as described in Conflict Detection.