Column-level conflict detection v5

By default, conflicts are resolved at row level. When changes from two nodes conflict, either the local or remote tuple is selected and the other is discarded. For example, commit timestamps for the two conflicting changes might be compared and the newer one kept. This approach ensures that all nodes converge to the same result and establishes commit-order-like semantics on the whole cluster.

However, in some cases it might be appropriate to resolve conflicts at the column level rather than the row level.

Consider a simple example, in which table t has two integer columns a and b and a single row (1,1). On one node execute:

UPDATE t SET a = 100

On another node, before receiving the preceding UPDATE, concurrently execute:

UPDATE t SET b = 100

This sequence results in an UPDATE-UPDATE conflict. With the update_if_newer conflict resolution, the commit timestamps are compared, and the new row version is kept. Assuming the second node committed last, the result is (1,100), which effectively discards the change to column a.

For many use cases, this behavior is the desired and expected. However, for some use cases, this might be an issue. Consider, for example, a multi-node cluster where each part of the application is connected to a different node, updating a dedicated subset of columns in a shared table. In that case, the different components might conflict and overwrite changes.

For such use cases, it might be more appropriate to resolve conflicts on a given table at the column level. To achieve that, PGD tracks the timestamp of the last change for each column separately and uses that to pick the most recent value, essentially performing update_if_newer.

Applied to the previous example, the result is (100,100) on both nodes, despite neither of the nodes ever seeing such a row.

When thinking about column-level conflict resolution, it can be useful to see tables as vertically partitioned, so that each update affects data in only one slice. This approach eliminates conflicts between changes to different subsets of columns. In fact, vertical partitioning can even be a practical alternative to column-level conflict resolution.

Column-level conflict resolution requires the table to have REPLICA IDENTITY FULL. The bdr.alter_table_conflict_detection function checks that and fails with an error if this setting is missing.

Enabling and disabling column-level conflict resolution

The bdr.alter_table_conflict_detection() function manages column-level conflict resolution.

Example

This example creates a table test_table and then enables column-level conflict resolution on it:

db=# CREATE TABLE my_app.test_table (id SERIAL PRIMARY KEY, val INT);
CREATE TABLE

db=# ALTER TABLE my_app.test_table REPLICA IDENTITY FULL;
ALTER TABLE

db=# SELECT bdr.alter_table_conflict_detection(
db(# 'my_app.test_table'::regclass, 
db(# 'column_modify_timestamp', 'cts');
 alter_table_conflict_detection
--------------------------------
 t

db=# \d my_app.test_table

The function adds a cts column as specified in the function call. It also creates two triggers (BEFORE INSERT and BEFORE UPDATE) that are responsible for maintaining timestamps in the new column before each change.

The new column specifies NOT NULL with a default value, which means that ALTER TABLE ... ADD COLUMN doesn't perform a table rewrite.

Avoid using columns with the bdr.column_timestamps data type for other purposes, as doing so can have negative effects. For example, it switches the table to column-level conflict resolution, which doesn't work correctly without the triggers.

Listing table with column-level conflict resolution

You can list tables having column-level conflict resolution enabled with the following query. This query detects the presence of a column of type bdr.column_timestamp.

SELECT nc.nspname, c.relname
FROM pg_attribute a
JOIN (pg_class c JOIN pg_namespace nc ON c.relnamespace = nc.oid)
  ON a.attrelid = c.oid
JOIN (pg_type t  JOIN pg_namespace nt ON t.typnamespace = nt.oid)
  ON a.atttypid = t.oid
WHERE NOT pg_is_other_temp_schema(nc.oid)
  AND nt.nspname = 'bdr'
  AND t.typname = 'column_timestamps'
  AND NOT a.attisdropped
  AND c.relkind IN ('r', 'v', 'f', 'p');

bdr.column_timestamps_create

This function creates column-level conflict resolution. It's called within column_timestamp_enable.

Synopsis

bdr.column_timestamps_create(p_source cstring, p_timestamp timestampstz)

Parameters

  • p_source The two options are current or commit.
  • p_timestamp Timestamp depends on the source chosen. If commit, then TIMESTAMP_SOURCE_COMMIT. If current, then TIMESTAMP_SOURCE_CURRENT.

DDL locking

When enabling or disabling column timestamps on a table, the code uses DDL locking to ensure that there are no pending changes from before the switch. This approach ensures only conflicts with timestamps in both tuples or in neither of them are seen. Otherwise, the code might unexpectedly see timestamps in the local tuple and NULL in the remote one. It also ensures that the changes are resolved the same way (column-level or row-level) on all nodes.

Current versus commit timestamp

An important decision is the timestamp to assign to modified columns.

By default, the timestamp assigned to modified columns is the current timestamp, as if obtained from clock_timestamp. This is simple, and for many cases it is correct (for example, when the conflicting rows modify non-overlapping subsets of columns).

It can, however, have various unexpected effects:

  • The timestamp changes during statement execution. So, if an UPDATE affects multiple rows, each gets a slightly different timestamp. This means that the effects of concurrent changes might get "mixed" in various ways, depending on how the changes performed on different nodes interleave.

  • The timestamp is unrelated to the commit timestamp. Using it to resolve conflicts means that the result isn't equivalent to the commit order, which means it likely can't be serialized.

Note

Statement and transaction timestamps might be added in the future, which will address issues with mixing effects of concurrent statements or transactions. Still, neither of these options can ever produce results equivalent to commit order.

You can also use the actual commit timestamp, although this feature is considered experimental. To use the commit timestamp, set the last parameter to true when enabling column-level conflict resolution:

SELECT bdr.column_timestamps_enable('test_table'::regclass, 'cts', true);

You can disable it using bdr.column_timestamps_disable.

Commit timestamps currently have restrictions that are explained in Notes.

Inspecting column timestamps

The column storing timestamps for modified columns is maintained by triggers. Don't modify it directly. It can be useful to inspect the current timestamps value, for example, while investigating how a conflict was resolved.

Three functions are useful for this purpose:

  • bdr.column_timestamps_to_text(bdr.column_timestamps)

    This function returns a human-readable representation of the timestamp mapping and is used when casting the value to text:

db=# select cts::text from test_table;
                                                 cts
-----------------------------------------------------------------------------------------------------
 {source: current, default: 2018-09-23 19:24:52.118583+02, map: [2 : 2018-09-23 19:25:02.590677+02]}
(1 row)
  • bdr.column_timestamps_to_jsonb(bdr.column_timestamps)

    This function turns a JSONB representation of the timestamps mapping and is used when casting the value to jsonb:

db=# select jsonb_pretty(cts::jsonb) from test_table;
                   jsonb_pretty
---------------------------------------------------
 {                                                +
     "map": {                                     +
         "2": "2018-09-23T19:24:52.118583+02:00"  +
     },                                           +
     "source": "current",                         +
     "default": "2018-09-23T19:24:52.118583+02:00"+
 }
(1 row)
  • bdr.column_timestamps_resolve(bdr.column_timestamps, xid)

    This function updates the mapping with the commit timestamp for the attributes modified by the most recent transaction if it already committed. This matters only when using the commit timestamp. For example, in this case, the last transaction updated the second attribute (with attnum = 2):

test=# select cts::jsonb from test_table;
                                                                  cts
----------------------------------------------------------------------------------------------------------------------------------------
 {"map": {"2": "2018-09-23T19:29:55.581823+02:00"}, "source": "commit", "default": "2018-09-23T19:29:55.581823+02:00", "modified": [2]}
(1 row)

db=# select bdr.column_timestamps_resolve(cts, xmin)::jsonb from test_table;
                                               column_timestamps_resolve
-----------------------------------------------------------------------------------------------------------------------
 {"map": {"2": "2018-09-23T19:29:55.581823+02:00"}, "source": "commit", "default": "2018-09-23T19:29:55.581823+02:00"}
(1 row)

Handling column conflicts using CRDT data types

By default, column-level conflict resolution picks the value with a higher timestamp and discards the other one. You can, however, reconcile the conflict in different, more elaborate ways. For example, you can use CRDT types that allow merging the conflicting values without discarding any information.

Notes

  • The attributes modified by an UPDATE are determined by comparing the old and new row in a trigger. This means that if the attribute doesn't change a value, it isn't detected as modified even if it's explicitly set. For example, UPDATE t SET a = a doesn't mark a as modified for any row. Similarly, UPDATE t SET a = 1 doesn't mark a as modified for rows that are already set to 1.

  • For INSERT statements, there's no old row to compare the new one to, so all attributes are considered to be modified, and they are assigned a new timestamp. This condition applies even for columns that weren't included in the INSERT statement and received default values. PGD can detect the attributes that have a default value but can't know if it was included automatically or specified explicitly.

    This situation effectively means column-level conflict resolution doesn't work for INSERT-INSERT conflicts even if the INSERT statements specify different subsets of columns. The newer row has timestamps that are all newer than the older row.

  • By treating the columns independently, it's easy to violate constraints in a way that isn't possible when all changes happen on the same node. Consider, for example, a table like this:

CREATE TABLE t (id INT PRIMARY KEY, a INT, b INT, CHECK (a > b));
INSERT INTO t VALUES (1, 1000, 1);
Assume one node does:
UPDATE t SET a = 100;
Another node concurrently does:
UPDATE t SET b = 500;
Each of those updates is valid when executed on the initial row and so passes on each node. But when replicating to the other node, the resulting row violates the `CHECK (A > b)` constraint, and the replication stops until the issue is resolved manually.
  • The column storing timestamp mapping is managed automatically. Don't specify or override the value in your queries, as the results can be unpredictable. (The value is ignored where possible.)

  • The timestamp mapping is maintained by triggers, but the order in which triggers execute matters. So if you have custom triggers that modify tuples and are executed after the pgl_clcd_ triggers, the modified columns aren't detected correctly.

  • When using regular timestamps to order changes or commits, the conflicting changes might have exactly the same timestamp because two or more nodes happened to generate the same timestamp. This risk isn't unique to column-level conflict resolution, as it can happen even for regular row-level conflict resolution. The node id is the tie breaker in this situation. The higher node id wins. This approach ensures that the same changes are applied on all nodes.

  • A clock skew can occur between different nodes. It can induce somewhat unexpected behavior, discarding seemingly newer changes because the timestamps are inverted. However, you can manage clock skew between nodes using the parameters bdr.maximum_clock_skew and bdr.maximum_clock_skew_action.

SELECT bdr.alter_node_group_config('group', ignore_redundant_updates := false);