Sequences v3.7
Many applications require that unique surrogate ids be assigned to database entries.
Often the database SEQUENCE
object is used to produce these. In
PostgreSQL these can be either a manually created sequence using the
CREATE SEQUENCE
command and retrieved by calling nextval()
function,
or serial
and bigserial
columns or alternatively
GENERATED BY DEFAULT AS IDENTITY
columns.
However, standard sequences in PostgreSQL are not multi-node aware, and only
produce values that are unique on the local node. This is important because
unique ids generated by such sequences will cause conflict and data loss (by
means of discarded INSERTs
) in multi-master replication.
BDR Global Sequences
For this reason, BDR provides an application-transparent way to generate unique ids using sequences on bigint or bigserial datatypes across the whole BDR group, called global sequences.
BDR global sequences provide an easy way for applications to use the database to generate unique synthetic keys in an asynchronous distributed system that works for most - but not necessarily all - cases.
Using BDR global sequences allows you to avoid the problems with insert
conflicts. If you define a PRIMARY KEY
or UNIQUE
constraint on a column
which is using a global sequence, it is not possible for any node to ever get
the same value as any other node. When BDR synchronizes inserts between the
nodes, they can never conflict.
BDR global sequences extend PostgreSQL sequences, so are crash-safe. To use
them, you must have been granted the bdr_application
role.
There are various possible algorithms for global sequences:
- Timeshard sequences
- Globally-allocated range sequences
Timeshard sequences generate values using an algorithm that does not require inter-node communication at any point, so is faster and more robust, as well as having the useful property of recording the timestamp at which they were created. Timeshard sequences have the restriction that they work only for 64-bit BIGINT datatypes and produce values 19 digits long, which may be too long for use in some host language datatypes such as Javascript Integer types. Globally-allocated sequences allocate a local range of values which can be replenished as-needed by inter-node consensus, making them suitable for either BIGINT or INTEGER sequences.
A global sequence can be created using the bdr.alter_sequence_set_kind()
function. This function takes a standard PostgreSQL sequence and marks it as
a BDR global sequence. It can also convert the sequence back to the standard
PostgreSQL sequence (see below).
BDR also provides the configuration variable bdr.default_sequence_kind
, which
determines what kind of sequence will be created when the CREATE SEQUENCE
command is executed or when a serial
, bigserial
or
GENERATED BY DEFAULT AS IDENTITY
column is created. Valid settings are:
local
(the default) meaning that newly created sequences are the standard PostgreSQL (local) sequences.galloc
which always creates globally-allocated range sequences.timeshard
which creates time-sharded global sequences for BIGINT sequences, but will throw ERRORs when used with INTEGER sequences.
The bdr.sequences
view shows information about individual sequence kinds.
currval()
and lastval()
work correctly for all types of global sequence.
Timeshard Sequences
The ids generated by timeshard sequences are loosely time-ordered so they can be used to get the approximate order of data insertion, like standard PostgreSQL sequences. Values generated within the same millisecond might be out of order, even on one node. The property of loose time-ordering means they are suitable for use as range partition keys.
Timeshard sequences work on one or more nodes, and do not require any inter-node communication after the node join process completes. So they may continue to be used even if there's the risk of extended network partitions, and are not affected by replication lag or inter-node latency.
Timeshard sequences generate unique ids in a different way to standard sequences. The algorithm uses 3 components for a sequence number. The first component of the sequence is a timestamp at the time of sequence number generation. The second component of the sequence number is the unique id assigned to each BDR node, which ensures that the ids from different nodes will always be different. Finally, the third component is the number generated by the local sequence itself.
While adding a unique node id to the sequence number would be enough to ensure there are no conflicts, we also want to keep another useful property of sequences, which is that the ordering of the sequence numbers roughly corresponds to the order in which data was inserted into the table. Putting the timestamp first ensures this.
A few limitations and caveats apply to timeshard sequences.
Timeshard sequences are 64-bits wide and need a bigint
or bigserial
.
Values generated will be at least 19 digits long.
There is no practical 32-bit integer
version, so cannot be used with serial
sequences - use globally-allocated range sequences instead.
There is a limit of 8192 sequence values generated per millisecond on any
given node for any given sequence. If more than 8192 sequences per
millisecond are generated from one sequence on one node, the generated
values will wrap around and could collide. There is no check on that for
performance reasons; the value is not reset to 0 at the start of each ms.
Collision will usually result in a
UNIQUE
constraint violation on INSERT
or UPDATE
. It cannot cause a
replication conflict, because sequence values generated on different nodes
cannot ever collide since they contain the nodeid.
In practice this is harmless; values are not generated fast enough
to trigger this limitation as there will be other
work being done, rows inserted, indexes updated, etc. Despite that,
applications should have a UNIQUE
constraint in place where they
absolutely rely on a lack of collisions.
Perhaps more importantly, the timestamp component will run out of values in the year 2050, and if used in combination with bigint, the values will wrap to negative numbers in the year 2033. This means that sequences generated after 2033 will have negative values. If you plan to deploy your application beyond this date, try one of [UUIDs, KSUUIDs and Other Approaches] mentioned below, or use globally-allocated range sequences instead.
The INCREMENT
option on a sequence used as input for timeshard sequences is
effectively ignored. This could be relevant for applications that do sequence
ID caching, like many object-relational mapper (ORM) tools, notably Hibernate.
Because the sequence is time-based, this has little practical effect since the
sequence will have advanced to a new non-colliding value by the time the
application can do anything with the cached values.
Similarly, the START
, MINVALUE
, MAXVALUE
and CACHE
settings may
be changed on the underlying sequence, but there is no benefit to doing
so. The sequence's low 14 bits are used and the rest is discarded, so
the value range limits do not affect the function's result. For the same
reason, setval()
is not useful for timeshard sequences.
Globally-allocated range Sequences
The globally-allocated range (or galloc
) sequences allocate ranges (chunks)
of values to each node. When the local range is used up, a new range is
allocated globally by consensus amongst the other nodes. This uses the key
space efficiently, but requires that the local node be connected to a majority
of the nodes in the cluster for the sequence generator to progress, when the
currently assigned local range has been used up.
Unlike timeshard sequences, galloc sequences support all sequence data types provided by PostgreSQL - smallint, integer and bigint. This means that galloc sequences can be used in environments where 64-bit sequences are problematic, such as using integers in javascript, since that supports only 53-bit values, or when the sequence is displayed on output with limited space.
The range assigned by each voting is currently predetermined based on the datatype the sequence is using:
- smallint - 1 000 numbers
- integer - 1 000 000 numbers
- bigint - 1 000 000 000 numbers
Each node will allocate two chunks of seq_chunk_size, one for the current use plus a reserved chunk for future usage, so the values generated from any one node will increase monotonically. However, viewed globally, the values generated will not be ordered at all. This could cause a loss of performance due to the effects on b-tree indexes, and will typically mean that generated values will not be useful as range partition keys.
The main downside of the galloc sequences is that once the assigned range is used up, the sequence generator has to ask for consensus about the next range for the local node that requires inter-node communication, which could lead to delays or operational issues if the majority of the BDR group is not accessible. This may be avoided in later releases.
The CACHE
, START
, MINVALUE
and MAXVALUE
options work correctly
with galloc sequences, however you need to set them before transforming
the sequence to galloc kind. The INCREMENT BY
option also works
correctly, however, you cannot assign an increment value which is equal
to or more than the above ranges assigned for each sequence datatype.
setval()
does not reset the global state for galloc sequences and
should not be used.
A few limitations apply to galloc sequences. BDR tracks galloc sequences in a
special BDR catalog bdr.sequence_alloc
. This catalog is required to track the
currently allocated chunks for the galloc sequences. The sequence name and
namespace is stored in this catalog. Since the sequence chunk allocation is
managed via Raft whereas any changes to the sequence name/namespace is managed
via replication stream, BDR currently does not support renaming galloc
sequences, or moving them to another namespace or renaming the namespace that
contains a galloc sequence. The user should be mindful of this limitation while
designing application schema.
Usage
Before transforming a local sequence to galloc, you need to take care of these prerequisites:
When sequence kind is altered to galloc, it will be rewritten and restart from
the defined start value of the local sequence. If this happens on an existing
sequence in a production database you will need to query the current value
then set the start value appropriately. To assist with this use case, BDR
allows users to pass a starting value with the function bdr.alter_sequence_set_kind()
.
If you are already using offset and you have writes from multiple nodes, you
need to check what is the greatest used value and restart the sequence at least
to the next value.
Since users cannot lock a sequence, you must leave a $MARGIN value to allow operations to continue while the max() value is queried.
The bdr.sequence_alloc
table will give information on the chunk size and what
ranges are allocated around the whole cluster.
In this example we started our sequence from 333,
and we have two nodes in the
cluster, we can see that we have a number of allocation 4, that is 2 per node
and the chunk size is 1000000 that is related to an integer sequence.
To see the ranges currently assigned to a given sequence on each node, use these queries:
- Node
Node1
is using range from333
to2000333
.
- Node
Node2
is using range from2000004
to4000003
.
NOTE You can't combine it to single query (like WHERE ctid IN ('(0,2)', '(0,3)')) as that will still only show the first range.
When a node finishes a chunk, it will ask a consensus for a new one and get the first available; in our case, it will be from 4000334 to 5000333. This will be the new reserved chunk and it will start to consume the old reserved chunk.
UUIDs, KSUUIDs and Other Approaches
There are other ways to generate globally unique ids without using the global sequences that can be used with BDR. For example:
- UUIDs, and their BDR variant, KSUUIDs
- Local sequences with a different offset per node (i.e. manual)
- An externally co-ordinated natural key
Please note that BDR applications cannot use other methods safely:
counter-table
based approaches relying on SELECT ... FOR UPDATE
, UPDATE ... RETURNING ...
or similar for sequence generation will not work correctly in BDR, because BDR
does not take row locks between nodes. The same values will be generated on
more than one node. For the same reason, the usual strategies for "gapless"
sequence generation do not work with BDR. In most cases the application should
coordinate generation of sequences that must be gapless from some external
source using two-phase commit, or it should only generate them on one node in
the BDR group.
UUIDs and KSUUIDs
UUID
keys instead avoid sequences entirely and
use 128-bit universal unique identifiers. These are random
or pseudorandom values that are so large that it is nearly
impossible for the same value to be generated twice. There is
no need for nodes to have continuous communication when using UUID
keys.
In the incredibly unlikely event of a collision, conflict detection will
choose the newer of the two inserted records to retain. Conflict logging,
if enabled, will record such an event, but it is
exceptionally unlikely to ever occur, since collisions
only become practically likely after about 2^64
keys have been generated.
The main downside of UUID
keys is that they're somewhat space- and
network inefficient, consuming more space not only as a primary key, but
also where referenced in foreign keys and when transmitted on the wire.
Additionally, not all applications cope well with UUID
keys.
BDR provides functions for working with a K-Sortable variant of UUID
data,
known as KSUUID, which generates values that can be stored using PostgreSQL's
standard UUID
data type. A KSUUID
value is similar to UUIDv1
in that
it stores both timestamp and random data, following the UUID
standard.
The difference is that KSUUID
is K-Sortable, meaning that it's weakly
sortable by timestamp. This makes it more useful as a database key as it
produces more compact btree
indexes, which improves
the effectiveness of search, and allows natural time-sorting of result data.
Unlike UUIDv1
,
KSUUID
values do not include the MAC of the computer on which they were
generated, so there should be no security concerns from using KSUUID
s.
KSUUID
v2 is now recommended in all cases. Values generated are directly
sortable with regular comparison operators.
There are two versions of KSUUID
in BDR, v1 and v2.
The legacy KSUUID
v1 is
now deprecated but is kept in order to support existing installations and should
not be used for new installations.
The internal contents of the v1 and v2 are not compatible, and as such the
functions to manipulate them are also not compatible. The v2 of KSUUID
also
no longer stores the UUID
version number.
Step & Offset Sequences
In offset-step sequences, a normal PostgreSQL sequence is used on each node. Each sequence increments by the same amount and starts at differing offsets. For example with step 1000, node1's sequence generates 1001, 2001, 3001, and so on, node2's generates 1002, 2002, 3002, etc. This scheme works well even if the nodes cannot communicate for extended periods, but the designer must specify a maximum number of nodes when establishing the schema, and it requires per-node configuration. However, mistakes can easily lead to overlapping sequences.
It is relatively simple to configure this approach with BDR by creating the desired sequence on one node, like this:
... then on each node calling setval()
to give each node a different offset
starting value, e.g.:
You should be sure to allow a large enough INCREMENT
to leave room for all
the nodes you may ever want to add, since changing it in future is difficult
and disruptive.
If you use bigint
values, there is no practical concern about key exhaustion,
even if you use offsets of 10000 or more. You'll need hundreds of years,
with hundreds of machines, doing millions of inserts per second, to have any
chance of approaching exhaustion.
BDR does not currently offer any automation for configuration of the per-node offsets on such step/offset sequences.
Composite Keys
A variant on step/offset sequences is to use a composite key composed of
PRIMARY KEY (node_number, generated_value)
, where the
node number is usually obtained from a function that returns a different
number on each node. Such a function may be created by temporarily
disabling DDL replication and creating a constant SQL function, or by using
a one-row table that is not part of a replication set to store a different
value in each node.
Global Sequence Management Interfaces
BDR provides an interface for converting between a standard PostgreSQL sequence and the BDR global sequence.
Note that the following functions are considered to be DDL
, so DDL replication
and global locking applies to them.
bdr.alter_sequence_set_kind
Allows the owner of a sequence to set the kind of a sequence.
Once set, seqkind
is only visible via the bdr.sequences
view;
in all other ways the sequence will appear as a normal sequence.
BDR treats this function as DDL
, so DDL replication and global locking applies,
if that is currently active. See [DDL Replication].
Synopsis
Parameters
seqoid
- name or Oid of the sequence to be alteredseqkind
-local
for a standard PostgreSQL sequence,timeshard
for BDR global sequence which uses the "time and sharding" based algorithm described in the [BDR Global Sequences] section, orgalloc
for globally-allocated range sequences which use consensus between nodes to assign unique ranges of sequence numbers to each node
Notes
When changing the sequence kind to galloc
, the first allocated range for that
sequence will use the sequence start value as the starting point. When there are
already existing values used by the sequence before it was changed to galloc
,
it is recommended to move the starting point so that the newly generated
values will not conflict with the existing ones using the following command:
This function uses the same replication mechanism as DDL
statements. This means
that the replication is affected by the ddl filters
configuration.
The function will take a global DDL
lock. It will also lock the sequence locally.
This function is transactional - the effects can be rolled back with the
ROLLBACK
of the transaction, and the changes are visible to the current
transaction.
The bdr.alter_sequence_set_kind
function can be only executed by
the owner of the sequence, unless bdr.backwards_compatibility
is
set is set to 30618 or below.
bdr.extract_timestamp_from_timeshard
This function extracts the timestamp component of the timeshard
sequence.
The return value is of type "timestamptz".
Synopsis
Parameters
timeshard_seq
- value of a timeshard sequence
Notes
This function is only executed on the local node.
bdr.extract_nodeid_from_timeshard
This function extracts the nodeid component of the timeshard
sequence.
Synopsis
Parameters
timeshard_seq
- value of a timeshard sequence
Notes
This function is only executed on the local node.
bdr.extract_localseqid_from_timeshard
This function extracts the local sequence value component of the timeshard
sequence.
Synopsis
Parameters
timeshard_seq
- value of a timeshard sequence
Notes
This function is only executed on the local node.
bdr.timestamp_to_timeshard
This function converts a timestamp value to a dummy timeshard sequence value.
This is useful for doing indexed searches or comparisons of values in the timeshard column and for a specific timestamp.
For example, given a table foo
with a column id
which is using a timeshard
sequence, we can get the number of changes since yesterday midnight like this:
A query formulated this way will use an index scan on the column id
.
Synopsis
Parameters
ts
- timestamp to be used for the timeshard sequence generation
Notes
This function is only executed on local node.
KSUUID v2 Functions
Functions for working with KSUUID
v2 data, K-Sortable UUID data.
bdr.gen_ksuuid_v2
This function generates a new KSUUID
v2 value, using the value of timestamp passed as an
argument or current system time if NULL is passed.
If you want to generate KSUUID automatically using system time, pass NULL argument.
The return value is of type "UUID".
Synopsis
Notes
This function is only executed on the local node.
bdr.ksuuid_v2_cmp
This function compares the KSUUID
v2 values.
It returns 1 if first value is newer, -1 if second value is lower, or zero if they are equal.
Synopsis
Parameters
UUID
-KSUUID
v2 to compare
Notes
This function is only executed on local node.
bdr.extract_timestamp_from_ksuuid_v2
This function extracts the timestamp component of KSUUID
v2.
The return value is of type "timestamptz".
Synopsis
Parameters
UUID
-KSUUID
v2 value to extract timestamp from
Notes
This function is only executed on the local node.
KSUUID v1 Functions
Functions for working with KSUUID
v1 data, K-Sortable UUID data(v1).
bdr.gen_ksuuid
This function generates a new KSUUID
v1 value, using the current system time.
The return value is of type "UUID".
Synopsis
Notes
This function is only executed on the local node.
bdr.uuid_v1_cmp
This function compares the KSUUID
v1 values.
It returns 1 if first value is newer, -1 if second value is lower, or zero if they are equal.
Synopsis
Notes
This function is only executed on the local node.
Parameters
UUID
-KSUUID
v1 to compare
bdr.extract_timestamp_from_ksuuid
This function extracts the timestamp component of KSUUID
v1 or UUIDv1
values.
The return value is of type "timestamptz".
Synopsis
Parameters
UUID
-KSUUID
v1 value to extract timestamp from
Notes
This function is only executed on the local node.