Unicode Collation Algorithm v11
The Unicode Collation Algorithm (UCA) is a specification (Unicode Technical Report #10) that defines a customizable method of collating and comparing Unicode data. Collation means how data is sorted as with a SELECT … ORDER BY
clause. Comparison is relevant for searches that use ranges with less than, greater than, or equal to operators.
Customizability is an important factor for various reasons such as the following.
- Unicode supports a vast number of languages. Letters that may be common to several languages may be expected to collate in different orders depending upon the language.
- Characters that appear with letters in certain languages such as accents or umlauts have an impact on the expected collation depending upon the language.
- In some languages, combinations of several consecutive characters should be treated as a single character with regards to its collation sequence.
- There may be certain preferences as to the collation of letters according to case. For example, should the lowercase form of a letter collate before the uppercase form of the same letter or vice versa.
- There may be preferences as to whether punctuation marks such as underscore characters, hyphens, or space characters should be considered in the collating sequence or should they simply be ignored as if they did not exist in the string.
Given all of these variations with the vast number of languages supported by Unicode, there is a necessity for a method to select the specific criteria for determining a collating sequence. This is what the Unicode Collation Algorithm defines.
Note
In addition, another advantage for using ICU collations (the implementation of the Unicode Collation Algorithm) is for performance. Sorting tasks, including B-tree index creation, can complete in less than half the time it takes with a non-ICU collation. The exact performance gain depends on your operating system version, the language of your text data, and other factors.
The following sections provide a brief, simplified explanation of the Unified Collation Algorithm concepts. As the algorithm and its usage are quite complex with numerous variations, refer to the official documents cited in these sections for complete details.
Basic Unicode Collation Algorithm Concepts
The official information for the Unicode Collation Algorithm is specified in Unicode Technical Report #10, which can be found on The Unicode Consortium website at:
http://www.unicode.org/reports/tr10/
The ICU – International Components for Unicode also provides much useful information. An explanation of the collation concepts can be found on their website located at:
http://userguide.icu-project.org/collation/concepts
The basic concept behind the Unicode Collation Algorithm is the use of multilevel comparison. This means that a number of levels are defined, which are listed as level 1 through level 5 in the following bullet points. Each level defines a type of comparison. Strings are first compared using the primary level, also called level 1.
If the order can be determined based on the primary level, then the algorithm is done. If the order cannot be determined based on the primary level, then the secondary level, level 2, is applied. If the order can be determined based on the secondary level, then the algorithm is done, otherwise the tertiary level is applied, and so on. There is typically, a final tie-breaking level to determine the order if it cannot be resolved by the prior levels.
- Level 1 – Primary Level for Base Characters. The order of basic characters such as letters and digits determines the difference such as
A < B
. - Level 2 – Secondary Level for Accents. If there are no primary level differences, then the presence or absence of accents and other such characters determine the order such as
a < á
. - Level 3 – Tertiary Level for Case. If there are no primary level or secondary level differences, then a difference in case determines the order such as
a < A
. - Level 4 – Quaternary Level for Punctuation. If there are no primary, secondary, or tertiary level differences, then the presence or absence of white space characters, control characters, and punctuation determine the order such as
-A < A
. - Level 5 – Identical Level for Tie-Breaking. If there are no primary, secondary, tertiary, or quaternary level differences, then some other difference such as the code point values determines the order.
International Components for Unicode
The Unicode Collation Algorithm is implemented by open source software provided by the International Components for Unicode (ICU). The software is a set of C/C++ and Java libraries.
When Advanced Server is used to create a collation that invokes the ICU components to produce the collation, the result is referred to as an ICU collation.
Locale Collations
When creating a collation for a locale, a predefined ICU short form name for the given locale is typically provided.
An ICU short form is a method of specifying collation attributes, which are the properties of a collation. Collation Attributes
provides additional information on collation attributes.
There are predefined ICU short forms for locales. The ICU short form for a locale incorporates the collation attribute settings typically used for the given locale. This simplifies the collation creation process by eliminating the need to specify the entire list of collation attributes for that locale.
The system catalog pg_catalog.pg_icu_collate_names
contains a list of the names of the ICU short forms for locales. The ICU short form name is listed in column icu_short_form
.
If needed, the default characteristics of an ICU short form for a given locale can be overridden by specifying the collation attributes to override that property. This is discussed in the next section.
Collation Attributes
When creating an ICU collation, the desired characteristics of the collation must be specified. As discussed in Locale Collations
, this can typically be done with an ICU short form for the desired locale. However, if more specific information is required, the specification of the collation properties can be done by using collation attributes.
Collation attributes define the rules of how characters are to be compared for determining the collation sequence of text strings. As Unicode covers a vast set of languages in numerous variations according to country, territory and culture, these collation attributes are quite complex.
For the complete, precise meaning and usage of collation attributes, see Section “Collator Naming Scheme” on the ICU – International Components for Unicode website at:
http://userguide.icu-project.org/collation/concepts
The following is a brief summary of the collation attributes and how they are specified using the ICU short form method
Each collation attribute is represented by an uppercase letter, which are listed in the following bullet points. The possible valid values for each attribute are given by codes shown within the parentheses. Some codes have general meanings for all attributes. X means to set the attribute off. O means to set the attribute on. D means to set the attribute to its default value.
- A – Alternate (N, S, D). Handles treatment of variable characters such as white spaces, punctuation marks, and symbols. When set to non-ignorable (N), differences in variable characters are treated with the same importance as differences in letters. When set to shifted (S), then differences in variable characters are of minor importance (that is, the variable character is ignored when comparing base characters).
- C – Case First (X, L, U, D). Controls whether a lowercase letter sorts before the same uppercase letter (L), or the uppercase letter sorts before the same lowercase letter (U). Off (X) is typically specified when lowercase first (L) is desired.
- E – Case Level (X, O, D). Set in combination with the Strength attribute, the Case Level attribute is used when accents are to be ignored, but not case.
- F – French Collation (X, O, D). When set to on, secondary differences (presence of accents) are sorted from the back of the string as done in the French Canadian locale.
- H – Hiragana Quaternary (X, O, D). Introduces an additional level to distinguish between the Hiragana and Katakana characters for compatibility with the JIS X 4061 collation of Japanese character strings.
- N – Normalization Checking (X, O, D). Controls whether or not text is thoroughly normalized for comparison. Normalization deals with the issue of canonical equivalence of text whereby different code point sequences represent the same character, which then present issues when sorting or comparing such characters. Languages such as Arabic, ancient Greek, Hebrew, Hindi, Thai, or Vietnamese should be used with Normalization Checking set to on.
- S – Strength (1, 2, 3, 4, I, D). Maximum collation level used for comparison. Influences whether accents or case are taken into account when collating or comparing strings. Each number represents a level. A setting of I represents identical strength (that is, level 5).
- T – Variable Top (hexadecimal digits). Applicable only when the Alternate attribute is not set to non-ignorable (N). The hexadecimal digits specify the highest character sequence that is to be considered ignorable. For example, if white space is to be ignorable, but visible variable characters are not to be ignorable, then Variable Top set to 0020 would be specified along with the Alternate attribute set to S and the Strength attribute set to 3. (The space character is hexadecimal 0020. Other non-visible variable characters such as backspace, tab, line feed, carriage return, etc. have values less than 0020. All visible punctuation marks have values greater than 0020.)
A set of collation attributes and their values is represented by a text string consisting of the collation attribute letter concatenated with the desired attribute value. Each attribute/value pair is joined to the next pair with an underscore character as shown by the following example.
AN_CX_EX_FX_HX_NO_S3
Collation attributes can be specified along with a locale’s ICU short form name to override those default attribute settings of the locale.
The following is an example where the ICU short form named LROOT
is modified with a number of other collation attribute/value pairs.
AN_CX_EX_LROOT_NO_S3
In the preceding example, the Alternate attribute (A)
is set to non-ignorable (N
). The Case First attribute (C)
is set to off (X)
. The Case Level attribute (E)
is set to off (X)
. The Normalization attribute (N)
is set to on (O)
. The Strength attribute (S)
is set to the tertiary level 3
. LROOT
is the ICU short form to which these other attributes are applying modifications.
Creating an ICU Collation
Creating an ICU collation can be done a number of different ways.
When creating a new database cluster with the
initdb
command, the--icu-short-form
option can be specified to define the ICU collation to be used by default by all databases in the cluster.When creating a new database with the
CREATE DATABASE
command, theICU_SHORT_FORM
parameter can be specified to define the ICU collation to be used by default in that database.In an existing database, the
CREATE COLLATION
command can be used with theICU_SHORT_FORM
parameter to define an ICU collation to be used under specific circumstances such as when assigned with theCOLLATE
clause onto selected columns of certain tables or when appended with theCOLLATE
clause onto an expression such asORDER BY expr COLLATE "collation_name"
.
The following describes the various ways of creating an ICU collation.
CREATE COLLATION
Use the ICU_SHORT_FORM
parameter with the CREATE COLLATION
command to create an ICU collation:
To be able to create a collation, you must have CREATE
privilege on the destination schema where the collation is to reside.
For information about the general usage of the CREATE COLLATION
command, please refer to the PostgreSQL core documentation available at:
https://www.postgresql.org/docs/11/static/sql-createcollation.html
UTF-8 character encoding of the database is required. Any LOCALE
, or LC_COLLATE
and LC_CTYPE
settings that are accepted with UTF-8 encoding can be used.
Parameters
collation_name
The name of the collation to be added. The collation_name may be schema-qualified.
locale
The locale to be used. Short cut for setting LC_COLLATE
and LC_TYPE
. If LOCALE
is specified, then LC_COLLATE
and LC_TYPE
must be omitted.
lc_collate
The collation to be used. If LC_CTYPE
is specified, then LC_COLLATE
must also be specified and LOCALE
must be omitted.
lc_ctype
The character classification to be used. If LC_COLLATE
is specified, then LC_CTYPE
must also be specified and LOCALE
must be omitted.
icu_short_form
The text string specifying the collation attributes and their settings. This typically consists of an ICU short form name, possibly appended with additional collation attribute/value pairs. A list of ICU short form names is available from column icu_short_form
in system catalog pg_catalog.pg_icu_collate_names
.
Example
The following creates a collation using the LROOT
ICU short form.
The definition of the new collation can be seen with the following psql
command.
CREATE DATABASE
The following is the syntax for creating a database with an ICU collation:
For complete information about the general usage, syntax, and parameters of the CREATE DATABASE
command, please refer to the PostgreSQL core documentation available at:
https://www.postgresql.org/docs/11/static/sql-createdatabase.html
When using the CREATE DATABASE
command to create a database using an ICU collation, the TEMPLATE template0
clause must be specified and the database encoding must be UTF-8.
The following is an example of creating a database using the LROOT
ICU short form collation, but sorts an uppercase form of a letter before its lowercase counterpart (CU
) and treats variable characters as non-ignorable (AN
).
The following psql
command shows the newly created database.
The following table is created and populated with rows in the database.
The following query shows that the uppercase form of a letter sorts before the lowercase form of the same base letter, and in addition, variable characters are taken into account when sorted as they appear at the beginning of the sort list. (The default behavior for en_US.UTF-8
is to sort the lowercase form of a letter before the uppercase form of the same base letter, and to ignore variable characters.)
initdb
A database cluster can be created with a default ICU collation for all databases in the cluster by using the --icu-short-form
option with the initdb
command.
For complete information about the general usage, syntax, and parameters of the initdb
command, please refer to the PostgreSQL core documentation available at:
https://www.postgresql.org/docs/11/static/app-initdb.html
The following illustrates this process.
The following shows the databases created in the cluster which all have an ICU collation of AN_CU_EX_NX_LROOT
.
Using a Collation
A newly defined ICU collation can be used anywhere the COLLATION
"collation_name"
clause can be used in a SQL command such as in the column specifications of the CREATE TABLE
command or appended to an expression in the ORDER BY
clause of a SELECT
command.
The following are some examples of the creation and usage of ICU collations based on the English language in the United States (en_US.UTF8
).
In these examples, ICU collations are created with the following characteristics.
Collation icu_collate_lowercase
forces the lowercase form of a letter to sort before its uppercase counterpart (CL
).
Collation icu_collate_uppercase
forces the uppercase form of a letter to sort before its lowercase counterpart (CU
).
Collation icu_collate_ignore_punct
causes variable characters (white space and punctuation marks) to be ignored during sorting (AS
).
Collation icu_collate_ignore_white_sp
causes white space and other non-visible variable characters to be ignored during sorting, but visible variable characters (punctuation marks) are not ignored (AS
, T0020
).
Note
When creating collations, ICU may generate notice and warning messages when attributes are given to modify the LROOT
collation.
The following psql
command lists the collations.
The following table is created and populated.
The following query sorts on column c2
using the default collation. Note that variable characters (white space and punctuation marks) with id
column values of 9
, 10
, and 11
are ignored and sort with the letter B
.
The following query sorts on column c2
using collation icu_collate_lowercase
, which forces the lowercase form of a letter to sort before the uppercase form of the same base letter. Also note that the AN
attribute forces variable characters to be included in the sort order at the same level when comparing base characters so rows with id
values of 9
, 10
, and 11
appear at the beginning of the sort list before all letters and numbers.
The following query sorts on column c2
using collation icu_collate_uppercase
, which forces the uppercase form of a letter to sort before the lowercase form of the same base letter.
The following query sorts on column c2
using collation icu_collate_ignore_punct
, which causes variable characters to be ignored so rows with id values of 9
, 10
, and 11
sort with the letter B
as that is the character immediately following the ignored variable character.
The following query sorts on column c2
using collation icu_collate_ignore_white_sp
. The AS
and T0020
attributes of the collation cause variable characters with code points less than or equal to hexadecimal 0020
to be ignored while variable characters with code points greater than hexadecimal 0020
are included in the sort.
The row with id
value of 11
, which starts with a space character (hexadecimal 0020
) sorts with the letter B
. The rows with id
values of 9
and 10
, which start with visible punctuation marks greater than hexadecimal 0020
, appear at the beginning of the sort list as these particular variable characters are included in the sort order at the same level when comparing base characters.