[wp-trac] [WordPress Trac] #58871: support uca14.0.0 collation in database where available
WordPress Trac
noreply at wordpress.org
Sat Sep 16 00:10:37 UTC 2023
#58871: support uca14.0.0 collation in database where available
-------------------------------------------------+-------------------------
Reporter: danielblack | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: Awaiting
| Review
Component: Database | Version: 6.3
Severity: normal | Resolution:
Keywords: has-patch has-unit-tests needs- | Focuses:
testing |
-------------------------------------------------+-------------------------
Comment (by craigfrancis):
I don't think we can simply use:
return version_compare( $db_version, '10.10.1', '>=' );
As it depends on the database in use (MySQL vs MariaDB), and while MariaDB
is currently using version 10, I assume MySQL will move from 8.1 to 10.10
at some point.
Maybe use [https://dev.mysql.com/doc/refman/8.0/en/show-character-set.html
SHOW CHARACTER SET]?
Also, do we want WordPress to be accent-insensitive by default? WordPress
is currently case-insensitive by default.
----
Notes:
- WordPress currently tries to use [https://github.com/WordPress
/wordpress-develop/blob/0c7ddbd67a5ca8a2eacb0e034a08549fb190db28/src/wp-
includes/class-wpdb.php#L885 utf8mb4_unicode_520_ci] when the database
version is [https://github.com/WordPress/wordpress-
develop/blob/0c7ddbd67a5ca8a2eacb0e034a08549fb190db28/src/wp-includes
/class-wpdb.php#L4100 >= 5.6].
- MariaDB added `xxx_unicode_520_ci` in
[https://mariadb.com/kb/en/mariadb-1006-changelog/ version 10.0.6] (which
was a beta release). For reference, MariaDB forked MySQL at version 5.1,
and followed the numbering scheme up to version 5.5, then they jumped to
version 10.0.10 (so the >= 5.6 version check works fairly well).
- The "520" in the name represents UCA 5.2.0.
- UCA 5.2.0 was a big improvement from `utf8_*` and `utf8mb3_*` (no
support for Emoji, missing CJK characters, etc); it's also better than
`utf8mb4_general_ci` and `utf8mb4_unicode_ci` which are affected by the
"Sushi-Beer" problem (treating all characters in
[https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane
SMP] as equal, see MySQL bug [https://bugs.mysql.com/bug.php?id=76553
#76553]).
- But, UCA 5.2.0 has the "Mother-Father issue in Japanese", this affects
''sorting'' of multi-byte characters, where MySQL does not recognise “ハ”
(U+30CF KATAKANA LETTER HA), “パ” (U+30D1 KATAKANA LETTER PA), and “バ”
(U+30D0 KATAKANA LETTER BA) as different characters (see
[https://dev.mysql.com/blog-archive/sushi-beer-an-introduction-of-utf8
-support-in-mysql-8-0/ Problem #3 Sorting level]).
- A [https://dev.mysql.com/blog-archive/mysql-character-sets-unicode-and-
uca-compliant-collations/ MySQL blog post from July 2021] suggests using
`utf8mb4_0900_*` (UCA 9.0.0) to address this issue; and specifically
`utf8mb4_0900_ai_ci` if you want to be accent-insensitive and case-
insensitive.
- The `utf8mb4_0900_ai_ci` charset was added in MySQL 8.0,
[https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-0.html released
2016-09-12]. MySQL 8.1 was
[https://dev.mysql.com/doc/relnotes/mysql/8.1/en/news-8-1-0.html released
2023-07-18].
- The [https://dev.mysql.com/doc/refman/8.1/en/charset-unicode-sets.html
MySQL Unicode Character Sets] page only notes support for UCA 9.0.0
(`utf8mb4_0900_ai_ci`), not UCA 14.0.0.
- [https://mariadb.com/kb/en/changes-improvements-in-mariadb-1010/ MariaDB
10.10] (where GA 10.10.2 was released 2022-11-17), added support for UCA
14.0.0 collations (e.g. uca1400_ai_ci, ref
[https://jira.mariadb.org/browse/MDEV-27009 MDEV-27009]).
- The [https://mariadb.com/kb/en/supported-character-sets-and-collations/
MariaDB character sets documentation] does not list anything for
`utf8mb4_0900_*` (and a quick test on MariaDB 10.6.14 returns an "unknown
collation" error).
----
Slightly off topic, the MariaDB Jira ticket
[https://jira.mariadb.org/browse/MDEV-30164 MDEV-30164] gets MariaDB
11.2.1 to support:
SET @@character_set_collations='utf8mb4=uca1400_ai_ci';
So a `CREATE TABLE` which uses `utf8mb4`, will use `uca1400_ai_ci`
instead.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/58871#comment:4>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list