[wp-trac] [WordPress Trac] #58871: support uca14.0.0 collation in database where available

WordPress Trac noreply at wordpress.org
Sat Sep 16 00:10:37 UTC 2023


#58871: support uca14.0.0 collation in database where available
-------------------------------------------------+-------------------------
 Reporter:  danielblack                          |       Owner:  (none)
     Type:  enhancement                          |      Status:  new
 Priority:  normal                               |   Milestone:  Awaiting
                                                 |  Review
Component:  Database                             |     Version:  6.3
 Severity:  normal                               |  Resolution:
 Keywords:  has-patch has-unit-tests needs-      |     Focuses:
  testing                                        |
-------------------------------------------------+-------------------------

Comment (by craigfrancis):

 I don't think we can simply use:

   return version_compare( $db_version, '10.10.1', '>=' );

 As it depends on the database in use (MySQL vs MariaDB), and while MariaDB
 is currently using version 10, I assume MySQL will move from 8.1 to 10.10
 at some point.

 Maybe use [https://dev.mysql.com/doc/refman/8.0/en/show-character-set.html
 SHOW CHARACTER SET]?

 Also, do we want WordPress to be accent-insensitive by default? WordPress
 is currently case-insensitive by default.

 ----

 Notes:

 - WordPress currently tries to use [https://github.com/WordPress
 /wordpress-develop/blob/0c7ddbd67a5ca8a2eacb0e034a08549fb190db28/src/wp-
 includes/class-wpdb.php#L885 utf8mb4_unicode_520_ci] when the database
 version is [https://github.com/WordPress/wordpress-
 develop/blob/0c7ddbd67a5ca8a2eacb0e034a08549fb190db28/src/wp-includes
 /class-wpdb.php#L4100 >= 5.6].

 - MariaDB added `xxx_unicode_520_ci` in
 [https://mariadb.com/kb/en/mariadb-1006-changelog/ version 10.0.6] (which
 was a beta release). For reference, MariaDB forked MySQL at version 5.1,
 and followed the numbering scheme up to version 5.5, then they jumped to
 version 10.0.10 (so the >= 5.6 version check works fairly well).

 - The "520" in the name represents UCA 5.2.0.

 - UCA 5.2.0 was a big improvement from `utf8_*` and `utf8mb3_*` (no
 support for Emoji, missing CJK characters, etc); it's also better than
 `utf8mb4_general_ci` and `utf8mb4_unicode_ci` which are affected by the
 "Sushi-Beer" problem (treating all characters in
 [https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane
 SMP] as equal, see MySQL bug [https://bugs.mysql.com/bug.php?id=76553
 #76553]).

 - But, UCA 5.2.0 has the "Mother-Father issue in Japanese", this affects
 ''sorting'' of multi-byte characters, where MySQL does not recognise “ハ”
 (U+30CF KATAKANA LETTER HA), “パ” (U+30D1 KATAKANA LETTER PA), and “バ”
 (U+30D0 KATAKANA LETTER BA) as different characters (see
 [https://dev.mysql.com/blog-archive/sushi-beer-an-introduction-of-utf8
 -support-in-mysql-8-0/ Problem #3 Sorting level]).

 - A [https://dev.mysql.com/blog-archive/mysql-character-sets-unicode-and-
 uca-compliant-collations/ MySQL blog post from July 2021] suggests using
 `utf8mb4_0900_*` (UCA 9.0.0) to address this issue; and specifically
 `utf8mb4_0900_ai_ci` if you want to be accent-insensitive and case-
 insensitive.

 - The `utf8mb4_0900_ai_ci` charset was added in MySQL 8.0,
 [https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-0.html released
 2016-09-12]. MySQL 8.1 was
 [https://dev.mysql.com/doc/relnotes/mysql/8.1/en/news-8-1-0.html released
 2023-07-18].

 - The [https://dev.mysql.com/doc/refman/8.1/en/charset-unicode-sets.html
 MySQL Unicode Character Sets] page only notes support for UCA 9.0.0
 (`utf8mb4_0900_ai_ci`), not UCA 14.0.0.

 - [https://mariadb.com/kb/en/changes-improvements-in-mariadb-1010/ MariaDB
 10.10] (where GA 10.10.2 was released 2022-11-17), added support for UCA
 14.0.0 collations (e.g. uca1400_ai_ci, ref
 [https://jira.mariadb.org/browse/MDEV-27009 MDEV-27009]).

 - The [https://mariadb.com/kb/en/supported-character-sets-and-collations/
 MariaDB character sets documentation] does not list anything for
 `utf8mb4_0900_*` (and a quick test on MariaDB 10.6.14 returns an "unknown
 collation" error).

 ----

 Slightly off topic, the MariaDB Jira ticket
 [https://jira.mariadb.org/browse/MDEV-30164 MDEV-30164] gets MariaDB
 11.2.1 to support:

    SET @@character_set_collations='utf8mb4=uca1400_ai_ci';

 So a `CREATE TABLE` which uses `utf8mb4`, will use `uca1400_ai_ci`
 instead.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/58871#comment:4>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list