[wp-trac] [WordPress Trac] #58871: support uca14.0.0 collation in database where available

Sun Sep 17 12:40:29 UTC 2023

#58871: support uca14.0.0 collation in database where available
-------------------------------------------------+-------------------------
 Reporter:  danielblack                          |       Owner:  (none)
     Type:  enhancement                          |      Status:  new
 Priority:  normal                               |   Milestone:  Awaiting
                                                 |  Review
Component:  Database                             |     Version:  6.3
 Severity:  normal                               |  Resolution:
 Keywords:  has-patch has-unit-tests needs-      |     Focuses:
  testing                                        |
-------------------------------------------------+-------------------------

Comment (by craigfrancis):

 Thanks @danielblack.

 Just a thought (as I'm not sure what the repercussions are), but if we
 added support for MySQL's `utf8mb4_0900_ai_ci` as well, to avoid multiple
 `SHOW COLLATION` queries, we could use:

     SHOW COLLATION where Collation IN
 ('uca1400_ai_ci','utf8mb4_0900_ai_ci','utf8mb4_unicode_520_ci');

 Then store the results on a private wpdb property, so it's cached, and can
 be used by `has_cap()`?

 Note that `determine_charset()` is called by `db_connect()`, via
 `init_charset()`; and while it's fairly fast (on my localhost ~0.0006s,
 which does not use a network connection), it won't be as fast as
 `mysqli_get_server_info()` to **guess** the supported character sets based
 on version number (~0.0000001s).

 ---

 And misc points...

 - I'm fine with accent-insensitive (like case-insensitive), I just don't
 know if it would cause any problems for anyone else (only reason I'm
 noting it).

 - Agreed, I don't think `@@character_set_collations` is useful here, I
 just thought I would mention it incase it gave any inspiration for
 alternative solutions.

 - The MaraDB documentation says "the character set name is always part of
 the collation name" ([https://mariadb.com/kb/en/character-set-and-
 collation-overview/ source]), I assume that's incorrect as collation
 `uca1400_ai_ci` would imply a different character set.

 - Running `SHOW COLLATION WHERE Collation LIKE "%uca1400%"` provides NULL
 for the `Charset`?

 - I assume it's still correct to use the `utf8mb4` character set, along
 with `mysqli_set_charset('utf8mb4')` for the connection?

 - Also, tables that exist today will use `utf8mb4_unicode_520_ci`, I don't
 think these will be changed during an update, see
 `maybe_convert_table_to_utf8mb4()`; would that cause any problems (e.g.
 adding new tables/columns that would then use a different collation)?

 - Oddly, if I manually run `ALTER TABLE wp_commentmeta CHANGE meta_key
 meta_key VARCHAR(255) CHARACTER SET utf8mb4 COLLATE uca1400_ai_ci NULL
 DEFAULT NULL`, then the `meta_key` field collation is set to
 `utf8mb4_uca1400_ai_ci`, which does kinda work with
 `maybe_convert_table_to_utf8mb4()` with it's use of `explode('_')`.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/58871#comment:6>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform