[wp-trac] [WordPress Trac] #58871: support uca14.0.0 collation in database where available

WordPress Trac noreply at wordpress.org
Mon Sep 18 00:55:43 UTC 2023


#58871: support uca14.0.0 collation in database where available
-------------------------------------------------+-------------------------
 Reporter:  danielblack                          |       Owner:  (none)
     Type:  enhancement                          |      Status:  new
 Priority:  normal                               |   Milestone:  Awaiting
                                                 |  Review
Component:  Database                             |     Version:  6.3
 Severity:  normal                               |  Resolution:
 Keywords:  has-patch has-unit-tests needs-      |     Focuses:
  testing                                        |
-------------------------------------------------+-------------------------

Comment (by danielblack):

 {{{
  SHOW COLLATION where Collation IN
 ('uca1400_ai_ci','utf8mb4_0900_ai_ci','utf8mb4_unicode_520_ci');
 }}}

 Sounds useful, and this could be just implemented in `determine_charset`,
 with cache, so there's one query. If we do that, maybe `has_cap( 'uca1400'
 )` need not be implemented. We'll see how easy the test cases are to
 write.

 I'll prepare another draft implemented in `determine_charset` and
 `maybe_convert_table_to_utf8mb4` doing a collation conversion too.

 From Misc:

 > `@@character_set_collations` is useful here, I just thought I would
 mention it incase it gave any inspiration for alternative solutions.

 It has some possibly useful implications as a default connection for
 coercing the collation that are probably worth while.

 https://mariadb.com/kb/en/setting-character-sets-and-collations/#changing-
 default-collation

 > The MaraDB documentation says "the character set name is always part of
 the collation name...

 Yep needs an update. I'll see what can be written.

 > Running SHOW COLLATION WHERE Collation LIKE "%uca1400%" provides NULL
 for the Charset

 Finally got to the bottom of this with the original commit -
 https://github.com/MariaDB/server/commit/133446828c9dcb484476e4b3598af0d63d056a6e
 (also a documentation task to pick up)

 Null implies it can apply to multiple character sets.

 {{{
 MariaDB [test]> select * from
 INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY where
 COLLATION_NAME='uca1400_ai_ci';
 +----------------+--------------------+-----------------------+------+------------+
 | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME   | ID   |
 IS_DEFAULT |
 +----------------+--------------------+-----------------------+------+------------+
 | uca1400_ai_ci  | utf8mb3            | utf8mb3_uca1400_ai_ci | 2048 |
 |
 | uca1400_ai_ci  | ucs2               | ucs2_uca1400_ai_ci    | 2560 |
 |
 | uca1400_ai_ci  | utf8mb4            | utf8mb4_uca1400_ai_ci | 2304 |
 |
 | uca1400_ai_ci  | utf16              | utf16_uca1400_ai_ci   | 2816 |
 |
 | uca1400_ai_ci  | utf32              | utf32_uca1400_ai_ci   | 3072 |
 |
 +----------------+--------------------+-----------------------+------+------------+
 }}}

 MySQL-5.5 still has the first two columns.

 > I assume it's still correct to use the utf8mb4 character set, along with
 mysqli_set_charset('utf8mb4') for the connection?

 Yes. Or any charset from above it seems.

 > Also, tables that exist today will use utf8mb4_unicode_520_ci, I don't
 think these will be changed during an update, see
 `maybe_convert_table_to_utf8mb4()`;

 But should they? I suspect doing so would be prudent.

 >  would that cause any problems (e.g. adding new tables/columns that
 would then use a different collation)?

 Only when the SQL use corresponds to existing tables as well.

 {{{
 MariaDB [test]> create table t520 (t varchar(30) character set  utf8mb4
 collate utf8mb4_unicode_520_ci);
 MariaDB [test]> create table t1400 (t varchar(30) character set  utf8mb4
 collate utf8mb4_uca1400_ai_ci);
 MariaDB [test]> insert into t520 values ('bob'),('jack'), ('jane');
 MariaDB [test]> insert into t1400 values ('bob'),('jack'), ('jane');

 MariaDB [test]> select * from t1400 join t520 on t1400.t = t520.t;
 ERROR 1267 (HY000): Illegal mix of collations
 (utf8mb4_uca1400_ai_ci,IMPLICIT) and (utf8mb4_unicode_520_ci,IMPLICIT) for
 operation '='
 }}}

 (and bug https://jira.mariadb.org/browse/MDEV-32192 for using
 `@@character_set_collations` to resolve this (for 11.2+)).

 Given the implicitness of this and compatibility with existing tables a
 conversion in update seems a way to avoid some problems.

 > Oddly, if I manually run ... which does kinda work with
 `maybe_convert_table_to_utf8mb4()` with it's use of explode('_').

 I assume that was intentional.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/58871#comment:7>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list