[wp-trac] [WordPress Trac] #56504: `sanitize_html_class()` is both too restrictive, and too permissive so it may return an invalid class name

Sun Sep 4 15:51:22 UTC 2022

#56504: `sanitize_html_class()` is both too restrictive, and too permissive so it
may return an invalid class name
--------------------------+------------------------------
 Reporter:  anrghg        |       Owner:  (none)
     Type:  defect (bug)  |      Status:  new
 Priority:  normal        |   Milestone:  Awaiting Review
Component:  General       |     Version:
 Severity:  normal        |  Resolution:
 Keywords:                |     Focuses:
--------------------------+------------------------------

Comment (by anrghg):

 Replying to [comment:1 joyously]:
 > I discovered the leading digit part of this in my testing of my theme,
 which adds body classes using page slugs.
 > Since my theme has options that are class names, changing the sanitizing
 function can cause user input to not match, so any change to this function
 needs a prominent Dev Note or some way to be backward compatible.

 Backcompat could be ensured by changing the function to:
 {{{
 function sanitize_html_class(
         string $class,
         string $fallback = '',
         bool $css_compliant = false,
         string $prefix = '_',
         bool $decode = true
 ) {
         if ( $css_compliant ) {

                 // Do the right thing.

         } else {

                 // Legacy behavior.

         }
         return apply_filters( 'sanitize_html_class', $sanitized, $class,
 $fallback );
 }
 }}}

 However, with your themes fixing the leading digit issue, users inputting
 classes with a leading digit or hyphen-digit only need to know the string
 that their classes will be prefixed with.

 The proposed function really supports user input. Where unpredictability
 comes in awkwardly is when copy-pasting a spec-conformant slug (supporting
 all Unicode in the page title) after selecting the slug in the URL bar so
 it is plain Unicode.

 The legacy `sanitize_html_class()` leaves only bare-bone ASCII, and if
 there is no ASCII in the slug e.g. due to being written in a Non-Latin
 script, it falls back on an empty string.

 Yet another pitfall in the legacy `sanitize_html_class()` is that it may
 return **uppercase** ASCII instead of converting it to lowercase while it
 is on it.

 Uppercase in URL slugs may lead to 404 errors depending on [whether the
 server supports it or not](https://stackoverflow.com/questions/7996919
 /should-url-be-case-sensitive). Generally it’s so unsafe that A-Z should
 be banned from slugs. The Block Editor does so. I came across the issue
 while using my plugin and [hinted it in a
 comment](https://github.com/ampproject/amp-
 wp/issues/7238#issuecomment-1236226182).

 Indeed a prominent **Dev Note** should be part of the solution. It should
 not only document that `sanitize_html_class()` has been fixed to
 *optionally* comply to the CSS spec.

 The Dev Note should also promote best practice of adding not only a single
 class but **three** classes to the `body` element:

 1. `id-1234` to complete the default `postid-1234` OR `page-id-1234` (with
 an extra hyphen for readability…);
 2. The legacy plain ASCII class but **sanitized** so in non-Latin scripts
 there will be no “-1”, “-2”, “-3” class names any longer, either;
 3. The CSS conformant plain Unicode class, with only ASCII symbols and
 punctuation removed except the hyphen and underscore and the backslash
 escaping numeric references plus the space delimiting them! Its really
 very tricky.

 As a bonus, the legacy not-really-sanitized class could also be added, but
 I do not recommend invalid CSS any longer.

 It is not up to the developers to complete sanitization after using a
 function designed and supposed to do the job.

 Nor should CSS be pulled down to bare-bone ASCII, at the expense of Non-
 Latin scripts, emoji, and UX.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/56504#comment:2>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform