[wp-trac] [WordPress Trac] #60295: esc_html() function returns an empty string when the last character of the input string variable is ASCII 145 or 146

WordPress Trac noreply at wordpress.org
Fri Jan 19 16:23:32 UTC 2024


#60295: esc_html() function returns an empty string when the last character of the
input string variable is ASCII 145 or 146
-------------------------------+------------------------------
 Reporter:  jani20             |       Owner:  (none)
     Type:  defect (bug)       |      Status:  new
 Priority:  normal             |   Milestone:  Awaiting Review
Component:  Formatting         |     Version:
 Severity:  normal             |  Resolution:
 Keywords:  reporter-feedback  |     Focuses:
-------------------------------+------------------------------

Comment (by dmsnell):

 @TobiasBg I think your example is using U+2019 _right single quotation
 mark_. It's hard to see because PHP is probably using UTF-8 by default and
 your string is the byte sequence `"test\xe2\x80\x99"`

 I'm able to reproduce using this.

 {{{#!php
 <?php
 '' === esc_html( "test\x91" );
 '' === esc_html( "test\x92" );
 }}}

 Now these single quotation marks @jani20 are not actually ASCII, but
 CP-1252, which is the default character encoding Microsoft used for its
 products for a long time. I'm guessing that your blog's charset is set to
 UTF-8, where these //bytes// form an invalid string.

 {{{#!php
 php > iconv( 'utf-8', 'utf-8', "test\x91" );
 PHP Notice:  iconv(): Detected an illegal character in input string in php
 shell code on line 1

 Notice: iconv(): Detected an illegal character in input string in php
 shell code on line 1
 }}}

 Things you might want to check:
  - the [https://codex.wordpress.org/Converting_Database_Character_Sets
 database character encoding].
  - your browser might have a character encoding selection in the Edit
 menu, or elsewhere. UTF-8 is what it likely should be. I've seen "Default"
 fail for some sites that don't indicate their charset.
  - ensure your theme is generating a
 [https://codex.wordpress.org/Meta_Tags_in_WordPress META element] with the
 right character encoding, or "charset"

 These characters may legitimately appear in HTML; when they do, WordPress
 [https://html.spec.whatwg.org/#numeric-character-reference-end-state
 should be treating them as CP-1252 treats them]. It does this right now if
 they appear through character references like `’` but not if they
 come through directly as normal text.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60295#comment:2>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list