[wp-trac] [WordPress Trac] #6077: UTF-8 strings are sometimes cut in the middle of a character

WordPress Trac wp-trac at lists.automattic.com
Mon Mar 3 16:08:42 GMT 2008


#6077: UTF-8 strings are sometimes cut in the middle of a character
------------------------+---------------------------------------------------
 Reporter:  nbachiyski  |       Owner:  anonymous            
     Type:  defect      |      Status:  new                  
 Priority:  normal      |   Milestone:  2.5                  
Component:  General     |     Version:                       
 Severity:  normal      |    Keywords:  unicode utf-8 excerpt
------------------------+---------------------------------------------------
 Using {{{substr}}} on UTF-8 strings can cause some characters to be cut on
 the middle, because {{{substr}}} counts bytes, but in UTF-8 a character
 can be more than one byte.

 Here is a patch, which:
  * Defines {{{mb_strcut}}} in {{{compat.php}} for the users, who don't
 have the {{{mb_string}}} extension.
  * Introduces a new {{{wp_html_excerpt}}} function, which uses
 {{{mb_strcut}}} and works well with html strings: counts entities as one
 character (& isn't 4 chars) and strips tags.

 There are some tests for the two functions:
  * [http://svn.automattic.com/wordpress-tests/wp-
 testcase/test_includes_compat.php _mb_strcut]
  * [http://svn.automattic.com/wordpress-tests/wp-
 testcase/test_includes_formatting.php wp_html_excerpt] (in the end of the
 file)

-- 
Ticket URL: <http://trac.wordpress.org/ticket/6077>
WordPress Trac <http://trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list