[wp-trac] [WordPress Trac] #7254: WP Diff shouldn't split words in the middle of UTF-8 characters

WordPress Trac wp-trac at lists.automattic.com
Sun Jul 6 20:57:46 GMT 2008


#7254: WP Diff shouldn't split words in the middle of UTF-8 characters
------------------------+---------------------------------------------------
 Reporter:  nbachiyski  |       Owner:  anonymous
     Type:  defect      |      Status:  new      
 Priority:  high        |   Milestone:  2.6      
Component:  General     |     Version:           
 Severity:  normal      |    Keywords:  has-patch
------------------------+---------------------------------------------------
 Expected:

 When we compare Грещките and Грешките we should get the following HTML
 code for the deleted part:
 {{{
 Гре<del>щ</del>ките
 }}}

 However, we get:
 {{{
 Гре�<del>�</del>ките
 }}}

 {{{WP_Text_Diff_Renderer_inline::_splitOnWords()}}} uses the following
 regular expression to split words: {{{/([^\w])/}}}. {{{\w}}} in this case
 matches {{{[a-zA-Z0-9_]}}} and everything else is outside of a word. This
 both isn't a good definition of a word and allows a word to end in the
 middle of a UTF-8 character, which is the case above.

 The solution is to make the regular expression work on a UTF-8 string,
 using the /u modifier (available from PHP 4.1.0).

 Patch attached.

-- 
Ticket URL: <http://trac.wordpress.org/ticket/7254>
WordPress Trac <http://trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list