[wp-trac] [WordPress Trac] #8759: Word count function doesn't work in several languages

WordPress Trac wp-trac at lists.automattic.com
Mon Dec 19 02:30:25 UTC 2011


#8759: Word count function doesn't work in several languages
----------------------------+-----------------------
 Reporter:  jim912          |       Owner:  westi
     Type:  task (blessed)  |      Status:  assigned
 Priority:  low             |   Milestone:  3.4
Component:  I18N            |     Version:  3.3
 Severity:  normal          |  Resolution:
 Keywords:  has-patch gci   |
----------------------------+-----------------------
Changes (by jiehanzheng):

 * version:  2.7 => 3.3


Comment:

 I suggest modifying the current zh_CN "algorithm", which will make life a
 lot easier.

 Here's what our current zh_CN-word-count.js does: it removes HTML tags
 first, then English punctuation marks, AND Chinese punctuation marks. And
 then it counts all "non-ASCII" characters, now the value of ''tc'' should
 be the number of non-English characters. After that, we use the original
 word-count.js method to count English words. After the entire process,
 ''tc'' is the number of '''Chinese characters and English words'''.

 The Chinese word-count.js file can be found at:
 http://i18n.svn.wordpress.org/zh_CN/tags/3.3/dist/wp-content/languages
 /zh_CN-word-count.dev.js

 Points worth mentioning:
 * Please consider removing punctuation marks in other languages because
 counting them doesn't make sense.
 * As for the "Word count: %d" string, I suggest not to make changes to wp-
 includes/script-loader.php, because translators can simply translate this
 string to their corresponding meanings when translating -- simply adding a
 translators' note will do the trick.
 * Our current zh_CN way doesn't consider some particular languages which
 should be counted as words but are not included in the ASCII set, like
 French (see examples below).
 * Naming: the names of variables in the current zh_CN-word-count.js are
 not accurate (e.g. settingsWestern and settingsAsian) -- I suggest re-
 naming them.


 Some testing:

 I tested the zh_CN script with some test strings in multiple languages,
 and it turns out zh_CN script can handle most cases. Therefore I suggest
 modifying based on the zh_CN js file.

 == Chinese + English: PASS ==
 {{{
 欢迎使用 WordPress。
 }}}
 tc = 5, 4 Chinese chars, 1 English word, 1 Chinese punctuation.

 == English only: PASS ==
 {{{
 Who says programmers don't have a sense of humor.
 }}}
 tc = 9.

 == Japanese + English: PASS ==
 {{{
 ログイン/ログアウト、管理、フィードと WordPress のリンク
 }}}
 tc = 21, 20 Japanese chars, 1 English punctuation mark (slash), 2 Japanese
 punctuation marks, 1 English word.

 == Burmese + English: ??? ==

 {{{
 ကူညီပံ့ပိုးမှု ဖိုရမ်များသို့ ေမးခွန်းများ/အေြဖများ/အြကံြပုချက်များ
 ေရးြခင်းနှင့် လမ်းညွှန်ချက်စာတမ်းများေရးြခင်း၊ ဘာသာြပန်ြခင်း၊
 သံုးသူအြမင်ပိုင်းဆိုင်ရာ ဒီဇိုင်းြပုလုပ်ြခင်း၊ ဘီတာများကို စမ်းသပ်ြခင်း၊
 အမှားများကိုြပင်ြခင်း၊ အမှားများကို သတင်းပို့ြခင်းတို့အတွက် WordPress မှ
 လူများ ပိုမိုလိုအပ်လျှက်ရှိပါသည်။ ပါဝင်ေဆာင်ရွက်လိုက်ပါ !
 }}}
 tc = 304 -- I need someone from Myanmar to help me out...

 == French: FAIL ==
 {{{
 Le français est une langue romane parlée sur plusieurs continents,
 principalement en Afrique.
 }}}

 tc = 15, actually there are only 13 French words: the problem may be
 caused by ç and é characters, which is not included by ASCII therefore
 counted as single characters. What we need to do is to change the
 t.SettingsAsian.count, currently:
 {{{
 /[^\u0000-\u007F]/g
 }}}
 to suit languages like French.

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/8759#comment:15>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software


More information about the wp-trac mailing list