[wp-trac] [WordPress Trac] #8759: Word count function doesn't work in several languages
WordPress Trac
wp-trac at lists.automattic.com
Mon Dec 19 02:30:25 UTC 2011
#8759: Word count function doesn't work in several languages
----------------------------+-----------------------
Reporter: jim912 | Owner: westi
Type: task (blessed) | Status: assigned
Priority: low | Milestone: 3.4
Component: I18N | Version: 3.3
Severity: normal | Resolution:
Keywords: has-patch gci |
----------------------------+-----------------------
Changes (by jiehanzheng):
* version: 2.7 => 3.3
Comment:
I suggest modifying the current zh_CN "algorithm", which will make life a
lot easier.
Here's what our current zh_CN-word-count.js does: it removes HTML tags
first, then English punctuation marks, AND Chinese punctuation marks. And
then it counts all "non-ASCII" characters, now the value of ''tc'' should
be the number of non-English characters. After that, we use the original
word-count.js method to count English words. After the entire process,
''tc'' is the number of '''Chinese characters and English words'''.
The Chinese word-count.js file can be found at:
http://i18n.svn.wordpress.org/zh_CN/tags/3.3/dist/wp-content/languages
/zh_CN-word-count.dev.js
Points worth mentioning:
* Please consider removing punctuation marks in other languages because
counting them doesn't make sense.
* As for the "Word count: %d" string, I suggest not to make changes to wp-
includes/script-loader.php, because translators can simply translate this
string to their corresponding meanings when translating -- simply adding a
translators' note will do the trick.
* Our current zh_CN way doesn't consider some particular languages which
should be counted as words but are not included in the ASCII set, like
French (see examples below).
* Naming: the names of variables in the current zh_CN-word-count.js are
not accurate (e.g. settingsWestern and settingsAsian) -- I suggest re-
naming them.
Some testing:
I tested the zh_CN script with some test strings in multiple languages,
and it turns out zh_CN script can handle most cases. Therefore I suggest
modifying based on the zh_CN js file.
== Chinese + English: PASS ==
{{{
欢迎使用 WordPress。
}}}
tc = 5, 4 Chinese chars, 1 English word, 1 Chinese punctuation.
== English only: PASS ==
{{{
Who says programmers don't have a sense of humor.
}}}
tc = 9.
== Japanese + English: PASS ==
{{{
ログイン/ログアウト、管理、フィードと WordPress のリンク
}}}
tc = 21, 20 Japanese chars, 1 English punctuation mark (slash), 2 Japanese
punctuation marks, 1 English word.
== Burmese + English: ??? ==
{{{
ကူညီပံ့ပိုးမှု ဖိုရမ်များသို့ ေမးခွန်းများ/အေြဖများ/အြကံြပုချက်များ
ေရးြခင်းနှင့် လမ်းညွှန်ချက်စာတမ်းများေရးြခင်း၊ ဘာသာြပန်ြခင်း၊
သံုးသူအြမင်ပိုင်းဆိုင်ရာ ဒီဇိုင်းြပုလုပ်ြခင်း၊ ဘီတာများကို စမ်းသပ်ြခင်း၊
အမှားများကိုြပင်ြခင်း၊ အမှားများကို သတင်းပို့ြခင်းတို့အတွက် WordPress မှ
လူများ ပိုမိုလိုအပ်လျှက်ရှိပါသည်။ ပါဝင်ေဆာင်ရွက်လိုက်ပါ !
}}}
tc = 304 -- I need someone from Myanmar to help me out...
== French: FAIL ==
{{{
Le français est une langue romane parlée sur plusieurs continents,
principalement en Afrique.
}}}
tc = 15, actually there are only 13 French words: the problem may be
caused by ç and é characters, which is not included by ASCII therefore
counted as single characters. What we need to do is to change the
t.SettingsAsian.count, currently:
{{{
/[^\u0000-\u007F]/g
}}}
to suit languages like French.
--
Ticket URL: <http://core.trac.wordpress.org/ticket/8759#comment:15>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list