[wp-trac] [WordPress Trac] #4457: WP does not properly encode UTF-8
mail per RFC 2047
WordPress Trac
wp-trac at lists.automattic.com
Wed Jun 13 19:05:55 GMT 2007
#4457: WP does not properly encode UTF-8 mail per RFC 2047
-----------------------+----------------------------------------------------
Reporter: trauschus | Owner: anonymous
Type: defect | Status: new
Priority: high | Milestone: 2.2.1
Component: General | Version: 2.2
Severity: critical | Keywords: rfc2047 mail
-----------------------+----------------------------------------------------
RFC2047, which is MIME Part 3, specifies that when sending non-ASCII
information in headers such as the RFC(2)822 Subject header, it must be
properly encoded. WordPress gets it *mostly* right, however, it violates
one very important rule (quoted from RFC2047):
Each 'encoded-word' MUST encode an integral number of octets. The
'encoded-text' in each 'encoded-word' must be well-formed according
to the encoding specified; the 'encoded-text' may not be continued in
the next 'encoded-word'. (For example, "=?charset?Q?=?=
=?charset?Q?AB?=" would be illegal, because the two hex digits "AB"
must follow the "=" in the same 'encoded-word'.)
Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.
However, I just received a mail from WordPress with the following subject
header:
Subject:
=?UTF-8?Q?[Trausch=E2=80=99s_Little_Home]_Please_moderate:_"Well,_it=E2?=
=?UTF-8?Q?=80=99s_good_I_don=E2=80=99t_use_IE=E2=80=A6"?=
The ’ (Unicode 0x2019) is split in mid-character, which is incorrect. The
sequence of hex characters E2 80 99 cannot be split per the standard, and
this causes RFC 2047-compliant mailers such as Evolution to display the
subject as-transmitted (e.g., in quite an ugly manner).
I used some code in a C# application to avoid this situation:
// c is a byte representing an octet of a UTF-8 Character.
if(RetVal.Length > wrapLength) {
if(((c & 0xC0) == 0xC0) || ((c & 0xC0)
== 0x80)) {
// Do Nothing -- We cannot split here.
} else {
RetVal += Ending;
Lines.Add(RetVal);
RetVal = "\n " + Preamble;
}
}
Basically, if the character ANDed with 0xC0 is equal to 0xC0 or 0x80, the
string should not be split at that location. It should not be terribly
hard to express that in PHP, as well. This is most likely not a potential
security issue, though it could cause strange behavior in mail user agents
(MUAs) which attempt to parse the quoted-words anyway. Evolution is
following the standard by choosing not to parse the quote-words. From RFC
2047:
6.3. Mail reader handling of incorrectly formed 'encoded-word's
It is possible that an 'encoded-word' that is legal according to the
syntax defined in section 2, is incorrectly formed according to the
rules for the encoding being used. For example:
(1) An 'encoded-word' which contains characters which are not legal
for a particular encoding (for example, a "-" in the "B"
encoding, or a SPACE or HTAB in either the "B" or "Q" encoding),
is incorrectly formed.
(2) Any 'encoded-word' which encodes a non-integral number of
characters or octets is incorrectly formed.
A mail reader need not attempt to display the text associated with an
'encoded-word' that is incorrectly formed. However, a mail reader
MUST NOT prevent the display or handling of a message because an
'encoded-word' is incorrectly formed.
I have chosen the pri/sev high/crit because this is a standards-compliance
issue, and might in remote situations be a security issue if there are
particularly borked MUAs out there that do strange things with the header.
--
Ticket URL: <http://trac.wordpress.org/ticket/4457>
WordPress Trac <http://trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list