[wp-hackers] Need internationalization-issue help, UTF-8 RSS feeds...

Sun Aug 8 16:56:07 UTC 2004

So CG-FeedRead is apparently having problems when it comes to UTF-8 feeds.
No surprise, since I haven't really worked on non-ASCII datasets really.

I'm at a loss here, because while I certainly understand multibyte/unicode
issues from a Mac/Windows C/C++ programming perspective, I am CLUELESS when
it comes to how it is being handled on the web, from a feed, into PHP, etc.
I have down the basic concept of UTF-8 as an encoding scheme, but not sure
how to apply said knowledge. ;)

The xml 'looks right' inside firefox, and 'looks wrong' with FeedRead.
However, it also 'looks wrong' if I take the raw XML feed (which my own HTTP
parser retrieves), and open it in Crimson Editor.  It looks correct again if
I take the same XML file and open it in FF.  Obviously, that means Crimson
isn't UTF-8 compliant (I can't find an encoding setting anywhere), so I'm
about to download a bunch of other text editors to help me debug further.

Now, this all makes sense, where the XML has random UTF-8 embedded multibyte
characters in the stream (like the 'ae' character), plus a TON of 'ASCII'
characters (< ord 128).

So... How programmatically do I keep from stomping the UTF-8 chars?  Even
when debugging through the feed processing, it looks like it is too late and
the UTF-8 to ascii (or something) 'stomp' has already occurred.  I wouldn't
be surprised to find that it is in fact certain PHP XML library calls that I
am making to convert the XML into structured arrays that is part of the
problem (and would, painfully, write my own XML converter if need be).  I do
understand that some of the other string functions I am using will
completely bork UTF-8 strings at this point (trim/strip/substring functions,
for example).

Thanks all,

David Chait
CHAITGEAR
www.chait.net