Lightweight way to parse non-ascii encodings

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

Workaphobia
Posts: 121
Joined: Thu Jan 25, 2007 12:21 am UTC

Lightweight way to parse non-ascii encodings

Postby Workaphobia » Tue Jan 01, 2008 6:31 am UTC

I'm trying to figure out a good way to search for a particular (ascii-range) delimiter character in a stream of UTF-16 data. The most straightforward way would be to just look for the octet one by one and pray that it doesn't appear as part of another character sequence, but that's obviously a lousy failure of a hack. UTF-16 is stateful, so there's really no good way for me to traverse it manually without reimplementing something that recognizes the entire encoding. I thought about using libiconv to convert some of the data to ascii/latin-1 and match for the delimiter afterwards, but there would seem to be no easy way to resume parsing the stream after the first codepoints that can't be expressed without UTF.

A google search turned up ICU but it seems a bit heavyweight for this one task - although a unicode regular expression parser would certainly do the task.

I thought about converting UTF-16 into UCS-2, which is very similar but is a fixed-length encoding that only supports codepoints up to 0xFFFF. I can then do a linear search incrementing two octets at a time. The input stream really shouldn't contain any data that can't fit in 16 bits so this conversion should succeed. Actually, in that case maybe I don't even need UCS-2, since it should already consist entirely of 16-bit groups. The other option would be converting the input UTF-16 to UTF-8. That will ensure that all valid 7-bit ascii values are legitimate occurrences.

This is work-related, and I'll probably have implemented it one way or the other by the time I get many responses here. But I figured I'd just get some opinions and talk about that pesky non-English part of computing. Before this project I never even had to think about Unicode and text encodings.
Evidently, the key to understanding recursion is to begin by understanding recursion.

The rest is easy.

ToLazyToThink
Posts: 83
Joined: Thu Jun 14, 2007 1:08 am UTC

Re: Lightweight way to parse non-ascii encodings

Postby ToLazyToThink » Tue Jan 01, 2008 3:26 pm UTC

I'll admit I don't use Unicode to often (at least to store anything outside the ascii range), but looking at the wiki page I think this should be simple.

According to that section, the individual halves of a surrogate pair will never form a valid character code point. So you should be able to get by just searching 16 bit's at a time for 0x00?? or 0x??00 (depending on the encoding) for your 7bit ascii code point.

Workaphobia
Posts: 121
Joined: Thu Jan 25, 2007 12:21 am UTC

Re: Lightweight way to parse non-ascii encodings

Postby Workaphobia » Wed Jan 02, 2008 1:46 am UTC

After giving that section of the article a much more careful read through, I see you're correct, it really is as simple as incrementing in pairs. I didn't realize that the surrogate pairs couldn't be arranged in a more complicated fashion. For example, IIRC UTF-8 allows up to six or so bytes to represent one codepoint, though again none of them are ascii bytes.
Evidently, the key to understanding recursion is to begin by understanding recursion.

The rest is easy.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 8 guests