Thursday, 15 September 2011

regex - Parsing Character Sets without Converting to UTF-8 -


I am working to parse / tune a set of languages ​​compiling CSS and how do I Handling non-ASCII inputs has obviously been dealt with by many people earlier.

As a general rule of thumb , I am reading "converting into UTF-8, process, and whatever encoding you had in the form of input." Would agree with me ...

But I'm thinking , all punctuation marks and numbers with whom I'm working directly ASCII (with code points) below 127) Ant other character strings are all filling in a hash table (i.e. program not want to be A you how many bytes are needed to express any character).

The questions come here:

  • Is there a formal character set that conflict with ASCII definitions for code is interested in what I am interested in (less than 127)?

  • Can you see a fault error in setting up large object linking and embedding characters so that I can not match all the characters that I'm not directly dealing with directly And to change the full-fledged character UTF-8 sign language decoding failure?

    For example:

      // AZ, AG and all non-ASCII stuff characters = (0x41..0x5A) || (0x61..0x7A) || (0x80..0xFF) // match1 or more identifier = character +   

    Thanks a lot!

    If you are going with encoding (such as PHP) encoding, then you can use UTF-16 IE Can not support input encoding Encoding ASCII must be compatible bitwise character sets should not be confused with ASCII compatibility.

    Data encoding unknown will work well for you because the data is passing by bus. If you need to deal with characters in any other way - decoding is required every time and in the beginning you may have to decode it once.

    Do not encode (and thus decoding, announcements, detections and other complexity) content in UTF-8, pass it through bus if input was UTF-8, output will be UTF-8 if input Windows-1252, the output will be Windows-1252, less surprisingly ...

No comments:

Post a Comment