utf 8 - Mystery UTF-8-like encoding -
i've been given file supposedly in utf-8, there weird encodings of non-english characters. example, in mystery encoding, hangul string
한국경북영덕군강구면
is encoded as:
0xed959c 0xeab5ad 0xeab2bd 0xebb63f 0xec983f 0xeb3f95 0xeab5b0 0xeab095 0xeab5ac 0xeba9b4
(differences in bold) rather standard utf-8:
0xed959c 0xeab5ad 0xeab2bd 0xebb681 0xec9881 0xeb8d95 0xeab5b0 0xeab095 0xeab5ac 0xeba9b4"
i'm seeing same phenomena cyrillic , chinese characters--some characters have same encoding utf-8, different. garbled characters have same byte width non garbled ones , i've verified aren't part of extension set. also, i've verified not java "modified utf-8".
any other ideas may be?
btw: don't have access code or people wrote file.
also, i'm on mac 10.11.6 in case has it.
your example string consists of utf-8, byte values (namely x81 , x8d) replaced ascii question mark ?
(x3f). plausible explanation example string has passed through piece of software tried interpret contents according other encoding (probably single-byte character set), , replaced "invalid" characters ?
(analogously how unicode text processor might replace invalid unicode characters u+fffd).
unfortunately, process not reversible, since @ least 2 distinct byte values (and more don't happen appear in example) got replaced, there's no guaranteed way identify original byte value in every case. depending on how important — is, depending how time it's worth spending on — potentially identify complete set of bytes got replaced, , write tries each possible value each byte, comparing resulting character-sequences (say) bigram frequencies corpus of text in relevant language, , selecting most-probable byte. (of course, make mistakes. estimate resulting error rate, can try same process on known text.)
Comments
Post a Comment