utf 8 - Mystery UTF-8-like encoding -


i've been given file supposedly in utf-8, there weird encodings of non-english characters. example, in mystery encoding, hangul string

한국경북영덕군강구면

is encoded as:

0xed959c 0xeab5ad 0xeab2bd 0xebb63f 0xec983f 0xeb3f95 0xeab5b0 0xeab095 0xeab5ac 0xeba9b4

(differences in bold) rather standard utf-8:

0xed959c 0xeab5ad 0xeab2bd 0xebb681 0xec9881 0xeb8d95 0xeab5b0 0xeab095 0xeab5ac 0xeba9b4"

i'm seeing same phenomena cyrillic , chinese characters--some characters have same encoding utf-8, different. garbled characters have same byte width non garbled ones , i've verified aren't part of extension set. also, i've verified not java "modified utf-8".

any other ideas may be?

btw: don't have access code or people wrote file.

also, i'm on mac 10.11.6 in case has it.

your example string consists of utf-8, byte values (namely x81 , x8d) replaced ascii question mark ? (x3f). plausible explanation example string has passed through piece of software tried interpret contents according other encoding (probably single-byte character set), , replaced "invalid" characters ? (analogously how unicode text processor might replace invalid unicode characters u+fffd).

unfortunately, process not reversible, since @ least 2 distinct byte values (and more don't happen appear in example) got replaced, there's no guaranteed way identify original byte value in every case. depending on how important — is, depending how time it's worth spending on — potentially identify complete set of bytes got replaced, , write tries each possible value each byte, comparing resulting character-sequences (say) bigram frequencies corpus of text in relevant language, , selecting most-probable byte. (of course, make mistakes. estimate resulting error rate, can try same process on known text.)


Comments

Popular posts from this blog

python - How to insert QWidgets in the middle of a Layout? -

python - serve multiple gunicorn django instances under nginx ubuntu -

module - Prestashop displayPaymentReturn hook url -