Mabinogi Encoding Discussion

This is an archive of the mabination.com forums which were active from 2010 to 2018. You can not register, post or otherwise interact with the site other than browsing the content for historical purposes. The content is provided as-is, from the moment of the last backup taken of the database in 2019. Image and video embeds are disabled on purpose and represented textually since most of those links are dead.

To view other archive projects go to https://archives.mabination.com

Osayidan wrote on 2015-06-09 23:41

nexon does weird shit... I'm not a developer in any way so I wouldn't know why, but when I made the housing search I found out the data returned from querries to that server is in utf16. Never before seen anyone use utf16 and haven't seen it since. Had to find some weird ass conversion function to be able to work with the data.
X wrote on 2015-06-09 23:51

UTF16 is used for a lot of asian character sets, which need the full two bytes of storage to encode their characters. Pretty much all of Mabi is in UTF16 (like the XML files). UTF16 is otherwise relatively unused. A notable exception that I can think of off the bat is the .Net framework, which stores strings internally as UTF-16.

UTF8 is what most of the computing world uses. It's kind of a hybrid - it behaves like ASCII (which is an 8 bit, or 1 byte) character set, unless the character requires two bytes, like the asian characters, when it behaves like UTF16. So UTF8 is the most compact while offering the largest character range.

There are other unicodes, like utf7 (Used by mail servers) and 32 (I have no idea why this even exists). Wikipedia has a wonderful article on it.

Mabi was coded in UTF16 to support Korean, I guess, and they didn't like UTF8? It's funny, you can cut the size of the XML files in half without losing anything if you convert them to UTF8... Anyway, so that's why housing search uses UTF16.

Also yes, the news stuff is base 62, not 64.
Rydian wrote on 2015-06-10 08:27

Base: The lack of a double equals at the end is probably the kicker.

Encoding: I tend to need conversion functions to work with XMLs in code as well.
[code]$file_contents = mb_convert_encoding(file_get_contents($input_folder.$file), "ASCII", "UCS-2");[/code]Except in the cases when I'm working with actual XML functions. But that tends to be /more/ of a headache.
Yai wrote on 2015-06-10 13:05

I don't know if this is relevant, but for the reading/writing the .txt files for the english patches I make for KR Mabi, I had to use UTF-16 LE. No other format is recognised by the client. I would not be surprised if they use that for absolutely everything.
Rydian wrote on 2015-06-10 14:15

Text is definitely 2-bytes in RAM in the client, I can tell you that much. Everywhere.
Even things like string references (code.interface.msg.channel_move.current) are.
X wrote on 2015-06-10 18:33

Quote from Rydian;1277820:
Base: The lack of a double equals at the end is probably the kicker.

Encoding: I tend to need conversion functions to work with XMLs in code as well.
[code]$file_contents = mb_convert_encoding(file_get_contents($input_folder.$file), "ASCII", "UCS-2");[/code]Except in the cases when I'm working with actual XML functions. But that tends to be /more/ of a headache.

Your headache partially (mostly?) stems from PHP and its crappy non-ascii support. Do note that if you convert the files to ASCII, it's a lossy conversion that could very well break everything ever. Pretty much three things can happen, and I don't care to read the PHP docs to find out which:

- "Wide" chars (2 byte) are simply dropped
- "Wide" chars are converted into a 1 byte char, like "?". (Anyone remember "The ??????? event is currently in progress, ????:soandso, blahblah")
- "Wide" chars are split into two single byte chars. For example, take the character ç¡ƒ. ç¡ƒ is a CJK Unified Ideograph that means bright red, according to here, though it's also an important base in Japanese characters like this one: ç ”. It is encoded in UTF16 by 0x4378 (or reversed, depending on the BOM). When converted to ASCII, it might just be split into 0x43 0x78, which is the ASCII string "Cx", as demo'd by this C# code:

[CODE]var c = "ç¡ƒ";
var cBytes = Encoding.Unicode.GetBytes(c);

Console.WriteLine(Encoding.ASCII.GetString(cBytes));[/CODE]

You can see why this could be a problem depending on where those characters appeared in the file. For example, a wide getting converted into a newline would mess up localization. A wide getting converted into a quote or angle bracket, for example, would handily render XML useless.

Note that languages without proper unicode support (PHP, C, Python < v3) are very nasty to do Mabi-related work in, for this reason. C#'s XDocument, for example, natively supports Unicode (as does all of the .Net library) and so needs nothing special to work with Mabi's files. PHP and C, for example, need wrappers, if they support it at all.

And let's not even talk about case folding.

Quote from Yai;1277835:
I don't know if this is relevant, but for the reading/writing the .txt files for the english patches I make for KR Mabi, I had to use UTF-16 LE. No other format is recognised by the client. I would not be surprised if they use that for absolutely everything.

Yes, the game uses UTF16 for basically eeeeeeeeeeeverything. However, I have successfully converted the XML and text files to UTF8 with no issues, and a meager savings of 200mb. If you tried converting to ASCII, the files would have been corrupted, as I explained above. It's also possible that the Korean game just forces UTF16, since they know most of the stuff is going to be in Korean.

Did you remember to put/update the encoding in the xml prolog?

Quote from Rydian;1277837:
Text is definitely 2-bytes in RAM in the client, I can tell you that much. Everywhere.
Even things like string references (code.interface.msg.channel_move.current) are.

Most programs will normalize text to a single format, just for the sanity of all involved. C usually uses WCHARs, C#, for example, converts all text to UTF16 internally, and reconverts it transparently when saved, usually to UTF8. The framework designers chose UTF16 to support interop with older languages.
Rydian wrote on 2015-06-10 18:56

It's not even the encoding anymore, it's PHP's XML support. It's another tacked-on library, and while googling for how to do the most basic stuff I ran into comments everywhere stating that people just gave up and used a function to turn XML entries into arrays-of-arrays and then just dealt with them like that.
X wrote on 2015-06-10 19:04

[SPOILER="My face"]
[Image: http://cdn.alltheragefaces.com/img/faces/large/disgusted-oh-god-l.png]
[/SPOILER]

[Image: http://qph.is.quoracdn.net/main-qimg-2e57ea9c62e9d717100610e3c846df42?convert_to_webp=true]

[SIZE="5"]

That's not what... Why even do... I don't... But you lose... I give up.[/SIZE]

[Image: http://i.lvme.me/v8ccqht.jpg]
Osayidan wrote on 2015-06-10 19:06

Quote from Rydian;1277857:
It's not even the encoding anymore, it's PHP's XML support. It's another tacked-on library, and while googling for how to do the most basic stuff I ran into comments everywhere stating that people just gave up and used a function to turn XML entries into arrays-of-arrays and then just dealt with them like that.

Pretty much what I had to do when I made my RSS feed server thingy:
https://osayidan.net/20130402/feedbooru-danbooru-rss-generator/

Danbooru's RSS feeds sucked so I wanted to use their API to make my own feeds. Then I found out PHP sucks ass at XML so I had to break all the API's XML results into arrays and do stuff with it then rebuild the RSS feed's XML the way I wanted it. I imagine it's already a pain for people who actually do programming but for someone like me who just does some PHP stuff when I can't find what I want so I do it myself, it was a nightmare D:
Rydian wrote on 2015-06-10 19:38

Ugh man, I had to go to the 2nd page of google results (desperation ho!) to find what I needed in basic XML parsing/searching because every example/sample on the first page both...

1 - Thought I would only care about info in an item and not an attribute of a self-closed one.
2 - Assumed I already knew which structure I would be pulling info from, 'cause they only pulled via index IDs.

Finally looked through enough comments to see somebody state that I could just foreach() the structure and compare stuff in the loop. I don't even know if PHP has a proper way of returning results via matching attributes at this point 'cause that's the only way I could find to do it.