Quote from Rydian;1277820:
Base: The lack of a double equals at the end is probably the kicker.
Encoding: I tend to need conversion functions to work with XMLs in code as well.
[code]$file_contents = mb_convert_encoding(file_get_contents($input_folder.$file), "ASCII", "UCS-2");[/code]Except in the cases when I'm working with actual XML functions. But that tends to be /more/ of a headache.
Your headache partially (mostly?) stems from PHP and its crappy non-ascii support. Do note that if you convert the files to ASCII, it's a lossy conversion that could very well break everything ever. Pretty much three things can happen, and I don't care to read the PHP docs to find out which:
- "Wide" chars (2 byte) are simply dropped
- "Wide" chars are converted into a 1 byte char, like "?". (Anyone remember "The ??????? event is currently in progress, ????:soandso, blahblah")
- "Wide" chars are split into two single byte chars. For example, take the character 硃. 硃 is a
CJK Unified Ideograph that means bright red, according to
here, though it's also an important base in Japanese characters like this one: ç ”. It is encoded in UTF16 by 0x4378 (or reversed, depending on the
BOM). When converted to ASCII, it might just be split into 0x43 0x78, which is the ASCII string "Cx", as demo'd by this C# code:
[CODE]var c = "硃";
var cBytes = Encoding.Unicode.GetBytes(c);
Console.WriteLine(Encoding.ASCII.GetString(cBytes));[/CODE]
You can see why this could be a problem depending on where those characters appeared in the file. For example, a wide getting converted into a newline would mess up localization. A wide getting converted into a quote or angle bracket, for example, would handily render XML useless.
Note that languages without proper unicode support (PHP, C, Python < v3) are very nasty to do Mabi-related work in, for this reason. C#'s XDocument, for example, natively supports Unicode (as does all of the .Net library) and so needs nothing special to work with Mabi's files. PHP and C, for example, need wrappers, if they support it at all.
And let's not even talk about
case folding.
Quote from Yai;1277835:
I don't know if this is relevant, but for the reading/writing the .txt files for the english patches I make for KR Mabi, I had to use UTF-16 LE. No other format is recognised by the client. I would not be surprised if they use that for absolutely everything.
Yes, the game uses UTF16 for basically eeeeeeeeeeeverything. However, I have successfully converted the XML and text files to UTF8 with no issues, and a meager savings of 200mb. If you tried converting to ASCII, the files would have been corrupted, as I explained above. It's also possible that the Korean game just forces UTF16, since they know most of the stuff is going to be in Korean.
Did you remember to put/update the encoding in the xml prolog?
Quote from Rydian;1277837:
Text is definitely 2-bytes in RAM in the client, I can tell you that much. Everywhere.
Even things like string references (code.interface.msg.channel_move.current) are.
Most programs will normalize text to a single format, just for the sanity of all involved. C usually uses WCHARs, C#, for example, converts all text to UTF16 internally, and reconverts it transparently when saved, usually to UTF8.
The framework designers chose UTF16 to support interop with older languages.