More Unicode ruminations
I've pretty much decided to go forward with Unicode support in 513, and it's a huge deal. It's going to impact so much of the frontend, and because I made so many changes with the string refactor, as I mentioned I'll probably have to redo the early speculative work I put into this. It's really okay though; I'm glad of it.

As an example of what I have to redo, a lot of that concerns skin functions. There's a lot of stuff in the skin code that calls getters and setters for various parameters, a number of which are strings. However I'm using the CString class for all of that, and once I turn on Unicode support CString suddenly becomes wide-character. This is really counter to the UTF-8 Everywhere manifesto which suggests not converting to wide-char format until absolutely necessary, and converting back to UTF-8 as soon as possible. In general I consider that extremely wise advice. This means I'll really have to alter all the skin routines to use DMString, which is always 8-bit and can hold UTF-8, and anything that works with them. For the most part though this shouldn't be horrible, as I actually designed DMString so I could drop it in as a replacement most of the places CString is currently used.

For string procs in the language I'm very torn. Character indexes, for instance, will not necessarily be accurate for things like copytext(), text2ascii(), findtext(), etc. Do I add something to the routines to do a more complicated calculation? (And if so, do I add a flag to detect UTF-8 and only use the more complex stuff then?) Or, should I just stick with the exact index? The waters get muddier for regular expressions, if I allow trans-ASCII characters in the character class (brackets) element.

And to add to the confusion, if the regex code were updated to use a strictly wide-char format, the wide-char implementation is actually different between Windows and Linux. Windows uses UTF-16, which means characters like emoji that are beyond 16 bits get encoded as two wide characters, whereas Linux uses 32-bit wide chars. To deal with this I might need to do something weird like add new regex opcodes that would handle things like character ranges in a new way, not using the handy wcsspn()/wcscspn() routines. It's a puzzler.

Switching between upper- and lowercase characters might not be so bad, though. There's already public domain code available to help with that, so at least things like lowertext() won't be so bad.

Other than the above concerns, more of the issues will be on the frontend. Surprisingly not very much of the backend code should have to change to accommodate UTF-8. The real issue is gonna be changing over so much code on the front. With the new icon editor stuff I tried to prepare for this change as best I could whenever I could remember to, to give myself at least a little less work to do. Basically any part of the code that currently talks to the backend needs to be aware of the need for UTF-8/wide conversion, and as I mentioned the skin code should be using UTF-8 except when it comes time to make the actual conversion.

So that's a little of what I'll be wrestling with in the coming weeks, and of course any input is appreciated. Wiser developers than myself have tackled problems like this, albeit maybe not on this scope, so I'd be interested to discover their solutions.

released this post 7 days early for patrons.   Become a patron
Tier Benefits
Recent Posts