#2491 UTF-8 Encoding

SlimerDude Tue 10 Nov 2015

Whilst writing the UTF-8 percent escaper in the last post I was unfortunate enough to peek into the brain-numbing nebulous headache that is character encoding. Still, I was concious enough to spot this...

Going by the normative definitions of UTF-8 in RFC3629 and Wikipedia it would seem that UTF-8 code points may be in the range of 0x000000 -> 0x10FFFF.

Whereas Fantom's Java code base seems to only recognise UTF-8 code points in the range 0x0000 -> 0xFFFF - as seen in:

  • fan.sys.Charset.Utf8Encoder
  • fan.sys.Charset.Utf8Decoder
  • fan.sys.Uri.percentEncodeChar()

I was just wondering why that is?

I also noted that fan.sys.FanInt.toChar() only recognises Unicode chars in the range 0x0000 -> 0xFFFF.

KevinKelley Tue 10 Nov 2015

I wondered about that before as well; I think it's because Java's internal representation of a Char is 2 bytes (UTF-16, Basic Multilingual Plane), likely because at the time 65K chars seemed like a lot. But Unicode kept growing...

Anyway since a java string can only hold 16-bit chars, the upper planes can't be directly decoded into array-of-char. Probably the decoder should be throwing "unicode is hard and java is weak!" exceptions there...maybe better would be to not fail or throw, but substitute a unicode-unrecognized U+FFFD, � char instead...

Java Character class javadoc

brian Tue 10 Nov 2015

Yeah its really a Java thing. They originally didn't support anything but 16-bit chars in strings. And that is still how Java works best. Java does support Unicode values above 0xFFFF using something called supplementary characters / surrogates. Its complicated, and I haven't ever really taken the time to understand the performance of using some of these methods for Fantom. It used to be that there wasn't any practical reason for supporting the higher Unicode planes. But emoticons are higher than 0xFFFF and probably the chars that we will want to support at some time.

The good news is that from the start, all character representation in Fantom uses a 64-bit integer (long in Java), so we should be future proof in the APIs. Its really just an internal detail how we map String/char support in Java runtime.

SlimerDude Tue 10 Nov 2015

Cool, that's fine for converting to and from native strings, no need to introduce supplementary character surrogates just yet!

But I think fan.sys.Charset.Utf8Encoder / Utf8Decoder and others should still be updated to handle the full UTF-8 range. That's because it's used by Buf and Streams which concentrate on byte data. Outstream.writeChar(Int char), for instance, has no such Java limitation but still suffers from a 0xFFFF limit.

Login or Signup to reply.