#233 String components

tactics Fri 30 May 2008

I was wondering, why did Fan decide to go with using integral code points instead of a Character class? It seems like a dedicated Char class would be more natural.

I know Ruby did it this way originally, then switched over to a Character class, and Java had Character from the beginning.

Having a Char class would be nicer because it gives you the ability to inspect the Unicode properties of the symbol, getting it's charset and case. It would also print in a more natural way and might be useful for quick and dirty code, using chars as a super-lightweight mnemonic enum until you go back to rewrite it.

brian Fri 30 May 2008

Well a Unicode code point is just an integer. In Java you don't really work with Character or Integer, but rather with char and int - and in reality they are exactly the same (both a 32-bit integer on the stack). Character and Integer have convenience methods, but in Fan all of those methods are just on Int.

So I don't view an integer and character as different types. I think char (as a Unicode code point) is just a use case of integer. Fan's philosophy is to prefer a smaller number of types with many methods - this is especially important with something like a Int/Char because conversions require an object allocation.

So if you take a look at the Int class you will see it supports all the methods you would typically use for both a mathematical integer and a Unicode char

'a'.toChar   =>  "a"
'a'.isLower  =>  true
'a'.upper    =>  'A'
'3'.isDigit  =>  true

Note the lower, upper, toLower, toUpper are for ASCII only, and the locale versions work for any Unicode char with the current locale (avoids the Turkish i problem that Java has).

helium Fri 30 May 2008

Sorry, but I can not resist:

Well a string is just an array of characters. Strings have convenience methods, but in D all of those methods/functions are just on arrays.

So D programmer don't view strings and arrays as different types. A D programmer thinks a string (as a list of characters) is just a use case of arrays. (D typically is a lot faster than Fan so Fan's philosophy to prefer a smaller number of types with many methods so that there are no conversions that require an object allocation is not important as it's currently slower than D anyway.)

So if you take a look at arrays you will see it supports all the methods you would typically use for both normal arrays and strings.

"Hallo" ~ " world"  =>  "Hello world"  // concatenation
...

brian Fri 30 May 2008

I agree, that treating strings and arrays with a unified collections API can be elegant. Of course D is closer to the metal - more akin to C than Java. In that world arrays, string, and pointers are all basically the same thing.

In Fan's world, the JVM and CLR doesn't let us treat arrays consistently, which is why we have the fractured collection types:

  • sys::Buf: byte[]
  • sys::Str, sys::StrBuf: char[]
  • sys::List: Object[]

Under the covers in Java/C# this makes things quite efficient. Not 100% elegant at the API level, but I think in practice you tend to work with them different. Right now the commonality (such as get, set, each) is only available via duck typing. There might be an opportunity for a mixin type, but I don't think it plays well with generics (see previous discussion).

tactics Fri 30 May 2008

Wow, I haven't done Java in a while.... I forgot char is stack-allocated.

I didn't realize integers had the toChar method. That makes up for the printing issue well enough, which is the issue that bothers me most about it.

I think the division of linear container types (Buf, Str, List) makes perfect sense. Byte strings and textual strings have an extremely elevated importance in programming. They deserve special types with their own syntactic sugar.

Login or Signup to reply.