Ruby Unicode String Fun

RUnicode is coming along nicely.

I just implemented a few methods for String. One was String#blocks, which returns the names of the blocks codepoints belong to.

  # "今日はトム君。Niall is a ☆.".blocks
  # => ["CJK Unified Ideographs", "CJK Unified Ideographs", "Hiragana",
  # "Katakana", "Katakana", "CJK Unified Ideographs",
  # "CJK Symbols and Punctuation", "Basic Latin", "Basic Latin",
  # "Basic Latin", "Basic Latin", "Basic Latin", "Basic Latin", "Basic Latin",
  # "Basic Latin", "Basic Latin", "Basic Latin", "Basic Latin",
  # "Miscellaneous Symbols", "Basic Latin"]

And then I decided Ruby needed a real String#upcase and String#downcase. The original String#upcase just transliterates ASCII.

The operation is locale insensitive—only characters “a’’ to “z’’ are affected.

My version performs simple uppercase mappings according to the data found in UnicodeData.txt. Although it takes about a year to do it.

  # "天空のエスカフローネ Tenkū no Esukafurōne, wörtlich".upcase
  # => "天空のエスカフローネ TENKŪ NO ESUKAFURŌNE, WÖRTLICH"

String#downcase just calls String#upcase to do its dirty work. And String#upcase! and String#downcase! just use String#replace.

This Ruby Unicoding is rather fun. It’s a good language to work with. It’s just so hackable. Maybe the next step should be to work out how to get String#upcase running in a timeframe similar to the original String#upcase, and then I might like to make some Ruby extensions in C. Although I’m not fond of C, I do think it’s a good choice for low-level things, and this is indeed low-level stuff.

RUnicode started out as one method I needed for my KLookup final year project. It’s growing a little, but I’m keeping it in the KLookup source tree for now. You can check it out with the following command:

svn checkout svn://rubyforge.org/var/svn/klookup

There’s a cute little demo in demo/ which makes use of the String#tr method (which is also available in jcode, I discovered) to convert Arabic numerals into (Japanese-style) kanji numerals. There’s a shell script to print the date from the Ruby script:

二千七年一月八日

It only goes up to (10**26)-1 at the moment. If you’re interested, (10**26)-1 (that’s a nine followed by 25 nines) looks like this:

九万九千九百九十九億九千九百九十九万九千九百九十九兆九万九千九百九十九億九千九百九十九万九千九百九十九

Enough babbling, goodnight.☆

Leave a Reply

You must be logged in to post a comment.