Text-wrangling

Because I needed to get something working in Ruby (I’m prototyping), I’ve been testing the same thing in various languages.

My project (multiradical kanji lookup - it doesn’t have a name yet) will probably take its input directly from radkdict. As a result, each line contains many kanji, so I decided to just split(//) on each line that I needed. ちょっと待て〜! Just doing that gives me a list of the bytes in the file, not a list of the characters. There’s no practical difference in UTF-8 between bytes and characters before you get a fair way away from ASCII, but when you’re looking at characters from the Han block there’s three bytes per character but rules of thumb are not the sort of thing you want to base parsing a text file on.

To cut a long story short, if you set $KCODE to ‘u’, splitting using RegExps will work (but indices still won’t).

So Ruby is FTL (for the loss, not faster than light). I’ll probably still be using Ruby for the project despite its failings. I haven’t even started on the background research (next: other research, analysis, design, and the development is where I start on the non-prototype code), so I still have plenty of time to change my mind.

Python (2.5, of course) and Perl also suffer the same sorry substring stupidity.

Java, on the other hand, works perfectly with substrings. The usual String.substring method works perfectly. When I ask for the first two characters of “英語が話せますか”, it gives me “英語”.

For the past few years Java has been growing on me. Now I’m really starting to think it’s a good thing. I’m looking forward to the day they open source it. gcj/gij are pretty good, but they’re not quite compatible yet.

I noticed jython in Ubuntu’s universe today so I installed it (no JRuby yet). It’s pretty damn swanky, despite the fact that jythonc depends on Python 2.1 (which is over four years old) - so I can’t compile Python classes to Java bytecode. It’s pretty good. You can evaluate Python from Java and access Java classes from Python.

Stream of conciousness journal entries FTW. Splat.

Leave a Reply

You must be logged in to post a comment.