Archive for October, 2006

Text-wrangling

Wednesday, October 18th, 2006

Because I needed to get something working in Ruby (I’m prototyping), I’ve been testing the same thing in various languages.

My project (multiradical kanji lookup - it doesn’t have a name yet) will probably take its input directly from radkdict. As a result, each line contains many kanji, so I decided to just split(//) on each line that I needed. ちょっと待て〜! Just doing that gives me a list of the bytes in the file, not a list of the characters. There’s no practical difference in UTF-8 between bytes and characters before you get a fair way away from ASCII, but when you’re looking at characters from the Han block there’s three bytes per character but rules of thumb are not the sort of thing you want to base parsing a text file on.

To cut a long story short, if you set $KCODE to ‘u’, splitting using RegExps will work (but indices still won’t).

So Ruby is FTL (for the loss, not faster than light). I’ll probably still be using Ruby for the project despite its failings. I haven’t even started on the background research (next: other research, analysis, design, and the development is where I start on the non-prototype code), so I still have plenty of time to change my mind.

Python (2.5, of course) and Perl also suffer the same sorry substring stupidity.

Java, on the other hand, works perfectly with substrings. The usual String.substring method works perfectly. When I ask for the first two characters of “英語が話せますか”, it gives me “英語”.

For the past few years Java has been growing on me. Now I’m really starting to think it’s a good thing. I’m looking forward to the day they open source it. gcj/gij are pretty good, but they’re not quite compatible yet.

I noticed jython in Ubuntu’s universe today so I installed it (no JRuby yet). It’s pretty damn swanky, despite the fact that jythonc depends on Python 2.1 (which is over four years old) - so I can’t compile Python classes to Java bytecode. It’s pretty good. You can evaluate Python from Java and access Java classes from Python.

Stream of conciousness journal entries FTW. Splat.

ヤー、おひさしぶり

Saturday, October 14th, 2006

IMEs are the best invention in the world. This week I’ve typed Ï€ and done some box drawing without having to look at 文字マップ (Gucharmap in this case).

Today I learnt Ï€ to 9 digits thanks to the Japanese language and this post. It was HARAGUCHI Akira’s recent escapades into real numbers that made me want to find Ï€ in Japanese phrases. It’s a great language for memorising numbers.

Today I have also been looking at fun things like graphical toolkits. It’s hard to put into words why GTK+ 2 could beat Qt or Tk in a fight. Just say that GTK+ 2 has a wider deployment and hope for the best. And then try to think up a good reason to choose Ruby over Python (I don’t think “Ruby is more Japonz” will cut the mustard in final year project documentation).

I shall probably start some coding of that soon (maybe sooner than my Gantt chart tells me I should be doing - but I’m sure I’ll live).

Well. I should go to bed now so I can go swimming in the morning and then write lots and lots of documentation.

P.S. I subscribe to the Doctor Will documentation methodology: have everything in a wiki then just print it out and call it a report.