Monday, June 2, 2008

Bidirectional unicode codes and Arabic

I have been importing Arabic data into MLSN (http://dcook.org/mlsn/) (incidentally, almost all data currently comes from the AWN project). We have a csv list of synonyms, and each synonym optionally has the Arabic root in square brackets. The whole list is an SQL string, surrounded by single quotes.

When I view in SciTE it looks exactly as I'd expect: the arabic is right-to-left, the square brackets are part of the flow, and then at the apostrophe we're back to left to right. (By the way, vi appearance is the same as SciTE)
But in gedit, firefox and open office writer, when the last synonym has a square bracket it gets jumbled up.

Here is how it looks in scite:


Here is how it looks in gedit:


So, I tried adding the unicode RLE (0x202B) character before the opening square bracket and before the closing square bracket. (Incidentally, in the tab-separated file this fixed the problem in all editors.)
In SciTE no change, which is good.
In gedit et al it now has the square brackets in the flow correctly, but it has been moved to the end of the line and the following SQL clauses are now right-to-left.
Putting a unicode LRE (0x202A) before the following comma did not help! (More precisely it moved the ",'awn');" part back into the left-to-right flow, but still left the Arabic string stranded on the right; but anyway the LRE on the comma causes MySQL problems, see below.)

Is this is a bug in all of gedit, firefox and open office? Could it be a linux or gnome bug and all those applications happen to use it, while SciTE/vi do not? Or am I doing something wrong? Any advice gratefully received!

Here is how it looks in gedit with the explicit RLE codes:



MySQL and Bidirectional Codes

It seems MySQL (after doing "SET NAMES UTF8;" of course), can cope with LRE/LRE inside a quoted string, but they cannot be in the SQL string itself. E.g. on a comma.

Firefox, IE6 and Bidirectional Codes

Without the RLE codes firefox messed up showing the square brackets in both normal display and in an edit form. With the RLE codes, just before each open and each close square bracket, it shows it correctly in normal display, but still gets it wrong in the HTML form input box.
(As an aside, IE6 on Windows XP is the opposite of firefox! The main table is (very) wrong but the edit box is correct!) (And as an aside to my aside, if there is one thing Windows does well it is i18n fonts: the Arabic looks beautiful.)

phpmyadmin has a textarea that shows it correctly (with firefox). They use an explicit dir="ltr" (?!). Using that did not work for me.

So, my solution was when the edit form is being used for Arabic is to dynamically set dir to "rtl", and "ltr" the rest of the time. And (for the IE users) also set dir="rtl" on the Arabic cells, and dir="ltr" on other cells, in the main table. There is no Arabic UI currently, but when there is the dir="rtl" will be set globally via a style sheet (which is why I set the default dir="ltr" explicitly on data cells for non-Arabic languages).

See it in action by doing a search on MLSN. Here is one example: http://two.dcook.org/software/mlsn/main.php?c=06car0
(mouseover the table cells, then clicking the cell will make the edit form active; compare Arabic with the other languages.)

I am open to suggestions for alternative solutions, but I believe this is the "proper" standards, cross-browser solution.

2 comments:

keith.s.wilkinson said...

CSS2 unicode-bidi may be of interest, also unicode-bidi in Opera and bidi on Moz.

keith.s.wilkinson said...

This may also be of interest (if you haven't already seen it).