Sunday, August 23, 2009

php6 - maybe not so painful?

I long ago learned that if you want to get development work done then always use the distro's own version of PHP, Apache, etc, rather than compiling your own version. Otherwise every upgrade due to a security or bug fix requires you to first realize the need (costs time keeping up with the study) and then do something about it (costs time). Time is money; not spending the time is insecure. So I outsource all of that to the experts at my distro or ISP.

Consequences are that my code avoids using the latest feature, and tries to work on as wide a set of versions as possible. PHP 5.0.0 was released in July 2004 yet at the start of this year two of my clients were still using php 4. One of them was still using php 4.2. (They've both upgraded this year; one because they moved to a new ISP, the other because we were troubleshooting code and running out of ideas, and wondered if moving to php 5 might magically fix it.)

Which is a long-winded introduction to say that PHP 6 will be out before long, and I'm very unexcited. It brings native unicode; but the mbstring extension already did the job fine. It removes bad stuff that was already deprecated, but I didn't use it anyway.

This article made me concerned it may turn out to be a negative experience overall, rather than a neutral one:
http://www.linux-mag.com/id/7433/2/

It says PHP4-style classes are disappearing. Yet they work fine in PHP 5, and E_STRICT has not been complaining if I've not explicitly written "public". Of even more concern is php 5.3 apparently brings in this change too?

However, it turns out this was just a poor Linux Magazine article. What is being removed in php 6 is "ZE1 compatability mode" (see the php manual description); it had defaulted to off, and only affected the way objects were copied/compared. The "php 4" example Linux Magazine show is valid php 5 code, and (as far as I can tell) will be valid php 6 code too.

So that was a red herring. But I'm also bothered how (string) works for binary strings in php 5, but will need to be changed to (binary) for php 6 - that is a very annoying change for anyone storing binary data in strings (disabling unicode is not an option if you also need to store UTF-8 in other strings)!

Thursday, August 20, 2009

Storing website translations in SQL

In php|a magazine, April 2009, there was an article called "Storing Multilingual Records in the MySQL Database". As the title says, it has some mysql-specific elements, but the concepts are quite general. The author introduced and compared three alternatives; as I have used a fourth way I thought I'd write about it here.

The situation was a product database, where the name, url and description have to be localized. But some entries will be untranslated and should fall back to using the default language entry.


1. Translations in a separate table


Here is the database schema:

product: product_id, group_id, price.
product_translation: product_id (links to product), language_id (links to language), name, url, description.
language: language_id, collation.

This requires an SQL join which is hard for mysql to execute efficiently. Fulltext indexes are also a problem, as the name and description fields contain text from different languages. The only good point of this approach is that it is easy to add a new language.


2. Data Copy


Here the group_id and price fields have been moved into the product_translation table. That makes the SQL queries a bit cleaner, as it saves one join, but doesn't really solve the other problems. And the data redundancy is an accident waiting to happen (I'm having a physical reaction just thinking about it).

3. Translation Directly In The Database


This is like 2 above, but each language gets its own field. For instance if we have three languages, English (en), German (de) and Japanese (ja) then it looks like:

product: product_id, language_id (links to language), name_en, url_en, description_en, name_de, url_de, description_de, name_ja, url_ja, description_ja, price, group_id.

The advantage this brings is using the default language is easy; you can either just always select the default language field, or use mysql's IFNULL function. E.g. either:
SELECT name_de,name_en WHERE product_id=123;
(and then check in your PHP script to see if name_de is blank, and if so use name_en instead)
Or:
SELECT IFNULL(name_de,name_en) as name WHERE product_id=123;

The other advantage is FULLTEXT indices work well now, as only one language is kept in each text column. Disadvantages are the space-wasting if actual translations are sparse, the work required to add a new language, and possibly hitting the mysql maximum row size (64K).

4. One table per language



This is the approach I've used in MLSN, among others:

product: product_id, group_id, price.
product_translation_en: product_id, name, url, description.
product_translation_de: product_id, name, url, description.
product_translation_ja: product_id, name, url, description.

Fulltext indices work well, joins are relatively easy, no data duplication. Adding a language feels cleaner than solution three above, as it doesn't require modifying an existing table, just adding a new one. If the site takes off then different languages can be split to different servers easily. And the language table can disappear as collation will now be part of the field definition when the table is created. (I think the same can be said of solution three, but the php|a article kept the language table in its schema - was that an oversight, or am I missing something?)

The dark side? Getting the default language requires two queries. Or would something like this (untested) SQL work?
SELECT ISNULL(de.name,en.name) FROM product_translation_de de,product_translation_en en WHERE de.product_id=123 AND en.product_id=123

Thoughts?



The 4th option works well for me, but I'd be interested to here arguments against it. Perhaps you use something different again?