Thursday, August 20, 2009

Storing website translations in SQL

In php|a magazine, April 2009, there was an article called "Storing Multilingual Records in the MySQL Database". As the title says, it has some mysql-specific elements, but the concepts are quite general. The author introduced and compared three alternatives; as I have used a fourth way I thought I'd write about it here.

The situation was a product database, where the name, url and description have to be localized. But some entries will be untranslated and should fall back to using the default language entry.


1. Translations in a separate table


Here is the database schema:

product: product_id, group_id, price.
product_translation: product_id (links to product), language_id (links to language), name, url, description.
language: language_id, collation.

This requires an SQL join which is hard for mysql to execute efficiently. Fulltext indexes are also a problem, as the name and description fields contain text from different languages. The only good point of this approach is that it is easy to add a new language.


2. Data Copy


Here the group_id and price fields have been moved into the product_translation table. That makes the SQL queries a bit cleaner, as it saves one join, but doesn't really solve the other problems. And the data redundancy is an accident waiting to happen (I'm having a physical reaction just thinking about it).

3. Translation Directly In The Database


This is like 2 above, but each language gets its own field. For instance if we have three languages, English (en), German (de) and Japanese (ja) then it looks like:

product: product_id, language_id (links to language), name_en, url_en, description_en, name_de, url_de, description_de, name_ja, url_ja, description_ja, price, group_id.

The advantage this brings is using the default language is easy; you can either just always select the default language field, or use mysql's IFNULL function. E.g. either:
SELECT name_de,name_en WHERE product_id=123;
(and then check in your PHP script to see if name_de is blank, and if so use name_en instead)
Or:
SELECT IFNULL(name_de,name_en) as name WHERE product_id=123;

The other advantage is FULLTEXT indices work well now, as only one language is kept in each text column. Disadvantages are the space-wasting if actual translations are sparse, the work required to add a new language, and possibly hitting the mysql maximum row size (64K).

4. One table per language



This is the approach I've used in MLSN, among others:

product: product_id, group_id, price.
product_translation_en: product_id, name, url, description.
product_translation_de: product_id, name, url, description.
product_translation_ja: product_id, name, url, description.

Fulltext indices work well, joins are relatively easy, no data duplication. Adding a language feels cleaner than solution three above, as it doesn't require modifying an existing table, just adding a new one. If the site takes off then different languages can be split to different servers easily. And the language table can disappear as collation will now be part of the field definition when the table is created. (I think the same can be said of solution three, but the php|a article kept the language table in its schema - was that an oversight, or am I missing something?)

The dark side? Getting the default language requires two queries. Or would something like this (untested) SQL work?
SELECT ISNULL(de.name,en.name) FROM product_translation_de de,product_translation_en en WHERE de.product_id=123 AND en.product_id=123

Thoughts?



The 4th option works well for me, but I'd be interested to here arguments against it. Perhaps you use something different again?

1 comment:

Anonymous said...

I'm interested in the different alternative ways of doing multilingual web pages with fallback to the default language. Apache apparently supports adding a language tag to a (static HTML) file extension, but that method doesn't directly support multilingual PHP. Haven't yet investigated how Silverstripe does it (latest release is 2.3.3). Getting back to the topic of multilingual SQL tables, there are ways of packing sparse tables, but no canned solutions. (Even a canned solution for automatic stemming of words to enable fuzzy full-text search in MySQL would be useful. And I haven't yet found any easy solutions for searching languages like Japanese).