In
php|a magazine, April 2009, there was an article called "Storing Multilingual Records in the MySQL Database". As the title says, it has some mysql-specific elements, but the concepts are quite general. The author introduced and compared three alternatives; as I have used a fourth way I thought I'd write about it here.
The situation was a product database, where the name, url and description have to be localized. But some entries will be untranslated and should fall back to using the default language entry.
1. Translations in a separate table
Here is the database schema:
product: product_id, group_id, price.
product_translation: product_id (links to product), language_id (links to language), name, url, description.
language: language_id, collation.
This requires an SQL join which is hard for mysql to execute efficiently. Fulltext indexes are also a problem, as the name and description fields contain text from different languages. The only good point of this approach is that it is easy to add a new language.
2. Data Copy
Here the group_id and price fields have been moved into the product_translation table. That makes the SQL queries a bit cleaner, as it saves one join, but doesn't really solve the other problems. And the data redundancy is an accident waiting to happen (I'm having a physical reaction just thinking about it).
3. Translation Directly In The Database
This is like 2 above, but each language gets its own field. For instance if we have three languages, English (en), German (de) and Japanese (ja) then it looks like:
product: product_id, language_id (links to language), name_en, url_en, description_en, name_de, url_de, description_de, name_ja, url_ja, description_ja, price, group_id.
The advantage this brings is using the default language is easy; you can either just always select the default language field, or use
mysql's IFNULL function. E.g. either:
SELECT name_de,name_en WHERE product_id=123;
(and then check in your PHP script to see if name_de is blank, and if so use name_en instead)
Or:
SELECT IFNULL(name_de,name_en) as name WHERE product_id=123;
The other advantage is FULLTEXT indices work well now, as only one language is kept in each text column. Disadvantages are the space-wasting if actual translations are sparse, the work required to add a new language, and possibly hitting the mysql maximum row size (64K).
4. One table per language
This is the approach I've used in MLSN, among others:
product: product_id, group_id, price.
product_translation_en: product_id, name, url, description.
product_translation_de: product_id, name, url, description.
product_translation_ja: product_id, name, url, description.
Fulltext indices work well, joins are relatively easy, no data duplication. Adding a language feels cleaner than solution three above, as it doesn't require modifying an existing table, just adding a new one. If the site takes off then different languages can be split to different servers easily. And the language table can disappear as collation will now be part of the field definition when the table is created. (I think the same can be said of solution three, but the php|a article kept the language table in its schema - was that an oversight, or am I missing something?)
The dark side? Getting the default language requires two queries. Or would something like this (untested) SQL work?
SELECT ISNULL(de.name,en.name) FROM product_translation_de de,product_translation_en en WHERE de.product_id=123 AND en.product_id=123
Thoughts?
The 4th option works well for me, but I'd be interested to here arguments against it. Perhaps you use something different again?