Published at: 22 Mar 2023Last modified at: 28 Dec 2023

Common Mistakes in the Interlingualization of Software

Table of Contents

Introduction

When software is interlingualized, that is, being programmed in a way to allow translations, for example, by using a library like GNU gettext, there are many common pitfalls that can introduce bugs.

I’ve worked both as a programmer and a translator for numerous software projects by now, I’ve seen a healthy amount of recurring mistakes.

In this technical essay, I’m looking at common mistakes, why I consider them to be a mistakes, examples, and possible solutions.

Rolling your own translation system

Problem

Rolling your own translation system for your software, rather than relying on an established and tried-and-tested library, is probably one of the biggest mistakes you can make. This is a constant source of frustration for maintainers and translators alike.

There are numerous problems with inventing a new translation system for your software:

In rare cases, there might be reasons why you still want to roll your own system despite all of these problems. So if you absolutely must roll your own system, be aware of all the challenges you need to deal with.

In my experience, the established libraries are more than sufficient for most cases.

Example: OpenRCT2

OpenRCT2 (as of version 0.4.3) suffers greatly from having a custom translation system. OpenRCT2 uses simple text files and string IDs like STR_1234. Translators have to work on the raw text files directly. We don’t get to use shiny tools like POEdit or Weblate. Submitting a translation is even more awkward: By posting pull requests, something which usually only developers do. Also, there are also no tools to verify for syntax errors or a missing placeholder in a translation file.

Example: Hedgewars

Hedgewars (as of version 1.0.0) is an exceptionally bad offender. It uses multiple custom file formats. While all of them are documented, working with them is quite painful.

First, Hedgewars uses both GNU gettext and the Qt translation system, which is acceptable. The issue are the custom file formats. One file format is a text file for strings used in-game. Another file is literally a Lua script which is a gigantic list of strings for scripted missions. A syntax error in a Lua translation will thus lead to errors. Another file format is used for tips in the main menu. It has the file name suffix .xml but is not actually XML but a custom file format that looks like XML. Finally, there are numerous text files all over the place for the descriptions of campaigns, missions and maps.

Several scripts to update and check the custom files exist, but they are very incomplete. Especially changes to the source strings are tracked only for some files so translators are sometimes in the dark when a translation has become outdated.

Part of the reason for this mess lies in Hedgewars’ design because it is split into multiple components. The main menu is a separate application than the actual gameplay part. Fixing all of these issues is a daunting task.

Solution

If possible, use a library that already exists. Resist the temptation to create custom file formats, you will likely regret it later.

If you’re still in the design phase, this is an easy decision, obviously. But when your custom system is already rolled out, making a switch is much harder and painful as you likely have to touch a lot of code and also convert the translation files to the new format. The earlier you get it right, the better.

Usually you should use GNU gettext. GNU gettext is the most mature library out there, well-tested, well-supported and well-documented. What is also helpful is the file format: The *.PO files are now so popular they are being supported by lots of other software as well. Many online translation services understand *.PO files.

For Qt-based software, I recommend to use Qt’s built-in translation system instead, which is in my opinion and personal experience also very solid. Qt’s translation system is is also well-documented and has proper tools available. The benefit of using Qt’s translation system for a Qt application is that it is more tightly integrated with Qt.

Untranslatable strings

Problem

An untranslatable string is a string in the program that will always stay the same, no matter the translation. It’s basically hardcoded into the program. As a result, the program can’t actually be translated completely by translators, even if all indicators show 100% completion.

Causes

Sometimes, software projects justify when they have some untranslatable strings:

Only the last argument is valid. The others aren’t because they miss the point of translations. They imply that the user already knows English (or whatever default language you use) anyway and the translations are only considered as an accessory. Accordingly, the quality of the translations decreases. There are users who don’t speak a single word of English. For them, every single untranslated string is cryptic.

Solution

Make all user-facing strings translatable (with the exception of technical strings like URLs, ID numbers, etc.).

Too little space

Problem

When the translated string does not fit into the user interface, e.g. the text exceeds the boundaries of the button, window, screen, etc. This problem is obvious when you see it but it often goes unnoticed because developers rarely test every single translation.

Solution

Leave enough “breathing room” in UI elements that contain text. If the original language is English and your text barely fits into a button, chances are high that translators will struggle. Many languages are more verbose than English. As a rule of thumb, for things like buttons, use +50% more space than the English text would take.

Test translations from time to time or ask translators or users to do it for you. Listen to reports complaining about a lack of space.

Confusing source strings

Problem

A string has been marked as translatable, but the source string causes headaches for translators because it is confusingly written. The translator finds themselves unable to translate the string. Or, even worse, the translator tries to translate an ambiguous string, but misinterprets the string and thus translates it incorrectly. Remember, most translator are not coders so strings need to make sense somewhat.

The problem has several manifestations:

Solutions

There are several possible solutions:

Also, all character sequences with a special technical meaning (like an escape code to trigger a text color change) must be documented somewhere.

Text is baked into images

Problem

There is an image file which contains text. The image is the same for all languages. Thus, for the user, the text is always the same and translators can’t fix it.

Example

In Hedgewars, in the singleplayer menu, there is a button for the training missions. It showed a hedgehog writing the English text “I must not eat melon bombs.” on a blackboard. This image was the same regardless of language settings.

This was solved by changing the code so that the image is only shown for English. Other languages get new a special image in which the text is replaced by comic-style lines.

Solutions

Possible solutions include:

No consistent use of format strings

Problem

This problem is when a string has been broken apart too aggressively. For example, a sentence is broken apart in the middle. Another possibility is that variable text (like numbers) are inserted into the string via string concatenation, thus generating multiple strings. Format strings aren’t used.

This is bad because:

Example 1

Consider the sentence “Hello, PLAYER! How is your day?”, where “PLAYER” is supposed to become the player name. In C, one naive implementation could look like this:

strcpy(translated_string, gettext("Hello, "));
strcat(translated_string, player_name);
strcat(translated_string, gettext("! How is your day?"));
strcat(translated_string, "\n");
printf(translated_string);

translated_string is a char* variable used to build the final string. gettext is the translation function from GNU gettext. When it comes to translating, the translator will see two strings:

This is quite confusing for translators since they usually see only 1 string at a time although it’s one sentence.

Example 2

In Minetest Game, there is an item tooltip for a key that is written like “Key to PLAYER’s THING”, where “PLAYER” is a player name and “THING” is an object with a locked. For example, “Key to Wuzzy’s Locked Chest”.

A naive implementation in Lua code would have been:

local translated_string = S("Key to ") .. PLAYER .. S("'s ") .. THING .. S(".")

(where S is the translation function)

The first problems as in Example 1 apply as the string is oddly broken apart in confusing fragments. However, there is an even worse problem: Word order. In many languages, the order of the parameters PLAYER and THING needs to be flipped. A correct German translation is “Schlüssel für THING von PLAYER”.

The naive implementation makes it impossible to translate it like that in German since PLAYER always comes first. There’s nothing the translators can do.

Solution

Use format strings or parameter symbols wherever variable text is used. (If you don’t know what a format string is: Look up the help for the C function printf.) Don’t do string concatenations within the same sentence. If your translation system does not support format string, it must be replaced.

Concatenating strings to insert hardcoded line breaks is usually OK, tho.

For Example 1, the correct code would be:

const char* player_name = "Wuzzy";
printf(gettext("Hello, %s! How is your day?"), player_name);

For Example 2, the correct code would look like this:

local translated_string = S("Key to @1's @2", PLAYER, THING)

Where S is the translation function and @1 and @2 are parameter symbols that Luanti will replace with PLAYER and THING, respectively.

Multiple use of the same string with different meanings

Problem

The identical string is used multiple times in the program, but has completely different meanings in different places. This problem is very easy to overlook because in the source language this is not a problem, but it is a problem in translations. In all likelihood, the translators won’t be able to translate the string correctly. This is the case when the target language actually makes a linguistic distinction.

Example

In the game “Cataclysm: Dark Days Ahead”, the string “Screwdriver” used to be ambiguous. The meanings:

The problem was solved by changing the name for the cocktail to “screwdriver cocktail”.

Solution

If your library allows this, specify context for the clashing strings. GNU gettext and Qt support this. Look up the manual to learn how context works.

Alternatively, you could also possible to change one of the source strings, if you can come up with a reasonable alternative, that is. It might not always be reasonable, however.

Numerus not considered

Problem

Very often the numerus is not taken into account correctly. The numerus is when words change with a number attached to it. In English, there is singular and plural: 1 cook, 2 cooks, etc.

Example

When the program displays “You have 20 coin(s).” instead of “You have 20 coins.”.

Solution

Practically all major translation systems have built-in support for the numerus. GNU gettext calls this feature “plural forms”. If available, use this.

If your system does not support this, you can sometimes get away with a workaround by turning strings like “%d coin(s)” to “Coins: %d” (where %d is a number). Note this workaround is not always reasonable.

Marking the empty string as translatable

Problem

Somewhere in the source code, the empty string was marked as translatable.

This is a problem because it makes no sense to mark the empty string as translatable and can sometimes lead to errors.

In GNU gettext, you actually can’t make the empty string translatable. Here, the empty string identifier has a special meaning. It is reserved to store metadata about the translation itself.

Example

In Luanti, we had a very strange bug shortly before the release of version 5.0.0: Weblate, the service we use to translate Luanti, suddenly stopped accepting updates to the translation files.

After some digging, I found out that someone somewhere has added a fgettext("") statement, and then everything went South. To quote my own post-mortem:

The reason why Weblate failed was a regression introduced in Luanti commit 5ef9056. Namely, fgettext("") was called, which you should NOT do (using empty string in GNU gettext has a special meaning)! This caused util/updatepo.sh to be confused and spit out garbage PO files that had their metadata removed. Without the metadata the PO files were broken and as a result, Weblate complained.

(util/updatepo.sh is our script to update the translation files.) The bugfix was to remove fgettext("") (commit 81f86b0). What makes this bug remarkable how attempting to translate the empty string in GNU gettext has triggered an entire chain of events that broke things.

Solution

Never mark an empty string as translatable.

Very long source strings

Problem

This problem is not as severe as other problems, but worth mentioning. Source strings that are very long are agonizing to translate. The main problem here is when a small changes frequently being made to the source string. This means the entire long string will become untranslated again, which is annoying.

Solution

Long source strings should be located and broken down into smaller appetizers that translators can digest. The rule of thumb here is “one train of thought per string”. When you notice one particular long string has seen multiple minor changes in a short time, it is a sign it should be broken up.