Common Mistakes in the Interlingualization of Software
Table of Contents
Introduction
When software is interlingualized, that is, being programmed in a way to allow translations, for example, by using a library like GNU gettext, there are many common pitfalls that can introduce bugs.
I’ve worked both as a programmer and a translator for numerous software projects by now, I’ve seen a healthy amount of recurring mistakes.
In this technical essay, I’m looking at common mistakes, why I consider them to be a mistakes, examples, and possible solutions.
Rolling your own translation system
Problem
Rolling your own translation system for your software, rather than relying on an established and tried-and-tested library, is probably one of the biggest mistakes you can make. This is a constant source of frustration for maintainers and translators alike.
There are numerous problems with inventing a new translation system for your software:
- First, consider you might be suffering under the Not Invented Here syndrome. Do you really think your own fancy translation system adds features the other libraries are lacking? Are these missing features worth starting from scratch?
- You have to design, test and implement a lot of things you haven’t even considered to be even on par with established libraries like GNU gettext.
- You must take into account parameters and format strings. There must be a way to insert a variable text into strings.
- You also must come up with an entire toolchain so you can update translation files and perform automatic syntax checks. You do not want to do this by hand.
- You need to track string changes. Will the translation files be updated accordingly and old translations get disabled when a source string has changed?
- You’ll likely miss out on various extremely useful translation tools that have been written over the years, like POEdit or Weblate.
- If your system uses any type of special syntax, you need to document it.
In rare cases, there might be reasons why you still want to roll your own system despite all of these problems. So if you absolutely must roll your own system, be aware of all the challenges you need to deal with.
In my experience, the established libraries are more than sufficient for most cases.
Example: OpenRCT2
OpenRCT2 (as of version 0.4.3) suffers greatly from having a custom translation system. OpenRCT2 uses simple text files and string IDs like STR_1234
. Translators have to work on the raw text files directly. We don’t get to use shiny tools like POEdit or Weblate. Submitting a translation is even more awkward: By posting pull requests, something which usually only developers do. Also, there are also no tools to verify for syntax errors or a missing placeholder in a translation file.
Example: Hedgewars
Hedgewars (as of version 1.0.0) is an exceptionally bad offender. It uses multiple custom file formats. While all of them are documented, working with them is quite painful.
First, Hedgewars uses both GNU gettext and the Qt translation system, which is acceptable. The issue are the custom file formats. One file format is a text file for strings used in-game. Another file is literally a Lua script which is a gigantic list of strings for scripted missions. A syntax error in a Lua translation will thus lead to errors. Another file format is used for tips in the main menu. It has the file name suffix .xml
but is not actually XML but a custom file format that looks like XML. Finally, there are numerous text files all over the place for the descriptions of campaigns, missions and maps.
Several scripts to update and check the custom files exist, but they are very incomplete. Especially changes to the source strings are tracked only for some files so translators are sometimes in the dark when a translation has become outdated.
Part of the reason for this mess lies in Hedgewars’ design because it is split into multiple components. The main menu is a separate application than the actual gameplay part. Fixing all of these issues is a daunting task.
Solution
If possible, use a library that already exists. Resist the temptation to create custom file formats, you will likely regret it later.
If you’re still in the design phase, this is an easy decision, obviously. But when your custom system is already rolled out, making a switch is much harder and painful as you likely have to touch a lot of code and also convert the translation files to the new format. The earlier you get it right, the better.
Usually you should use GNU gettext. GNU gettext is the most mature library out there, well-tested, well-supported and well-documented. What is also helpful is the file format: The *.PO
files are now so popular they are being supported by lots of other software as well. Many online translation services understand *.PO
files.
For Qt-based software, I recommend to use Qt’s built-in translation system instead, which is in my opinion and personal experience also very solid. Qt’s translation system is is also well-documented and has proper tools available. The benefit of using Qt’s translation system for a Qt application is that it is more tightly integrated with Qt.
Untranslatable strings
Problem
An untranslatable string is a string in the program that will always stay the same, no matter the translation. It’s basically hardcoded into the program. As a result, the program can’t actually be translated completely by translators, even if all indicators show 100% completion.
Causes
- The string was simply overlooked
- The string is intentionally not translated
Sometimes, software projects justify when they have some untranslatable strings:
- It would mean more work for the translators
- The strings are not that important
- They are proper names
- It’s a technical identifier that mustn’t be changed
Only the last argument is valid. The others aren’t because they miss the point of translations. They imply that the user already knows English (or whatever default language you use) anyway and the translations are only considered as an accessory. Accordingly, the quality of the translations decreases. There are users who don’t speak a single word of English. For them, every single untranslated string is cryptic.
Solution
Make all user-facing strings translatable (with the exception of technical strings like URLs, ID numbers, etc.).
Too little space
Problem
When the translated string does not fit into the user interface, e.g. the text exceeds the boundaries of the button, window, screen, etc. This problem is obvious when you see it but it often goes unnoticed because developers rarely test every single translation.
Solution
Leave enough “breathing room” in UI elements that contain text. If the original language is English and your text barely fits into a button, chances are high that translators will struggle. Many languages are more verbose than English. As a rule of thumb, for things like buttons, use +50% more space than the English text would take.
Test translations from time to time or ask translators or users to do it for you. Listen to reports complaining about a lack of space.
Confusing source strings
Problem
A string has been marked as translatable, but the source string causes headaches for translators because it is confusingly written. The translator finds themselves unable to translate the string. Or, even worse, the translator tries to translate an ambiguous string, but misinterprets the string and thus translates it incorrectly. Remember, most translator are not coders so strings need to make sense somewhat.
The problem has several manifestations:
- Ambiguous: The string is ambiguous and the context has not been explained to the translator.
- Cryptic: The string consists of unintelligible character sequences, presumably to be resolved by the software internally
Solutions
There are several possible solutions:
- Rewrite the string to become more readable
- Break the string into multiple strings. Line breaks, list entries and paragraphs are good points to break a string apart
- Add a comment directed at the translator explaining what this string does (Both GNU gettext and Qt support this)
Also, all character sequences with a special technical meaning (like an escape code to trigger a text color change) must be documented somewhere.
Text is baked into images
Problem
There is an image file which contains text. The image is the same for all languages. Thus, for the user, the text is always the same and translators can’t fix it.
Example
In Hedgewars, in the singleplayer menu, there is a button for the training missions. It showed a hedgehog writing the English text “I must not eat melon bombs.” on a blackboard. This image was the same regardless of language settings.
This was solved by changing the code so that the image is only shown for English. Other languages get new a special image in which the text is replaced by comic-style lines.
Solutions
Possible solutions include:
- Edit the offending image and remove all text from it. Store the text as actual text so tools like GNU gettext can work with it
- Show the original image for the source language and show an alternative simplified image (without the text) to all other languages
- Use a different image for each language, each image contains the translation. This is obviously the least pragmatic solution if you support tons of languages, because someone has to draw these new images
No consistent use of format strings
Problem
This problem is when a string has been broken apart too aggressively. For example, a sentence is broken apart in the middle. Another possibility is that variable text (like numbers) are inserted into the string via string concatenation, thus generating multiple strings. Format strings aren’t used.
This is bad because:
- Translators may not know or understand the structure of the finished concatenated string because they only see the cut-off portions at a time
- It forces a certain order of the variable text. But many languages have a different word order and it’s often grammatically impossible in many languages to keep the word order of the source string intact
Example 1
Consider the sentence “Hello, PLAYER! How is your day?”, where “PLAYER” is supposed to become the player name. In C, one naive implementation could look like this:
strcpy(translated_string, gettext("Hello, "));
strcat(translated_string, player_name);
strcat(translated_string, gettext("! How is your day?"));
strcat(translated_string, "\n");
printf(translated_string);
translated_string
is a char*
variable used to build the final string. gettext
is the translation function from GNU gettext. When it comes to translating, the translator will see two strings:
- “Hello, ”
- “! How is your day?”
This is quite confusing for translators since they usually see only 1 string at a time although it’s one sentence.
Example 2
In Minetest Game, there is an item tooltip for a key that is written like “Key to PLAYER’s THING”, where “PLAYER” is a player name and “THING” is an object with a locked. For example, “Key to Wuzzy’s Locked Chest”.
A naive implementation in Lua code would have been:
local translated_string = S("Key to ") .. PLAYER .. S("'s ") .. THING .. S(".")
(where S
is the translation function)
The first problems as in Example 1 apply as the string is oddly broken apart in confusing fragments. However, there is an even worse problem: Word order. In many languages, the order of the parameters PLAYER and THING needs to be flipped. A correct German translation is “Schlüssel für THING von PLAYER”.
The naive implementation makes it impossible to translate it like that in German since PLAYER always comes first. There’s nothing the translators can do.
Solution
Use format strings or parameter symbols wherever variable text is used. (If you don’t know what a format string is: Look up the help for the C function printf
.) Don’t do string concatenations within the same sentence. If your translation system does not support format string, it must be replaced.
Concatenating strings to insert hardcoded line breaks is usually OK, tho.
For Example 1, the correct code would be:
const char* player_name = "Wuzzy";
printf(gettext("Hello, %s! How is your day?"), player_name);
For Example 2, the correct code would look like this:
local translated_string = S("Key to @1's @2", PLAYER, THING)
Where S
is the translation function and @1
and @2
are parameter symbols that Luanti will replace with PLAYER
and THING
, respectively.
Multiple use of the same string with different meanings
Problem
The identical string is used multiple times in the program, but has completely different meanings in different places. This problem is very easy to overlook because in the source language this is not a problem, but it is a problem in translations. In all likelihood, the translators won’t be able to translate the string correctly. This is the case when the target language actually makes a linguistic distinction.
Example
In the game “Cataclysm: Dark Days Ahead”, the string “Screwdriver” used to be ambiguous. The meanings:
- Screwdriver, a tool
- Screwdriver, a kind of cocktail
The problem was solved by changing the name for the cocktail to “screwdriver cocktail”.
Solution
If your library allows this, specify context for the clashing strings. GNU gettext and Qt support this. Look up the manual to learn how context works.
Alternatively, you could also possible to change one of the source strings, if you can come up with a reasonable alternative, that is. It might not always be reasonable, however.
Numerus not considered
Problem
Very often the numerus is not taken into account correctly. The numerus is when words change with a number attached to it. In English, there is singular and plural: 1 cook, 2 cooks, etc.
Example
When the program displays “You have 20 coin(s).” instead of “You have 20 coins.”.
Solution
Practically all major translation systems have built-in support for the numerus. GNU gettext calls this feature “plural forms”. If available, use this.
If your system does not support this, you can sometimes get away with a workaround by turning strings like “%d coin(s)” to “Coins: %d” (where %d
is a number). Note this workaround is not always reasonable.
Marking the empty string as translatable
Problem
Somewhere in the source code, the empty string was marked as translatable.
This is a problem because it makes no sense to mark the empty string as translatable and can sometimes lead to errors.
In GNU gettext, you actually can’t make the empty string translatable. Here, the empty string identifier has a special meaning. It is reserved to store metadata about the translation itself.
Example
In Luanti, we had a very strange bug shortly before the release of version 5.0.0: Weblate, the service we use to translate Luanti, suddenly stopped accepting updates to the translation files.
After some digging, I found out that someone somewhere has added a fgettext("")
statement, and then everything went South. To quote my own post-mortem:
The reason why Weblate failed was a regression introduced in Luanti commit5ef9056
. Namely,fgettext("")
was called, which you should NOT do (using empty string in GNU gettext has a special meaning)! This causedutil/updatepo.sh
to be confused and spit out garbage PO files that had their metadata removed. Without the metadata the PO files were broken and as a result, Weblate complained.
(util/updatepo.sh
is our script to update the translation files.) The bugfix was to remove fgettext("")
(commit 81f86b0
). What makes this bug remarkable how attempting to translate the empty string in GNU gettext has triggered an entire chain of events that broke things.
Solution
Never mark an empty string as translatable.
Very long source strings
Problem
This problem is not as severe as other problems, but worth mentioning. Source strings that are very long are agonizing to translate. The main problem here is when a small changes frequently being made to the source string. This means the entire long string will become untranslated again, which is annoying.
Solution
Long source strings should be located and broken down into smaller appetizers that translators can digest. The rule of thumb here is “one train of thought per string”. When you notice one particular long string has seen multiple minor changes in a short time, it is a sign it should be broken up.