< Home

Sorting the Baltics

Hi there, I'm Panda and lately I have been pondering the Baltic states. Did you know the Baltic states are in alphabetical order from north to south? Estonia, Latvia, Lithuania — a pretty neat mnemonic that made this panda ponder: in how many languages does this mnemonic work? A seemingly simple question that turned out to reveal some interesting insights into how different languages are stored and processed.

Our plan de campagne is straightforward: collect the Baltic countries' names in as many languages as possible. Then, for each language, check whether the names are in alphabetical order. It sounds like a walk in the park, but that usually means we are going for a hike. So, we better get started!

Data collection

The first step in our plan de campagne is collecting the Baltic countries' names in as many languages as possible. Initially, I used Google Translator to translate the countries' names to all 133 supported languages. However, while inspecting the output, I realised I had no way to assess the quality of the translations and, by extension, the quality of the final answer. I want a more reliable data source.

The Unicode Common Locale Data Repository (CLDR) contains, as the name suggests, a bunch of locale data. This includes high-quality translations of names for countries and regions, which is exactly what we need. You can manually download the data here or run the script below.

#!/bin/bash

LATEST_VERSION=48.1
wget https://unicode.org/Public/cldr/$LATEST_VERSION/cldr-common-$LATEST_VERSION.zip
unzip cldr-common-$LATEST_VERSION.zip "common/main/*" -d cldr-data

The downloaded folder contains one xml file per supported language variety. From each file, we extract all <territory> tags, inside of which are the names of the countries. We are only interested in the tags of type "EE", "LT" or "LV" — representing Estonia, Lithuania and Latvia, respectively. Files that do not have data for all three countries are ignored. The parsing function below implements these steps. Note that the countries are always returned in the desired order, from north to south.

import xml.etree.ElementTree as ET

def parse(path: str) -> list | None:
    tree = ET.parse(path)
    root = tree.getroot()

    territories = root.find("localeDisplayNames/territories")
    if territories is None:
        return

    TARGETS = ["EE", "LV", "LT"]
    result = {}

    for t in territories.findall("territory"):
        
        if (t_type := t.get("type")) in TARGETS:
            result[t_type] = t.text.lower()

        if len(result) == len(TARGETS):
            return [result[c] for c in TARGETS]

Sorting it out

Now that the data collection is complete, it is time to put things in order — alphabetical order to be precise. Recall that the mnemonic works for a language if the names of the Baltic states in that language appear in alphabetical order. For example, the Baltic countries' names in Tarifit (a Berber language spoken in northern Morocco) are istunya, latevya, litwanya. If we sort these names, we get istunya, latevya, litwanya. The order did not change and therefore, the mnenomic works in Tarifit. Implementing this in code is straightforward.

def in_alphabetical_order(names: list[str]) -> bool:
    return sorted(names) == names

Or is it? When dealing with human languages, things are never so straightforward. The above solution works fine when dealing with English, but how does it handle non-Latin characters? To answer that question, we need to take a look under the hood.

Python stores strings as sequences of Unicode code points. A Unicode code point is a hexadecimal value that represents a character. For example, the character "A" is represented by the code point U+0041. Python's built-in sorted() function uses these code points to perform the sorting, e.g. U+0041 comes before U+0042 (which represents "B"). As you can see, this works for English but it breaks down quickly for other languages.

For instance, the Swedish language consists of the 26 letters of the Latin alphabet plus å, ä, ö — in that order. The code point for å is U+00E5 whereas the code point for ä is U+00E4. This means that the letter ä has a lower numerical value even though it appears later in the alphabet. Indeed, running sorted(['å', 'ä']) returns ['ä', 'å']. You can imagine that this problem gets worse for languages that have completely different alphabets.

To make matters even more complicated, there are many language-specific sorting rules. For example, French treats é and e as the same letter whereas Spanish considers ñ and n as distinct letters. Thus, we need to take into account the alphabets of each language and its particular sorting rules. This is called locale-aware sorting. Luckily, I am not the first person to run into this problem. The aforementioned CLDR also contains sorting rules for each language. To access these rules, we import the International Components for Unicode library, which uses CLDR in the background.

import icu

def in_alphabetical_order(words: list[str], language: str) -> bool:
    collator = icu.Collator.createInstance(icu.Locale(language))
    return sorted(words, key=collator.getSortKey) == words

Before we put everything together, you may wonder, as did I, how sorting works for right-to-left languages. For example, we know that the mnemonic works in English but if we were to reverse the names and then sort them, we would discover that the order had changed, indicating that the mnemonic does not work in English (ainotse, aivtal, ainauhtil becomes ainauhtil, ainotse, aivtal). Does that mean we need to reverse all the words from right-to-left languages before sorting them?

No. As it turns out, characters (or code points) are stored in logical order, as opposed to visual or reading order. In other words, everything in memory is stored from left-to-right and consequently, the first character of a word is always at index 0. Python does not care about whether the first character is displayed on the left or right side of the screen — that is a rendering issue and therefore, does not affect our program logic.

Conclusion

Now that we have all the building blocks, it is time to put together the final solution. We iterate over every xml file in the CLDR folder and our parsing function extracts the names of the Baltic states. Then, we check whether the names are in alphabetical order with our locale-aware sorting function. Lastly, the results are saved to a csv file.

import os
import csv

cldr_directory = "cldr-data/common/main"
header = ["language", "estonia", "latvia", "lithuania", "in_order"]
data = [header]

for filename in os.listdir(cldr_directory):
    path = os.path.join(cldr_directory, filename)

    names = parse(path)
    if names is None:
        continue

    language = filename.removesuffix(".xml")
    in_order = in_alphabetical_order(names, language)
    data.append([language, *names, in_order])

with open("results.csv", "w") as file:
    writer = csv.writer(file)
    writer.writerows(data)

The results are in! The mnemonic works for 195 out of 243 languages, or about 80.2%. So, there you go, that is our final answer. I also included some bonus facts for you below.

That is all I have for you today. Thank you so much for reading this far and I hope you have a wonderful day!
Panda <3

< Home