English is, in many ways, convenient.
Whether you’re watching that latest show on Netflix, listening to your new favorite band, or talking to your old friend halfway around the world, English is there. It’s relatively easy to use, ubiquitous, almost normalized. The same goes for many language models, built by English speakers, in English, and with English in mind. But there are still many domains in which relying on English in the hope that communication doesn’t break down fails.
Natural language processing (NLP) is one such domain. As an umbrella term, NLP covers the various ways in which computers generate, process, and understand natural languages. But, while the input/output (I/O) model of communication in human-computer interaction (HCI) rests on the same principles regardless of the computing device, human languages are incredibly diverse. What follows is that some linguistic systems and features are shared across many languages, while others can be lacking, even in geographically close language systems.
The sheer number of #NLProc resources available nowadays is nothing but impressive.— Krzysztof Borowski (@VoiceFirstChris) October 10, 2021
The fact that many implicitly assume that language = English, or that languages default to English-like systems, is naive and detrimental to developing multilingual NLP tools in the long run.
Recently, I have been thinking about the Anglo-centricity of our approaches to NLP, which automatically generates several problems. To use a quote from Emily Bender of the University of Washington, “English isn’t generic for language (despite what NLP papers might lead you to believe).”
Preparing to give a talk at SDSS next week. My title: “English isn’t generic for language, despite what NLP papers might lead you to believe” >>— Emily M. Bender (@emilymbender) May 25, 2019
To demonstrate that, I came up with a quick list of reasons (with examples) why language models trained on standard English data will fail when applied to other languages. The list also applies to languages from typologically, culturally, or geographically close regions. Here are four grammatical features that differ across languages, which makes your English-based model obsolete.
Let’s discuss these in a bit more detail now.
1. The Inflection Point
If you come from an inflection-rich language, learning English feels like a breeze. But if you happen to speak English as your first language, the fact that each word has multiple variations that slightly differ can be taxing. As a linguistic feature, inflection comes easily. After all, using dogs instead of dog, or eats instead of eat allows us to be more specific. But, there are languages (and even whole language families) where these tiny modifications carry great importance.
Take Polish, for example, with the standard package of seven cases. As it is, it prides itself on the number of alterations applied to each noun.
Look at that cake. That cake looks great, I’m watching that cake, but I don’t have that cake at home.
As you may have noticed (or not, since that cake just stole the show), I just used the word “cake” three times in the previous sentence. Because English falls on the relaxed side of the inflection spectrum, no modifications of the word “cake” are needed. In Polish, however, each of the three instances takes a slightly different form.
- To ciasto [subject, nominative] świetnie wygląda, przyglądam się temu ciastu [patient, dative], ale nie mam tego ciasta [genitive, negated object] ciasta w domu.
And if you think this is incidental, it’s not–these modifications apply to all nouns in Polish, not to mention adjectives, pronouns, numerals, etc. Think your English-based model can do that? Think again.
2. The Affixation Fixation
Similar to that idea is affixation: adding small meaningful chunks called morphemes onto words to create a new meaning or modify an existing one. In English, you can think of words such as “prenatal” (pre- = before) or “loneliness” (-ness = condition, quality, state). Affixation is a common way of expanding the lexicon and occurs in that function in many languages, which includes the Slavic language family.
In Slavic, however, affixation is taken to a whole next level. Compared to English, Slavic adds a bunch of different affixes to your regular pack of prefixes, suffixes, and infixes to create nuanced and unique modifications of the base word.
Take that teddy bear. If you see that teddy bear in a store, you’ll probably think “teddy bear.” But, let’s say you own one and absolutely love it and want to express it through language by modifying the original phrase. How do you do it (can you)?
In Polish (and, similarly, other Slavic languages), this becomes relatively easy.
- miś = teddy bear (neutral or somewhat endearing)
- misio = teddy bear (endearing)
- misiunio = [little to very little] teddy bear (very endearing)
- misiaczek = [little to very little] teddy bear (also, very endearing)
Do you now see what you miss as an English speaker? A similar case is that of the numerous derivatives of the most common verbs, such as jechać (= to go with means of transportation), iść (= to go on foot, prototypically), or czytać (= to read). In comparison, English tends to express these meanings through phrasal verbs.
3. That Sweater I Bought
As a fellow Germanic language, English takes word order seriously. In so doing, it tends to inform us about the doer of the action, the action itself, the circumstances of the action, etc. As a rule, a door doesn’t go before the person who opened it (excluding passive voice here). A car doesn’t go before the person who ran to it. A book doesn’t go before someone who read it (but this is not an absolute rule). In terms of information structure (also known as information packaging), English wants the topic or theme to come first in the sentence.
Unlike in English, free word order is a common characteristic of Polish and other Slavic languages. (As you can imagine, this is a bit of an exaggeration since not everything can go anywhere without consequences.) Because inflection expresses semantic relations between specific words in the sentence, Polish and Slavic can get away with relative freedom on the sentence level.
Compare these English examples:
- Yesterday, I bought a new sweater in the store.
- I bought a new sweater in the store yesterday.
- ? In the store yesterday I bought a new sweater. (entering some shaky grounds)
Now contrast them with these examples in Polish:
- Wczoraj kupiłem nowy sweter w sklepie.
= Yesterday, I bought a new sweater in the store.
- Wczoraj kupiłem w sklepie nowy sweter.
= [yesterday | I bought | in the store | new sweater]
- Wczoraj kupiłem sweter w sklepie nowy.
= [yesterday | I bought | sweater | in the store | new]
- Wczoraj w sklepie kupiłem nowy sweter.
= [yesterday | in the store | I bought | new sweater]
- Wczoraj w sklepie kupiłem sweter nowy.
= [yesterday | in the store | I bought | sweater | new]
- Wczoraj w sklepie sweter kupiłem nowy.
= [yesterday | in the store | sweater | I bought | new]
- Wczoraj w sklepie sweter nowy kupiłem.
= [yesterday | in the store | sweater | new | I bought]
- Wczoraj w sklepie nowy sweter kupiłem.
= [yesterday | in the store | new sweater | I bought]
- Wczoraj nowy sweter w sklepie kupiłem.
= [yesterday | new sweater | in the store | I bought]
- Wczoraj nowy sweter kupiłem w sklepie.
= [yesterday | new sweater | I bought | in the store]
- Wczoraj sweter w sklepie nowy kupiłem.
= [yesterday | sweater | in the store | new | I bought]
- Wczoraj sweter w sklepie kupiłem nowy.
= [yesterday | sweater | in the store | I bought | new]
- Kupiłem wczoraj nowy sweter w sklepie.
= [I bought | yesterday | new sweater | in the store]
- Kupiłem wczoraj sweter nowy w sklepie.
= [I bought | yesterday | sweater | new | in the store]
- Kupiłem wczoraj w sklepie nowy sweter.
= [I bought | yesterday | in the store | new sweater]
- ~ Kupiłem wczoraj w sklepie sweter nowy.
= [I bought | yesterday | in the store | sweater | new]
And so it continues.
4. Obsessed With Gender
Grammatical gender is one of the most fascinating features of world languages. While some languages possess developed gender systems and rules, languages such as Finnish, Hungarian, or Turkish lack it entirely.
In that regard, English is somewhere toward the weaker side of the spectrum, marking gender mainly through pronouns and some nouns, many of which are now considered obsolete. That is, however, not the case for Slavic languages. For instance, from the linguistic point of view, Polish is pretty much gender-obsessed. Since every noun has a specific grammatical gender, adjoining adjectives and numerals need the gender marked, too. Additionally, gender markings on all past-tense verbs must indicate the speaker’s and the hearer’s gender. As the world of conversational AI and chatbots accelerates more and more in our ever-online reality, this poses significant challenges for effective computer-mediated communication.
Imagine building a chatbot and having to incorporate this knowledge into your conversation design process. It also means that, at minimum, you should (a) incorporate an element of gender recognition for each chatbot user and (b) apply it consistently throughout the conversational experience. In this scenario, matching the user’s preferred gender becomes yet another complication. On the flip side, Slavic names are easily identifiable as female or male.
Still, the use of grammatical gender each time the conversation turns to past actions requires building an extended model that recognizes the gender of the speaker(s) and the hearer(s). Consider this minimal pair of questions:
- Kupiłam wczoraj sweter.
= I bought a sweater yesterday [female speaker]
- Kupiłem wczoraj sweter.
= I bought a sweater yesterday [male speaker]
As if this wasn’t enough, Polish also has separate forms for groups of people, depending on — wait for it — the number of male speakers present. If that number is equal to or greater than 1, then male-specific forms need to be used. But, if no male speakers are in that group, then their female-specific equivalents are necessary:
- Co wczoraj kupiliście?
= What did you buy yesterday? [male-only or mixed group]
- Co wczoraj kupiłyście?
= What did you buy yesterday? [female-only group]
If you’d like to dive into this topic deeper, Laura Janda has a great overview of gender as a grammatical category in Slavic, including Polish.
As it has become painfully obvious by now, one language does not equal another language. Having an English-language model or starting with one should be just that–a starting point. It cannot be an excuse to translate the lexicon into the target language in the hope it magically works in that language, too. Language systems are incredibly complex. That is why we need human brains to translate them for machines that can then do the hard computational work our brains cannot.