An A.I. Translation Tool Can Help Save Dying Languages. But at What Cost?

This content has been archived. It may no longer be accurate or relevant.

From Slate:

Sanjib Chaudhary chanced upon StoryWeaver, a multilingual children’s storytelling platform, while searching for books he could read to his 7-year-old daughter. Chaudhary’s mother tongue is Kochila Tharu, a language with about 250,000 speakers in eastern Nepal. (Nepali, Nepal’s official language, has 16 million speakers.) Languages with a relatively small number of speakers, like Kochila Tharu, do not have enough digitized material for linguistic communities to thrive—no Google Translate, no film or television subtitles, no online newspapers. In industry parlance, these languages are “underserved” and “underresourced.”

This is where StoryWeaver comes in. Founded by the Indian education nonprofit Pratham Books, StoryWeaver currently hosts more than 50,000 open-licensed stories across reading levels in more than 300 languages from around the world. Users can explore the repository by reading level, language, and theme, and once they select a story, they can click through illustrated slides (each as if it were the page of a book) in the selected language (there are also bilingual options, where two languages are shown side-by-side, as well as download and read-along audio options). “Smile Please,” a short tale about a fawn’s ramblings in the forest, is currently the “most read” story—originally written in Hindi for beginners, it has since been translated into 147 languages and read 281,000 times.

A majority of the languages represented on the platform are from Africa and Asia, and many are Indigenous, in danger of losing speakers in a world of almost complete English hegemony. Chaudhary’s experience as a parent reflects this tension. “The problem with children is that they prefer to read storybooks in English rather than in their own language because English is much, much easier. With Kochila Tharu, the spelling is difficult, the words are difficult, and you know, they’re exposed to English all the time, in schools, on television,” Chaudhary said

Artificial intelligence-assisted translation tools like StoryWeaver can bring more languages into conversation with one another—but the tech is still new, and it depends on data that only speakers of underserved languages can provide. This raises concerns about how the labor of the native speakers powering A.I. tools will be valued and how repositories of linguistic data will be commercialized.

To understand how A.I.-assisted translation tools like StoryWeaver work, it’s helpful to look at neighboring India: With 22 official languages and more than 780 spoken languages, it is no accident that the country is a hub of innovation for multilingual tech. StoryWeaver’s inner core is inspired by a natural language processing tool developed at Microsoft Research India called interactive neural machine translation prediction technology, or INMT.

Unlike most A.I.-powered commercial translation tools, INMT doesn’t do away with a human intermediary altogether. Instead, it assists humans with hints in the language they’re translating into. For example, if you begin typing, “It is raining” in the target language, the model working on the back-end supplies “tonight,” “heavily,” and “cats and dogs” as options for completing your sentence, based on the context and the previous word or set of words. During translation, the tool accounts for meaning in the original language and what the target language allows, and then generates possibilities for the translator to choose from, said Kalika Bali, principal researcher at Microsoft and one of INMT’s main architects.

Tools like INMT allow StoryWeaver’s cadre of volunteers to generate translations of existing stories quickly. The user interface is easy to master even for amateur translators, many of whom, like Chaudhary, are either volunteering their time or already working for nonprofits in early childhood education. The latter is the case for Churki Hansda. Working in Kora and Santali, two underserved Indigenous languages spoken in eastern India, she is an employee at Suchana Uttor Chandipur Community Society, one of StoryWeaver’s many partner organizations scattered all over the world. “We didn’t really have storybooks growing up. Our school textbooks were in Bengali [the dominant regional language], and we would end up memorizing everything because we didn’t understand what we were reading,” Hansda told me. “It’s a good feeling to be able to create books in our languages for our children.”

Amna Singh, Pratham Books’ content and partnerships manager, estimates that 58 percent of the languages represented on StoryWeaver are underserved, a status quo that has cascading consequences for early childhood learning outcomes. But attempts to undo the neglect of underserved language communities are also closely linked with unlocking their potential as consumers, and A.I.-powered translation technology is a big part of this shift. Voice recognition tools and chat bots in regional Indian languages aim to woo customers outside metropolitan cities, a market that is expected to expand as cellular data usage becomes even cheaper.

These tools are only as good as their training data, and sourcing is a major challenge. For sustained multilingualism on the internet, machine translation models require large volumes of training data generated in two languages parallel to one another. Parliamentary proceedings and media publications are common sources of publicly available data that can be scraped for training purposes. However, both these sources—according to Microsoft’s researcher Bali—are too specific, and do not encompass a wide enough range in terms of topics and vocabulary to be properly representative of human speech. (This is why StoryWeaver isn’t a good source for training data, either, because sentences in children’s books are fairly simple and the reading corpus only goes up to fourth-grade reading levels.)

Link to the rest at Slate